sqoop之旅3-sqoop导入

sqoop-import

introduction

导入的主要语法是

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

命令格式为:

$ sqoop import
--connect \  (连接字符)
--username \ (用户名)
--password [-P |--password-file]\ (单个密码或用文件形式,文件权限一般为400)
--query \   (查询字符串)
--warehouse-dir \  (导入的HDFS目录)
--fields-terminal-by \  (分隔符,默认是逗号)
-m [--num-mappers]   (MR任务数量,控制导入并行度)
McUr0f.png

Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port.

  • 连接数据库服务器
  • 指定参数—connect
  • 通过类似url的形式,有时还要指定端口
# 1. 普通形式
# 表示:连接上名为employees的主机database.example.com
$ sqoop import --connect jdbc:mysql://database.example.com/employees

# 2. 使用用户和密码(明文)
$ sqoop import 
--connect jdbc:mysql://database.example.com/employees \
--username root 
--password 12345

# 3. 密码为隐藏形式
$ sqoop import 
--connect jdbc://mysql://database.example.com/employees \
--username root 
-- password-file ${user.name}/.password

# 4. 指定端口
$sqoop import
--connect jdbc:mysql://ubuntu:3306/db    # ubuntu为主机名;db为数据库名;如果是分布式集群,不要用localhost作为主机名,因为这是本机而不是连接过去的服务器
Mc0UVP.png

Selecting the Data to Import

Sqoop typically imports data in a table-centric fashion. Use the --table argument to select the table to import. For example, --table employees. This argument can also identify a VIEW or other table-like entity in a database.

By default, all columns within a table are selected for import. Imported data is written to HDFS in its "natural order;”

You can select a subset of columns and control their ordering by using the --columns argument. For example: --columns "name,employee_id,jobtitle".

You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT <column_list> FROM <table_name>. Append a WHERE clause to this with the --where argument. For example: --where "id > 400".

  • 默认是通过—table参数导入表中的全部数据;
  • 不仅可以是表,也可以是视图、或者其他类似表的结构
  • 默认导入的属性顺序和原来的相同,但是可以人为的指定属性顺序
  • 可以带上where条件,使用查询语句
# 带上where条件
$sqoop import \
--connect jdbc:mysql://database.example.com/employees \
--username root \
--password 12345 \
--where "name='xiaoming' " \
--table help_keyword \
--target-dir /sqoop/hadoop/data \  # 导出的目录
-m 1  # mapreduce作业数量

# 查询指定的列
$sqoop import \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--where "name='xiaoming' " \
--table employees  \
--target-dir /sqoop/hadoop/data  \ 
-m 1
select name from employees where name="xiaoming"

# 指定自定义查询SQL
$sqoop import  \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--target-dir /sqoop/hadoop/data  \ 
--query 'select id,name from mysql.employees where $CONDITIONS and name="xiaoming"' \
--split-by id \
-m 4

# 导入部分数据
$sqoop import \
--connect jdbc:mysql://ubuntu:3306/sqooptest \  # ubuntu为主机名;sqooptest为数据库名
--username root \  # 指定用户和密码
--password 123456 \
--query "select * from sqoop where student_id <= 500 AND \$CONDITIONS"  

总结2点:

  1. 引号问题:
    1. 要么外层使用单引号,内层使用双引号,CONDITIONS的\符号不用转义
    2. 要么外层使用双引号,内层使用单引号,CONDITIONS的\符号需要转义
  2. 自定义querySQL语句中必须带有where $CONDITIONS以及AND

If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"


Free-form Query Imports

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.

When importing a free-form query, you must specify a destination directory with --target-dir.

Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

  • 除了指定表、字段和带上where条件外,还可以直接使用—query进行导入
  • 必须指定到处的目标目录
  • 必须带上$CONDITIONS;同时指定分隔符
  • 注意单引号和双引号的使用问题
$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
--split-by a.id \  # 指定分隔符
--target-dir /usr/foo/joinresults  # 指定目录


$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
-m 1
--target-dir /usr/foo/joinresults

Controlling Parallelism

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.

By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16.

By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.

If a table does not have a primary key defined and the is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.

  • -m用来指定mapreduce的任务数量,后面接上一个整数,表示MR的并行度
  • 在导入的时候,sqoop会使用split-by进行负载切分:获取字段的max和min值,再记性切分;
  • 并行度默认是4,有时可能是8或者16
  • 如果字段不是均匀分布的,会导致任务的不平衡;sqoop目前不支持使用多个字段进行切分字段

Controlling Distributed Cache

  • Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job.
  • Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.

Controlling the Import Process

By default, the import process will use JDBC which provides a reasonable cross-vendor import channel.

Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool .

By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.

By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument.

--target-dir is incompatible with --warehouse-dir.

  • sqoop默认使用的是jdbc 作为导入通道,可以根据数据库进行修改,进行提速,比如MySQL使用mysqldump
  • sqoop默认是将数据导入到当前用户的foo目录下
  • `--target-dir和--warehouse-dir不能同时存在
$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--warehouse-dir /shared \   # 数据将会被导入到/shared/foo/目录下面

$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--target-dir /dest \ # 数据将会被导入到/dest/foo/目录下面 

Controlling transaction isolation

By default, Sqoop uses the read committed transaction isolation in the mappers to import data. This may not be the ideal in all ETL workflows and it may desired to reduce the isolation guarantees.

The --relaxed-isolation option can be used to instruct Sqoop to use read uncommitted isolation level.

the option --relaxed-isolation may not be supported on all databases.

  • 默认情况下,sqoop使用已读提交的事务隔离级别来导入数据;方式不理想
  • relaxed-isolation用于指示sqoop使用读取未提交的隔离级别;但是这个参数不是所有的数据库都支持

Incremental Imports

File Formats

Two file formats: delimited text or SequenceFiles.Delimited text is the default import format. You can also specify it explicitly by using the --as-textfile argument.

These delimiters may be commas, tabs, or other characters.

SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes.

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.

  • 两种主要的文件:带有分隔符的文件和序列文件;带分隔符的文件是默认的
  • 分隔符可以是逗号、制表符或其他
  • 序列文件是二进制格式的文件,用于自定义记录特定的数据类型存储
  • 数据默认是不会压缩的,可以通过指定的参数进行压缩。

Importing Data Into Hive

Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced.

Sqoop will by default import NULL values as string null.

You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.

You can import compressed tables into Hive using the --compress and --compression-codec options.

$ sqoop import  ... --null-string '\\N' --null-non-string '\\N'

Importing Data Into HBase

By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS.

If the input table has composite key, the --hbase-row-key must be in the form of a comma-separated list of composite key attributes.

You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family

Mg15lV.png
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 196,264评论 5 462
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 82,549评论 2 373
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 143,389评论 0 325
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,616评论 1 267
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,461评论 5 358
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,351评论 1 273
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,776评论 3 387
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,414评论 0 255
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,722评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,760评论 2 314
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,537评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,381评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,787评论 3 300
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,030评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,304评论 1 252
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,734评论 2 342
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,943评论 2 336

推荐阅读更多精彩内容