實作二

Sqoop (2) : 將 MySQL 資料表匯入 Hive

建立 Hive 資料庫

  1. 首先,為了避免產生相同名稱的資料表,以下實作建議產生 Hive 與 Impala 的資料表於您專屬的 Hive 資料庫中。請使用以下指令建立一個與使用者名稱相同的 Hive 資料庫。
    [user00@master ~]$ export DBID=$USER
    [user00@master ~]$ hive -e "create database $DBID"
    Hive history file=/tmp/user00/hive_job_log_d1559058-06ad-4c8c-bdc5-750f47dc5cf3_1409161801.txt
    OK
    Time taken: 0.439 seconds
    

建立 Hive 資料表

  1. 接著,請刪除實作一 Sqoop (1) 產生的 HDFS 目錄 mysql_data
    [user00@master ~]$ hadoop fs -rm -r -skipTrash mysql_data
    Deleted mysql_data
    
    否則,等一下執行 sqoop create-hive-tablesqoop import 時,會出現以下的訊息。
    15/08/05 10:15:00 INFO hive.HiveImport: Hive import complete.
    15/08/05 10:15:00 INFO hive.HiveImport: Export directory is not empty, keeping it.
    
  2. 請輸入指令 sqoop create-hive-table 用來產生 Hive 資料表:
    [user00@master ~]$ sqoop create-hive-table --connect "jdbc:mysql://localhost/$DBID" --table mysql_data --username $DBID -P --hive-table mysql_data --hive-database $DBID
    Enter password: 輸入密碼
    
  3. 若有正常產生 Hive 資料表,您將看到以下的類似訊息:
    15/08/05 10:18:52 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
    15/08/05 10:18:52 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
    15/08/05 10:18:52 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    15/08/05 10:18:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mysql_data` AS t LIMIT 1
    15/08/05 10:18:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mysql_data` AS t LIMIT 1
    15/08/05 10:18:53 INFO hive.HiveImport: Loading uploaded data into Hive
    15/08/05 10:18:55 INFO hive.HiveImport: Hive history file=/tmp/user00/hive_job_log_7add82d6-f6b6-49cd-837a-1868d8e3c4a3_1164160356.txt
    15/08/05 10:18:56 INFO hive.HiveImport: OK
    15/08/05 10:18:56 INFO hive.HiveImport: Time taken: 0.575 seconds
    15/08/05 10:18:56 INFO hive.HiveImport: Hive import complete.
    
  4. 檢查 Hive 的 HDFS 目錄是否有產生資料表
    [user00@master ~]$ hadoop fs -ls /user/hive/warehouse/$DBID.db
    Found 1 items
    drwxr-xr-x   - user00 hive          0 2015-08-05 14:20 /user/hive/warehouse/user00.db/mysql_data
    
  5. 檢查 Hive mysql_data 資料表是否有內容,結果應該是空的。
    user00@master ~ $ hadoop fs -ls /user/hive/warehouse/$DBID.db/mysql_data
    
  6. sqoop create-hive-table 這個指令只是從 MySQL 讀取 Schema 並自動產生 Hive 資料表的 Schema
    [user00@master ~]$ hive -e "show tables; use $DBID; describe mysql_data;" 
    Hive history file=/tmp/user00/hive_job_log_54734e8c-7dc5-44b3-8e78-f2ef6479033e_334353074.txt
    OK
    Time taken: 0.764 seconds
    OK
    Time taken: 0.016 seconds
    OK
    id  int 
    name    string  
    Time taken: 0.17 seconds
    

執行記錄匯入

  1. 請輸入如下指令(與前面不同的是加上 --hive-import --hive-table mysql_data --hive-database $DBID 參數)
    [user00@master ~]$ export DBID=$USER
    [user00@master ~]$ sqoop import --connect "jdbc:mysql://localhost/$DBID" --table mysql_data --username $DBID -P --hive-import --hive-table mysql_data --hive-database $DBID
    Enter password: 輸入密碼
    
  2. 若資料表匯入有正常執行,您將看到以下的類似訊息:
    15/08/05 14:30:04 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
    15/08/05 14:30:04 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
    15/08/05 14:30:04 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    15/08/05 14:30:04 INFO tool.CodeGenTool: Beginning code generation
    15/08/05 14:30:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mysql_data` AS t LIMIT 1
    15/08/05 14:30:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mysql_data` AS t LIMIT 1
    15/08/05 14:30:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
    Note: /tmp/sqoop-user00/compile/3a4745a9ba0072591338f577df6c14bb/mysql_data.java uses or overrides a deprecated API.
    Note: Recompile with -Xlint:deprecation for details.
    15/08/05 14:30:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-user00/compile/3a4745a9ba0072591338f577df6c14bb/mysql_data.jar
    15/08/05 14:30:07 WARN manager.MySQLManager: It looks like you are importing from mysql.
    15/08/05 14:30:07 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
    15/08/05 14:30:07 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
    15/08/05 14:30:07 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
    15/08/05 14:30:07 INFO mapreduce.ImportJobBase: Beginning import of mysql_data
    15/08/05 14:30:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    15/08/05 14:30:09 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `mysql_data`
    15/08/05 14:30:09 INFO mapred.JobClient: Running job: job_201508050909_0014
    15/08/05 14:30:10 INFO mapred.JobClient:  map 0% reduce 0%
    15/08/05 14:30:20 INFO mapred.JobClient:  map 100% reduce 0%
    15/08/05 14:30:21 INFO mapred.JobClient: Job complete: job_201508050909_0014
    15/08/05 14:30:21 INFO mapred.JobClient: Counters: 23
    15/08/05 14:30:21 INFO mapred.JobClient:   File System Counters
    15/08/05 14:30:21 INFO mapred.JobClient:     FILE: Number of bytes read=0
    15/08/05 14:30:21 INFO mapred.JobClient:     FILE: Number of bytes written=392280
    15/08/05 14:30:21 INFO mapred.JobClient:     FILE: Number of read operations=0
    15/08/05 14:30:21 INFO mapred.JobClient:     FILE: Number of large read operations=0
    15/08/05 14:30:21 INFO mapred.JobClient:     FILE: Number of write operations=0
    15/08/05 14:30:21 INFO mapred.JobClient:     HDFS: Number of bytes read=197
    15/08/05 14:30:21 INFO mapred.JobClient:     HDFS: Number of bytes written=16
    15/08/05 14:30:21 INFO mapred.JobClient:     HDFS: Number of read operations=2
    15/08/05 14:30:21 INFO mapred.JobClient:     HDFS: Number of large read operations=0
    15/08/05 14:30:21 INFO mapred.JobClient:     HDFS: Number of write operations=2
    15/08/05 14:30:21 INFO mapred.JobClient:   Job Counters 
    15/08/05 14:30:21 INFO mapred.JobClient:     Launched map tasks=2
    15/08/05 14:30:21 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=13832
    15/08/05 14:30:21 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
    15/08/05 14:30:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    15/08/05 14:30:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    15/08/05 14:30:21 INFO mapred.JobClient:   Map-Reduce Framework
    15/08/05 14:30:21 INFO mapred.JobClient:     Map input records=2
    15/08/05 14:30:21 INFO mapred.JobClient:     Map output records=2
    15/08/05 14:30:21 INFO mapred.JobClient:     Input split bytes=197
    15/08/05 14:30:21 INFO mapred.JobClient:     Spilled Records=0
    15/08/05 14:30:21 INFO mapred.JobClient:     CPU time spent (ms)=2800
    15/08/05 14:30:21 INFO mapred.JobClient:     Physical memory (bytes) snapshot=325623808
    15/08/05 14:30:21 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3196616704
    15/08/05 14:30:21 INFO mapred.JobClient:     Total committed heap usage (bytes)=471465984
    15/08/05 14:30:21 INFO mapreduce.ImportJobBase: Transferred 16 bytes in 13.8284 seconds (1.157 bytes/sec)
    15/08/05 14:30:21 INFO mapreduce.ImportJobBase: Retrieved 2 records.
    15/08/05 14:30:21 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mysql_data` AS t LIMIT 1
    15/08/05 14:30:21 INFO hive.HiveImport: Removing temporary files from import process: hdfs://master:8020/user/user00/mysql_data/_logs
    15/08/05 14:30:21 INFO hive.HiveImport: Loading uploaded data into Hive
    15/08/05 14:30:23 INFO hive.HiveImport: Hive history file=/tmp/user00/hive_job_log_ede2d76e-c4f2-4983-9bcd-eea96ed17f11_50835470.txt
    15/08/05 14:30:24 INFO hive.HiveImport: OK
    15/08/05 14:30:24 INFO hive.HiveImport: Time taken: 0.69 seconds
    15/08/05 14:30:25 INFO hive.HiveImport: Loading data to table user00.mysql_data
    15/08/05 14:30:25 INFO hive.HiveImport: chgrp: changing ownership of '/user/hive/warehouse/user00.db/mysql_data/_SUCCESS': User does not belong to hive
    15/08/05 14:30:25 INFO hive.HiveImport: chgrp: changing ownership of '/user/hive/warehouse/user00.db/mysql_data/part-m-00000': User does not belong to hive
    15/08/05 14:30:25 INFO hive.HiveImport: chgrp: changing ownership of '/user/hive/warehouse/user00.db/mysql_data/part-m-00001': User does not belong to hive
    15/08/05 14:30:25 INFO hive.HiveImport: Table user00.mysql_data stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 16, raw_data_size: 0]
    15/08/05 14:30:25 INFO hive.HiveImport: OK
    15/08/05 14:30:25 INFO hive.HiveImport: Time taken: 0.896 seconds
    15/08/05 14:30:25 INFO hive.HiveImport: Hive import complete.
    15/08/05 14:30:25 INFO hive.HiveImport: Export directory is empty, removing it.
    

檢查匯入結果

  1. 轉換之結果,可於 Hive 儲存於 HDFS 的路徑查到。
    [user00@master ~]$ hadoop fs -ls /user/hive/warehouse/$DBID.db/mysql_data/
    Found 3 items
    -rw-r--r--   1 user00 user00          0 2015-08-05 14:30 /user/hive/warehouse/user00.db/mysql_data/_SUCCESS
    -rw-r--r--   1 user00 user00          8 2015-08-05 14:30 /user/hive/warehouse/user00.db/mysql_data/part-m-00000
    -rw-r--r--   1 user00 user00          8 2015-08-05 14:30 /user/hive/warehouse/user00.db/mysql_data/part-m-00001
    
  2. 檢視 part-m-***** 轉換結果
    [user00@master ~]$ hadoop fs -cat /user/hive/warehouse/$DBID.db/mysql_data/part-*
    1Hello
    2World
    

結果觀察

參考資料

  1. Sqoop: Big data conduit between NoSQL and RDBMS”, , Advisory Consultant, IBM, 23 Jul 2013

本文件最後更新於: