實作四

Hive (1) : 將 CSV 資料匯出到 Hive

下載 CSV 示範資料:魔球

  1. 巨量資料的應用已經擴及生活中的許多角落,電影「魔球(MONEYBALL)」就是一個例子。在 Sean Lahman 的網站上公佈了一些關於棒球的 CSV 資料,是採創用 CC 3.0 授權(Creative Commons Attribution-ShareAlike 3.0 Unported License),因此,我們就先拿這份資料來當作學習如何將 CSV 格式資料匯入 Hive 的範例。
  2. 首先,我們下載 2012 年的球賽統計資料,並解壓縮。
    [user00@master ~]$ mkdir baseball
    [user00@master ~]$ cd baseball/
    [user00@master baseball]$ wget http://seanlahman.com/files/database/lahman2012-csv.zip
    [user00@master baseball]$ unzip lahman2012-csv.zip
    
  3. 接著,我們必須將資料上傳到 HDFS。讓我們在 HDFS 上建立一個名為 baseball 的目錄,並將所有 CSV 檔案上傳到該目錄中。
    [user00@master baseball]$ hadoop fs -mkdir baseball
    [user00@master baseball]$ hadoop fs -put *.csv baseball
    [user00@master baseball]$ hadoop fs -ls baseball
    Found 24 items
    -rw-r--r--   1 user00 user00     198529 2015-08-05 11:50 baseball/AllstarFull.csv
    -rw-r--r--   1 user00 user00    5730747 2015-08-05 11:50 baseball/Appearances.csv
    -rw-r--r--   1 user00 user00       7304 2015-08-05 11:50 baseball/AwardsManagers.csv
    -rw-r--r--   1 user00 user00     240867 2015-08-05 11:50 baseball/AwardsPlayers.csv
    -rw-r--r--   1 user00 user00      16719 2015-08-05 11:50 baseball/AwardsShareManagers.csv
    -rw-r--r--   1 user00 user00     220135 2015-08-05 11:50 baseball/AwardsSharePlayers.csv
    -rw-r--r--   1 user00 user00    6488747 2015-08-05 11:50 baseball/Batting.csv
    -rw-r--r--   1 user00 user00     644669 2015-08-05 11:50 baseball/BattingPost.csv
    -rw-r--r--   1 user00 user00    8171830 2015-08-05 11:50 baseball/Fielding.csv
    -rw-r--r--   1 user00 user00     298470 2015-08-05 11:50 baseball/FieldingOF.csv
    -rw-r--r--   1 user00 user00     573945 2015-08-05 11:50 baseball/FieldingPost.csv
    -rw-r--r--   1 user00 user00     175990 2015-08-05 11:50 baseball/HallOfFame.csv
    -rw-r--r--   1 user00 user00     130719 2015-08-05 11:50 baseball/Managers.csv
    -rw-r--r--   1 user00 user00       3662 2015-08-05 11:50 baseball/ManagersHalf.csv
    -rw-r--r--   1 user00 user00    3049250 2015-08-05 11:50 baseball/Master.csv
    -rw-r--r--   1 user00 user00    3602473 2015-08-05 11:50 baseball/Pitching.csv
    -rw-r--r--   1 user00 user00     381812 2015-08-05 11:50 baseball/PitchingPost.csv
    -rw-r--r--   1 user00 user00     700024 2015-08-05 11:50 baseball/Salaries.csv
    -rw-r--r--   1 user00 user00      42933 2015-08-05 11:50 baseball/Schools.csv
    -rw-r--r--   1 user00 user00     180758 2015-08-05 11:50 baseball/SchoolsPlayers.csv
    -rw-r--r--   1 user00 user00       8369 2015-08-05 11:50 baseball/SeriesPost.csv
    -rw-r--r--   1 user00 user00     550032 2015-08-05 11:50 baseball/Teams.csv
    -rw-r--r--   1 user00 user00       3238 2015-08-05 11:50 baseball/TeamsFranchises.csv
    -rw-r--r--   1 user00 user00       1609 2015-08-05 11:50 baseball/TeamsHalf.csv
    

建立 Hive 資料庫

  1. 雖然 CSV 有欄位定義,但 Hive 並無法自動判斷欄位的資料型態,因此我們還是必須根據資料來源的定義,自行建立 Hive 資料表的 Schema。
  2. 如果想瞭解這份資料的每個 CSV 檔有哪些欄位,請參閱 readme 2012.txt 這個說明檔案。像是有 Open Source Sports 這樣的網站,而他們每年都會公佈各年度的統計資料。說明文件中也有提到另有 MS Access 的格式,若有需要,也可以下載這個版本,就可以比較輕易地進行 MS Access 先轉 MS SQL Server 再轉到 Hive 資料表的轉換,就可以省卻自己定義 Schema 的步驟。
    MS Access Versions:
          lahman2012.mdb
          2012readme.txt
    
  3. 在 Hive 上匯入 CSV 資料的順序是
  4. 首先,Hive 跟其他資料庫系統很像,也有一個互動式的 Shell 環境。請在命令列中執行指令 hive 就可以進入 Hive 的互動式查詢介面。
    [user00@master baseball]$ hive
    Hive history file=/tmp/user00/hive_job_log_f6a9efc9-aaad-4d2a-9c6e-469664367645_315092396.txt
    hive> 
    
  5. 後面的步驟若是看到 hive> 開頭的,就代表是在 Hive 的互動式介面中執行的指令。
  6. 您可以使用show databases 檢視目前有哪幾個資料庫。
    hive> show databases;
    OK
    default
    user00
    Time taken: 0.698 seconds
    
  7. 若要離開 Hive Shell 的話,請下 quit;exit;
    hive> quit;
    
  8. 有時,如果想直接執行很短的查詢,也可以用 hive -e <HiveQL 查詢語法> 的方式執行查詢。
  9. 接下來,就讓我們來學習如何在 Hive 建立一個資料庫的語法。
  10. Hive 的語法,我們稱為 Hive Query Language(簡稱 HiveQL);
  11. 建立資料庫的 HiveQL 語法是 create database <資料庫名稱>

建立 Hive 資料表

  1. 接著,讓我們選定某一個想要分析的 CSV 資料,例如:Master.csv,先瞭解 Master 這個檔案的欄位跟資料型態。說明文件中提到
      MASTER - Player names, DOB, and biographical info
    ---
    2.1 MASTER table
    ---
    lahmanID       Unique number assigned to each player
    playerID       A unique code asssigned to each player.  The playerID links
                   the data in this file with records in the other files.
    managerID      An ID for individuals who served as managers
    hofID          An ID for individuals who are in teh baseball Hall of Fame
    birthYear      Year player was born
    birthMonth     Month player was born
    birthDay       Day player was born
    birthCountry   Country where player was born
    birthState     State where player was born
    birthCity      City where player was born
    deathYear      Year player died
    deathMonth     Month player died
    deathDay       Day player died
    deathCountry   Country where player died
    deathState     State where player died
    deathCity      City where player died
    nameFirst      Player's first name
    nameLast       Player's last name
    nameNote       Note about player's name (usually signifying that they changed
                   their name or played under two differnt names)
    nameGiven      Player's given name (typically first and middle)
    nameNick       Player's nickname
    weight         Player's weight in pounds
    height         Player's height in inches
    bats           Player's batting hand (left, right, or both)
    throws         Player's throwing hand (left or right)
    debut          Date that player made first major league appearance
    finalGame      Date that player made first major league appearance (blank if still active)
    college        College attended
    lahman40ID     ID used in Lahman Database version 4.0
    lahman45ID     ID used in Lahman database version 4.5
    retroID        ID used by retrosheet
    holtzID        ID used by Sean Holtz's Baseball Almanac
    bbrefID        ID used by Baseball Reference website
    
    Master 的資料型態
    欄位 範例 資料型態
    lahmanID 1 INT
    playerID aaronha01 STRING
    managerID NULL INT
    hofID aaronha01h STRING
    birthYear 1934 INT
    birthMonth 2 INT
    birthDay 5 INT
    birthCountry USA STRING
    birthState AL STRING
    birthCity Mobile STRING
    deathYear NULL INT
    deathMonth NULL INT
    deathDay NULL INT
    deathCountry NULL STRING
    deathState NULL STRING
    deathCity NULL STRING
    nameFirst Hank STRING
    nameLast Aaron STRING
    nameNote NULL STRING
    nameGiven Henry Louis STRING
    nameNick “Hammer, Hammerin’ Hank, Bad Henry” STRING
    weight 180 INT
    height 72 INT
    bats R STRING
    throws R STRING
    debut 4/13/1954 STRING
    finalGame 10/3/1976 STRING
    college NULL STRING
    lahman40ID aaronha01 STRING
    lahman45ID aaronha01 STRING
    retroID aaroh101 STRING
    holtzID aaronha01 STRING
    bbrefID aaronha01 STRING
  2. HiveQL 建立資料表的語法是 create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )
    [user00@master baseball]$ hive -e "create table $DBID.Master
    ( lahmanID INT, playerID STRING, managerID INT, hofID STRING, 
      birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING, 
      birthState STRING, birthCity STRING, deathYear INT, deathMonth INT,
      deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING,
      nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING,
      nameNick STRING, weight INT, height INT, bats STRING, throws STRING,
      debut STRING, finalGame STRING, college STRING, lahman40ID STRING,
      lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;"
    

匯入 CSV 資料到 Hive 資料表

  1. HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>
  2. 以下是用 hive -e 指令執行的方式:
    [user00@master baseball]$ hive -e "LOAD DATA LOCAL INPATH \"Master.csv\" OVERWRITE INTO TABLE $DBID.master;"
    
    若正常執行,您會看到如下的結果:
    Hive history file=/tmp/user00/hive_job_log_a399a8dd-f313-4b63-929a-f545bfbdd559_1877336452.txt
    Copying data from file:/home/user00/baseball/Master.csv
    Copying file: file:/home/user00/baseball/Master.csv
    Loading data to table user00.master
    rmr: DEPRECATED: Please use 'rm -r' instead.
    Moved: 'hdfs://master:8020/user/hive/warehouse/user00.db/master' to trash at: hdfs://master:8020/user/user00/.Trash/Current
    chgrp: changing ownership of '/user/hive/warehouse/user00.db/master': User does not belong to hive
    Table user00.master stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 3049250, raw_data_size: 0]
    OK
    Time taken: 1.558 seconds
    
  3. 若使用 hive shell 的話,請改用 LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE userXX.master; 並手動取代 HiveQL 中的 userXX (XX 為數字)
  4. 這裡因為使用了 LOCAL 關鍵字,因此讀取的是目前工作目錄的 Master.csv,若沒有使用 LOCAL 關鍵字,則代表使用 HDFS 的檔案,此時建議給絕對路徑。如:LOAD DATA INPATH "/user/userXX/baseball/Master.csv" OVERWRITE INTO TABLE userXX.master;。執行結果如下:
    hive> LOAD DATA INPATH "/user/user00/baseball/Master.csv" OVERWRITE INTO TABLE user00.master;
    Loading data to table user00.master
    rmr: DEPRECATED: Please use 'rm -r' instead.
    Moved: 'hdfs://master:8020/user/hive/warehouse/user00.db/master' to trash at: hdfs://master:8020/user/user00/.Trash/Current
    chgrp: changing ownership of '/user/hive/warehouse/user00.db/master/Master.csv': User does not belong to hive
    Table user00.master stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 3049250, raw_data_size: 0]
    OK
    Time taken: 1.309 seconds
    

檢查匯出結果

  1. 首先,我們用 HiveQL 的語法,來檢查是否有正常產生資料庫。語法為 SHOW DATABASES;
    [user00@master baseball]$ hive
    Hive history file=/tmp/user00/hive_job_log_33cf301e-65b1-4965-9ea6-af0895bd05e6_1775863642.txt
    hive> SHOW DATABASES;
    OK
    default
    user00
    Time taken: 0.665 seconds
    
  2. 接著,讓我們切換預設的資料庫,變成剛剛產生的 userXX 資料庫。語法為 USE <資料庫名稱>;
    hive> USE userXX;
    OK
    Time taken: 0.014 seconds
    
  3. 我們可以用類似 SQL 語法,來查詢目前的資料庫有哪幾個資料表。語法為 SHOW TABLES;
    hive> SHOW TABLES;
    OK
    master
    mysql_data
    Time taken: 0.086 seconds
    
  4. 其次,我們可以檢查一下剛剛建立的 userXX.master 資料表,內容是否正常。由於 HiveQL 與 SQL-92 有很高的相容性,因此我們可以用標準的 SELECT * FROM <資料表名稱> WHERE <條件>; 來進行資料的檢索。
    hive> SELECT * FROM Master LIMIT 10;
    hive> SELECT lahmanID FROM Master LIMIT 10;
    
  5. 執行完以上的查詢後,您或許已經發現,當下的條件愈多,Hive 的執行速度就愈久。那是因為 Hive 必須將您下的查詢,轉換成 MapReduce 任務。
    hive> SELECT lahmanID FROM Master WHERE birthyear > 1900 LIMIT 10;
    hive> SELECT COUNT( * ) FROM Master;
    
    hive> SELECT COUNT( * ) FROM Master;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    Starting Job = job_201508050909_0008, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508050909_0008
    Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201508050909_0008
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2015-08-05 12:23:57,169 Stage-1 map = 0%,  reduce = 0%
    2015-08-05 12:24:02,201 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
    2015-08-05 12:24:03,212 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
    2015-08-05 12:24:04,220 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
    2015-08-05 12:24:05,229 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
    2015-08-05 12:24:06,237 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
    2015-08-05 12:24:07,251 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.37 sec
    2015-08-05 12:24:08,261 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.37 sec
    2015-08-05 12:24:09,271 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.37 sec
    MapReduce Total cumulative CPU time: 3 seconds 370 msec
    Ended Job = job_201508050909_0008
    MapReduce Jobs Launched: 
    Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.37 sec   HDFS Read: 3049479 HDFS Write: 6 SUCCESS
    Total MapReduce CPU Time Spent: 3 seconds 370 msec
    OK
    18126
    Time taken: 16.922 seconds
    hive> quit;
    
  6. hive 指令後方若加上 -S 參數,則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
    [user00@master baseball]$ hive -S -e "use $DBID; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 limit 10;"
    1934    1   Hank
    1939    2   Tommie
    1954    3   Don
    1972    4   Andy
    1951    11  Glenn
    1972    12  Jeff
    1967    13  Jim
    1969    14  Kurt
    1968    15  Kyle
    1967    17  Paul
    

參考資料

  1. Build a data library with Hive”, By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks

本文件最後更新於: