實作十一

Impala (4) : PARTITION 範例

面對持續增加的資料來源(incremental data source),有時候為了增加定時匯入的彈性,會採用 PARTITION (分區) 的方式來建立資料表。其中,最典型的是時間序列日誌資料,就可以使用時間戳記(Time Stamp)來當作 Partition。

建立 PARTITION 範例資料表

以下範例是建立一個依序根據『年』、『月』、『日』、『主機名稱』當作 Partition 的資料表 $DBID.logs ,請於 impala-shell 中執行以下指令:

use userXX;
CREATE TABLE logs ( field1 STRING , field2 STRING , field3 STRING ) PARTITIONED BY ( year STRING , month STRING , day STRING , host STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
INSERT INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test1.log" ) ;
INSERT INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test2.log" ) ;
INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test3.log" ) ;
INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test4.log" ) ;
INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test5.log" ) ;
INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test6.log" ) ;

實際執行結果如下所示:

[user00@master ~]$ impala-shell 
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

[master:21000] > use user00;
Query: use user00
[master:21000] > CREATE TABLE logs ( field1 STRING , field2 STRING , field3 STRING ) PARTITIONED BY ( year STRING , month STRING , day STRING , host STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
Query: create TABLE logs ( field1 STRING , field2 STRING , field3 STRING ) PARTITIONED BY ( year STRING , month STRING , day STRING , host STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Fetched 0 row(s) in 1.03s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test1.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test1.log" )

Inserted 1 row(s) in 7.93s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test2.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "11" , day = "30" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test2.log" )
Inserted 1 row(s) in 3.22s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test3.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test3.log" )
Inserted 1 row(s) in 3.22s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test4.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "12" , day = "01" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test4.log" )
Inserted 1 row(s) in 3.22s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test5.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host1" ) VALUES ( "FTP", "192.168.70.10" , "/home/jazz/test5.log" )
Inserted 1 row(s) in 3.22s
[master:21000] > INSERT INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test6.log" ) ;
Query: insert INTO logs PARTITION ( year = "2014" , month = "12" , day = "02" , host = "host2" ) VALUES ( "FTP", "192.168.70.11" , "/home/jazz/test6.log" )
Inserted 1 row(s) in 3.22s
[master:21000] > 
[master:21000] > exit;
Goodbye user00

觀察目錄結構

前面我們已經知道 Impala 呼叫 INSERT INTO … VALUES 時,會在對應的 HDFS 路徑 /user/hive/warehouse 中,建立檔案。讓我們觀察一下若是啟用 PARTITION 的支援,其目錄結構會以何種方式呈現。請在 Bash Shell 環境中執行以下指令:

hadoop fs -ls -R /user/hive/warehouse/$DBID.db/logs

執行結果如下所示,我們可以注意到 userXX 資料庫的 logs 資料表,其擁有者身份為 impala 而非執行 Impala SQL 語法的 userXX 身份。 其次,看起來子目錄會以 Partition 名稱命名。

[user00@master ~]$ hadoop fs -ls -R /user/hive/warehouse/$DBID.db/logs
drwxrwxr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/_impala_insert_staging
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11/day=30
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11/day=30/host=host1
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11/day=30/host=host1/a344c3a1f775f2bd-16782a8c004673b9_1736974311_data.0.
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11/day=30/host=host2
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=11/day=30/host=host2/154ef1f9d8a02eaa-46031f5be8b2f9af_2025069347_data.0.
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=01
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=01/host=host1
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=01/host=host1/f24db370508f27ed-2ab4e41ed8cd5aab_1687234502_data.0.
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=01/host=host2
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=01/host=host2/6b4f4f6081cf2022-2c451c0633aad291_1506300064_data.0.
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=02
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=02/host=host1
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=02/host=host1/294d1d964126515f-cb5888e0afb21b6_1015019318_data.0.
drwxr-xr-x   - impala hive          0 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=02/host=host2
-rw-r--r--   1 impala hive         39 2015-08-06 00:26 /user/hive/warehouse/user00.db/logs/year=2014/month=12/day=02/host=host2/fb47e1da3ef966bb-314a42f0797b4aa2_264314387_data.0.

根據以上的目錄結構,對照建立資料表的語法,不難發現每個 Partition 本身就是一個子目錄,並以宣告的順序來決定其階層架構。以下是 $DBID.logs 執行六個 INSERT 語句後的目錄結構:

/user/hive/warehouse/$DBID.db/logs/
└── year=2014
    ├── month=11
    │   └── day=30
    │       ├── host=host1
    │       └── host=host2
    └── month=12
        ├── day=01
        │   ├── host=host1
        │   └── host=host2
        └── day=02
            ├── host=host1
            └── host=host2

執行範例查詢

雖然目錄結構長得很複雜,但是對於查詢而言並沒有太大的影響。讓我們使用以下語句,來查詢 $DBID.logs 資料表:

impala-shell -q "SELECT * FROM $DBID.logs;"

執行結果如下:

[user00@master ~]$ impala-shell -q "SELECT * FROM $DBID.logs;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: select * FROM user00.logs
+--------+---------------+----------------------+------+-------+-----+-------+
| field1 | field2        | field3               | year | month | day | host  |
+--------+---------------+----------------------+------+-------+-----+-------+
| FTP    | 192.168.70.11 | /home/jazz/test2.log | 2014 | 11    | 30  | host2 |
| FTP    | 192.168.70.11 | /home/jazz/test4.log | 2014 | 12    | 01  | host2 |
| FTP    | 192.168.70.10 | /home/jazz/test5.log | 2014 | 12    | 02  | host1 |
| FTP    | 192.168.70.10 | /home/jazz/test1.log | 2014 | 11    | 30  | host1 |
| FTP    | 192.168.70.11 | /home/jazz/test6.log | 2014 | 12    | 02  | host2 |
| FTP    | 192.168.70.10 | /home/jazz/test3.log | 2014 | 12    | 01  | host1 |
+--------+---------------+----------------------+------+-------+-----+-------+
Fetched 6 row(s) in 0.27s

參考資料

  1. Partitioning

本文件最後更新於: