實作十

Impala (3) : LOAD DATA 語法

接下來,讓我們使用類似 Hive 的方式,將已經存在的 CSV 檔,匯入 Impala 資料表。

準備範例資料

首先,讓我們產生兩個範例 CSV 檔 table1.csv 與 table2.csv。請在 SSH 的 Shell 環境中,分別執行以下三段指令:

cat > table1.csv << EOF
1,true,123.123,2012-10-24 08:55:00 
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
EOF

cat > table2.csv << EOF
1,true,12789.123
2,false,1243.5
3,false,24453.325
4,false,2423.3254
5,true,243.325
60,false,243565423.325
70,true,243.325
80,false,243423.325
90,true,243.325
EOF

hadoop fs -mkdir -p /tmp/$USER
hadoop fs -chmod a+w /tmp/$USER
hadoop fs -put table1.csv /tmp/$USER/table1.csv
hadoop fs -put table2.csv /tmp/$USER/table2.csv
hadoop fs -ls /tmp/$USER

執行結果如下所示:

[user00@master ~]$ cat > table1.csv << EOF
> 1,true,123.123,2012-10-24 08:55:00 
> 2,false,1243.5,2012-10-25 13:40:00
> 3,false,24453.325,2008-08-22 09:33:21.123
> 4,false,243423.325,2007-05-12 22:32:21.33454
> 5,true,243.325,1953-04-22 09:11:33
> EOF
[user00@master ~]$ cat > table2.csv << EOF
> 1,true,12789.123
> 2,false,1243.5
> 3,false,24453.325
> 4,false,2423.3254
> 5,true,243.325
> 60,false,243565423.325
> 70,true,243.325
> 80,false,243423.325
> 90,true,243.325
> EOF
[user00@master ~]$ hadoop fs -mkdir -p /tmp/$USER
[user00@master ~]$ hadoop fs -chmod a+w /tmp/$USER
[user00@master ~]$ hadoop fs -put table1.csv /tmp/$USER/table1.csv
[user00@master ~]$ hadoop fs -put table2.csv /tmp/$USER/table2.csv
[user00@master ~]$ hadoop fs -ls /tmp/$USER
Found 2 items
-rw-r--r--   1 user00 supergroup        193 2015-08-06 00:06 /tmp/user00/table1.csv
-rw-r--r--   1 user00 supergroup        158 2015-08-06 00:07 /tmp/user00/table2.csv

建立 Impala 資料表

讓我們在 userXX 資料庫底下,建立兩個資料表 t1 與 t2 。請執行以下指令:

impala-shell -q "CREATE TABLE $DBID.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
CREATE TABLE $DBID.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
LOAD DATA INPATH '/tmp/$USER/table1.csv' INTO TABLE $DBID.t1 ;
LOAD DATA INPATH '/tmp/$USER/table2.csv' INTO TABLE $DBID.t2 ;"
執行結果如下:
[user00@master ~]$ impala-shell -q "CREATE TABLE $DBID.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
CREATE TABLE $DBID.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
LOAD DATA INPATH '/tmp/$USER/table1.csv' INTO TABLE $DBID.t1 ;
LOAD DATA INPATH '/tmp/$USER/table2.csv' INTO TABLE $DBID.t2 ;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: create TABLE user00.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Fetched 0 row(s) in 1.03s
Query: create TABLE user00.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Fetched 0 row(s) in 1.02s
Query: load DATA INPATH '/tmp/user00/table1.csv' INTO TABLE user00.t1
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Fetched 1 row(s) in 4.84s
Query: load DATA INPATH '/tmp/user00/table2.csv' INTO TABLE user00.t2
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Fetched 1 row(s) in 6.09s

驗證匯入資料

讓我們先來檢驗一下,資料是否正確地匯入到資料表 userXX.t1 與 userXX.t2 。請執行以下指令:
impala-shell -q "use $DBID; SELECT * FROM t1; SELECT * FROM t2;"

執行結果如下所示:

[user00@master ~]$ impala-shell -q "use $DBID; SELECT * FROM t1; SELECT * FROM t2;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: use user00
Query: select * FROM t1
+----+-------+------------+-------------------------------+
| id | col_1 | col_2      | col_3                         |
+----+-------+------------+-------------------------------+
| 1  | true  | 123.123    | 2012-10-24 08:55:00           |
| 2  | false | 1243.5     | 2012-10-25 13:40:00           |
| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5  | true  | 243.325    | 1953-04-22 09:11:33           |
+----+-------+------------+-------------------------------+
Fetched 5 row(s) in 0.30s
Query: select * FROM t2
+----+-------+---------------+
| id | col_1 | col_2         |
+----+-------+---------------+
| 1  | true  | 12789.123     |
| 2  | false | 1243.5        |
| 3  | false | 24453.325     |
| 4  | false | 2423.3254     |
| 5  | true  | 243.325       |
| 60 | false | 243565423.325 |
| 70 | true  | 243.325       |
| 80 | false | 243423.325    |
| 90 | true  | 243.325       |
+----+-------+---------------+
Fetched 9 row(s) in 0.34s

執行複雜查詢

讓我們執行一個比較複雜的 JOIN 查詢,看看需要花多久時間才能得到回應。請執行以下指令:

impala-shell -q "use $DBID; SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;"

實際執行結果如下所示:

[user00@master ~]$ impala-shell -q "use $DBID; SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: use user00
Query: select t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10
+----+-------+---------------+---------------+
| id | col_1 | max(t2.col_2) | min(t2.col_2) |
+----+-------+---------------+---------------+
| 1  | true  | 12789.123     | 12789.123     |
| 2  | false | 1243.5        | 1243.5        |
| 3  | false | 24453.325     | 24453.325     |
| 4  | false | 2423.3254     | 2423.3254     |
| 5  | true  | 243.325       | 243.325       |
+----+-------+---------------+---------------+
Fetched 5 row(s) in 0.75s

EXPLAIN 語句

實務上可以使用 EXPLAIN 語句來瞭解 Impala SQL 查詢的 Query Plan。以下是上述複雜查詢的 Query Plan 結果:

[user00@master ~]$ impala-shell -q "use $DBID; EXPLAIN SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: use user00
Query: explain SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=2.16GB VCores=2                            |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| user00.t1, user00.t2                                                               |
|                                                                                    |
| 08:MERGING-EXCHANGE [UNPARTITIONED]                                                |
| |  order by: t1.id ASC                                                             |
| |  limit: 10                                                                       |
| |                                                                                  |
| 04:TOP-N [LIMIT=10]                                                                |
| |  order by: t1.id ASC                                                             |
| |                                                                                  |
| 07:AGGREGATE [FINALIZE]                                                            |
| |  output: max:merge(t2.col_2), min:merge(t2.col_2)                                |
| |  group by: t1.id, t1.col_1                                                       |
| |                                                                                  |
| 06:EXCHANGE [HASH(t1.id,t1.col_1)]                                                 |
| |                                                                                  |
| 03:AGGREGATE                                                                       |
| |  output: max(t2.col_2), min(t2.col_2)                                            |
| |  group by: t1.id, t1.col_1                                                       |
| |                                                                                  |
| 02:HASH JOIN [INNER JOIN, BROADCAST]                                               |
| |  hash predicates: user00.t2.id = user00.t1.id                                    |
| |                                                                                  |
| |--05:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  01:SCAN HDFS [user00.t1]                                                        |
| |     partitions=1/1 files=1 size=193B                                             |
| |                                                                                  |
| 00:SCAN HDFS [user00.t2]                                                           |
|    partitions=1/1 files=1 size=158B                                                |
+------------------------------------------------------------------------------------+
Fetched 31 row(s) in 0.03s

參考資料

  1. LOAD DATA Statement
  2. SELECT Statement
  3. EXPLAIN Statement

本文件最後更新於: