接下來,讓我們使用類似 Hive 的方式,將已經存在的 CSV 檔,匯入 Impala 資料表。
首先,讓我們產生兩個範例 CSV 檔 table1.csv 與 table2.csv。請在 SSH 的 Shell 環境中,分別執行以下三段指令:
cat > table1.csv << EOF 1,true,123.123,2012-10-24 08:55:00 2,false,1243.5,2012-10-25 13:40:00 3,false,24453.325,2008-08-22 09:33:21.123 4,false,243423.325,2007-05-12 22:32:21.33454 5,true,243.325,1953-04-22 09:11:33 EOF
cat > table2.csv << EOF 1,true,12789.123 2,false,1243.5 3,false,24453.325 4,false,2423.3254 5,true,243.325 60,false,243565423.325 70,true,243.325 80,false,243423.325 90,true,243.325 EOF
hadoop fs -mkdir -p /tmp/$USER hadoop fs -chmod a+w /tmp/$USER hadoop fs -put table1.csv /tmp/$USER/table1.csv hadoop fs -put table2.csv /tmp/$USER/table2.csv hadoop fs -ls /tmp/$USER
執行結果如下所示:
[user00@master ~]$ cat > table1.csv << EOF > 1,true,123.123,2012-10-24 08:55:00 > 2,false,1243.5,2012-10-25 13:40:00 > 3,false,24453.325,2008-08-22 09:33:21.123 > 4,false,243423.325,2007-05-12 22:32:21.33454 > 5,true,243.325,1953-04-22 09:11:33 > EOF [user00@master ~]$ cat > table2.csv << EOF > 1,true,12789.123 > 2,false,1243.5 > 3,false,24453.325 > 4,false,2423.3254 > 5,true,243.325 > 60,false,243565423.325 > 70,true,243.325 > 80,false,243423.325 > 90,true,243.325 > EOF [user00@master ~]$ hadoop fs -mkdir -p /tmp/$USER [user00@master ~]$ hadoop fs -chmod a+w /tmp/$USER [user00@master ~]$ hadoop fs -put table1.csv /tmp/$USER/table1.csv [user00@master ~]$ hadoop fs -put table2.csv /tmp/$USER/table2.csv [user00@master ~]$ hadoop fs -ls /tmp/$USER Found 2 items -rw-r--r-- 1 user00 supergroup 193 2015-08-06 00:06 /tmp/user00/table1.csv -rw-r--r-- 1 user00 supergroup 158 2015-08-06 00:07 /tmp/user00/table2.csv
讓我們在 userXX 資料庫底下,建立兩個資料表 t1 與 t2 。請執行以下指令:
impala-shell -q "CREATE TABLE $DBID.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; CREATE TABLE $DBID.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; LOAD DATA INPATH '/tmp/$USER/table1.csv' INTO TABLE $DBID.t1 ; LOAD DATA INPATH '/tmp/$USER/table2.csv' INTO TABLE $DBID.t2 ;"執行結果如下:
[user00@master ~]$ impala-shell -q "CREATE TABLE $DBID.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; CREATE TABLE $DBID.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; LOAD DATA INPATH '/tmp/$USER/table1.csv' INTO TABLE $DBID.t1 ; LOAD DATA INPATH '/tmp/$USER/table2.csv' INTO TABLE $DBID.t2 ;" Starting Impala Shell without Kerberos authentication Connected to master:21000 Query: create TABLE user00.t1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Fetched 0 row(s) in 1.03s Query: create TABLE user00.t2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Fetched 0 row(s) in 1.02s Query: load DATA INPATH '/tmp/user00/table1.csv' INTO TABLE user00.t1 +----------------------------------------------------------+ | summary | +----------------------------------------------------------+ | Loaded 1 file(s). Total files in destination location: 1 | +----------------------------------------------------------+ Fetched 1 row(s) in 4.84s Query: load DATA INPATH '/tmp/user00/table2.csv' INTO TABLE user00.t2 +----------------------------------------------------------+ | summary | +----------------------------------------------------------+ | Loaded 1 file(s). Total files in destination location: 1 | +----------------------------------------------------------+ Fetched 1 row(s) in 6.09s
impala-shell -q "use $DBID; SELECT * FROM t1; SELECT * FROM t2;"
執行結果如下所示:
[user00@master ~]$ impala-shell -q "use $DBID; SELECT * FROM t1; SELECT * FROM t2;" Starting Impala Shell without Kerberos authentication Connected to master:21000 Query: use user00 Query: select * FROM t1 +----+-------+------------+-------------------------------+ | id | col_1 | col_2 | col_3 | +----+-------+------------+-------------------------------+ | 1 | true | 123.123 | 2012-10-24 08:55:00 | | 2 | false | 1243.5 | 2012-10-25 13:40:00 | | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | | 5 | true | 243.325 | 1953-04-22 09:11:33 | +----+-------+------------+-------------------------------+ Fetched 5 row(s) in 0.30s Query: select * FROM t2 +----+-------+---------------+ | id | col_1 | col_2 | +----+-------+---------------+ | 1 | true | 12789.123 | | 2 | false | 1243.5 | | 3 | false | 24453.325 | | 4 | false | 2423.3254 | | 5 | true | 243.325 | | 60 | false | 243565423.325 | | 70 | true | 243.325 | | 80 | false | 243423.325 | | 90 | true | 243.325 | +----+-------+---------------+ Fetched 9 row(s) in 0.34s
讓我們執行一個比較複雜的 JOIN 查詢,看看需要花多久時間才能得到回應。請執行以下指令:
impala-shell -q "use $DBID; SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;"
實際執行結果如下所示:
[user00@master ~]$ impala-shell -q "use $DBID; SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;" Starting Impala Shell without Kerberos authentication Connected to master:21000 Query: use user00 Query: select t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10 +----+-------+---------------+---------------+ | id | col_1 | max(t2.col_2) | min(t2.col_2) | +----+-------+---------------+---------------+ | 1 | true | 12789.123 | 12789.123 | | 2 | false | 1243.5 | 1243.5 | | 3 | false | 24453.325 | 24453.325 | | 4 | false | 2423.3254 | 2423.3254 | | 5 | true | 243.325 | 243.325 | +----+-------+---------------+---------------+ Fetched 5 row(s) in 0.75s
實務上可以使用 EXPLAIN 語句來瞭解 Impala SQL 查詢的 Query Plan。以下是上述複雜查詢的 Query Plan 結果:
[user00@master ~]$ impala-shell -q "use $DBID; EXPLAIN SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10;" Starting Impala Shell without Kerberos authentication Connected to master:21000 Query: use user00 Query: explain SELECT t1.id, t1.col_1, MAX(t2.col_2), MIN(t2.col_2) FROM t2 JOIN t1 USING(id) GROUP BY id,col_1 ORDER BY id limit 10 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.16GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | user00.t1, user00.t2 | | | | 08:MERGING-EXCHANGE [UNPARTITIONED] | | | order by: t1.id ASC | | | limit: 10 | | | | | 04:TOP-N [LIMIT=10] | | | order by: t1.id ASC | | | | | 07:AGGREGATE [FINALIZE] | | | output: max:merge(t2.col_2), min:merge(t2.col_2) | | | group by: t1.id, t1.col_1 | | | | | 06:EXCHANGE [HASH(t1.id,t1.col_1)] | | | | | 03:AGGREGATE | | | output: max(t2.col_2), min(t2.col_2) | | | group by: t1.id, t1.col_1 | | | | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: user00.t2.id = user00.t1.id | | | | | |--05:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [user00.t1] | | | partitions=1/1 files=1 size=193B | | | | | 00:SCAN HDFS [user00.t2] | | partitions=1/1 files=1 size=158B | +------------------------------------------------------------------------------------+ Fetched 31 row(s) in 0.03s