實作九

Impala (2) : INSERT INTO … VALUES 語法

我們可以用以下的指令查看目前資料庫的狀態:
SHOW DATABASES;
SHOW TABLES;
執行結果如下:
[master:21000] > show databases;
Query: show databases
+------------------+
| name             |
+------------------+
| _impala_builtins |
| default          |
| user00           |
+------------------+
Fetched 3 row(s) in 0.00s
[master:21000] > show tables;
Query: show tables

Fetched 0 row(s) in 0.01s
看起來跟 Hive 的結果不太一致,那是因為 Impala 還沒跟 Hive 進行同步,讓我們使用以下指令,強迫 Impala 與 Hive 的 metastore 進行同步。
invalidate metadata;
SHOW DATABASES;
use userXX;
SHOW TABLES;
執行結果如下:
[master:21000] > invalidate metadata;
Query: invalidate metadata

Fetched 0 row(s) in 3.88s
[master:21000] > SHOW DATABASES;
Query: show DATABASES
+------------------+
| name             |
+------------------+
| _impala_builtins |
| default          |
| user00           |
+------------------+
Fetched 3 row(s) in 0.00s
[master:21000] > use user00;
Query: use user00
[master:21000] > SHOW TABLES;
Query: show TABLES
+-------------------+
| name              |
+-------------------+
| batting           |
| fact_player_stats |
| master            |
| mysql_data        |
| pitching          |
+-------------------+
Fetched 5 row(s) in 0.00s

由於 Impala 支援 INSERT INTO … VALUES 語法(Hive 必須要到 0.14 以後才有支援)。因此,讓我們先使用 INSERT 語法來建立一個示範用的資料表。請執行以下 Impala SQL 來建立範例資料表:

use userXX;
CREATE TABLE Users ( id INT, name STRING );
CREATE TABLE Logins ( id INT, login TIMESTAMP );
INSERT INTO Users  VALUES ( 1, "Jason" ), ( 2, "Jazz" ), ( 3, "Jim" ), ( 4, "Steven" ), ( 5, "Tim" );
INSERT INTO Logins VALUES ( 1, "2014-12-21 17:18:53" ), ( 2, "2014-12-20 21:38:42" ), ( 3, "2014-12-22 18:17:38" ), ( 4, "2014-12-23 09:25:35" ), ( 5, "2014-12-24 13:35:17" );
SELECT users.id, users.name, logins.login FROM users JOIN logins ON ( users.id = logins.id );

執行結果如下所示。由於這些動作都只發生在 impalad 跟 catalogd 之間,查詢的回應速度大約是秒至毫秒等級。

[master:21000] > use user00;
Query: use user00
[master:21000] > CREATE TABLE Users ( id INT, name STRING );
Query: create TABLE Users ( id INT, name STRING )

Fetched 0 row(s) in 1.04s
[master:21000] > CREATE TABLE Logins ( id INT, login TIMESTAMP );
Query: create TABLE Logins ( id INT, login TIMESTAMP )

Fetched 0 row(s) in 1.04s
[master:21000] > INSERT INTO Users  VALUES ( 1, "Jason" ), ( 2, "Jazz" ), ( 3, "Jim" ), ( 4, "Steven" ), ( 5, "Tim" );
Query: insert INTO Users  VALUES ( 1, "Jason" ), ( 2, "Jazz" ), ( 3, "Jim" ), ( 4, "Steven" ), ( 5, "Tim" )
Inserted 5 row(s) in 6.21s
[master:21000] > INSERT INTO Logins VALUES ( 1, "2014-12-21 17:18:53" ), ( 2, "2014-12-20 21:38:42" ), ( 3, "2014-12-22 18:17:38" ), ( 4, "2014-12-23 09:25:35" ), ( 5, "2014-12-24 13:35:17" );
Query: insert INTO Logins VALUES ( 1, "2014-12-21 17:18:53" ), ( 2, "2014-12-20 21:38:42" ), ( 3, "2014-12-22 18:17:38" ), ( 4, "2014-12-23 09:25:35" ), ( 5, "2014-12-24 13:35:17" )
Inserted 5 row(s) in 6.00s
[master:21000] > SELECT users.id, users.name, logins.login FROM users JOIN logins ON ( users.id = logins.id );
Query: select users.id, users.name, logins.login FROM users JOIN logins ON ( users.id = logins.id )
+----+--------+---------------------+
| id | name   | login               |
+----+--------+---------------------+
| 1  | Jason  | 2014-12-21 17:18:53 |
| 2  | Jazz   | 2014-12-20 21:38:42 |
| 3  | Jim    | 2014-12-22 18:17:38 |
| 4  | Steven | 2014-12-23 09:25:35 |
| 5  | Tim    | 2014-12-24 13:35:17 |
+----+--------+---------------------+
Fetched 5 row(s) in 0.40s

離開 Impala Shell 的方式與 Hive Shell 相同,請使用 quit;exit;
[master:21000] > quit;
Goodbye user00

問題討論

當使用 Hive 時,必須先使用 LOAD DATA 的語法將資料來源匯入指定資料表。資料會存放於 Hive 的 HDFS 存放目錄(如:hdfs://lab.3du.me:9000/user/hive/warehouse 底下),那在 Impala 如果使用 INSERT INTO 語法,資料會存放在哪裡呢?

由於 Impala 與 Hive 共用 metastore,因此實際上的資料也會存放於 /user/hive/warehouse 底下!可以使用以下指令檢查:
er00@master ~]$ hadoop fs -ls -R /user/hive/warehouse/$DBID.db
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:26 /user/hive/warehouse/user00.db/batting
-rw-r--r--   1 user00 supergroup    6488747 2015-08-05 23:26 /user/hive/warehouse/user00.db/batting/Batting.csv
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:32 /user/hive/warehouse/user00.db/fact_player_stats
-rw-r--r--   1 user00 supergroup   13057255 2015-08-05 23:32 /user/hive/warehouse/user00.db/fact_player_stats/000000_0
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins
-rw-r--r--   1 impala hive              110 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins/464fad09d2508e04-ac0cdeb713d2dfa3_1008088474_data.0.
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins/_impala_insert_staging
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:20 /user/hive/warehouse/user00.db/master
-rw-r--r--   1 user00 supergroup    3049250 2015-08-05 23:20 /user/hive/warehouse/user00.db/master/Master.csv
drwxr-xr-x   - user00 hive                0 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data
-rw-r--r--   1 user00 user00              0 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/_SUCCESS
-rw-r--r--   1 user00 user00              8 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/part-m-00000
-rw-r--r--   1 user00 user00              8 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/part-m-00001
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:30 /user/hive/warehouse/user00.db/pitching
-rw-r--r--   1 user00 supergroup    3602473 2015-08-05 23:30 /user/hive/warehouse/user00.db/pitching/Pitching.csv
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/users
-rw-r--r--   1 impala hive               36 2015-08-05 23:49 /user/hive/warehouse/user00.db/users/2d4f2425a22921e6-dee9b22ea2933490_1342285975_data.0.
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/users/_impala_insert_staging

實驗建立一個新的資料表,名為 test。

[user00@master ~]$ impala-shell -q "use $DBID; create table test;"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: use user00
Query: create table test
ERROR: AnalysisException: Table requires at least 1 column
Could not execute command: create table test
[user00@master ~]$ impala-shell -q "use $DBID; create table test (id int);"
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Query: use user00
Query: create table test (id int)

Fetched 0 row(s) in 1.04s

查詢 Hive 的 HDFS 存放目錄,會發現多了一個名為 test 的目錄。

[user00@master ~]$ hadoop fs -ls -R /user/hive/warehouse/$DBID.db
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:26 /user/hive/warehouse/user00.db/batting
-rw-r--r--   1 user00 supergroup    6488747 2015-08-05 23:26 /user/hive/warehouse/user00.db/batting/Batting.csv
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:32 /user/hive/warehouse/user00.db/fact_player_stats
-rw-r--r--   1 user00 supergroup   13057255 2015-08-05 23:32 /user/hive/warehouse/user00.db/fact_player_stats/000000_0
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins
-rw-r--r--   1 impala hive              110 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins/464fad09d2508e04-ac0cdeb713d2dfa3_1008088474_data.0.
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/logins/_impala_insert_staging
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:20 /user/hive/warehouse/user00.db/master
-rw-r--r--   1 user00 supergroup    3049250 2015-08-05 23:20 /user/hive/warehouse/user00.db/master/Master.csv
drwxr-xr-x   - user00 hive                0 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data
-rw-r--r--   1 user00 user00              0 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/_SUCCESS
-rw-r--r--   1 user00 user00              8 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/part-m-00000
-rw-r--r--   1 user00 user00              8 2015-08-05 23:15 /user/hive/warehouse/user00.db/mysql_data/part-m-00001
drwxr-xr-x   - user00 supergroup          0 2015-08-05 23:30 /user/hive/warehouse/user00.db/pitching
-rw-r--r--   1 user00 supergroup    3602473 2015-08-05 23:30 /user/hive/warehouse/user00.db/pitching/Pitching.csv
drwxrwxr-x   - impala hive                0 2015-08-05 23:53 /user/hive/warehouse/user00.db/test
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/users
-rw-r--r--   1 impala hive               36 2015-08-05 23:49 /user/hive/warehouse/user00.db/users/2d4f2425a22921e6-dee9b22ea2933490_1342285975_data.0.
drwxrwxr-x   - impala hive                0 2015-08-05 23:49 /user/hive/warehouse/user00.db/users/_impala_insert_staging

統計指令

  1. 有時候為了瞭解資料表的特徵,可以使用以下指令來查詢特定資料表的欄位個數、欄位資料特徵與檔案個數等:
use userXX;
COMPUTE STATS users;
COMPUTE STATS logins;
SHOW TABLE STATS users;
SHOW TABLE STATS logins;
SHOW COLUMN STATS users;
SHOW COLUMN STATS logins;

執行結果如下所示:

[user00@master ~]$ impala-shell 
Starting Impala Shell without Kerberos authentication
Connected to master:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

[master:21000] > use user00;
Query: use user00
[master:21000] > COMPUTE STATS users;
Query: compute STATS users
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 5.23s
[master:21000] > COMPUTE STATS logins;
Query: compute STATS logins
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 4.92s
[master:21000] > SHOW TABLE STATS users;
Query: show TABLE STATS users
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| 5     | 1      | 36B  | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+
Fetched 1 row(s) in 0.01s
[master:21000] > SHOW TABLE STATS logins;
Query: show TABLE STATS logins
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| 5     | 1      | 110B | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+
Fetched 1 row(s) in 0.00s
[master:21000] > SHOW COLUMN STATS users;
Query: show COLUMN STATS users
+--------+--------+------------------+--------+----------+-------------------+
| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size          |
+--------+--------+------------------+--------+----------+-------------------+
| id     | INT    | 5                | -1     | 4        | 4                 |
| name   | STRING | 5                | -1     | 6        | 4.199999809265137 |
+--------+--------+------------------+--------+----------+-------------------+
Fetched 2 row(s) in 0.00s
[master:21000] > SHOW COLUMN STATS logins;
Query: show COLUMN STATS logins
+--------+-----------+------------------+--------+----------+----------+
| Column | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+-----------+------------------+--------+----------+----------+
| id     | INT       | 5                | -1     | 4        | 4        |
| login  | TIMESTAMP | 5                | -1     | 16       | 16       |
+--------+-----------+------------------+--------+----------+----------+
Fetched 2 row(s) in 0.00s

參考資料

  1. INSERT Statement
  2. SHOW Statement
  3. COMPUTE STATS Statement

本文件最後更新於: