Trafodion Bulk Load 初试

    xiaoxiao2023-03-24  6

    如在文章《Trafodion 数据加载介绍》中提到,Trafodion的Bulk Load与Tricke Load不同,Bulk Load主要是针对大数据量的且一般是批量装载的方式。Bulk Load通常使用LOAD语句实现。

    下面介绍几种Bulk Load方式,

    从Trafodion表加载数据从HDFS文件加载数据(Hive外部表)从Hive表加载数据从外部数据库加载数据(Mysql to Hive)

    从Trafodion表加载数据

    (1)创建两个表结构相同的测试表

    SQL>showddl test_tbl; CREATE TABLE TRAFODION.SEABASE.TEST_TBL ( ID INT NO DEFAULT NOT NULL NOT DROPPABLE SERIALIZED , NAME VARCHAR(10) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED ) ; --- SQL operation complete. SQL>showddl test_tbl2; CREATE TABLE TRAFODION.SEABASE.TEST_TBL2 ( ID INT NO DEFAULT NOT NULL NOT DROPPABLE SERIALIZED , NAME VARCHAR(10) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED ) ; --- SQL operation complete. (2)表1有几条数据,表2没有数据

    SQL>select * from test_tbl; ID NAME ----------- ---------- 1 LIU 2 zhang 3 LIU 4 wu 5 feng --- 5 row(s) selected. SQL>select * from test_tbl2; --- 0 row(s) selected. (3)从表1加载数据到表2

    SQL>load into test_tbl2 select * from test_tbl; UTIL_OUTPUT -------------------------------------------------------------------------------------------------------------------------------- Task: LOAD Status: Started Object: TRAFODION.SEABASE.TEST_TBL2 Task: CLEANUP Status: Started Object: TRAFODION.SEABASE.TEST_TBL2 Task: CLEANUP Status: Ended Object: TRAFODION.SEABASE.TEST_TBL2 Task: PREPARATION Status: Started Object: TRAFODION.SEABASE.TEST_TBL2 Rows Processed: 5 Task: PREPARATION Status: Ended ET: 00:00:00.263 Task: COMPLETION Status: Started Object: TRAFODION.SEABASE.TEST_TBL2 Task: COMPLETION Status: Ended ET: 00:00:00.531 --- SQL operation complete. SQL>select * from test_tbl2; ID NAME ----------- ---------- 1 LIU 2 zhang 3 LIU 4 wu 5 feng --- 5 row(s) selected.

    从HDFS文件加载数据(Hive 外部表)

    (1)创建测试文件并利用scp或者其他方式上传到Hadoop Cluster上,内容如下

    [centos@cent-1 tmp]$ cat test_table Lucy,20,F Jason,22,M Stephen,21,M Lily,19,F Vincent,25,M

    (2)将测试文件导入HDFS中

    [hdfs@cent-1 ~]$ hadoop fs -mkdir -p /hive/test/test_tbl [hdfs@cent-1 ~]$ hadoop fs -copyFromLocal /tmp/test_table /hive/test/test_tbl [hdfs@cent-1 ~]$ hadoop fs -ls /hive/test/test_tbl Found 1 items -rw-r--r--   3 hdfs supergroup         57 2016-09-28 02:31 /hive/test/test_tbl/test_table

    (3)在Trafodion中创建测试表

    SQL>create table test_table ( name varchar(10), age int, sex varchar(1)); --- SQL operation complete. SQL>showddl test_table; CREATE TABLE TRAFODION.SEABASE.TEST_TABLE ( NAME VARCHAR(10) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED , AGE INT DEFAULT NULL SERIALIZED , SEX VARCHAR(1) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED ) ; --- SQL operation complete.

    (4)在Hive中创建Hive外部表,指向前面对应的HDFS文件

    [hdfs@cent-1 ~]$ hive Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> create external table test_tbl (name string,age int,sex string) row format delimited fields terminated by ',' location '/hive/test/test_tbl'; OK Time taken: 0.085 seconds hive> show tables; OK test_tbl Time taken: 0.364 seconds, Fetched: 1 row(s)

    (5)加载数据到Trafodion表

    SQL>load into test_table +>select * from hive.hive.test_tbl; UTIL_OUTPUT -------------------------------------------------------------------------------------------------------------------------------- Task: LOAD Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: CLEANUP Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: CLEANUP Status: Ended Object: TRAFODION.SEABASE.TEST_TABLE Task: PREPARATION Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Rows Processed: 5 Task: PREPARATION Status: Ended ET: 00:00:01.272 Task: COMPLETION Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: COMPLETION Status: Ended ET: 00:00:00.254 --- SQL operation complete. SQL>select * from test_table; NAME AGE SEX ---------- ----------- ---- Lucy 20 F Jason 22 M Stephen 21 M Lily 19 F Vincent 25 M --- 5 row(s) selected.

    从Hive表加载数据

    (1)在Hive中创建测试表并插入几条数据

    hive> describe hive_tbl; OK name                    string age                     int sex                     string Time taken: 0.107 seconds, Fetched: 3 row(s) hive> select * from hive_tbl; OK Lucy 20 F Jason 22 M Stephen 21 M Time taken: 0.093 seconds, Fetched: 3 row(s)

    (2)Trafodion中的表继续沿用上述测试表,先清空数据

    SQL>purgedata test_table; --- SQL operation complete. SQL>showddl test_table; CREATE TABLE TRAFODION.SEABASE.TEST_TABLE ( NAME VARCHAR(10) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED , AGE INT DEFAULT NULL SERIALIZED , SEX VARCHAR(1) CHARACTER SET ISO88591 COLLATE DEFAULT DEFAULT NULL SERIALIZED ) ; --- SQL operation complete. (3)从Hive加载数据到Trafodion

    (注:在做从hive到trafodion数据加载的时候有几个参数可以配置,用来提升加载性能,如HIVE_MAX_STRING_LENGTH、ALLOW_INCOMPATIBLE_ASSIGNMENT

       典型的用法如下:

        CQD HIVE_MAX_STRING_LENGTH '1000'; --用于当列的最大长度为1KB时

        CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'ON'; --用于当有时间相关的列的时候

    SQL>load into test_table select * from hive.hive.hive_tbl; UTIL_OUTPUT -------------------------------------------------------------------------------------------------------------------------------- Task: LOAD Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: CLEANUP Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: CLEANUP Status: Ended Object: TRAFODION.SEABASE.TEST_TABLE Task: PREPARATION Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Rows Processed: 3 Task: PREPARATION Status: Ended ET: 00:00:00.390 Task: COMPLETION Status: Started Object: TRAFODION.SEABASE.TEST_TABLE Task: COMPLETION Status: Ended ET: 00:00:00.447 --- SQL operation complete. SQL>select * from test_table; NAME AGE SEX ---------- ----------- ---- Lucy 20 F Jason 22 M Stephen 21 M --- 3 row(s) selected.

    从外部数据库加载数据(Mysql to Hive)

    (1)在集群上安装并启动Sqoop,Sqoop安装步骤请参照其官方文档 点击打开链接

    (2)下载mysql connector Jar包,并放到/var/lib/sqoop目录下

    (3)在某台机器上安装Mysql,安装Mysql步骤此处不作详细说明

    (4)在mysql数据中创建测试表,并插入几条数据

    mysql> use test; Database changed mysql> create table test_tbl(name varchar(10), age int, sex char(1)); Query OK, 0 rows affected (0.00 sec) mysql> select * from test; ERROR 1146 (42S02): Table 'test.test' doesn't exist mysql> select * from test_tbl; Empty set (0.00 sec) mysql> insert into test_tbl values('Wang wei', 30, 'M'); Query OK, 1 row affected (0.00 sec) mysql> insert into test_tbl values('Li ping', 29, 'F'); Query OK, 1 row affected (0.00 sec) mysql> mysql> mysql> select * from test_tbl; +----------+------+------+ | name | age | sex | +----------+------+------+ | Wang wei | 30 | M | | Li ping | 29 | F | +----------+------+------+ 2 rows in set (0.00 sec)

    (5)使用sqoop命令从mysql往hive导入数据

    [hdfs@cent-2 ~]$ sqoop import --connect jdbc:mysql://localhost:3306/test --driver com.mysql.jdbc.Driver --username centos --table test_tbl --split-by name --hive-import --create-hive-table --hive-table test_tabl_hive Warning: /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. ... Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties OK Time taken: 3.052 seconds Loading data to table default.test_tabl_hive Table default.test_tabl_hive stats: [numFiles=4, totalSize=27] OK Time taken: 0.748 seconds

    (6)从Trafodion或Hive Shell中查看数据是否导入成功

    SQL>select * from hive.hive.test_tabl_hive; NAME AGE SEX -------------------------------------------------------------------------------------------------------------------------------- ----------- -------------------------------------------------------------------------------------------------------------------------------- Li ping 29 F Wang wei 30 M --- 2 row(s) selected.

    转载请注明原文地址: https://ju.6miu.com/read-1202445.html
    最新回复(0)