数据分析系统Hive（第二部分）

xiaoxiao2021-03-25 112

第一：Hive的运行原理

一、Hive Cli（查询处理器）

1.Query Compiler

Parser（语法解析器，生成AST（抽象语法树））Semantic Analyzer（语义分析器，生成QB（查询块））Logical Plan Generator（逻辑查询计划生成器，生成QB Tree）Logical Optimizer（逻辑查询优化器，生成QB Tree）Physical Plan Generator（物理查询计划生成器，生成Phys Plan）Physical Optimizer（物理查询优化器，生成 Phys Plan）

2.Exection Engine - MR - Spark - Tez

二、MetaStore（元数据）

1.存储表的元数据信息

数据库，表的基本信息分区信息列的信息存储格式信息各种属性信息权限信息

2.使用关系型数据库进行存储 3.MetaStore Server

通用的Catalog Server支持多语言访问local/remote Mode配置 <property> <name>hive.metastore.uris</name> <value>thrift://hadoopA:8020</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> </property>

三、YARN（计算平台）四、HDFS（数据存储）

第二：Hive的访问方式

Hive Cli

hviehive -h hotst -p port

Hive Beeline

beeline -u jdbc:hive2://host:port user password

编程访问

JDBC/ODBCThrift

第三：Hive的查询优化

1. 合理建表

一、定义列类型

基本类型负责类型

二、使用分区列

查询维度，业务需求日期分区，动态分区set hive.exec.dynamic.partition=true

第三、定义分桶列第四、采用列式存储

Parquetorc

第五、采用压缩存储

Bzip，GzipLZOSnappy

2. 通用参数优化

第一、启用数据压缩

减少存储和IO压缩Hive输出和中间结果 hive.exec.compress.output=truehive.exec.compress.intermediate=true 设置Hive中间表存储格式 hive.query.result.fileformat=SequenceFile

第二、Job执行优化

并行执行多个job

hive.exec.parallel=true(default false)hive.exec.parallel.thread.number=8(default 8)

本地执行模式

hive.exec.mode.local.auto=truehive.exec.mode.local.auto.inputbytes.max(128 by default)hive.exec.mode.local.auto.tasks.max(4 by default)num_reduce_task<=1

第三、选择合适引擎

MRSparkTez

第四、MapReduce参数优化

Map阶段优化

mapreduce.job.maps无效num_map_tasks切割参数影响大小切割算法

Reduce阶段优化

mapred.reduce.tasks直接设置num_reduce_tasks大小影响参数切割算法

Shuffle阶段优化

压缩中间数据

配置方法

mapreduce.map.output.compress=true

mapreduce.map.output.compress.codec

org.apache.hadoop.io.compress.LzoCodec

org.apache.hadoop.io.compress.SnappyCodec

3. Join优化

第一、Common join 在Reduce端做join 第二、Map join

将小文件放到内存，和大文件的每一个map在mapper阶段进行joinhive.auto.convert.join=true(default false)hive.mapjoin.smalltable.filesize=600M(default 25M)Select /+MAPJOIN(a)/..a JOIN b 强制指定mapjoin

第三、Bucket map join

set hive.optimize.bucketmapjoin=truemapjoin 一起工作所有要join的表必须分桶，大桶的表的个数是小桶的表的整数倍做了bucket的列必须是等于join的列

第四：项目实践

第一、项目说明

搭建好Hadoop集群环境。安装好hdfs，yarn，hive等组件分布创建三张Hive表。user_dimension,brand_dimension,record.其中user_dimension和brand_dimension是外部表。向这三张表中导入数据。数据来源均已准备好，并且和三张表的格式对应

第二、创建user_dimension,brand_dimension,record这三张表

[hadoop@hadoopa command]$ cat create_table_user.sql create external table if not exists user_dimension ( uid STRING, name STRING, gender STRING, birth DATE, province STRING )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/user_dimension' ; [hadoop@hadoopa command]$ [hadoop@hadoopa command]$ cat create_table_brand.sql create external table if not exists brand_dimension ( bid STRING, category STRING, brand STRING )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ; [hadoop@hadoopa command]$ [hadoop@hadoopa command]$ cat create_table_record.sql create table if not exists record ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING ) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' [hadoop@hadoopa command]$ hive -f create_table_user.sql [hadoop@hadoopa command]$ hive -f create_table_brand.sql [hadoop@hadoopa command]$ hive -f create_table_record.sql

第三、向这三张表中导入数据

hive> load data local inpath '/home/hadoop/hadooptraining/datasource/brand.list' overwrite into table brand_dimension; hive> load data local inpath '/home/hadoop/hadooptraining/datasource/user.list' overwrite into table user_dimension; hive> load data inpath 'hdfs://hadoopA:8020/flume/record/2017-03-10/2220/transaction_log.1489155600805' overwrite into table record partition(partition_date='2017-03-10',hour=22);

第四、使用HQL命令验证试验结果

查询各品牌销售总额

HQL语句 [hadoop@hadoopa command]$ cat brand_price_list.sql select brand,sum(price) as totalPrice from record join brand_dimension on record.bid=brand_dimension.bid group by brand_dimension.brand order by totalPrice desc; [hadoop@hadoopa command]$ 2. 执行结果 [hadoop@hadoopa command]$ hive -f brand_price_list.sql which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20170313114243_fdc4d60e-75f7-426e-ba69-cf7198eaedc6 Total jobs = 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2017-03-13 11:43:12 Starting to launch local task to process map join; maximum memory = 518979584 2017-03-13 11:43:15 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable 2017-03-13 11:43:16 Uploaded 1 File to: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (34008 bytes) 2017-03-13 11:43:16 End of local task; Time Taken: 3.194 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0031, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0031/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0031 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-03-13 11:43:40,015 Stage-2 map = 0%, reduce = 0% 2017-03-13 11:43:54,259 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec 2017-03-13 11:44:08,440 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 5.47 sec MapReduce Total cumulative CPU time: 5 seconds 470 msec Ended Job = job_1489117385526_0031 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0032, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0032/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0032 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2017-03-13 11:44:32,410 Stage-3 map = 0%, reduce = 0% 2017-03-13 11:44:52,170 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.4 sec 2017-03-13 11:45:07,571 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.73 sec MapReduce Total cumulative CPU time: 5 seconds 730 msec Ended Job = job_1489117385526_0032 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.47 sec HDFS Read: 62768 HDFS Write: 1666 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.73 sec HDFS Read: 6942 HDFS Write: 1480 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 200 msec OK SKYWORTH 11992 SAMSUNG 10240 YILI 9872 TCL 6741 OLAY 6442 MEIZU 6345 ASUS 5705 PEAK 5431 APPLE 5213 MOUTAI 4772 SHARP 4721 PEACEBIRD 4680 MIZUNO 4599 DHC 4585 NIULANSHAN 4582 CAMEL 4569 NIKE 4358 SEPTWOLVES 4345 OPPO 4306 NB 4237 KAPPA 4092 ZARA 4068 GUANGMING 4054 HP 4043 HISENSE 3995 HLA 3963 HUAWEI 3927 KANS 3884 LANGJIU 3857 NIVEA 3579 LINING 3559 CLINIQUE 3552 LENOVO 3534 PUMA 3531 HTC 3405 GXG 3322 UNIQLO 3271 HAIER 3106 LOREAL 2948 WULIANGYE 2912 ADIDAS 2847 MOTOROLA 2819 VIVO 2809 DELL 2785 SANYUAN 2776 LANCOME 2714 SELECTED 2633 INNISFREE 2452 SONY 2353 ACER 2339 XIAOMI 2260 HONGXING 2113 ANTA 1990 MENGNIU 1776 IPHONE 1628 SEMIR 1589 PHILIPS 1205 361 718 MEIFUBAO 448 Time taken: 146.0 seconds, Fetched: 59 row(s)

查询个年龄段用户消费总额

HQL语句 [hadoop@hadoopa command]$ cat age_price_list.sql select cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) as age, sum(price) as totalPrice from record join user_dimension on record.uid=user_dimension.uid group by cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) order by totalPrice desc; 2. 执行结果 [hadoop@hadoopa command]$ hive -f age_price_list.sql which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20170313114910_f92cc413-d572-4574-a30e-2cfa9fdcded8 Total jobs = 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2017-03-13 11:49:32 Starting to launch local task to process map join; maximum memory = 518979584 2017-03-13 11:49:35 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable 2017-03-13 11:49:35 Uploaded 1 File to: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (30382 bytes) 2017-03-13 11:49:35 End of local task; Time Taken: 3.073 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0033, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0033/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0033 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-03-13 11:49:57,659 Stage-2 map = 0%, reduce = 0% 2017-03-13 11:50:18,269 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.14 sec 2017-03-13 11:50:32,451 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 6.71 sec MapReduce Total cumulative CPU time: 6 seconds 710 msec Ended Job = job_1489117385526_0033 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0034, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0034/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0034 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2017-03-13 11:50:50,851 Stage-3 map = 0%, reduce = 0% 2017-03-13 11:51:06,025 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.22 sec 2017-03-13 11:51:19,065 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.14 sec MapReduce Total cumulative CPU time: 5 seconds 140 msec Ended Job = job_1489117385526_0034 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 6.71 sec HDFS Read: 64575 HDFS Write: 1104 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 6338 HDFS Write: 1037 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 850 msec OK 46 9653 44 9436 10 8068 2 7918 15 7470 9 7456 40 7391 14 7388 43 7109 37 6671 6 5980 11 5969 18 5858 30 5853 29 5841 22 5819 39 5737 20 5597 13 5564 41 5503 21 5306 12 4998 23 4991 28 4830 35 4829 33 4528 19 4347 25 4109 36 4017 32 3910 17 3698 1 3468 7 3459 16 3344 26 3328 31 3286 3 3188 4 3141 8 2639 42 2557 45 2473 24 2457 34 2454 47 2254 27 2137 38 1796 5 1534 0 1232 Time taken: 131.25 seconds, Fetched: 48 row(s)

查询个省份消费总额

HQL语句 [hadoop@hadoopa command]$ cat province_prince_list.sql select province,sum(price) as totalPrice from record join user_dimension on record.uid=user_dimension.uid group by user_dimension.province order by totalPrice desc; 2. 执行结果 [hadoop@hadoopa command]$ hive -f province_prince_list.sql which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20170313115315_d285100f-73a1-44b4-8ffd-9537cbea48e9 Total jobs = 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2017-03-13 11:53:44 Starting to launch local task to process map join; maximum memory = 518979584 2017-03-13 11:53:48 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable 2017-03-13 11:53:48 Uploaded 1 File to: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (35297 bytes) 2017-03-13 11:53:48 End of local task; Time Taken: 4.051 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0035, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0035/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0035 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-03-13 11:54:20,334 Stage-2 map = 0%, reduce = 0% 2017-03-13 11:54:45,944 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.22 sec 2017-03-13 11:55:06,202 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.18 sec MapReduce Total cumulative CPU time: 9 seconds 180 msec Ended Job = job_1489117385526_0035 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0036, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0036/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0036 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2017-03-13 11:55:31,708 Stage-3 map = 0%, reduce = 0% 2017-03-13 11:55:51,447 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.66 sec 2017-03-13 11:56:15,088 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 6.75 sec MapReduce Total cumulative CPU time: 6 seconds 750 msec Ended Job = job_1489117385526_0036 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 9.18 sec HDFS Read: 62835 HDFS Write: 1045 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 6.75 sec HDFS Read: 6333 HDFS Write: 938 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 930 msec OK ZheJiang 11501 ShanDong 11336 LiaoNing 10519 JiLin 10341 FuJian 9742 XiangGang 9371 QingHai 9329 ShanXi3 9283 GuiZhou 9148 HaiNan 9037 HuBei 8868 NeiMengGu 8245 BeiJing 8035 JiangXi 7706 AnHui 7382 HuNan 6666 ShangHai 6441 JiangSu 6025 TaiWan 5988 ShanXi1 5862 ChongQing 5854 XinJiang 5557 HeNan 5390 XiZang 5322 TianJin 4984 YunNan 4695 GuangXi 4646 HeiLongJiang 4639 GanSu 4288 GuangDong 3753 SiChuan 3627 Aomen 3233 NingXia 2444 HeBei 1334 Time taken: 182.306 seconds, Fetched: 34 row(s)

第五. 创建user_dimension_orc,brand_dimension_orc,record_orc这三张表。这三张表使用ORC格式保存数据

[hadoop@hadoopa command]$ cat create_table_user_orc.sql create external table if not exists user_dimension_orc ( uid STRING, name STRING, gender STRING, birth DATE, province STRING )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/user_dimension' ; [hadoop@hadoopa command]$ cat create_table_brand_orc.sql create external table if not exists brand_dimension_orc ( bid STRING, category STRING, brand STRING )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC location 'hdfs://hadoopA:8020/warehouse/brand_dimension' ; [hadoop@hadoopa command]$ cat create_table_record_orc.sql create table if not exists record_orc ( rid STRING, uid STRING, bid STRING, trancation_date TIMESTAMP, price INT, source_province STRING, target_province STRING, site STRING, express_number STRING, express_company STRING ) PARTITIONED BY ( partition_date STRING, hour INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC ;

第六.将数据导入到三张ORC表中

hive> insert into table brand_dimension_orc select * from brand_dimension; hive> insert into table user_dimension_orc select * from user_dimension; hive> insert overwrite table record_orc partition (partition_date='2017-03-10',hour=22) select rid,uid,bid,trancation_date,price,source_province,target_province,site,express_number,express_company from record

转载请注明原文地址: https://ju.6miu.com/read-40995.html

技术

最新回复(0)