一、下载安装包并解压
[hadoop@hadoopa ~]$ tar -zxvf apache-hive-2.1.0-bin.tar.gz二、安装mysql数据库 1.执行安装命令
[hadoop@hadoopa ~]$ sudo yum install -y mysql-server mysql mysql-devel2.将mysql设置为开机自动启动
[hadoop@hadoopa ~]$ sudo chkconfig mysqld on3.启动mysql
[hadoop@hadoopa ~]$ service mysqld start4.给默认的root用户设置密码
[hadoop@hadoopa ~]$ mysqladmin -u root password '123'5.进入mysql设置root用户权限
mysql>grant all on *.* to root@'%' identified by '123'; mysql>grant all on *.* to root@'localhost' identified by '123'; mysql>grant all on *.* to root@'hadoopA' identified by '123'; mysql>flush privileges;三、编辑配置文件 1.在hive-env.sh中增加Hdoop的安装目录
HADOOP_HOME=/home/hadoop/hadoop-2.7.32.编辑hive-site.xml文件
[hadoop@hadoopa ~]$ cat /home/hadoop/apache-hive-2.1.0-bin/conf/hive-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://hadoopA:9083</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/warehouse</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://hadoopA:8020</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>datanucleus.schema.autoCreateTables</name> <value>true</value> </property> <property> <name>beeline.hs2.connection.user</name> <value>hadoop</value> </property> <property> <name>beeline.hs2.connection.password</name> <value>123</value> </property> </configuration> 增加hive-log4j2.properties [hadoop@hadoopa conf]$ cd /home/hadoop/apache-hive-2.1.0-bin/conf/ [hadoop@hadoopa conf]$ cat hive-log4j2.properties # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. status = INFO name = HiveLog4j2 packages = org.apache.hadoop.hive.ql.log # list of properties property.hive.log.level = INFO property.hive.root.logger = DRFA property.hive.log.dir = /home/hadoop/apache-hive-2.1.0-bin/logs property.hive.log.file = hive.log property.hive.perflogger.log.level = INFO # list of all appenders appenders = console, DRFA # console appender appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n # daily rolling file appender appender.DRFA.type = RollingRandomAccessFile appender.DRFA.name = DRFA appender.DRFA.fileName = ${sys:hive.log.dir}/${sys:hive.log.file} # Use %pid in the filePattern to append <process-id>@<host-name> to the filename if you want separate log files for different CLI session appender.DRFA.filePattern = ${sys:hive.log.dir}/${sys:hive.log.file}.%d{yyyy-MM-dd} appender.DRFA.layout.type = PatternLayout appender.DRFA.layout.pattern = %d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n appender.DRFA.policies.type = Policies appender.DRFA.policies.time.type = TimeBasedTriggeringPolicy appender.DRFA.policies.time.interval = 1 appender.DRFA.policies.time.modulate = true appender.DRFA.strategy.type = DefaultRolloverStrategy appender.DRFA.strategy.max = 30 # list of all loggers loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX, PerfLogger logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn logger.NIOServerCnxn.level = WARN logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO logger.ClientCnxnSocketNIO.level = WARN logger.DataNucleus.name = DataNucleus logger.DataNucleus.level = ERROR logger.Datastore.name = Datastore logger.Datastore.level = ERROR logger.JPOX.name = JPOX logger.JPOX.level = ERROR logger.PerfLogger.name = org.apache.hadoop.hive.ql.log.PerfLogger logger.PerfLogger.level = ${sys:hive.perflogger.log.level} # root logger rootLogger.level = ${sys:hive.log.level} rootLogger.appenderRefs = root rootLogger.appenderRef.root.ref = ${sys:hive.root.logger}四、启动hive 1.启动metastore服务
[hadoop@hadoopa apache-hive-2.1.0-bin]$ bin/hive --service metastore &2.启动hiveserver服务
[hadoop@hadoopa apache-hive-2.1.0-bin]$ bin/hive --service hiveserver2 &五:进入Hive Cli
[hadoop@hadoopa apache-hive-2.1.0-bin]$ bin/hive六:启动hive服务报错处理 权限错误解决方法
1.database(和关系型数据库中的数据库一样) 2.tables(和关系型数据库中的表一样) - 每张表对应在hdfs上的一个目录 3.Partitions(可选)一些特殊的列,用于优化存储和查询 - 对每张表的目录进行子目录的划分 - 为减少不必要的暴力数据扫描,可以对表进行分区 - 为避免产生过多小文件,建议多离散字段进行分区 4.Buckets(可以)一种特殊的分区数据组织方式 - 对于值较多的字段,可以将其分成若干个bucket - 可结合clustered by 与 Bucket使用 5.Files实际数据的物理存储单元 6.文件格式 - 由用户自定义 1.默认是文本文件(TEXTFILE) 2.文本文件,用户需显示指定分隔符 - 其它已支持格式 1.SequenceFile 2.Avro 3.RC/ORC/Parquet 4.用户自定义的(InputFormat和OutputFormat) - 支持数据压缩 1.Bzip,Gzip 2.LZO 3.Snappy
1.在shell端执行的命令
[hadoop@hadoopa apache-hive-2.1.0-bin]$ hive -e "show databases" [hadoop@hadoopa command]$ hive -f employees.sql2.在hive端执行的命令
hive> ! pwd; /home/hadoop/apache-hive-2.1.0-bin/conf hive> dfs -ls /out; Found 2 items -rw-r--r-- 3 hadoop supergroup 0 2017-03-11 18:47 /out/_SUCCESS -rw-r--r-- 3 hadoop supergroup 0 2017-03-11 18:47 /out/part-r-00000 hive>3.创建复杂表的语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name( col_name data_type ... ) [PARTITIONED BY (col_name data_type,...)] [CLUSTERED BY (col_name,col_name,...) [SORT BY (col_name[ASC|DESC],...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name,col_name,...)] [[ROW FORMAT row_format] [STORED AS file_format]] [LOCATION hdfs_path] 创建表语句 [hadoop@hadoopa command]$ cat employees.sql CREATE TABLE IF NOT EXISTS employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, decutions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/home/hadoop/apache-hive-2.1.0-bin/examples/data/employees.txt' OVERWRITE INTO TABLE employees; 查看表的schema hive> describe employees; OK name string salary float subordinates array<string> decutions map<string,float> address struct<street:string,city:string,state:string,zip:int> Time taken: 1.449 seconds, Fetched: 5 row(s) 查看表的创建 hive> show create table employees; OK CREATE TABLE `employees`( `name` string, `salary` float, `subordinates` array<string>, `decutions` map<string,float>, `address` struct<street:string,city:string,state:string,zip:int>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'colelction.delim'='\u0002', 'field.delim'='\u0001', 'line.delim'='\n', 'mapkey.delim'='\u0003', 'serialization.format'='\u0001') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://hadoopA:8020/warehouse/employees' TBLPROPERTIES ( 'numFiles'='1', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='429', 'transient_lastDdlTime'='1489303083') Time taken: 1.528 seconds, Fetched: 26 row(s) 加载数据 LOAD DATA LOCAL INPATH '/home/hadoop/apache-hive-2.1.0-bin/examples/data/employees.txt' OVERWRITE INTO TABLE employees; employees.txt的数据格式 John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600 Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601 Todd Jones^A70000.0^A^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.1^A200 Chicago Ave.^BOak Park^BIL^B60700 Bill King^A60000.0^A^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.1^A300 Obscure Dr.^BObscuria^BIL^B601004.创建分区表
动态分区需要非严格模式 set hive.exec.dynamic.partition.mode=nonstrict; 创建非分区表语句 CREATE TABLE IF NOT EXISTS employees_part ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, decutions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> ) PARTITIONED BY (state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE; FROM employees e INSERT OVERWRITE TABLE employees_part PARTITION(state) SELECT e.*,e.address.state 查看导入到动态分区的数据 hive> dfs -lsr /warehouse/; lsr: DEPRECATED: Please use 'ls -R' instead. drwxr-xr-x - hadoop supergroup 0 2017-03-12 14:59 /warehouse/demo1 drwxr-xr-x - hadoop supergroup 0 2017-03-12 15:18 /warehouse/employees -rwxr-xr-x 3 hadoop supergroup 429 2017-03-12 15:18 /warehouse/employees/employees.txt drwxr-xr-x - hadoop supergroup 0 2017-03-12 16:15 /warehouse/employees_part drwxr-xr-x - hadoop supergroup 0 2017-03-12 16:15 /warehouse/employees_part/state=IL -rwxr-xr-x 3 hadoop supergroup 441 2017-03-12 16:15 /warehouse/employees_part/state=IL/000000_05.创建外部表
创建外部表的语句 CREATE EXTERNAL TABLE Los( domain_id INT, ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hivetest/logs'6.使用不同的文件格式
STORED AS file_format -STORED AS PARQUET -STORED AS ORC -STORED AS SEQUENCEFILE -STORED AS AVRO -STORED AS TEXTFILE -自定义的表7.如何创建带压缩的ORC表:步骤
原始数据(存储在HDFS中)---Hive临时表(text或者SequenceFile)-- Hive ORC表8.删除表
DROP DATABASE [IF EXISTS] database_name DROP TABLE [IF EXISTS] table_name - 元数据被删除 - 内部表的数据被删除 - 外部表的数据不会被删除1.DML:Load,Insert,Update,Delete
2.Select
With语句作为临时表 with q1 as (select key from src where key='5') select * from q1; Group ByJoins INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN LEFT SEMI-JOIN Map-side Joins 两种分布式Join算法 Map-side Join(Broadcast join) Reduce-side Join(shuffle join) Sort/Distribute/Cluster/Order ByTransform and Map-Reduce ScriptsOperators and User-Defined Functions(UDFs)UnionLateral View SELECT pageid,adid FROM pageAds LATERAL VIEW explode(adid_list)adTable AS adid;3.Explain
1.语法
EXPLAIN [EXTENDED] query2.作用 查询Query的执行计划
3.例子
hive> explain select * from employees; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: employees Statistics: Num rows: 1 Data size: 429 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: name (type: string), salary (type: float), subordinates (type: array<string>), decutions (type: map<string,float>), address (type: struct<street:string,city:string,state:string,zip:int>) outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 1 Data size: 429 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 2.996 seconds, Fetched: 17 row(s)