大数据基础(十)Maven构建Hadoop日志清洗项目(二)

    xiaoxiao2025-01-28  10

    Maven Hadoop日志清洗项目(二) Hadoop 2.7.2  Hive 2.1.0  Sqoop 1.4.6 参考: http://www.cnblogs.com/edisonchou/p/4464349.html 1、将HDFS中清洗好的文件入库hive 为了能够借助Hive进行统计分析,首先我们需要将清洗后的数据存入Hive中,那么我们需要先建立一张表。这里我们选择分区表,以日期作为分区的指标,建表语句如下:(这里关键之处就在于确定映射的HDFS位置,我这里是/user/root/logcleanjob_output即清洗后的数据存放的位置) 文件位置: root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output Found 2 items -rw-r--r--   2 root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS -rw-r--r--   2 root supergroup   50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000 hive>create database logtest; hive>CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/root/logcleanjob_output'; 验证: select * from techbbs; 119.127.191.86 20130531235956 forum.php?mod=viewthread&tid=11528&page=60&authorid=53387 157.56.177.164 20130531235957 api.php?mod=js&bid=65 223.240.215.151 20130531235958 source/plugin/pcmgr_url_safeguard/url_api.inc.php 112.64.235.246 20130531235957 home.php?mod=misc&ac=sendmail&rand=1370014195 49.74.113.251 20130531235958 home.php?mod=spacecp&ac=follow&op=checkfeed&rand=1370015996 117.79.176.9 20130531235958 home.php?mod=space&do=notice Time taken: 0.097 seconds, Fetched: 672261 row(s) 2、统计PV 页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下: hive>CREATE TABLE techbbs_pv_2015_04_25 AS SELECT COUNT(1) AS PV FROM techbbs; hive> select * from techbbs_pv_2015_04_25; OK 672261 Time taken: 0.128 seconds, Fetched: 1 row(s) 3、注册用户数   该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。因此,这里我们只需要统计出日志中访问的URL是member.php?mod=register的即可,HQL代码如下: hive>CREATE TABLE techbbs_reguser_2015_04_25 AS SELECT COUNT(1) AS REGUSER FROM techbbs WHERE INSTR(url,'member.php?mod=register')>0;   hive> select * from techbbs_reguser_2015_04_25; OK 551 Time taken: 0.087 seconds, Fetched: 1 row(s)    4、关键指标之三:独立IP数   一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字: hive>CREATE TABLE techbbs_ip_2015_04_25 AS SELECT COUNT(DISTINCT ip) AS IP FROM techbbs; hive> select * from techbbs_ip_2015_04_25; OK 31535 Time taken: 0.118 seconds, Fetched: 1 row(s) 5、关键指标之四:跳出用户数   只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数,HQL代码如下: hive>CREATE TABLE techbbs_jumper_2015_04_25 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM techbbs GROUP BY ip HAVING times=1) e; hive> show tables; OK techbbs techbbs_ip_2015_04_25 techbbs_jumper_2015_04_25 techbbs_pv_2015_04_25 techbbs_reguser_2015_04_25 Time taken: 0.01 seconds, Fetched: 5 row(s) hive> select * from techbbs_jumper_2015_04_25; OK 10622 Time taken: 0.081 seconds, Fetched: 1 row(s) 6、将所有关键指标放入一张汇总表中以便于通过Sqoop导出到MySQL   为了方便通过Sqoop统一导出到MySQL,这里我们借助一张汇总表将刚刚统计到的结果整合起来,通过表连接结合,HQL代码如下: hive>CREATE TABLE techbbs_2015_04_25 AS SELECT '2015_04_25', a.pv, b.reguser, c.ip, d.jumper FROM techbbs_pv_2015_04_25 a JOIN techbbs_reguser_2015_04_25 b ON 1=1 JOIN techbbs_ip_2015_04_25 c ON 1=1 JOIN techbbs_jumper_2015_04_25 d ON 1=1; hive> select * from techbbs_2015_04_25; OK 2015_04_25 672261 551 31535 10622 Time taken: 0.104 seconds, Fetched: 1 row(s) 7、使用Sqoop导入到MySQL 7.1  准备工作:在MySQL中创建结果汇总表   (1)Step1:创建一个新数据库:techbbs mysql> create database techbbs; Query OK, 1 row affected (0.00 sec)   (2)Step2:创建一张新数据表:techbbs_logs_stat mysql> create table techbbs_logs_stat(logdate varchar(10) primary key, pv int,reguser int,ip int,jumper int); Query OK, 0 rows affected (0.24 sec) 7.2 导入操作:通过export命令 root@py-server:/projects/data# hadoop fs -ls /user/hive/warehouse/logtest.db Found 5 items drwxrwxr-x   - root supergroup          0 2016-08-13 21:59 /user/hive/warehouse/logtest.db/techbbs_2015_04_25 drwxrwxr-x   - root supergroup          0 2016-08-13 21:54 /user/hive/warehouse/logtest.db/techbbs_ip_2015_04_25 drwxrwxr-x   - root supergroup          0 2016-08-13 21:57 /user/hive/warehouse/logtest.db/techbbs_jumper_2015_04_25 drwxrwxr-x   - root supergroup          0 2016-08-13 21:48 /user/hive/warehouse/logtest.db/techbbs_pv_2015_04_25 drwxrwxr-x   - root supergroup          0 2016-08-13 21:53 /user/hive/warehouse/logtest.db/techbbs_reguser_2015_04_25 mysql> show tables; +---------------------------+ | Tables_in_hive            | +---------------------------+ | techbbs_logs_stat         | +---------------------------+   (1)Step1:编写导出命令 sqoop export --connect jdbc:mysql://py-server:3306/hive --username root --password fm1106 --table techbbs_logs_stat --fields-terminated-by '\001' --export-dir '/user/hive/warehouse/logtest.db/techbbs_2015_04_25'   这里的--export-dir是指定的hive目录下的汇总表所在位置,我这里是/user/hive/warehouse/logtest.db/techbbs_2015_04_25。 16/08/13 22:13:04 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=539988 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=849 HDFS: Number of bytes written=0 HDFS: Number of read operations=19 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters  Launched map tasks=4 Data-local map tasks=4 Total time spent by all maps in occupied slots (ms)=17659 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=17659 Total vcore-milliseconds taken by all map tasks=17659 Total megabyte-milliseconds taken by all map tasks=18082816 Map-Reduce Framework Map input records=1 Map output records=1 Input split bytes=741 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=463 CPU time spent (ms)=3280 Physical memory (bytes) snapshot=711569408 Virtual memory (bytes) snapshot=7689871360 Total committed heap usage (bytes)=581435392 File Input Format Counters  Bytes Read=0 File Output Format Counters  Bytes Written=0 16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Transferred 849 bytes in 17.7487 seconds (47.8344 bytes/sec) 16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Exported 1 records.   (2)Step2:查看导出结果 mysql> select * from techbbs_logs_stat; +------------+--------+---------+-------+--------+ | logdate    | pv     | reguser | ip    | jumper | +------------+--------+---------+-------+--------+ | 2015_04_25 | 672261 |     551 | 31535 |  10622 | +------------+--------+---------+-------+--------+ 1 row in set (0.00 sec) 后记:

    关于原文(http://www.cnblogs.com/edisonchou/p/4464349.html)的定时任务和按日期的设置,请参考原文。

    感谢原作者!

    转载请注明原文地址: https://ju.6miu.com/read-1295854.html
    最新回复(0)