大数据基础(九)Maven构建Hadoop日志清洗项目(一)

    xiaoxiao2025-01-28  13

    Maven Hadoop日志清洗项目(一)

    hadoop 2.7.2

    参考: Maven Hadoop: http://www.cnblogs.com/Leo_wl/p/4862820.html http://blog.csdn.net/kongxx/article/details/42339581 日志清洗: http://www.cnblogs.com/edisonchou/p/4458219.html 1、新建Maven工程 Eclipse-》新建Maven工程 http://mvnrepository.com/search?q=hadoop-mapreduce-client groupid:com  artifactid:first 依赖包 hadoop-common hadoop-hdfs hadoop-mapreduce-client-core hadoop-mapreduce-client-jobclient hadoop-mapreduce-client-common 我加了个hadoop-yarn-common,这个可以不要 pom.xml【注意:版本改成你自己的】 <dependencies>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-common</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-hdfs</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-core</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-jobclient</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-common</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>jdk.tools</groupId>             <artifactId>jdk.tools</artifactId>             <version>1.8</version>             <scope>system</scope>             <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-yarn-common</artifactId>             <version>2.7.2</version>         </dependency> </dependencies> 点击保存,开始构建。 构建完成后可以在Maven Dependencies下看到依赖包。 2、新建LogCleanJob类 代码见附录【详细代码解释参考原文http://www.cnblogs.com/edisonchou/p/4458219.html】 注意:pom.xml要添加assembly插件,直接用jar导出一直报错,没找到原因。 还有原文的@Override在run方法编译不通过,注释掉了。 E:\fm-workspace\workspace_2\first>mvn assembly:assembly cd first\target下 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar E:\fm-workspace\workspace_2\first\target>dir 2016/08/13  18:21    <DIR>          . 2016/08/13  18:21    <DIR>          .. 2016/08/13  18:19    <DIR>          archive-tmp 2016/08/13  17:34    <DIR>          classes 2016/08/13  18:21        42,996,951 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 2016/08/13  18:21             9,266 first-0.0.1-SNAPSHOT.jar 2016/08/13  18:19    <DIR>          maven-archiver 2016/08/13  17:31    <DIR>          maven-status 2016/08/13  18:19    <DIR>          surefire-reports 2016/08/13  17:34    <DIR>          test-classes                2 个文件     43,006,217 字节                8 个目录 113,821,888,512 可用字节 重命名first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 为first.jar并拷贝到linux下 root@py-server:/projects/data# ll 总用量 42008 drwxr-xr-x 4 root root     4096  8月 13 18:52 ./ drwxr-xr-x 7 root root     4096  8月 11 16:29 ../ -rw-r--r-- 1 root root 42996951  8月 13 18:21 first.jar drwxr-xr-x 2 root root     4096  8月 13 15:36 hadoop-logs/ drwxr-xr-x 2 root root     4096  8月  3 21:04 test/ 5、上传数据到HDFS 数据文件在原文找吧:http://www.cnblogs.com/edisonchou/p/4458219.html,大概200MB左右。 也可以用你自己的日志文件,不过格式要一致。 root@py-server:/projects/data/hadoop-logs# ll 总用量 213056 drwxr-xr-x 2 root root      4096  8月 13 15:36 ./ drwxr-xr-x 4 root root      4096  8月 13 18:25 ../ -rw-r--r-- 1 root root  61084192  4月 26  2015 access_2013_05_30.log -rw-r--r-- 1 root root 157069653  4月 26  2015 access_2013_05_31.log HDFS默认路径是 /user/root/ root@py-server:/projects/data# hadoop fs -put hadoop-logs/ . root@py-server:/projects/data# hadoop fs -ls  Found 14 items drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 .sparkStaging drwxr-xr-x   - root supergroup          0 2016-08-13 15:38 hadoop-logs -rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 imdb_labelled.txt -rw-r--r--   2 root supergroup         72 2016-08-04 09:29 kmeans_data.txt drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 kmeans_result drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 kmeans_result.txt -rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 ks_aio.py drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 mymlresult drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 naive_bayes_result -rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 price_data.txt -rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 price_data2.txt -rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 price_train_data.txt -rw-r--r--   2 root supergroup        120 2016-08-04 09:24 sample_kmeans_data.txt -rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 sample_libsvm_data.txt 6、Hadoop测试 root@py-server:/projects/data# hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output

    结果:【速度超快,不到瞬间啊!36s】

    在hadoop UI看(本人的是:py-server:8088)下看:

    User:rootName:LogCleanJobApplication Type:MAPREDUCEApplication Tags: YarnApplicationState:FINISHEDFinalStatus Reported by AM:SUCCEEDEDStarted:星期六 八月 13 18:46:18 +0800 2016Elapsed:36secTracking URL:HistoryDiagnostics: 

    Clean process success! root@py-server:/projects/data# hadoop fs -ls /user/root/ Found 15 items drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/.sparkStaging drwxr-xr-x   - root supergroup          0 2016-08-13 18:45 /user/root/hadoop-logs -rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 /user/root/imdb_labelled.txt -rw-r--r--   2 root supergroup         72 2016-08-04 09:29 /user/root/kmeans_data.txt drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/kmeans_result drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 /user/root/kmeans_result.txt -rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 /user/root/ks_aio.py drwxr-xr-x   - root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 /user/root/mymlresult drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 /user/root/naive_bayes_result -rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 /user/root/price_data.txt -rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 /user/root/price_data2.txt -rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 /user/root/price_train_data.txt -rw-r--r--   2 root supergroup        120 2016-08-04 09:24 /user/root/sample_kmeans_data.txt -rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 /user/root/sample_libsvm_data.txt root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output Found 2 items -rw-r--r--   2 root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS -rw-r--r--   2 root supergroup   50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000 root@py-server:/projects/data# hadoop fs -cat /user/root/logcleanjob_output/part-r-00000 118.112.191.88 20130530204006 source/plugin/wsh_wx/img/wsh_zk.css 113.107.237.31 20130530204005 thread-10500-1-1.html 110.251.129.203 20130531081904 forum.php?mod=ajax&action=forumchecknew&fid=111&time=1369959258&inajax=yes 118.112.191.88 20130530204006 data/cache/style_1_common.css?y7a 220.231.55.69 20130530204005 home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1369917603 110.75.174.58 20130531081903 thread-21066-1-1.html 118.112.191.88 20130530204006 data/cache/style_1_forum_viewthread.css?y7a 110.75.174.55 20130531081904 home.php?do=thread&from=space&mod=space&uid=71469&view=me 14.17.29.89 20130530204006 home.php?mod=misc&ac=sendmail&rand=1369917604 121.25.131.148 20130531081906 data/attachment/common/c2/common_12_usergroup_icon.jpg 59.174.191.135 20130530204003 forum.php?mod=forumdisplay&fid=111&page=1&filter=author&orderby=dateline 118.112.191.88 20130530204007 data/attachment/common/65/common_11_usergroup_icon.jpg 121.25.131.148 20130531081905 home.php?mod=misc&ac=sendmail&rand=1369959541 101.229.199.98 20130530204007 data/cache/style_1_widthauto.css?y7a 59.174.191.135 20130530204005 home.php?mod=space&uid=71081&do=profile&from=space ####################################### 问题解决: 1. mave 中断怎么办 http://www.cnblogs.com/tangyanbo/p/4329303.html 右键项目:maven->update project并勾选force选项,如果勾选force,那么不用删除未下载成功的残余文件,在大量jar包未下载成功的时候可以选择勾选force 重新build一下。 2. hadoop jar 没有指定主类名,直接将主类名放在first.jar后会提示找不到input那个文件夹 hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output ####################################### 附录:LogCleanJob.java package com.first; //package techbbs; import java.net.URI; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class LogCleanJob extends Configured implements Tool {     public static void main(String[] args) {         Configuration conf = new Configuration();         try {             int res = ToolRunner.run(conf, new LogCleanJob(), args);             System.exit(res);         } catch (Exception e) {             e.printStackTrace();         }     }     //@Override     public int run(String[] args) throws Exception {         final Job job = new Job(new Configuration(),                 LogCleanJob.class.getSimpleName());         // 设置为可以打包运行         job.setJarByClass(LogCleanJob.class);         FileInputFormat.setInputPaths(job, args[0]);         job.setMapperClass(MyMapper.class);         job.setMapOutputKeyClass(LongWritable.class);         job.setMapOutputValueClass(Text.class);         job.setReducerClass(MyReducer.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(NullWritable.class);         FileOutputFormat.setOutputPath(job, new Path(args[1]));         // 清理已存在的输出文件         FileSystem fs = FileSystem.get(new URI(args[0]), getConf());         Path outPath = new Path(args[1]);         if (fs.exists(outPath)) {             fs.delete(outPath, true);         }                  boolean success = job.waitForCompletion(true);         if(success){             System.out.println("Clean process success!");         }         else{             System.out.println("Clean process failed!");         }         return 0;     }     static class MyMapper extends             Mapper<LongWritable, Text, LongWritable, Text> {         LogParser logParser = new LogParser();         Text outputValue = new Text();         protected void map(                 LongWritable key,                 Text value,                 org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)                 throws java.io.IOException, InterruptedException {             final String[] parsed = logParser.parse(value.toString());             // step1.过滤掉静态资源访问请求             if (parsed[2].startsWith("GET /static/")                     || parsed[2].startsWith("GET /uc_server")) {                 return;             }             // step2.过滤掉开头的指定字符串             if (parsed[2].startsWith("GET /")) {                 parsed[2] = parsed[2].substring("GET /".length());             } else if (parsed[2].startsWith("POST /")) {                 parsed[2] = parsed[2].substring("POST /".length());             }             // step3.过滤掉结尾的特定字符串             if (parsed[2].endsWith(" HTTP/1.1")) {                 parsed[2] = parsed[2].substring(0, parsed[2].length()                         - " HTTP/1.1".length());             }             // step4.只写入前三个记录类型项             outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);             context.write(key, outputValue);         }     }     static class MyReducer extends             Reducer<LongWritable, Text, Text, NullWritable> {         protected void reduce(                 LongWritable k2,                 java.lang.Iterable<Text> v2s,                 org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)                 throws java.io.IOException, InterruptedException {             for (Text v2 : v2s) {                 context.write(v2, NullWritable.get());             }         };     }     /*      * 日志解析类      */     static class LogParser {         public static final SimpleDateFormat FORMAT = new SimpleDateFormat(                 "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);         public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(                 "yyyyMMddHHmmss");         public static void main(String[] args) throws ParseException {             final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";             LogParser parser = new LogParser();             final String[] array = parser.parse(S1);             System.out.println("样例数据: " + S1);             System.out.format(                     "解析结果:  ip=%s, time=%s, url=%s, status=%s, traffic=%s",                     array[0], array[1], array[2], array[3], array[4]);         }         /**          * 解析英文时间字符串          *           * @param string          * @return          * @throws ParseException          */         private Date parseDateFormat(String string) {             Date parse = null;             try {                 parse = FORMAT.parse(string);             } catch (ParseException e) {                 e.printStackTrace();             }             return parse;         }         /**          * 解析日志的行记录          *           * @param line          * @return 数组含有5个元素,分别是ip、时间、url、状态、流量          */         public String[] parse(String line) {             String ip = parseIP(line);             String time = parseTime(line);             String url = parseURL(line);             String status = parseStatus(line);             String traffic = parseTraffic(line);             return new String[] { ip, time, url, status, traffic };         }         private String parseTraffic(String line) {             final String trim = line.substring(line.lastIndexOf("\"") + 1)                     .trim();             String traffic = trim.split(" ")[1];             return traffic;         }         private String parseStatus(String line) {             final String trim = line.substring(line.lastIndexOf("\"") + 1)                     .trim();             String status = trim.split(" ")[0];             return status;         }         private String parseURL(String line) {             final int first = line.indexOf("\"");             final int last = line.lastIndexOf("\"");             String url = line.substring(first + 1, last);             return url;         }         private String parseTime(String line) {             final int first = line.indexOf("[");             final int last = line.indexOf("+0800]");             String time = line.substring(first + 1, last).trim();             Date date = parseDateFormat(time);             return dateformat1.format(date);         }         private String parseIP(String line) {             String ip = line.split("- -")[0].trim();             return ip;         }     } } ###################################################### 完整的pom.xml  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">   <modelVersion>4.0.0</modelVersion>   <groupId>com</groupId>   <artifactId>first</artifactId>   <version>0.0.1-SNAPSHOT</version>   <packaging>jar</packaging>   <name>first</name>   <url>http://maven.apache.org</url>   <properties>     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>   </properties>   <dependencies>     <dependency>       <groupId>junit</groupId>       <artifactId>junit</artifactId>       <version>3.8.1</version>       <scope>test</scope>     </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-common</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-hdfs</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-core</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-jobclient</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-mapreduce-client-common</artifactId>             <version>2.7.2</version>         </dependency>         <dependency>             <groupId>jdk.tools</groupId>             <artifactId>jdk.tools</artifactId>             <version>1.8</version>             <scope>system</scope>             <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>         </dependency>         <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-yarn-common</artifactId>             <version>2.7.2</version>         </dependency>   </dependencies>    <build>   <defaultGoal>compile</defaultGoal>   <plugins>               <plugin>                   <artifactId>maven-assembly-plugin</artifactId>                   <configuration>                       <archive>                           <manifest>                               <mainClass>com.first.LogCleanJob</mainClass>                           </manifest>                       </archive>                       <descriptorRefs>                           <descriptorRef>jar-with-dependencies</descriptorRef>                       </descriptorRefs>                   </configuration>               </plugin>           </plugins>     </build> </project>
    转载请注明原文地址: https://ju.6miu.com/read-1295865.html
    最新回复(0)