hadoop 2.7.2
参考: Maven Hadoop: http://www.cnblogs.com/Leo_wl/p/4862820.html http://blog.csdn.net/kongxx/article/details/42339581 日志清洗: http://www.cnblogs.com/edisonchou/p/4458219.html 1、新建Maven工程 Eclipse-》新建Maven工程 http://mvnrepository.com/search?q=hadoop-mapreduce-client groupid:com artifactid:first 依赖包 hadoop-common hadoop-hdfs hadoop-mapreduce-client-core hadoop-mapreduce-client-jobclient hadoop-mapreduce-client-common 我加了个hadoop-yarn-common,这个可以不要 pom.xml【注意:版本改成你自己的】 <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>jdk.tools</groupId> <artifactId>jdk.tools</artifactId> <version>1.8</version> <scope>system</scope> <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-common</artifactId> <version>2.7.2</version> </dependency> </dependencies> 点击保存,开始构建。 构建完成后可以在Maven Dependencies下看到依赖包。 2、新建LogCleanJob类 代码见附录【详细代码解释参考原文http://www.cnblogs.com/edisonchou/p/4458219.html】 注意:pom.xml要添加assembly插件,直接用jar导出一直报错,没找到原因。 还有原文的@Override在run方法编译不通过,注释掉了。 E:\fm-workspace\workspace_2\first>mvn assembly:assembly cd first\target下 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar E:\fm-workspace\workspace_2\first\target>dir 2016/08/13 18:21 <DIR> . 2016/08/13 18:21 <DIR> .. 2016/08/13 18:19 <DIR> archive-tmp 2016/08/13 17:34 <DIR> classes 2016/08/13 18:21 42,996,951 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 2016/08/13 18:21 9,266 first-0.0.1-SNAPSHOT.jar 2016/08/13 18:19 <DIR> maven-archiver 2016/08/13 17:31 <DIR> maven-status 2016/08/13 18:19 <DIR> surefire-reports 2016/08/13 17:34 <DIR> test-classes 2 个文件 43,006,217 字节 8 个目录 113,821,888,512 可用字节 重命名first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 为first.jar并拷贝到linux下 root@py-server:/projects/data# ll 总用量 42008 drwxr-xr-x 4 root root 4096 8月 13 18:52 ./ drwxr-xr-x 7 root root 4096 8月 11 16:29 ../ -rw-r--r-- 1 root root 42996951 8月 13 18:21 first.jar drwxr-xr-x 2 root root 4096 8月 13 15:36 hadoop-logs/ drwxr-xr-x 2 root root 4096 8月 3 21:04 test/ 5、上传数据到HDFS 数据文件在原文找吧:http://www.cnblogs.com/edisonchou/p/4458219.html,大概200MB左右。 也可以用你自己的日志文件,不过格式要一致。 root@py-server:/projects/data/hadoop-logs# ll 总用量 213056 drwxr-xr-x 2 root root 4096 8月 13 15:36 ./ drwxr-xr-x 4 root root 4096 8月 13 18:25 ../ -rw-r--r-- 1 root root 61084192 4月 26 2015 access_2013_05_30.log -rw-r--r-- 1 root root 157069653 4月 26 2015 access_2013_05_31.log HDFS默认路径是 /user/root/ root@py-server:/projects/data# hadoop fs -put hadoop-logs/ . root@py-server:/projects/data# hadoop fs -ls Found 14 items drwxr-xr-x - root supergroup 0 2016-08-09 23:59 .sparkStaging drwxr-xr-x - root supergroup 0 2016-08-13 15:38 hadoop-logs -rw-r--r-- 2 root supergroup 85285 2016-08-06 07:59 imdb_labelled.txt -rw-r--r-- 2 root supergroup 72 2016-08-04 09:29 kmeans_data.txt drwxr-xr-x - root supergroup 0 2016-08-09 23:59 kmeans_result drwxr-xr-x - root supergroup 0 2016-08-05 16:16 kmeans_result.txt -rw-r--r-- 2 root supergroup 43914 2016-08-04 12:33 ks_aio.py drwxr-xr-x - root supergroup 0 2016-08-09 10:51 mymlresult drwxr-xr-x - root supergroup 0 2016-08-09 10:28 naive_bayes_result -rw-r--r-- 2 root supergroup 66288 2016-08-09 23:57 price_data.txt -rw-r--r-- 2 root supergroup 1619 2016-08-08 17:54 price_data2.txt -rw-r--r-- 2 root supergroup 1619 2016-08-09 09:13 price_train_data.txt -rw-r--r-- 2 root supergroup 120 2016-08-04 09:24 sample_kmeans_data.txt -rw-r--r-- 2 root supergroup 104736 2016-08-08 17:14 sample_libsvm_data.txt 6、Hadoop测试 root@py-server:/projects/data# hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output结果:【速度超快,不到瞬间啊!36s】
在hadoop UI看(本人的是:py-server:8088)下看:
User:rootName:LogCleanJobApplication Type:MAPREDUCEApplication Tags: YarnApplicationState:FINISHEDFinalStatus Reported by AM:SUCCEEDEDStarted:星期六 八月 13 18:46:18 +0800 2016Elapsed:36secTracking URL:HistoryDiagnostics: Clean process success! root@py-server:/projects/data# hadoop fs -ls /user/root/ Found 15 items drwxr-xr-x - root supergroup 0 2016-08-09 23:59 /user/root/.sparkStaging drwxr-xr-x - root supergroup 0 2016-08-13 18:45 /user/root/hadoop-logs -rw-r--r-- 2 root supergroup 85285 2016-08-06 07:59 /user/root/imdb_labelled.txt -rw-r--r-- 2 root supergroup 72 2016-08-04 09:29 /user/root/kmeans_data.txt drwxr-xr-x - root supergroup 0 2016-08-09 23:59 /user/root/kmeans_result drwxr-xr-x - root supergroup 0 2016-08-05 16:16 /user/root/kmeans_result.txt -rw-r--r-- 2 root supergroup 43914 2016-08-04 12:33 /user/root/ks_aio.py drwxr-xr-x - root supergroup 0 2016-08-13 18:46 /user/root/logcleanjob_output drwxr-xr-x - root supergroup 0 2016-08-09 10:51 /user/root/mymlresult drwxr-xr-x - root supergroup 0 2016-08-09 10:28 /user/root/naive_bayes_result -rw-r--r-- 2 root supergroup 66288 2016-08-09 23:57 /user/root/price_data.txt -rw-r--r-- 2 root supergroup 1619 2016-08-08 17:54 /user/root/price_data2.txt -rw-r--r-- 2 root supergroup 1619 2016-08-09 09:13 /user/root/price_train_data.txt -rw-r--r-- 2 root supergroup 120 2016-08-04 09:24 /user/root/sample_kmeans_data.txt -rw-r--r-- 2 root supergroup 104736 2016-08-08 17:14 /user/root/sample_libsvm_data.txt root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output Found 2 items -rw-r--r-- 2 root supergroup 0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS -rw-r--r-- 2 root supergroup 50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000 root@py-server:/projects/data# hadoop fs -cat /user/root/logcleanjob_output/part-r-00000 118.112.191.88 20130530204006 source/plugin/wsh_wx/img/wsh_zk.css 113.107.237.31 20130530204005 thread-10500-1-1.html 110.251.129.203 20130531081904 forum.php?mod=ajax&action=forumchecknew&fid=111&time=1369959258&inajax=yes 118.112.191.88 20130530204006 data/cache/style_1_common.css?y7a 220.231.55.69 20130530204005 home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1369917603 110.75.174.58 20130531081903 thread-21066-1-1.html 118.112.191.88 20130530204006 data/cache/style_1_forum_viewthread.css?y7a 110.75.174.55 20130531081904 home.php?do=thread&from=space&mod=space&uid=71469&view=me 14.17.29.89 20130530204006 home.php?mod=misc&ac=sendmail&rand=1369917604 121.25.131.148 20130531081906 data/attachment/common/c2/common_12_usergroup_icon.jpg 59.174.191.135 20130530204003 forum.php?mod=forumdisplay&fid=111&page=1&filter=author&orderby=dateline 118.112.191.88 20130530204007 data/attachment/common/65/common_11_usergroup_icon.jpg 121.25.131.148 20130531081905 home.php?mod=misc&ac=sendmail&rand=1369959541 101.229.199.98 20130530204007 data/cache/style_1_widthauto.css?y7a 59.174.191.135 20130530204005 home.php?mod=space&uid=71081&do=profile&from=space ####################################### 问题解决: 1. mave 中断怎么办 http://www.cnblogs.com/tangyanbo/p/4329303.html 右键项目:maven->update project并勾选force选项,如果勾选force,那么不用删除未下载成功的残余文件,在大量jar包未下载成功的时候可以选择勾选force 重新build一下。 2. hadoop jar 没有指定主类名,直接将主类名放在first.jar后会提示找不到input那个文件夹 hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output ####################################### 附录:LogCleanJob.java package com.first; //package techbbs; import java.net.URI; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class LogCleanJob extends Configured implements Tool { public static void main(String[] args) { Configuration conf = new Configuration(); try { int res = ToolRunner.run(conf, new LogCleanJob(), args); System.exit(res); } catch (Exception e) { e.printStackTrace(); } } //@Override public int run(String[] args) throws Exception { final Job job = new Job(new Configuration(), LogCleanJob.class.getSimpleName()); // 设置为可以打包运行 job.setJarByClass(LogCleanJob.class); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 清理已存在的输出文件 FileSystem fs = FileSystem.get(new URI(args[0]), getConf()); Path outPath = new Path(args[1]); if (fs.exists(outPath)) { fs.delete(outPath, true); } boolean success = job.waitForCompletion(true); if(success){ System.out.println("Clean process success!"); } else{ System.out.println("Clean process failed!"); } return 0; } static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> { LogParser logParser = new LogParser(); Text outputValue = new Text(); protected void map( LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context) throws java.io.IOException, InterruptedException { final String[] parsed = logParser.parse(value.toString()); // step1.过滤掉静态资源访问请求 if (parsed[2].startsWith("GET /static/") || parsed[2].startsWith("GET /uc_server")) { return; } // step2.过滤掉开头的指定字符串 if (parsed[2].startsWith("GET /")) { parsed[2] = parsed[2].substring("GET /".length()); } else if (parsed[2].startsWith("POST /")) { parsed[2] = parsed[2].substring("POST /".length()); } // step3.过滤掉结尾的特定字符串 if (parsed[2].endsWith(" HTTP/1.1")) { parsed[2] = parsed[2].substring(0, parsed[2].length() - " HTTP/1.1".length()); } // step4.只写入前三个记录类型项 outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]); context.write(key, outputValue); } } static class MyReducer extends Reducer<LongWritable, Text, Text, NullWritable> { protected void reduce( LongWritable k2, java.lang.Iterable<Text> v2s, org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context) throws java.io.IOException, InterruptedException { for (Text v2 : v2s) { context.write(v2, NullWritable.get()); } }; } /* * 日志解析类 */ static class LogParser { public static final SimpleDateFormat FORMAT = new SimpleDateFormat( "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH); public static final SimpleDateFormat dateformat1 = new SimpleDateFormat( "yyyyMMddHHmmss"); public static void main(String[] args) throws ParseException { final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127"; LogParser parser = new LogParser(); final String[] array = parser.parse(S1); System.out.println("样例数据: " + S1); System.out.format( "解析结果: ip=%s, time=%s, url=%s, status=%s, traffic=%s", array[0], array[1], array[2], array[3], array[4]); } /** * 解析英文时间字符串 * * @param string * @return * @throws ParseException */ private Date parseDateFormat(String string) { Date parse = null; try { parse = FORMAT.parse(string); } catch (ParseException e) { e.printStackTrace(); } return parse; } /** * 解析日志的行记录 * * @param line * @return 数组含有5个元素,分别是ip、时间、url、状态、流量 */ public String[] parse(String line) { String ip = parseIP(line); String time = parseTime(line); String url = parseURL(line); String status = parseStatus(line); String traffic = parseTraffic(line); return new String[] { ip, time, url, status, traffic }; } private String parseTraffic(String line) { final String trim = line.substring(line.lastIndexOf("\"") + 1) .trim(); String traffic = trim.split(" ")[1]; return traffic; } private String parseStatus(String line) { final String trim = line.substring(line.lastIndexOf("\"") + 1) .trim(); String status = trim.split(" ")[0]; return status; } private String parseURL(String line) { final int first = line.indexOf("\""); final int last = line.lastIndexOf("\""); String url = line.substring(first + 1, last); return url; } private String parseTime(String line) { final int first = line.indexOf("["); final int last = line.indexOf("+0800]"); String time = line.substring(first + 1, last).trim(); Date date = parseDateFormat(time); return dateformat1.format(date); } private String parseIP(String line) { String ip = line.split("- -")[0].trim(); return ip; } } } ###################################################### 完整的pom.xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com</groupId> <artifactId>first</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>first</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>jdk.tools</groupId> <artifactId>jdk.tools</artifactId> <version>1.8</version> <scope>system</scope> <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-common</artifactId> <version>2.7.2</version> </dependency> </dependencies> <build> <defaultGoal>compile</defaultGoal> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.first.LogCleanJob</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </build> </project>