mahout k-means实战

xiaoxiao2021-03-25 101

关于聚类算法，在参与项目期间，真正用的比较多和有具体操作的是kmeans算法，因此这里就只说下mahout kmeans整体运行的IPO以及一些细节问题。

1. Kmeans算法主要需要配置的参数信息及注意事项：

①-input 输入文件，可以是一个文件或者目录，项目期间主要用了.txt .csv两种文件格式

②-output 输出目录

③-distanceMeasure 选择距离计算的方式默认是欧氏距离平方

④-cluster 初始聚类中心点的文件路径，其包含的必须是序列文件，如果K参数被设置，则该路径上的数据将被覆盖

⑤-convergenceDelta 判断推出循环的阈值，默认是0.5，这个是用来判断准则函数时候达到阈值

⑥-maxIter 最大的循环次数

⑦-overwrite 如果出现则对输出路径进行重写

⑧-cl 是否对数据进行分类，如果出现，则会生成clusteredPoints文件

⑨-method 选择使用的计算方式，单机或者集群，默认是集群

注意事项：input 和output目录均为hdfs上的目录，其中K值可选可不选，当设置K值后，聚类中心可以不再配置。最大的聚类次数是一定要设置的，否则当聚类循环的阈值没有达到的时候就会一直循环。K值的设定可以根据经验值，用户的需要或者是通过Canopy粗聚类确定。

2.基本过程如下：

项目期间其中所使用的数据集之一为小儿中医肺炎数据，数据集的大致情况为70维属性，6000个记录，大致格式如下：（输入的源文件忘记保存了。。。）

1 0 3 2 -1 3 4 2 3 2 2 0 1 3 4 1 1 1 1-1 3 -1 3 2 0 0 0 -1 3 3 3 -1 2 0 -2 这里将源文件标记为pneumonia.txt

（1）首先将输入文件上传到HDFS文件系统上，通过下面的命令行：

bin/hadoop fs -mkdir lyn/mahout bin/hadoop fs -put 本地目录 lyn/mahout （2）这里上传的文件是文本格式， mahout下处理的文件必须是SequenceFile格式的，所以需要把txtfile转换成sequenceFile，而聚类必须是向量格式的，所以需要将其转化为向量文件

bin/mahout org.apache.mahout.clustering.conversion.InputDriver -i lyn/mahout/pneumonia.txt -o lyn/mahout/vecfile -v org.apache.mahout.math.RandomAccessSparseVector（3）Kmeans算法运行

bin/mahout/ kmeans -i lyn/mahout/vecfile -o lyn/mahout/result -c lyn/mahout/clu -x 20 -k 3 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl （4）分析运行的结果

输出目录下有clusteredPoints、cluster-x、cluster-（x+1）-final等几个文件夹，x表示第x次迭代，每次的迭代结果都会存到cluster-x，最后一次（x+1）迭代结果存在cluster-（x+1）-final，clusteredPoints下存的也是最后聚类结果，但它俩存的东西不太一样，一个是类，一个是点。

cluster-（x+1）-final存储的格式类似下图（图源http://blog.csdn.net/dr_guo/article/details/52861328）

clusteredPoints存储的格式类似于下图，key为该点所属的类，value具体的数据对象(居然在优盘中找到当时的结果了)

kmeans生成的结果目录中的文件，直接用cat命令打开是乱码，看不到上图的内容，因为结果是序列文件，因此需要用clusterdump命令将序列文件转化为文本文件并放到本地目录中

bin/mahout clusterdump -i lyn/mahout/result/cluster-3-final -p lyn/mahout/result/clusterPoints -o 本地目录在本地目录中就可以看到上图中的结果了

3.kmeans结果的整理

项目中因为需要对结果进行可视化展示，因此需要对结果文件进行进一步的处理，将对应的类和点单独放到每一个文件中，具体代码如下：

import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.text.DecimalFormat; import java.util.regex.Matcher; import java.util.regex.Pattern; public class test1 { public static void main(String[] args) throws IOException { File f = new File("e://test.txt"); System.out.println("begin"); BufferedReader bf = new BufferedReader(new FileReader(f)); String s = null; Pattern p = Pattern.compile("(\\[)([a-z]*)(\\])"); Pattern p1 = Pattern.compile("(Key: )([\\d]{1,3})(:)(.*)(\\[)(.*)(\\])"); Pattern p2 = Pattern.compile("([\\d]{1,3})(:)(.)(.000)"); DecimalFormat df = new DecimalFormat("000"); DecimalFormat df1 = new DecimalFormat("00"); while ((s = bf.readLine()) != null) { Matcher m = p1.matcher(s); while (m.find()) { String cla = df.format(Integer.parseInt(m.group(2))); String s2 = m.group(6); Matcher m2= p2.matcher(s2); while(m2.find()){ System.out.println(cla+df.format(Integer.parseInt(m2.group(1)))+m2.group(3));//列名+值 } } } System.out.println("end"); } /*对聚类挖掘算法后的结果进行格式化处理 * 用来做可视化展示和关联规则的输入 * 参数 input 为要处理文件的路径 */ static void guanlian(String input) throws IOException{ File f = new File(input); //File out = new File(output); BufferedReader bf = new BufferedReader(new FileReader(f)); //BufferedWriter bw = new BufferedWriter(new FileWriter(out,true)); String line = null; int i=1; Pattern p1 = Pattern.compile("(Key: )([\\d]{1,3})(:)(.*)(\\[)(.*)(\\])"); Pattern p2 = Pattern.compile("([\\d]+)(:)([-]?\\d)(.)"); DecimalFormat df = new DecimalFormat("00"); //DecimalFormat df1 = new DecimalFormat("00"); while((line =bf.readLine())!=null){ Matcher m = p1.matcher(line); while(m.find()){ String s2 = m.group(6); Matcher m1 = p2.matcher(s2); String newline = ""; while(m1.find()){ newline += (df.format(Integer.parseInt(m1.group(1)))+m1.group(3)+' '); } String fileName = "e:\\guanlian\\"+m.group(2)+".txt"; System.out.println(fileName); File f1 = new File(fileName); if(!f1.exists())f1.createNewFile(); BufferedWriter bw = new BufferedWriter(new FileWriter(f1,true)); System.out.println(newline); bw.write(newline); bw.newLine(); bw.flush(); bw.close(); } } bf.close(); } public static void main(String[] args) throws IOException{ guanlian("e://bat.txt"); System.out.println("complete"); } } 处理完的结果文件和具体的内容如下：

93.txt内容如下：

至此结果基本处理完毕，下步可用于关联规则或者分类算法的输入文件。

附：将命令行写成批处理文件，利用java执行该批处理文件的过程

import java.io.*; public class bat2 { public void creatBat(String s) { FileWriter fw = null; try { String str="/mahout/mahout-distribution-0.9/p.txt"; String[] strarray=str.split("/"); String strfilename=strarray[strarray.length-1]; String strpath="/user/hadoop/mahout6/"+strfilename; String strclass="20"; Integer kind=Integer.parseInt(strclass); fw = new FileWriter(s); fw.write("hadoop fs -rmr /user/hadoop/mahout6"+ "\n"); fw.write("hadoop fs -mkdir -p /user/hadoop/mahout6"+ "\n"); fw.write("hadoop fs -put "+str+" /user/hadoop/mahout6"+"\n"); fw.write("mahout org.apache.mahout.clustering.conversion.InputDriver -i "+strpath+" -o /user/hadoop/mahout6/vecfile -v org.apache.mahout.math.RandomAccessSparseVector"+"\n"); fw.write("mahout kmeans -i /user/hadoop/mahout6/vecfile -o /user/hadoop/mahout6/result1 -c /user/hadoop/mahout6/clu1 -x "+kind+" -k 3 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl"+"\n"); fw.write("mahout seqdumper -i /user/hadoop/mahout6/result1/clusteredPoints/part-m-00000 -o /mahout/test/bat.txt"); } catch (IOException e) { e.printStackTrace(); System.exit(0); } finally { if (fw != null) { try { fw.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(0); } } } } private String execute(String batname) { File file = new File(batname); if (file.exists()){ file.setExecutable(true); file.setReadable(true); file.setWritable(true); }else{ System.out.println("File no exists."); } Process process; String line = null; StringBuffer sb = new StringBuffer(); try { process = Runtime.getRuntime().exec(batname); InputStream fis = process.getInputStream(); BufferedReader br = new BufferedReader(new InputStreamReader(fis)); while ((line = br.readLine()) != null) { System.out.println(line); } if (process.waitFor() != 0) { System.out.println("fail"); return "fail"; } System.out.println(batname + " run successful!"); return "success"; } catch (Exception e) { e.printStackTrace(); return "fail"; } } public static void main(String[] args) { String batname="del.sh"; File file = new File(batname); if(file.exists()) file.delete(); bat2 df = new bat2(); System.out.println(file.getAbsolutePath()); df.creatBat(file.getAbsolutePath()); df.execute(file.getAbsolutePath()); } }

转载请注明原文地址: https://ju.6miu.com/read-20615.html

技术

最新回复(0)