RDD的partition分片,每个partition由一个task来处理 //parallelize演示 val num=sc.parallelize(1 to 10) 创建一个1到10的数组,默认和executor的个数一样 val doublenum=num.map(_*2) 数组的每个值乘以 2 val threenum=doublenum.filter(_%3 ==0) 过滤出 每个与3取余等于0的值 threenum.collect action操作 threenum.toDebugString 查看依赖
val num1=sc.parallelize(1 to 10,6) 设置partition为6 val doublenum1=num1.map(_*2) val threenum1=doublenum1.filter(_%3==0) threenum1.collect threenum1.toDebugString
threenum.cache() 缓存到内存,只有当有action的操作的时候才会真正执行 val fournum=threenum.map(x=>x*x) fournum.collect fournum.toDebugString threenum.unpersist() 取消缓存
num.reduce(+) 汇总 num.take(5) 取前五个 num.first 第一个 num.count() 统计 num.take(5).foreach(println) 取前五个 循环打印出来
k-v 样例 val kv1=sc.parallelize(List((“A”,1),(“B”,2),(“C”,3),(“A”,4),(“B”,5))) kv1.sortByKey().collect 根据key来排序 Array[(String, Int)] = Array((A,1), (A,4), (B,2), (B,5), (C,3)) kv1.groupByKey().collect 根据Key来汇总 Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5)), (A,CompactBuffer(1, 4)), (C,CompactBuffer(3))) kv1.reduceByKey(+).collect 根据Key来统计 Array[(String, Int)] = Array((B,7), (A,5), (C,3))
val kv2=sc.parallelize(List((“A”,4),(“A”,4),(“C”,3),(“A”,4),(“B”,5))) kv2.distinct.collect 去重 Array[(String, Int)] = Array((A,4), (B,5), (C,3)) kv1.union(kv2).collect 关联 Array[(String, Int)] = Array((A,1), (B,2), (C,3), (A,4), (B,5), (A,4), (A,4), (C,3), (A,4), (B,5))
val kv3=sc.parallelize(List((“A”,10),(“B”,20),(“D”,30))) kv1.join(kv3).collect Array[(String, (Int, Int))] = Array((B,(2,20)), (B,(5,20)), (A,(1,10)), (A,(4,10)))
kv1.cogroup(kv3).collect 同一个Key对应的Value组合到一起 Array[(String, (Iterable[Int], Iterable[Int]))] = Array((B,(CompactBuffer(2, 5),CompactBuffer(20))), (A,(CompactBuffer(1, 4),CompactBuffer(10))), (C,(CompactBuffer(3),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(30))))
val kv4=sc.parallelize(List(List(1,2),List(3,4))) kv4.flatMap(x=>x.map(_+1)).collect
读取文件 val rdd1=sc.textFile(“hdfs://192.168.18.144:9000/data/customers.csv”) val words=rdd1.flatMap(_.split(“,”)) 分词 val wordscount=words.map(x=>(x,1)).reduceByKey(+) 将相同的key相加起来 wordscount.collect