使用Alluxio优化Spark RDD
1.1 RDD的读取
>valdomain=sc.textFile("alluxio://sparktest-m1:19998/chengdu/dnsDomainChengdu_distinct")1.2 RDD的写出
>valresult=domain.map(line=>(line,1L)).reduceByKey(_+_) >result.saveAsTextFile("hdfs://sparktest-m1:9000/chengdu/tes")二、RDD的缓存
2.1使用Spark API缓存 RDD
使用Spark进行计算时,通常使用spark API提供的cache或者persist来提高spark的计算性能,此时spark将RDD中的数据缓存到executors的JVM内存中,下一次使用该数据时就直接可以从其中的内存中读取。
1) MEMORY_ONLY: stores Javaobjects in the Spark JVM memory
2) MEMORY_ONLY_SER: storesserialized java objects in the Spark JVM memory
3) DISK_ONLY: stores the data on the local disk
>rdd.persist(MEMORY_ONLY) 或者rdd.persist() 或者rdd.cache() >rdd.count()2.2 使用Alluxio缓存 RDD
使用Alluxio进行缓存时,通常使用的Spark API为saveAsTextFile和saveAsObjectFile。
1) saveAsTextFile: writes the RDD as a text file, where eachelement is a line in the file
2) saveAsObjectFile: writes the RDD out to a file, by usingJava serialization on each element
>rdd.saveAsTextFile(alluxioPath) >rdd = sc.textFile(alluxioPath) >rdd.count()2.3 不同缓存方式在count操作下的性能对比
From the figure,it is clear that reading from an RDD saved in Alluxio results in very stable,and predictable performance. However, when persisting data within Spark, theperformance is high for smaller data set sizes, but large data set sizes causesa significant decrease in performance. For instance, when using persist(MEMORY_ONLY) on a machinewith 61GB of memory, when the data set size exceeds 10GB, the data nolonger can fully fit in Spark memory, and the runtime slows down.
This figure also shows that when using saveAsTextFile with filesin Alluxio memory, the performance is slower than using the Spark cache directlyfor smaller data set sizes. However, for larger data set sizes, reading the RDDfrom Alluxio files performs significantly better, because it scales linearlywith the data size. Therefore, for a given size of memory for a node, Alluxioenables applications to process more data at memory speeds.
总的来说,使用Alluxio缓存时性能是相对稳定的、线性可预测的,使用Spark缓存则会在数据规模增大时会出现严重的衰退。数据规模相对较小时,spark缓存性能更高,而Alluxio缓存使用saveAsTextFile操作写文件性能要比Spark缓存差。数据规模增大时,Alluxio缓存呈现出稳定性的特点,Spark缓存会出现严重的性能衰退。例如,在一个61GB内存的平台上进行 persist(MEMORY_ONLY)操作时,数据规模超过10GB性能开始衰退。
