上接博文。
使用RowMatrix类计算列的统计量。每一行为某一样本的特征向量
import org.apache.spark.mllib.linalg.distributed.RowMatrix val vectors = data.map(lp => lp.features) val matrix = new RowMatrix(vectors) val matrixSummary = matrix.computeColumnSummaryStatistics() //每一列的常用统计量 println(matrixSummary.mean) //均值 println(matrixSummary.min) //最小值 println(matrixSummary.max) //最大值 println(matrixSummary.variance)//方差 println(matrixSummary.numNonzeros)//非零的个数
使用去均值归一化方法:
(x−μ)/sqrt(variance) //对数据进行标准化预处理,选择性的去均值操作,和标准方差操作 import org.apache.spark.mllib.feature.StandardScaler val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors) val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features))) // 验证逻辑回归算法性能改善情况。NB和DT算法不受数据归一化的影响 val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numIterations) val lrTotalCorrectScaled = scaledData.map { point => if (lrModelScaled.predict(point.features) == point.label) 1 else 0 }.sum val lrAccuracyScaled = lrTotalCorrectScaled / numData val lrPredictionsVsTrue = scaledData.map { point => (lrModelScaled.predict(point.features), point.label) } val lrMetricsScaled = new BinaryClassificationMetrics(lrPredictionsVsTrue) val lrPr = lrMetricsScaled.areaUnderPR val lrRoc = lrMetricsScaled.areaUnderROC println(f"${lrModelScaled.getClass.getSimpleName}\nAccuracy:${lrAccuracyScaled * 100}%2.4f%%\nArea under PR: ${lrPr * 100.0}%2.4f%%\nArea under ROC: ${lrRoc * 100.0}%2.4f%%")***LogisticRegressionModel Accuracy:62.0419% Area under PR: 72.7254% Area under ROC: 61.9663%*
LogisticRegressionModel Accuracy:66.5720% Area under PR: 75.7964% Area under ROC: 66.5483%
朴素贝叶斯更适用于类别特征,仅仅使用类别特征对样本进行分类实验:
// 生成仅有类别属性的特征向量 val dataNB = records.map { r => val trimmed = r.map(_.replaceAll("\"", "")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 LabeledPoint(label, Vectors.dense(categoryFeatures)) } //验证NB算法的性能 val nbModelCats = NaiveBayes.train(dataNB) val nbTotalCorrectCats = dataNB.map { point => if (nbModelCats.predict(point.features) == point.label) 1 else 0 }.sum val nbAccuracyCats = nbTotalCorrectCats / numData val nbPredictionsVsTrueCats = dataNB.map { point => (nbModelCats.predict(point.features), point.label) } val nbMetricsCats = new BinaryClassificationMetrics(nbPredictionsVsTrueCats) val nbPrCats = nbMetricsCats.areaUnderPR val nbRocCats = nbMetricsCats.areaUnderROC println(f"${nbModelCats.getClass.getSimpleName}\nAccuracy:${nbAccuracyCats * 100}%2.4f%%\nArea under PR: ${nbPrCats * 100.0}%2.4f%%\nArea under ROC: ${nbRocCats * 100.0}%2.4f%%")结果: NaiveBayesModel Accuracy: 60.9601% Area under PR: 74.0522% Area under ROC: 60.5138% 从结果看,NB算法有了很大提升,说明数据特征对模型的适应性。
已讨论的对模型性能影响因素:特征提取、特征的选择、数据格式和对数据分布的假设 接下来,讨论模型参数对性能的影响。