Python与机器学习1——scikit-learn使用的简易框架

xiaoxiao2021-03-25 99

机器学习大火，忽然间很多人都朝这里使劲，我是个研一的学生，这并不是我的专业方向，出于种种原因，我也来了。小白一枚，从零自学，经两个月跌跌撞撞，这里一锤子那里一棒子的学习，确定了前期的学习路线并开博客，与君共勉。本系列博客主要参考《利用Python进行数据分析》、《Python数据挖掘入门与实践》、《机器学习》（周志华）。以后两本为主线学习。第一本书作为工具书，用于补充Python、Pandas等背景知识；第二本书作为实践书，主要利用scikit-learn练习算法的使用和调参等等；第三本书作为理论书，结合第二本加强对算法的理解。当然内功还需线性代数、概率论等。本人尝试过先过一遍数学，可没有不经过实践的理论转身就忘，所以会在读这三本书的同时穿插数学基础。

先依托简单的K近邻算法，熟悉最简单的scikit-learn使用框架，讲解都在代码的注释中。为便于理解，我把“导入库”的语句都写在了距离“调用库”语句最近的上方。

import numpy as np import os data_filename = os.path.join("C:\Users\Han Chunhui", "Ionosphere","ionosphere.data")#import dataset x = np.zeros((351, 34), dtype='float')#create space for data ,351 rows and 34 columns y = np.zeros((351,), dtype='bool')#create space for labels ,351 rows and 1 column import csv with open(data_filename, 'r') as input_file: reader = csv.reader(input_file) for i, row in enumerate(reader): data = [float(datum) for datum in row[:-1]] x[i] = data y[i] = row[-1] == 'g'#x is the set of data , y is the set of labels from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=14)#split the set of train and test from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier() estimator.fit(x_train, y_train)#"KNeighborsClassifier" is a object ,fit is a method of this object to train the set of train y_predicted = estimator.predict(x_test)#predict is a method of this object to predict the result of the set of test accuracy = np.mean(y_test == y_predicted) * 100#compare the result with the fact ,and we get the accuracy. print("The accuracy is {0:.1f}%".format(accuracy))#print is "The accuracy is 86.4%"

接下来使用“交叉验证”方法测试算法性能，简单来说“交叉验证”就是在同一数据集中多次切分出不同的训练集和测试集。

from sklearn.cross_validation import cross_val_score scores = cross_val_score(estimator, x, y, scoring='accuracy') #cross validation average_accuracy = np.mean(scores) * 100 print("The average accuracy is {0:.1f}%".format(average_accuracy))#print is "The average accuracy is 82.3%"

以上使用的是默认参数（即K近邻算法中的近邻个数为默认），接下来人为调整参数看看不同参数的效果。

avg_scores = []#save the results produced by all parameters. all_scores = [] parameter_values = list(range(1, 21)) #change parameter from 1 to 20 for n_neighbors in parameter_values: estimator = KNeighborsClassifier(n_neighbors=n_neighbors)#change parameter scores = cross_val_score(estimator, x, y, scoring='accuracy')#cross validation avg_scores.append(np.mean(scores))#add result to avg_scores all_scores.append(scores) from matplotlib import pyplot as plt plt.plot(parameter_values,avg_scores, '-o')#draw the results plt.show()

通常数据集并不规整，需要进行一系列数据预处理，如最基本的：归一化。像数据预处理这样的步骤常常是一系列并且固定不变的，我们为了使用方便并避免错放顺序，可使用“流水线”对“步骤们”进行封装，就像一个函数封装（代表）了一系列操作一样。使用scikit-learn的步骤稍升级的框架为：

X_broken = np.array(x) X_broken[:,::2] /= 10#every other line, divide the second feature values by 10 from sklearn.preprocessing import MinMaxScaler#normalization [0~1] from sklearn.pipeline import Pipeline scaling_pipeline = Pipeline([('scale', MinMaxScaler()),('predict', KNeighborsClassifier())])#create pipeline including normalization and classifier scores = cross_val_score(scaling_pipeline, X_broken, y,scoring='accuracy')#cross validation print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(scores) * 100))#print is "The pipeline scored an average accuracy for is 82.3%"

代码来自于《Python数据挖掘入门与实践》，已验证。运行环境：PyCharm Python2.7

转载请注明原文地址: https://ju.6miu.com/read-15973.html

技术

最新回复(0)