机器学习大火,忽然间很多人都朝这里使劲,我是个研一的学生,这并不是我的专业方向,出于种种原因,我也来了。 小白一枚,从零自学,经两个月跌跌撞撞,这里一锤子那里一棒子的学习,确定了前期的学习路线并开博客,与君共勉。 本系列博客主要参考《利用Python进行数据分析》、《Python数据挖掘入门与实践》、《机器学习》(周志华)。以后两本为主线学习。 第一本书作为工具书,用于补充Python、Pandas等背景知识; 第二本书作为实践书,主要利用scikit-learn练习算法的使用和调参等等; 第三本书作为理论书,结合第二本加强对算法的理解。 当然内功还需线性代数、概率论等。本人尝试过先过一遍数学,可没有不经过实践的理论转身就忘,所以会在读这三本书的同时穿插数学基础。
先依托简单的K近邻算法,熟悉最简单的scikit-learn使用框架,讲解都在代码的注释中。 为便于理解,我把“导入库”的语句都写在了距离“调用库”语句最近的上方。
import numpy as np import os data_filename = os.path.join("C:\Users\Han Chunhui", "Ionosphere","ionosphere.data")#import dataset x = np.zeros((351, 34), dtype='float')#create space for data ,351 rows and 34 columns y = np.zeros((351,), dtype='bool')#create space for labels ,351 rows and 1 column import csv with open(data_filename, 'r') as input_file: reader = csv.reader(input_file) for i, row in enumerate(reader): data = [float(datum) for datum in row[:-1]] x[i] = data y[i] = row[-1] == 'g'#x is the set of data , y is the set of labels from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=14)#split the set of train and test from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier() estimator.fit(x_train, y_train)#"KNeighborsClassifier" is a object ,fit is a method of this object to train the set of train y_predicted = estimator.predict(x_test)#predict is a method of this object to predict the result of the set of test accuracy = np.mean(y_test == y_predicted) * 100#compare the result with the fact ,and we get the accuracy. print("The accuracy is {0:.1f}%".format(accuracy))#print is "The accuracy is 86.4%"接下来使用“交叉验证”方法测试算法性能,简单来说“交叉验证”就是在同一数据集中多次切分出不同的训练集和测试集。
from sklearn.cross_validation import cross_val_score scores = cross_val_score(estimator, x, y, scoring='accuracy') #cross validation average_accuracy = np.mean(scores) * 100 print("The average accuracy is {0:.1f}%".format(average_accuracy))#print is "The average accuracy is 82.3%"以上使用的是默认参数(即K近邻算法中的近邻个数为默认),接下来人为调整参数看看不同参数的效果。
avg_scores = []#save the results produced by all parameters. all_scores = [] parameter_values = list(range(1, 21)) #change parameter from 1 to 20 for n_neighbors in parameter_values: estimator = KNeighborsClassifier(n_neighbors=n_neighbors)#change parameter scores = cross_val_score(estimator, x, y, scoring='accuracy')#cross validation avg_scores.append(np.mean(scores))#add result to avg_scores all_scores.append(scores) from matplotlib import pyplot as plt plt.plot(parameter_values,avg_scores, '-o')#draw the results plt.show()通常数据集并不规整,需要进行一系列数据预处理,如最基本的:归一化。像数据预处理这样的步骤常常是一系列并且固定不变的,我们为了使用方便并避免错放顺序,可使用“流水线”对“步骤们”进行封装,就像一个函数封装(代表)了一系列操作一样。使用scikit-learn的步骤稍升级的框架为:
X_broken = np.array(x) X_broken[:,::2] /= 10#every other line, divide the second feature values by 10 from sklearn.preprocessing import MinMaxScaler#normalization [0~1] from sklearn.pipeline import Pipeline scaling_pipeline = Pipeline([('scale', MinMaxScaler()),('predict', KNeighborsClassifier())])#create pipeline including normalization and classifier scores = cross_val_score(scaling_pipeline, X_broken, y,scoring='accuracy')#cross validation print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(scores) * 100))#print is "The pipeline scored an average accuracy for is 82.3%"代码来自于《Python数据挖掘入门与实践》,已验证。运行环境:PyCharm Python2.7