sklearn K-means MiniBatch-K-Means

    xiaoxiao2025-01-31  4

    K-means: K-means的注意事项,对于不同量纲(扁平数据)及(类别)非凸数据不适用,应当做PCA 预处理。 通过对协方差阵的估计可以看到,make_blobs是用单位协方差阵生成的。 cluster_std为每个cluster的标准差。 下面Anisotropicly Distributed Blobs施加的是强线性变换(无扰动)并强负相关 变换后相关系数-0.95065634126728737,可以看到样本点分布非均匀且分类不独立。 这时当然不适合K-means, 可见PCA的重要性。 下面是一些使用K-means的不当数据形式展示,但后两个的情况并没有很差: import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs plt.figure(figsize = (12, 12)) n_samples = 1500 random_state = 170 X, y = make_blobs(n_samples = n_samples, random_state = random_state) y_pred = KMeans(n_clusters = 2, random_state = random_state).fit_predict(X) plt.subplot(221) plt.scatter(X[:,0], X[:,1], c = y_pred) plt.title("Incorrect Number of Blobs") transformation = [[0.60834549, -0.63667341], [-0.40887718,0.852553229]] X_aniso = np.dot(X, transformation) y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_aniso) plt.subplot(222) plt.scatter(X_aniso[:,0], X_aniso[:,1], c = y_pred) plt.title("Anisotropicly Distributed Blobs") X_varied, y_varied = make_blobs(n_samples = n_samples, cluster_std = [1.0, 2.5, 0.5], random_state = random_state) y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_varied) plt.subplot(223) plt.scatter(X_varied[:,0], X_varied[:,1], c = y_pred) plt.title("Unequal Variance") X_filtered = np.vstack((X[y==0][:500], X[y==1][:100], X[y==2][:10])) y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_filtered) plt.subplot(224) plt.scatter(X_filtered[:,0], X_filtered[:,1], c = y_pred) plt.title("Unevenly Sized Blobs") plt.show() mini_batch k均值聚类与传统k均值的区别是,原来对质心的更新是单点进行,现在改为 对一小批数据进行更新及计算质心。(速度比K-means快,精度更低) batch_size设定更新样本量。 参数设定 init='k-means++' 会由程序自动寻找合适的n_clusters n_init 参数设定进行尝试的模型个数(default 10),最终会选择最好的(距质心平方和最小)。 max_no_improvement 参数设定在使用mini k-means模型时连续若干个抽样选择没有导致  质心距离函数有明显下降的情况下的最多终止步数。(超过该步数求解终止) verbose 参数设定打印求解过程的程度,值越大,细节打印越多。 subplots_adjust用于调整边宽,默认边宽约0.1个axis坐标系单位。 sklearn.metrics.pairwise_distances_argmin:  对两个二维数组进行距离计算,按顺序(按行)计算在第二个数组中与第一个  数组距离最近的相应行的下标。 np.logical_not:对数组(每个元素)取逻辑非。 下面是一个比较K-means与MiniBatchK-means的例子: import time import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MiniBatchKMeans, KMeans from sklearn.metrics.pairwise import pairwise_distances_argmin from sklearn.datasets.samples_generator import make_blobs np.random.seed(0) batch_size = 45 centers = [[1,1], [-1,-1], [1,-1]] n_clusters = len(centers) X, labels_true = make_blobs(n_samples = 3000, centers = centers, cluster_std = 0.7) k_means = KMeans(init = "k-means++", n_clusters = 3, n_init = 10) t0 = time.time() k_means.fit(X) t_batch = time.time() - t0 k_means_labels = k_means.labels_ k_means_cluster_centers = k_means.cluster_centers_ k_means_labels_unique = np.unique(k_means_labels) mbk = MiniBatchKMeans(init = 'k-means++', n_clusters = 3, batch_size = batch_size, n_init = 10,\    max_no_improvement = 10, verbose = 0) t0 = time.time() mbk.fit(X) t_mini_batch = time.time() - t0 mbk_means_labels = mbk.labels_ mbk_means_cluster_centers = mbk.cluster_centers_ mbk_means_labels_unique = np.unique(mbk_means_labels) fig = plt.figure(figsize = (8, 3)) fig.subplots_adjust(left = 0.02, right = 0.98, bottom = 0.05, top = 0.9) colors = ['#4EACC5', '#FF9C34', '#4E9A06'] order = pairwise_distances_argmin(k_means_cluster_centers, mbk_means_cluster_centers) ax = fig.add_subplot(1, 3, 1) for k, col in zip(range(n_clusters), colors):  my_members = k_means_labels == k  cluster_center = k_means_cluster_centers[k]  ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor = col, marker = '.')  ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor = col, \    markeredgecolor = 'k', markersize = 6) ax.set_title('KMeans') ax.set_xticks(()) ax.set_yticks(()) plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' % (t_batch, k_means.inertia_)) ax = fig.add_subplot(1, 3, 2) for k, col in zip(range(n_clusters), colors):  my_members = mbk_means_labels == order[k]  cluster_center = mbk_means_cluster_centers[order[k]]  ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor = col, marker = '.')  ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor = col, markeredgecolor = 'k', markersize = 6) ax.set_title("MiniBatchKMeans") ax.set_xticks(()) ax.set_yticks(()) plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' % (t_mini_batch, mbk.inertia_)) different = (mbk_means_labels == 4) ax = fig.add_subplot(1, 3, 3) for k in range(n_clusters):  different += ((k_means_labels == k)!=(mbk_means_labels == order[k])) identic = np.logical_not(different) ax.plot(X[identic, 0], X[identic, 1], 'w', markerfacecolor = '#bbbbbb', marker = '.') ax.plot(X[different, 0], X[different, 1], 'w', markerfacecolor = 'm', marker = '.') ax.set_title('Difference') ax.set_xticks(()) ax.set_yticks(()) plt.show()
    转载请注明原文地址: https://ju.6miu.com/read-1295958.html
    最新回复(0)