Matlab实现(KNN)自适应谱聚类

xiaoxiao2021-04-14 121

自适应谱聚类

谱聚类中常用构造K近邻关系的相似度矩阵，这里引用实现相关算法，首先根据输入的数据data和K值求解KNN矩阵：get_knn_distance.m

function [A]=get_knn_distance(data,K) % Input : data: N*D 维的数据，N表示数据点数，D表示数据维数 % K: 近邻数 % % Output :A:KNN矩阵 dataT = data'; [n,d] = size(data); dist_matrix = dist(dataT); % 找到K近邻 [value index] = sort(dist_matrix, 1);%按列从小到大重新排列value重排结果，index重排编号 KNNindex = index(2:K+1, :);%去掉自己0的编号,K近邻的的编号 rowindex = reshape(KNNindex, size(KNNindex, 1)*size(KNNindex, 2), 1);%K近邻编号形n*1矩阵 tempindex = repmat(1:n, K, 1);%K行1列的矩阵 columnindex = reshape(tempindex, size(tempindex, 1)*size(tempindex, 2), 1);%矩阵列编号形成n*1矩阵 tempvalue = value(2:K+1, :);%去掉自己0的值 value1 = reshape(tempvalue, size(tempvalue, 1)*size(tempvalue, 2), 1);%K近邻编号的值形n*1矩阵 %形成新的K近邻矩阵 A= sparse(rowindex, columnindex, double(value1),n,n); A1 = triu(A);%取上三角函数 A1 = A1 + A1'; A2 = tril(A);%取下三角函数 A2 = A2 + A2'; A = max(A1, A2); %A = full(A);

求解前K个最小的特征值对应的特征向量sc.m。对于谱聚类来说，参数sigma是一个非常敏感的超参数，现有大量论文基于局部密度实现自适应的参数，有兴趣的可以查找研究

function [cluster_labels, evd_time, kmeans_time, total_time] = sc(A, sigma_matrix, num_clusters) %SC Spectral clustering using a sparse similarity matrix (t-nearest-neighbor). % % Input : A : N-by-N sparse distance matrix, where % N is the number of data % sigma_matrix : sigma value used in computing similarity, % num_clusters : number of clusters % % Output : cluster_labels : N-by-1 vector containing cluster labels % evd_time : running time for eigendecomposition % kmeans_time : running time for k-means % total_time : total running time % % Convert the sparse distance matrix to a sparse similarity matrix, % where S = exp^(-(A^2 / 2*sigma^2)). % Note: This step can be ignored if A is sparse similarity matrix. % disp('Converting distance matrix to similarity matrix...'); tic; A = A.*A; %A = -A/(2*sigma*sigma);%取-，求最大的特征值 %disp(A) A = -A./sigma_matrix;%sigma_matrix存放的是局部自适应的参数矩阵 %disp(A) n=size(A,1); % 求解相似性矩阵 S= spfun(@exp, A); %disp(S) clear A; toc; % % Do laplacian, L = D^(-1/2) * S * D^(-1/2) % disp('Doing Laplacian...'); D = sum(S, 2) + (1e-10); D = sqrt(1./D); % D^(-1/2) D = spdiags(D, 0, n, n); %disp(D) L = D * S * D; clear D S; time1 = toc; % % Do eigendecomposition, if L = % D^(-1/2) * S * D(-1/2) : set 'LM' (Largest Magnitude), or % I - D^(-1/2) * S * D(-1/2): set 'SM' (Smallest Magnitude). %L = eye(n)-(D^(-1/2) * S * D^(-1/2)); %归一化拉普拉斯矩阵 % disp('Performing eigendecomposition...'); OPTS.disp = 0; [V, val] = eigs(L, num_clusters, 'LM', OPTS);%V特征向量，val特征值对应的对角矩阵 time2 = toc; % % Do k-means % disp('Performing kmeans...'); % Normalize each row to be of unit length sq_sum = sqrt(sum(V.*V, 2)) + 1e-20; U = V ./ repmat(sq_sum, 1, num_clusters); clear sq_sum V; cluster_labels = k_means(U, [], num_clusters); total_time = toc; % % Calculate and show time statistics % evd_time = time2 - time1; kmeans_time = total_time - time2; total_time; disp('Finished!');

众所周知，kmeans聚类对初始点的选择敏感，此处引用的kmeans聚类可根据centers的值简单的选择初始化方式

function cluster_labels = k_means(data, centers, num_clusters) %K_MEANS Euclidean k-means clustering algorithm. % % Input : data : N-by-D data matrix, where N is the number of data, % D is the number of dimensions % centers : K-by-D matrix, where K is num_clusters, or % 'random', random initialization, or % [], empty matrix, orthogonal initialization % num_clusters : Number of clusters % % Output : cluster_labels : N-by-1 vector of cluster assignment % % Reference: Dimitrios Zeimpekis, Efstratios Gallopoulos, 2006. % http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/ % % Parameter setting % iter = 0; qold = inf; threshold = 0.001; % % Check if with initial centers % if strcmp(centers, 'random') disp('Random initialization...'); centers = random_init(data, num_clusters); elseif isempty(centers) disp('Orthogonal initialization...'); centers = orth_init(data, num_clusters); end % % Double type is required for sparse matrix multiply % data = double(data); centers = double(centers); % % Calculate the distance (square) between data and centers % n = size(data, 1); x = sum(data.*data, 2)';%1*n X = x(ones(num_clusters, 1), :);%num_clusters*n y = sum(centers.*centers, 2);%num_clusters*1 Y = y(:, ones(n, 1));%num_clusters*n P = X + Y - 2*centers*data';% % % Main program % while 1 iter = iter + 1; % Find the closest cluster for each data point [val, ind] = min(P, [], 1);%返回一个行向量，每列最小值的行号 % Sum up data points within each cluster P = sparse(ind, 1:n, 1, num_clusters, n);%对应的聚类中心和数据编号 centers = P*data;%中心点权值和 % Size of each cluster, for cluster whose size is 0 we keep it empty cluster_size = P*ones(n, 1);%每一类数量 % For empty clusters, initialize again zero_cluster = find(cluster_size==0);%返回列列表的标号 if length(zero_cluster) > 0%聚类中心为空，进行初始化处理 disp('Zero centroid. Initialize again...'); centers(zero_cluster, :)= random_init(data, length(zero_cluster)); cluster_size(zero_cluster) = 1; end % Update centers centers = spdiags(1./cluster_size, 0, num_clusters, num_clusters)*centers; % Update distance (square) to new centers y = sum(centers.*centers, 2); Y = y(:, ones(n, 1)); P = X + Y - 2*centers*data'; % Calculate objective function value qnew = sum(sum(sparse(ind, 1:n, 1, size(P, 1), size(P, 2)).*P)); mesg = sprintf('Iteration %d:\n\tQold=%g\t\tQnew=%g', iter, full(qold), full(qnew)); disp(mesg); % Check if objective function value is less than/equal to threshold if threshold >= abs((qnew-qold)/qold) mesg = sprintf('\nkmeans converged!'); disp(mesg); break; end qold = qnew; end cluster_labels = ind'; %---------------------------------------------------------- function init_centers = random_init(data, num_clusters) %RANDOM_INIT Initialize centroids choosing num_clusters rows of data at random % % Input : data : N-by-D data matrix, where N is the number of data, % D is the number of dimensions % num_clusters : Number of clusters % % Output: init_centers : K-by-D matrix, where K is num_clusters rand('twister', sum(100*clock)); init_centers = data(ceil(size(data, 1)*rand(1, num_clusters)), :); function init_centers = orth_init(data, num_clusters) %ORTH_INIT Initialize orthogonal centers for k-means clustering algorithm. % % Input : data : N-by-D data matrix, where N is the number of data, % D is the number of dimensions % num_clusters : Number of clusters % % Output: init_centers : K-by-D matrix, where K is num_clusters % % Find the num_clusters centers which are orthogonal to each other % Uniq = unique(data, 'rows'); % Avoid duplicate centers num = size(Uniq, 1); first = ceil(rand(1)*num); % Randomly select the first center init_centers = zeros(num_clusters, size(data, 2)); % Storage for centers init_centers(1, :) = Uniq(first, :); Uniq(first, :) = []; c = zeros(num-1, 1); % Accumalated orthogonal values to existing centers for non-centers % Find the rest num_clusters-1 centers for j = 2:num_clusters c = c + abs(Uniq*init_centers(j-1, :)'); [minimum, i] = min(c); % Select the most orthogonal one as next center init_centers(j, :) = Uniq(i, :); Uniq(i, :) = []; c(i) = []; end clear c Uniq;

此处实现了自适应的谱聚类，对高斯核函数的2*sigma*sigma进行了改造，文献[3][4],文献[3]使用每个数据的第sigma_k个近邻距离作为自己的sigma，文献[4]在[3]基础上添加共享近邻数 Sim(xi,xj)=exp(−d(xi,xj)2/2σ2) ， Sim(xi,xj)=exp(−d(xi,xj)2/(2σi∗σj)) ， Sim(xi,xj)=exp(−d(xi,xj)2/(2σi∗σj∗(SNN+1)))

function [sigma_matrix]=density_adaptive_sigma(data,sigma_k,sigma_type) % % by wangboajia 2017.4.14 % input : data % : sigma_k近邻数 % :sigma_type:输入数据类型：字符型-----KNN----->第K近邻 % -----SNN----->第K近邻+共享近邻 % 数值型---->直接赋值 % output：sigma_matrix：输出参数矩阵 dataT = data'; [n,d] = size(data); dist_matrix = dist(dataT); % 找到sigma_k近邻 [value index] = sort(dist_matrix, 1);%按列从小到大重新排列value重排结果，index重排编号 KNNindex = index(2:sigma_k+1, :);%去掉自己0的编号,K近邻的的编号 KNNvalue = value(2:sigma_k+1, :);%去掉自己0的值 if (ischar(sigma_type)) if strcmp(sigma_type,'KNN') sigma_matrix=sigma_knn(KNNvalue,sigma_k,n); elseif strcmp(sigma_type,'SNN') sigma_matrix=sigma_snn(KNNvalue,KNNindex,sigma_k,n); end elseif ( isnumeric(sigma_type)) sigma_matrix=sigma_num(sigma_type,n); end %---------------------------------------------------------- %某数据的第sigma_k个近邻距离表示sigma function sigma_matrix=sigma_knn(KNNvalue,sigma_k,n) sigma_knn_value = KNNvalue(sigma_k,:); sigma_matrix=sigma_knn_value'*sigma_knn_value*2; %---------------------------------------------------------- %某对数据的sigma_k个近邻共享近邻数 function sigma_matrix=sigma_snn(KNNvalue,KNNindex,sigma_k,n) %%共享近邻个数 sigma_snn=zeros(n,n); for i = 1:n-1 for j = i+1:n sigma_snn(i,j) = size(intersect( KNNindex(:,i),KNNindex(:,j)),1); end end temp = triu(sigma_snn);%取上三角函数 sigma_snn = temp + temp'; sigma_knn_value = KNNvalue(sigma_k,:); sigma_matrix=sigma_knn_value'*sigma_knn_value*2.*(sigma_snn+1); %---------------------------------------------------------- %%统一赋初值 function sigma_matrix=sigma_num(sigma_type,n) sigma_matrix=ones(n,n); sigma_matrix=sigma_matrix*sigma_type*sigma_type;

测试：testspectral.m

load('two.txt'); dataSet=two; num_clusters=2; [n,d]=size(dataSet); K=10; %K取值可以从1到n-1，使用K=n-1退化为原始谱聚类 [A]=get_knn_distance(dataSet,K);%KNN相似矩阵 sigma_k=7; %sigma函数的自适应选择，论文建议使用7 sigma_type='SNN';%此处选择KNN则不加共享近邻数，选择SNN加，也可以单独赋数值型数据如0.1 %sigma_type='KNN'; %sigma_type=0.1; [sigma_matrix]=density_adaptive_sigma(dataSet,sigma_k,sigma_type);%求取自适应sigma矩阵 [cluster_labels, evd_time, kmeans_time, total_time] = sc(A, sigma_matrix, num_clusters);%谱聚类 disp(['数据样本的数量为：',num2str(n)]); disp(['求取前K个特征值和特征向量的时间：',num2str(evd_time)]); disp(['K-means聚类时间：',num2str(kmeans_time)]); disp(['谱聚类总时间：',num2str(total_time)]); X=dataSet(cluster_labels==1,1); Y=dataSet(cluster_labels==1,2); plot(dataSet(cluster_labels==1,1),dataSet(cluster_labels==1,2),'r+', dataSet(cluster_labels==2,1),dataSet(cluster_labels==2,2),'b.');

数据和代码下载

这里下载参考： [1]Matlab相关函数 [2]Parallel Spectral Clustering in Distributed Systems Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Chang IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, pp. 568-586, March 2011(源代码下载) [3]Zelnik-Manor, L., Perona, P.,2004. Self-tuning spectral clustering. In Proceedings of NIPS’04, pp. 1601–1608. [4]Local density adaptive similarity measurement for spectral clustering.Pattern Recognition Letters 32 (2011) 352–358

转载请注明原文地址: https://ju.6miu.com/read-670464.html

技术

最新回复(0)