K-means is a very simple hard clustering technique. The goal of k-means clustering is to minimize the total distance from all the points to the corresponding centroids, which is
argminCΣ||xi−Ci||22 , where xi is ith data point, and Ci is the center assignment for each data point.There are two iterative steps in k-means clustering, which are centroid assignment and centroid update. In the first step, assign each data point with its nearest centroid, which is
Ci=argminj||xi−Cj||22 .In second step, for each centroid, find the mean value of all the data points of that centroid, and use it as new clustering centroid.
C∗j=1NΣx in CjX .Repeat these two steps until no centroids change, the model will converge at a local optimum. K-means is a initialization dependent method, which means a good initial state will affect the quality and converging rate of clustering.
When the data set is really big, k-means can be slow even for a single iteration, so distribute computing framework can be applied. Here we have a simple introduction of Map Reduce framework, which contains two parts, Map and Reduce.
In Map, do Data-parallel over elements, and emit a (key, value) pair.
In Reduce, the (key, value) pairs are aggregated over keys and generate a new (key, value*) pair.
For example, when we want to count the occurrence of all the words inside a corpus, we can first divide the corpus into many sets of documents, and send them to multiple machines. Each machine generates a (word, count) pair, and send it to a machine in the second layer by using hashing function
h:Vi−>machine . Here each word is processed by only one machine so that there will be no data conflict. Combining (key,value) pairs can be communication expensive, so local reduce helps improve the efficient.