Machine Learning:K-Means Mathematics
Published On 2021/12/30 Thursday, Singapore
This post covers the mathematics behind K-Means. Specifically, it covers the following:
1. Cost Function
Assume we have the training dataset $D={\mathbf{x}_1,…,\mathbf{x}_n}$, Where $\mathbf{x}_i$ represents features where $\mathbf{x}_i\in \mathbb{R}^m$, $m$ is the number of feature variables, $n$ is the number of samples.
The k-means algorithm divides the $n$ samples to $k$ disjoint clusters $C= {C_1,C_2,…,C_k}$, each cluster described by its centroid $\mathbf{\mu}_k$ which is calculated as the mean of the samples in the cluster $C_k$.
The objectives are twofold:
- Within each cluster, the samples are as much similar to each other as possible.
- Across different clusters, the samples are as much different from each other as possible.
Thus, the k-means algorithm aims to choose the centroids $\boldsymbol{\mu}={\mu_1,\mu_2,…,\mu_k}$ and assign a cluster id $\mathbf{c} ={c_1,c_2,…,c_k}$ to the samples to minimize the within cluster sum of squares :
\[\begin{align*} \arg\min_{\boldsymbol{\mu},\mathbf{c}}E(\boldsymbol{\mu},\mathbf{c}) &= \arg\min_{\boldsymbol{\mu},\mathbf{c}} \sum^{K}_{k=1}\sum_{\mathbf{x}_i \in C_{k}}||\mathbf{x}_i-\mu_k||^2\\ &= \arg\min_{\boldsymbol{\mu},\mathbf{c}} \sum^{K}_{k=1}\sum_{\mathbf{x}_i \in C_k}||\mathbf{x}_i-\mu_k||^2\\ &=\arg\min_{ \boldsymbol{\mu},\mathbf{c}} \sum^{n}_{i=1}||\mathbf{x}_i-\mu_k||^2 \end{align*}\]Where $\mathbf{x}i \in C{k}$ indicates the sample $\mathbf{x}i$ belongs to the cluster $k$. $\sum{\mathbf{x}_i \in C_k}| \mathbf{x}_i -u_k|^2$ is the sum of the distance from samples within the cluster $C_k$ to the centroid $\mu_k$. $E$ is the sum of the distance of all samples to their corresponding cluster centroid.
The cost function of K-means is as average of the distance of all samples to their corresponding centroid, also called distortion:
\[J(\boldsymbol{\mu},\mathbf{c}) = \frac{1}{n} \sum^{n}_{i=1}||\mathbf{x}_i-\mu_{c_i}||^2\tag{1}\]Where $\mu_{c_i}$ is the centriod of sample $\mathbf{x}_i$.
2. Optimization
To find the optimal cluster centroids that minimize the cost function:
-
First, random initiate of $K$ cluster centroids: ${\mathbf{\mu}_1,..,\mathbf{\mu}_k,…,\mathbf{\mu}_K}$, where $\mathbf{\mu}_k\in \mathbb{R}^m$.
-
Second, for each sample from $i = 1$ to $n$, find the cluster centroid that closest to sample $\mathbf{x}_i$, specifically to find the index $c_i$ of cluster centroid closest to $\mathbf{x}_i$.
\[c_i = \arg\min_{k \in \{1,2,..,K\}} ||\mathbf{x}_i-\mu_k||^2\tag{2}\]The cluster assignment step can be understood as minimize the cost function while holding ${\mathbf{\mu}_1,..,\mathbf{\mu}_k,…,\mathbf{\mu}_K}$ unchanged.
-
Third, for each cluster from $k = 1$ to $K$, update the cluster centroid as the average of points assigned to cluster $k$.
\[\mu_k = \sum_{\mathbf{x}_i \in C_k}\mathbf{x}_i\tag{3}\]The move centroid step can be understood as minimize the cost function while holding ${c_1,c_2,…,c_K}$ unchanged.
\[\mu_k = \arg\min_{\boldsymbol{\mu},\mathbf{c}} J(\boldsymbol{\mu},\mathbf{c})\; \; w.r.t. \; \; \mathbf{c}\]
The second and third step are repeated until converge.
3. Initialization
Recommended ways of doing random initialization are
- when $K<n$, randomly pick $K$ training examples as centroids.
- K-means might end up with local optima. multiple random initializations can be used, then pick the clustering that gives the lowest cost/distortion. This is especially useful when $K$ is small and range from 2 to 10. However, when $K$ is large enough, multiple random initialization is no longer providing significant performance improvement.
- An alternative way to deal with the local optima is by smart initialization, which is implemented in sk-learn as
kmeans++
.
4. How to Choose the number of clusters
When it comes to choose the the number of clusters, there ways are used generally
- Visualization the sample to see if distinct groups.
- Elbow Criterion: x axis is the number of cluster, and y axis is the distortion/cost function. When using it, checking the existence of elbow — After a point, the cost/distortion decreases at a slower rate. However, it may not show clear elbow, which makes it harder to decide the $K$ from the plot.
- Evaluate K-Means based on metrics from downstream tasks. Specifically, how well it will perform for the later tasks.
5. Evaluating Performance
DUN index
Reference & Resources
- Week 8: Unsupervised Learning, Machine Learning from Standford University on Coursera
- K-Means, Scikit-learn Documentation