Machine Learning：K-Means Mathematics

Published On 2021/12/30 Thursday, Singapore

This post covers the mathematics behind K-Means. Specifically, it covers the following:

Cost Function
Optimization
Initialization
How to Choose the Number of Clusters

1. Cost Function

Assume we have the training dataset $D={\mathbf{x}_1,…,\mathbf{x}_n}$, Where $\mathbf{x}_i$ represents features where $\mathbf{x}_i\in \mathbb{R}^m$, $m$ is the number of feature variables, $n$ is the number of samples.

The k-means algorithm divides the $n$ samples to $k$ disjoint clusters $C= {C_1,C_2,…,C_k}$, each cluster described by its centroid $\mathbf{\mu}_k$ which is calculated as the mean of the samples in the cluster $C_k$.

The objectives are twofold:

Within each cluster, the samples are as much similar to each other as possible.
Across different clusters, the samples are as much different from each other as possible.

Thus, the k-means algorithm aims to choose the centroids $\boldsymbol{\mu}={\mu_1,\mu_2,…,\mu_k}$ and assign a cluster id $\mathbf{c} ={c_1,c_2,…,c_k}$ to the samples to minimize the within cluster sum of squares :

\[\begin{align*} \arg\min_{\boldsymbol{\mu},\mathbf{c}}E(\boldsymbol{\mu},\mathbf{c}) &= \arg\min_{\boldsymbol{\mu},\mathbf{c}} \sum^{K}_{k=1}\sum_{\mathbf{x}_i \in C_{k}}||\mathbf{x}_i-\mu_k||^2\\ &= \arg\min_{\boldsymbol{\mu},\mathbf{c}} \sum^{K}_{k=1}\sum_{\mathbf{x}_i \in C_k}||\mathbf{x}_i-\mu_k||^2\\ &=\arg\min_{ \boldsymbol{\mu},\mathbf{c}} \sum^{n}_{i=1}||\mathbf{x}_i-\mu_k||^2 \end{align*}\]

Where $\mathbf{x}i \in C{k}$ indicates the sample $\mathbf{x}i$ belongs to the cluster $k$. $\sum{\mathbf{x}_i \in C_k}| \mathbf{x}_i -u_k|^2$ is the sum of the distance from samples within the cluster $C_k$ to the centroid $\mu_k$. $E$ is the sum of the distance of all samples to their corresponding cluster centroid.

The cost function of K-means is as average of the distance of all samples to their corresponding centroid, also called distortion:

\[J(\boldsymbol{\mu},\mathbf{c}) = \frac{1}{n} \sum^{n}_{i=1}||\mathbf{x}_i-\mu_{c_i}||^2\tag{1}\]

Where $\mu_{c_i}$ is the centriod of sample $\mathbf{x}_i$.

2. Optimization

To find the optimal cluster centroids that minimize the cost function:

First, random initiate of $K$ cluster centroids: ${\mathbf{\mu}_1,..,\mathbf{\mu}_k,…,\mathbf{\mu}_K}$, where $\mathbf{\mu}_k\in \mathbb{R}^m$.
Second, for each sample from $i = 1$ to $n$, find the cluster centroid that closest to sample $\mathbf{x}_i$, specifically to find the index $c_i$ of cluster centroid closest to $\mathbf{x}_i$.
\[c_i = \arg\min_{k \in \{1,2,..,K\}} ||\mathbf{x}_i-\mu_k||^2\tag{2}\]
The cluster assignment step can be understood as minimize the cost function while holding ${\mathbf{\mu}_1,..,\mathbf{\mu}_k,…,\mathbf{\mu}_K}$ unchanged.

\[c_i = \arg\min_{\boldsymbol{\mu},\mathbf{c}} J(\boldsymbol{\mu},\mathbf{c}) \; \; w.r.t. \; \; \boldsymbol{\mu}\]

Third, for each cluster from $k = 1$ to $K$, update the cluster centroid as the average of points assigned to cluster $k$.
\[\mu_k = \sum_{\mathbf{x}_i \in C_k}\mathbf{x}_i\tag{3}\]
The move centroid step can be understood as minimize the cost function while holding ${c_1,c_2,…,c_K}$ unchanged.
\[\mu_k = \arg\min_{\boldsymbol{\mu},\mathbf{c}} J(\boldsymbol{\mu},\mathbf{c})\; \; w.r.t. \; \; \mathbf{c}\]

The second and third step are repeated until converge.

3. Initialization

Recommended ways of doing random initialization are

when $K<n$, randomly pick $K$ training examples as centroids.
K-means might end up with local optima. multiple random initializations can be used, then pick the clustering that gives the lowest cost/distortion. This is especially useful when $K$ is small and range from 2 to 10. However, when $K$ is large enough, multiple random initialization is no longer providing significant performance improvement.
An alternative way to deal with the local optima is by smart initialization, which is implemented in sk-learn as kmeans++.

4. How to Choose the number of clusters

When it comes to choose the the number of clusters, there ways are used generally

Visualization the sample to see if distinct groups.
Elbow Criterion: x axis is the number of cluster, and y axis is the distortion/cost function. When using it, checking the existence of elbow — After a point, the cost/distortion decreases at a slower rate. However, it may not show clear elbow, which makes it harder to decide the $K$ from the plot.
Evaluate K-Means based on metrics from downstream tasks. Specifically, how well it will perform for the later tasks.

5. Evaluating Performance

DUN index

Reference & Resources

Week 8: Unsupervised Learning, Machine Learning from Standford University on Coursera
K-Means, Scikit-learn Documentation

💚 Back to Home