Machine Learning：Prepare Data for K-Means Clustering

Published On 2022/01/01 Saturday, Singapore

This post covers data preprocessing steps for K-Means Clustering. Specifically, it covers the following:

To illustrate the data preprocessing steps, we are using the Wholesale Customers Dataset. It includes the annual spending in monetary units on diverse product categories.

Feature	Description	Type
Fresh	Annual spending on fresh products	Continuous
Milk	Annual spending on milk products	Continuous
Grocery	Annual spending on grocery products	Continuous
Frozen	Annual spending on frozen products	Continuous
Detergents_Paper	Annual spending on detergents and paper products	Continuous
Delicatessen	annual Spending on delicatessen products	Continuous
Channel	Customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel	Nominal
Region	Customers Region - Lisnon, Oporto or Other	Nominal

You can access the data with the following scripts

import os
import pandas as pd
uci_data_folder = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/"
uci_data_file = "Wholesale%20customers%20data.csv"
df = pd.read_csv(os.path.join(uci_data_folder,uci_data_file))

1. Standardization

Because k-means optimization is based on distance metrics. When the features fall in completely different ranges, standardization is needed. There are different ways of doing standardization. Among them, the most popular ways are Standard Scaler and Min Max Scaler.

In the example dataset, we find that Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen are all annual spending but they are on different scales. In this case, we can apply standardization on them.

Standard Scaler standardize features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:

\[z = (x - u) / s\]

where $u$ is the mean of the training samples, and $s$ is the standard deviation of the training samples.

  from sklearn.preprocessing import StandardScaler
  cols = ['Fresh', 'Milk', 'Grocery', 'Frozen','Detergents_Paper', 'Delicassen']
  std_cols = [f"std_{col}" for col in cols]
  scaler = StandardScaler()
  df[std_cols] = scaler.fit_transform(df[cols])

Min Max Scaler scales and translates each feature individually such that it is in the given range on the training set. By default, it scales to a range between zero and one.

  from sklearn.preprocessing import MinMaxScaler
  cols = ['Fresh', 'Milk', 'Grocery', 'Frozen','Detergents_Paper', 'Delicassen']
  mm_cols = [f"mm_{col}" for col in cols]
  scaler = MinMaxScaler()
  df[mm_cols] = scaler.fit_transform(df[cols])

2. Dimension Reduction

3. Reference & Resources

Introduction to Customer Segmentation in Python, Coursera

💚 Back to Home