Machine Learning:Prepare Data for K-Means Clustering
Published On 2022/01/01 Saturday, Singapore
This post covers data preprocessing steps for K-Means Clustering. Specifically, it covers the following:
To illustrate the data preprocessing steps, we are using the Wholesale Customers Dataset. It includes the annual spending in monetary units on diverse product categories.
Feature | Description | Type |
---|---|---|
Fresh | Annual spending on fresh products | Continuous |
Milk | Annual spending on milk products | Continuous |
Grocery | Annual spending on grocery products | Continuous |
Frozen | Annual spending on frozen products | Continuous |
Detergents_Paper | Annual spending on detergents and paper products | Continuous |
Delicatessen | annual Spending on delicatessen products | Continuous |
Channel | Customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel | Nominal |
Region | Customers Region - Lisnon, Oporto or Other | Nominal |
You can access the data with the following scripts
import os
import pandas as pd
uci_data_folder = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/"
uci_data_file = "Wholesale%20customers%20data.csv"
df = pd.read_csv(os.path.join(uci_data_folder,uci_data_file))
1. Standardization
Because k-means optimization is based on distance metrics. When the features fall in completely different ranges, standardization is needed. There are different ways of doing standardization. Among them, the most popular ways are Standard Scaler and Min Max Scaler.
In the example dataset, we find that Fresh
, Milk
, Grocery
, Frozen
, Detergents_Paper
, and Delicatessen
are all annual spending but they are on different scales. In this case, we can apply standardization on them.
-
Standard Scaler standardize features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:
\[z = (x - u) / s\]where $u$ is the mean of the training samples, and $s$ is the standard deviation of the training samples.
from sklearn.preprocessing import StandardScaler cols = ['Fresh', 'Milk', 'Grocery', 'Frozen','Detergents_Paper', 'Delicassen'] std_cols = [f"std_{col}" for col in cols] scaler = StandardScaler() df[std_cols] = scaler.fit_transform(df[cols])
-
Min Max Scaler scales and translates each feature individually such that it is in the given range on the training set. By default, it scales to a range between zero and one.
from sklearn.preprocessing import MinMaxScaler cols = ['Fresh', 'Milk', 'Grocery', 'Frozen','Detergents_Paper', 'Delicassen'] mm_cols = [f"mm_{col}" for col in cols] scaler = MinMaxScaler() df[mm_cols] = scaler.fit_transform(df[cols])
2. Dimension Reduction
3. Reference & Resources