Machine Learning:Prepare Data for K-Means Clustering

Published On 2022/01/01 Saturday, Singapore

This post covers data preprocessing steps for K-Means Clustering. Specifically, it covers the following:


To illustrate the data preprocessing steps, we are using the Wholesale Customers Dataset. It includes the annual spending in monetary units on diverse product categories.

Feature Description Type
Fresh Annual spending on fresh products Continuous
Milk Annual spending on milk products Continuous
Grocery Annual spending on grocery products Continuous
Frozen Annual spending on frozen products Continuous
Detergents_Paper Annual spending on detergents and paper products Continuous
Delicatessen annual Spending on delicatessen products Continuous
Channel Customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel Nominal
Region Customers Region - Lisnon, Oporto or Other Nominal


You can access the data with the following scripts

import os
import pandas as pd
uci_data_folder = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/"
uci_data_file = "Wholesale%20customers%20data.csv"
df = pd.read_csv(os.path.join(uci_data_folder,uci_data_file))


1. Standardization

Because k-means optimization is based on distance metrics. When the features fall in completely different ranges, standardization is needed. There are different ways of doing standardization. Among them, the most popular ways are Standard Scaler and Min Max Scaler.

In the example dataset, we find that Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen are all annual spending but they are on different scales. In this case, we can apply standardization on them.

2. Dimension Reduction


3. Reference & Resources





💚 Back to Home