K-Means Clustering
K-means clustering is one of the most widely used unsupervised learning algorithms in Machine Learning and Statistics. It partitions data into (K) clusters such that points within the same cluster are as similar as possible.
Objective Function
K-means minimizes the within-cluster sum of squares (WCSS):
\min_{{C_k,\mu_k}} \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i – \mu_k|^2
where:
-
(C_k) is cluster (k)
-
(\mu_k) is the centroid (mean) of cluster (k)
-
(|x_i – \mu_k|^2) is the squared Euclidean distance
Algorithm Steps
-
Choose the number of clusters (K).
-
Initialize (K) centroids (often using K-means++).
-
Assign each point to the nearest centroid.
-
Recompute each centroid as the mean of assigned points.
-
Repeat steps 3᎓4 until assignments no longer change or the objective stabilizes.
Example
Suppose customer data contain two features:
-
Annual income
-
Spending score
Using (K=3), K-means may identify clusters such as:
-
High income, high spending
-
Moderate income, moderate spending
-
Low income, low spending
Choosing the Number of Clusters
Common methods include:
-
Elbow method
-
Silhouette score
-
Gap statistic
Advantages
-
Simple and fast
-
Scales to large datasets
-
Easy to interpret
Limitations
-
Must choose (K) in advance
-
Sensitive to initialization
-
Assumes roughly spherical, similarly sized clusters
-
Sensitive to outliers
Applications
-
Customer segmentation
-
Image compression
-
Document clustering
-
Anomaly detection (as a preprocessing step)
-
Gene expression analysis
K-Means vs Other Clustering Methods
| Method | Strength |
|---|---|
| K-means | Fast and scalable |
| Hierarchical clustering | Produces dendrograms |
| DBSCAN | Detects arbitrary-shaped clusters and noise |
| Gaussian Mixture Models | Provides probabilistic assignments |
Summary
K-means clustering partitions data into (K) groups by iteratively assigning points to the nearest centroid and updating centroids to minimize within-cluster variance. It is a foundational algorithm for exploratory data analysis and unsupervised learning.