kMeans

Sources:

Overview

It defines k centroids (k is manually selected).
k $\Rightarrow$ number of clusters.
Every instance MUST be assigned to a centroid.
Each data point owns to a single cluster.

The algorithm

Randomly selects k centroids from the dataset.
Computes the distances (euclidean, cosine...) between for each non-centroid point to the k centroids.
Assigns each non-centroid point to a cluster, based on the smallest computed distances.
Recalculates the cluster centroids: the new centroid is the average or the mean value of all the cluster instances.
Back to Step 2 until:
1. The maximum number of iterations is reached
2. The clusters stabilize: when clusters remain the same, centroids remain the same of distances are small enough.

Always converges after enough iterations to a local optimum. However, it doesn't provide a deterministic solution, due to the random initialization.

How to choose the optimal k?

Plotting data by picking a different color for each data point gives a good overview.
Evaluate the inertia of each cluster. The goal is:
- a low metric value
- a low number number of clusters
The elbow method is a good tool to select the optimal value of k.

PreviousClustering metrics NextGaussian Mixture Model

Last updated 5 years ago

Was this helpful?