kMeans
Last updated
Last updated
It defines k centroids (k is manually selected).
k number of clusters.
Every instance MUST be assigned to a centroid.
Each data point owns to a single cluster.
Randomly selects k centroids from the dataset.
Computes the distances (euclidean, cosine...) between for each non-centroid point to the k centroids.
Assigns each non-centroid point to a cluster, based on the smallest computed distances.
Recalculates the cluster centroids: the new centroid is the average or the mean value of all the cluster instances.
Back to Step 2 until:
The maximum number of iterations is reached
The clusters stabilize: when clusters remain the same, centroids remain the same of distances are small enough.
Always converges after enough iterations to a local optimum. However, it doesn't provide a deterministic solution, due to the random initialization.
Plotting data by picking a different color for each data point gives a good overview.
Evaluate the inertia of each cluster. The goal is:
a low metric value
a low number number of clusters
The elbow method is a good tool to select the optimal value of k.