kMeans

Overview

  • It defines k centroids (k is manually selected).

  • k \Rightarrow number of clusters.

  • Every instance MUST be assigned to a centroid.

  • Each data point owns to a single cluster.

The algorithm

  1. Randomly selects k centroids from the dataset.

  2. Computes the distances (euclidean, cosine...) between for each non-centroid point to the k centroids.

  3. Assigns each non-centroid point to a cluster, based on the smallest computed distances.

  4. Recalculates the cluster centroids: the new centroid is the average or the mean value of all the cluster instances.

  5. Back to Step 2 until:

    1. The maximum number of iterations is reached

    2. The clusters stabilize: when clusters remain the same, centroids remain the same of distances are small enough.

Always converges after enough iterations to a local optimum. However, it doesn't provide a deterministic solution, due to the random initialization.

How to choose the optimal k?

  • Plotting data by picking a different color for each data point gives a good overview.

  • Evaluate the inertia of each cluster. The goal is:

    • a low metric value

    • a low number number of clusters

  • The elbow method is a good tool to select the optimal value of k.

Last updated