Blogs/K Means

# K Means

peterwashington Nov 01 2021 8 min read 0 viewsUnsupervised

The simplest method for clustering is called *k-means clustering.* In k-means clustering, you input an argument k which represents the number of clusters you want to find.

The first step is to choose k random “center points”. In the image below, the circles are the data points and the squares are the k=3 initialized cluster centers:

Each data point is assigned to the center point that it is closest to. We draw a circle around these “clusters” which are defined by which center point (square) each data point (circle) is closest to:

With the newly defined clusters, the mean point is re-calculated as the mean coordinate of all the points in that cluster:

Each data point is reassigned to its closest center point based on the updated center points:

This process of assigning points to the closest center point and updating the assignment of data points to center points based on distance is repeated until either the clusters no longer change or several iterations of the algorithm have run.

Now that we understand the intuition for how k-means clustering works, let’s look at the formal mathematical definition which defines that process. The goal of k-means clustering is to find the sets of points S (which there are K of) and mean(i) denotes the mean point of centroid i:

The outer term, the argmin over C, is saying that we are finding the set (of clusters of points), named C, that is minimal over the equation to follow. That equation to follow starts with a summation from i=1 up to i=K, meaning we are summing over all K clusters. The inner summation is summing, for a particular cluster of points i, the distance from each point * x * to the mean point of the cluster *C**i*.

Putting it all together, we are finding the configuration of points such that the sum, over all clusters of points, of the distances between each point in each cluster to the average point in that cluster is minimized. This is just a mathematical way of stating the process that we described with pictures above.