K-Means is an unsupervised machine learning algorithm. The algorithm divides the data points into k groups (called clusters), where each data point can belong to only one cluster. K-Means aims to group together similar data points into the same cluster, while keeping different clusters as far apart as possible.
Each cluster has a center, which is a data point that represents the center of the cluster. A data point gets added to a cluster whose center is closest to that data point. Distance between points is measures using sum of squared distances method.
- Select the number of clusters, k
- Appoint k data points as cluster centers (either random assignment, or space them as far apart as possible)
- Until cluster assignments do not change, do the following for each data point:
- Calculate the sum of squared distance between it and all the cluster centers.
- Assign the point to the cluster having the closest center.
- Recalculate the center for clusters by taking the average of all data points assigned to that cluster.
- K-Means clustering is highly sensitive to the initially chosen cluster centers. Hence, K-means can be run with different starting cluster centers to get optimum results.
- If you do not know the optimum number of clusters to divide the data, try the algorithm with different values of k and select the best k for which the data gets nicely grouped together.
Initially, each data point is treated as an independent cluster. At each step, the two closest clusters are merged to become one cluster. This process continues until only a single cluster remains. Once the process is complete, we can cut the tree into clusters as needed.