THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
 in Dejan Sarka (Entire Site) Search

# Data Mining Algorithms – K-Means Clustering

Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it.

K-Means is a distance-based partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes.

The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables).

The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there.

The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster.

Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster.

Here is a graphical representation of the process. You can see the cases in a two-dimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters.

After the centroids were selected, the algorithm assigns each case to the nearest centroid.

Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move.

The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster.

In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop.

K-means clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best.

The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.

Published Friday, April 17, 2015 9:03 AM by Dejan Sarka

(required)
(required)
Submit