THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – K-Means Clustering

Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it.

K-Means is a distance-based partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes.

The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables).

The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there.

The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster.

Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster.

Here is a graphical representation of the process. You can see the cases in a two-dimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters.

image

After the centroids were selected, the algorithm assigns each case to the nearest centroid.

image

Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move.

image 

The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster.

image

In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop.

K-means clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best.

The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement