THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – EM Clustering

With the K-Means algorithm, each object is assigned to exactly one cluster. It is assigned to this cluster with a probability equal to 1.0. It is assigned to all other clusters with a probability equal to 0.0. This is hard clustering.

Instead of distance, you can use a probabilistic measure to determine cluster membership. For example, you can cover the objects with bell curves for each dimension with a specific mean and standard deviation. A case is assigned to every cluster with a certain probability. Because clusters can overlap, this is called soft clustering. The Expectation-Maximization (EM) method changes the parameters of the bell curve to improve covering in each iteration.

The Expectation - Maximization (EM) Clustering algorithm extends the K-Means paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of the membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures.

The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case (i.e., the probability that a case belongs to a given cluster from the initially defined k number of clusters). In the second step—the "maximization" step—the algorithm uses these cluster memberships to re-estimate the models' parameters, such as the location and scale parameters of Gaussian distribution.

The algorithm assumes that the data is drawn from a mixture of Gaussian distributions (bell curves). Take a look at the graphics. In the first row, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets the specified stopping criteria—for example, when it reaches a certain likelihood-of-improvement rate between iterations.

Step 1: Initializing the mixture distribution

image

Step 2: Modifying the mixture distribution

image

Step 3: Final modification

image

You use the EM Clustering for the same purposes as the K-Means Clustering.

In addition, you can search for outliers based on combinations of values of all input variables with the EM algorithm. You check the highest probability of cases over all clusters. The cases where the highest probability is still low do not fit well into any cluster. Said differently, they are not like other cases, and therefore you can assume that they are outliers. See the last figure in this blog post bellow.

image

The green case belongs to the cluster D with probability 0.95, to the cluster C with probability 0.002, to the cluster E with probability 0.0003, and so on.

The red case belongs to the cluster C with probability 0.03, to the cluster B with probability 0.02, to the cluster D with probability 0.003, and so on. The highest probability for the red case is still a low value; therefore, this case does not fit well to any of the clusters and thus might represent an outlier.

Outliers can also represent potentially fraudulent transactions. EM Clustering is therefore useful also for fraud detection. Finally, you can use EM Clustering for advanced data profiling to find the rows with suspicious combinations of column values.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dennis Mark Thorsen said:

Nice explanation. The EM algorithm is indeed a beautiful algorithm.

May 13, 2015 4:29 AM
 

Dejan Sarka said:

Thanks:-)

May 13, 2015 5:54 AM

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement