THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

  • DevWeek 2016 BI in SQL Server 2016 Workshop Setup

    I got some questions about virtual machine / notebook setup for my Business Intelligence in SQL Server 2016 DevWeek post-conference workshop. I am writing this blog because I want to spread this information as quickly as possible.

    There will be no labs during the seminar, no time for this. However, I will make all of the code available. Therefore, if the attendees would like to test the code, they need to prepare their own setup. I will use the following SW:

    Windows Server 2012 R2

    SQL Server 2016 components

    • Database Engine
    • SSIS
    • SSRS
    • SSAS
    • DQS
    • MDS
    • R Services


    • SQL Server Management Studio (this is not included in SQL Server setup anymore)
    • SQL Server Data Tools
    • R Tools for Visual Studio
    • R Studio

    Excel 2016 Professional Plus with add-ins

    • MDS add-in
    • Power Pivot
    • Power Query
    • Power Map
    • Power View
    • Azure ML add-in

    Excel 2013 Professional Plus with add-ins

    • Data Mining add-in (this add-in does not work for Excel 2016 yet, this one is announced for Excel 2016 only later this year, after SQL Server 2016 release)

    Power BI Apps and Services

    • Power BI Desktop
    • Power BI Service (they need to create a free account at
    • Azure ML (they need to create a free account at

    Mobile Publisher

    AdventureWorks demo databases version 2016, 2014 or 2012

    I know the list is long:-) However, nobody needs to test everything. Just pick the parts you need and you want to learn about.

    See you soon!

  • Conferences, User Groups, and Seminars Q1 and Q2 2016–Update

    Traditionally, I write down a list of presentations I am giving on different events every semester. This semester, I am already a bit late. I am still missing some info. So here is the list of the events I am planning to attend. I will add events and correct the list as needed later. Here is the updated info. Of course, more updates will come when I get the relevant information.

    1. Bulgarian UG meeting, Sofia, January 14th: presentation Introducing R and Azure ML
    2. Slovenian UG meeting, Ljubljana, February 18th: presentation Introducing R and Using R in SQL Server 2016, Power BI, and Azure ML
    3. SQL Server Konferenz 2016, Darmstadt, February 23rd – 25th:
      • pre-conference seminar Data Mining Algorithms in SSAS, Excel, R, and Azure ML
      • presentation SQL Server & Power BI Geographic and Temporal Predictions
    4. PASS SQL Saturday #495, Pordenone, February 27th:
      • presentation SQL Server 2012-2016 Columnar Storage
      • presentation Enterprise Information Management with SQL Server 2016
    5. DevWeek 2016, London, April 22nd – 26th:
      • post-conference seminar Business Intelligence in SQL Server 2016
      • presentation Using R in SQL Server 2016 Database Engine and Reporting Services
      • presentation SQL Server Isolation Levels and Locking
    6. SQL Nexus, Copenhagen, May 2nd – 4th: presentation Identity Mapping and De-Duplicating
    7. SQL Bits 2016, Liverpool, May 4th – 7th: presentation Using R in SQL Server 2016 Database Engine and SSRS
    8. SQL Day, Wroclaw, May 16th – 18th:
      • pre-conference seminar Data Mining Algorithms in SSAS, Excel, R, and Azure ML
      • presentation: Statistical Analysis with T-SQL
      • presentation: Anomaly Detection
    9. PASS SQL Saturday #508, Kyiv, May 21st: information to follow.
    10. PASS SQL Saturday #510, Paris, June 25th: information to follow.
    11. PASS SQL Saturday #520, Cambridge, September 10th: information to follow. And yes, this is already quarter 3, but I am late with this ist anywaySmile
  • Data Mining Algorithms – Neural Network

    A neural network is a powerful data modeling tool that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform "intelligent" tasks similar to those performed by the human brain. Neural networks resemble the human brain in the following two ways:

    • A neural network acquires knowledge through learning
    • A neural network's knowledge is stored within inter-neuron connection strengths known as synaptic weights

    The Neural Network algorithm is an artificial intelligence technique that explores more possible data relationships than other algorithms. Because it is such a thorough technique, the processing of it is usually slower than the processing of other classification algorithms.

    A neural network consists of basic units modeled after biological neurons. Each unit has many inputs that it combines into a single output value. These inputs are connected together, so the outputs of some units are used as inputs into other units. The network can have one or more middle layers called hidden layers. The simplest are feed-forward networks (pictured), where there is only a one-way flow through the network from the inputs to the outputs. There are no cycles in the feed-forward networks.

    As mentioned, units combine inputs into a single output value. This combination is called the unit’s activation function. Consider this example: The human ear can function near a working jet engine. Yet, if it were only 10 times more sensitive, you would be able to hear a single molecule hitting the membrane in your ears! What does that mean? When you go from 0.01 to 0.02, the difference should be comparable with going from 100 to 200. In biology, there are many types of non-linear behavior.

    Thus, an activation function has two parts. The first part is the combination function that merges all of the inputs into a single value (weighted sum, for example). The second part is the transfer function, which transfers the value of the combination function to the output value of the unit. The linear transfer function would do just the linear regression. The transfer functions are S-shaped, like the sigmoid function:

    Sigmoid(x) = 1 / (1 + e(-x)).

    A single hidden layer is optimal, so the Neural Network algorithm always uses a maximum of one (or zero for Logistic Regression).

    The Neural Network algorithm uses the hyperbolic tangent activation function in the hidden layer and the sigmoid function in output layer. You can see a Neural Network with a single hidden layer in the following picture.


    Training a neural network is the process of setting the best weights on the inputs of each of the units. This backpropagation process does the following:

    • Gets a training example and calculates outputs
    • Calculates the error – the difference between the calculated and the expected (known) result
    • Adjusts the weights to minimize the error

    Like the Decision Trees algorithm, you can use the Neural Network algorithm for classification and prediction. The interpretation of the Neural Network algorithm results is somewhat more complex than the interpretation of the Decision Trees algorithm results. Consequently, the Decision Trees algorithm is more popular.

  • PASS SQL Saturday #460 Slovenia 2015 Recapitulation

    Our third SQL Saturday in Ljubljana is over. Two weeks seems to be enough time to sleep over and see things a bit from a distance. Without any further delays, I can declare that it is clear that the event was a pure successSmile

    Let me start with the numbers, comparing total number of people, including speakers, sponsors, attendees, and organizers, with previous two SQL Saturdays in Ljubljana:

    • 2013: 135 people from 12 countries
    • 2014: 215 people from 16 countries
    • 2015: 253 people from 20 countries

    You can clearly see the growth. Even the keynote was full, like the following picture shows.


    We again experienced very small drop rate; more than 90% or registered attendees showed up. That’s very nice, showing respect to the speakers and sponsors. So thank you, attendees, for being fair and respectful again!

    We had more sponsors than previous years. This was extremely important, because this time we did not get the venue for free, and therefore we needed more money than for the previous two events. Thank you, sponsors, for enabling the event!

    Probably the most important part of these SQL Saturday events are the speakers. We got 125 sessions submitted by 51 speakers from 20 countries! We were really surprised. We take this a sign of our good work in the past. 30 great sessions with two state of the art precon seminars is more than we expected, yet still not enough to accommodate all speakers that sent the submissions. Thank you all speakers, those who were selected and those who were not! I hope we see you again in Slovenia next year. You can see some of the most beautiful speakers and volunteers in the following picture (decide by yourself if there is somebody spoiling the pictureSmile).


    Next positive surprise were the volunteers. With these number of speakers and attendees, we would not be able to handle the event without them. We realized that we have a great community, consisting of some really helpful people, that we can always count on. Thank you all!

    I think I can say for all three organizers, Mladen Prajdić, Matija Lah, and me, that we were more tired than any year before. However, hosting a satisfied crowd is the best payback you can imagineSmile And the satisfaction level was high even among the youngest visitors, as you can see from the following picture.


    Of course, we experienced also some negative things. However, just a day before the New Year evening, I am not going to deal with them now. Let me finish this post in a positive waySmile

  • Data Mining Algorithms – Decision Trees

    Decision Trees is a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked, etc.). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). With limited statistical significance, you are going to predict the target variable for a new case using its known values of the input variables based on results of your trained model.

    Recursive partitioning is used to build the tree. The data is split into partitions using a certain value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box.

    The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. You know intuitively that purity is related to the percentage of the cases in each class of the target variable. There are many better, but more complicated measures of the purity, for example entropy or information gain.

    The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure. This would not make any sense. The results could not be used for any meaningful prediction. This phenomenon is called “over-fitting”. There are two basic approaches to solve this problem: pre-pruning (bonsai) and post-pruning techniques.

    The pre-pruning (bonsai) methods prevent growth of the tree in advance by applying tests at each node to determine whether a further split is going to be useful; the tests can be simple (number of cases) or complicated (complexity penalty). The post-pruning methods allow the tree to grow and then prune off the useless branches. The post-pruning methods tend to give more accurate results, but they require more computation than pre-pruning methods.

    Imagine the following example. You have the answers to a simple question: Did you like the famous Woodstock movie? You also have some demographic data: age (20 to 60) and education (ranged in 7 classes from the lowest to the highest). In all, 55% of the interviewees liked the movie and 45% did not like it.

    Can you discover factors that have an influence on whether they liked the movie?

    Starting point: 55% of the interviewees liked the movie and 45% did not like it.


    After checking all possible splits, you find the best initial split made at the age of 35.


    With further splitting, you finish with a full-grown tree. Note that not all branches lead to purer classes. Some of them are not useful at all and should be pruned.


    Decision trees are used for classification and prediction. Typical usage scenarios include:

    • Predicting which customers will leave
    • Targeting the audience for mailings and promotional campaigns
    • Explain reasons for a decision
    • Answering questions such as “What movies do young female customers buy?”

    Decision Trees is the most popular data mining algorithm. This is because the results are very understandable and simple to interpret, and the quality of the predictions is usually very high.

  • PASS SQL Saturday #460 Slovenia Schedule

    It is alive! It was really hard to make the choices. Nevertheless, the schedule is public now. To everybody that submitted proposals – thank you very much again! We are sorry we cannot accommodate everybody. Please note that even if you were not selected, we would be happy to see you in Ljubljana.

  • PASS SQL Saturday #460 Slovenia – Speakers and Sessions Submitted

    With start of October, we closed the call for speakers for SQL Saturday Slovenia. We are really excited by the number of speakers and the number of sessions they submitted. We got 51 different speakers from 20 different countries submitting 125 session proposals! You can see the breakdown of number of speakers and sessions per country in the following Excel Pivot Table with data bars.


    Now we have a really heavy duty: the selection. With so many excellent proposals, this is an incredibly complex task. If you are not selected, please understand that this is not due to a bad proposal or session description; although we would like to, we simply cannot accommodate every speaker that submitted. Anyway, I wanted to express my deep thankfulness for your proposals. No matter of selection, we are definitely making the most advanced and the most international conference in Slovenia. You can see another representation of speaker and session counts in a Power Map report.


    This will be English language only event. Therefore, also attendees from any country around the world are more than welcome. You should consider visiting Ljubljana for couple of days, not just for SQL Saturday. Why? Here are some possible reasons.

  • Data Mining Algorithms – Naive Bayes

    I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction.

    It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. Starting with a single dependent and a single independent variable, the algorithm is not too complex to understand (I am using an example from my old book about statistics - Thomas H. Wonnacott, Ronald J. Wonnacot: Introductory Statistics, Wiley 1990).

    Let’s say I am buying a used car. In an auto magazine I find that 30% of second-hand cars are faulty. I take with me a mechanic who can make a shrewd guess on a basis of a quick drive. Of course, he isn’t always right. Of the faulty cars he examined in the past he correctly pronounced 90 % faulty and wrongly pronounced 10% ok. When judging good cars, he correctly pronounced 80% of them as good, and wrongly 20% as faulted. In the graph, we can see that 27% (90% of 30%) of all cars are actually faulty and then correctly identified as such. 14% (20% of 70%) are judged faulty, although they are good. Altogether, 41% (27% + 14%) of cars are judged faulty. Of these cars, 67% (27% / 41%) are actually faulty. To sum up: Once the car has been pronounced faulty by the mechanic, the chance that it is actually faulty rises from the original 30% up to 67%. The following figure shows this process.


    The calculations in the previous slide can be summarized in another tree, the reverse tree. You can start branching with opinion of the mechanic (59% ok and 41% faulty). Moving to the right, the second branching shows the actual conditions of the cars, and this is the valuable information for you. For example, the top branch says: Once the car is judged faulty, the chance that it actually turns faulty is 67%. The third branch from the top displays clearly: Once the car is judged good, the chance that it is actually faulty is just 5%. You can see the reverse tree in the following figure.


    As mentioned, Naïve Bayes treats all of the input attributes as independent of each other with respect to the target variable. While this could be a wrong assumption, it allows multiplying the probabilities to determine the likelihood of each state of the target variable based on states of input variables. For example, let’s say that you need to analyze whether there is any association between NULLs in different columns of your Products table. You realize that if Color is missing, 80% of Weight values are missing as well; and if Class is missing, 60% of Weight values are missing as well. You can multiply these probabilities. If Weight is missing, you can calculate the product:

    0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48

    You can also check what happens to the not missing state of the target variable, the Weight:

    0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08

    You can easily see that the likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown. You can convert the likelihoods to probabilities by normalizing their sum to 1:

    P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857

    P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143

    Now when you know that the Color value is NULL and the Class value is null, then you have nearly 86% chances that you get NULL also in the Weight attribute. This might lead you to some conclusions where to start improving your data quality.

    In general, you use the Naive Bayes algorithm for classification. You want to extract models describing important data classes and then assign new cases to predefined classes. Some typical usage scenarios include:

    • Categorizing bank loan applications (safe or risky) based on previous experience
    • Determining which home telephone lines are used for Internet access
    • Assigning customers to predefined segments.
    • Quickly obtaining a basic comprehension of the data by checking the correlation between input variables.
  • Conferences and Seminars 2015 Q3 and Q4

    I am finishing my list of conferences and seminars I am attending in the second half of the year 2015. Here is my list.

    1. Kulendayz 2015 – September 4th-5th. Although I will have huge problems to get there on time, I would never like to miss it. I have one talk there.
    2. SQL Saturday #413 Denmark – September 17th-19th. You can join me already on Friday for the Data Mining Algorithms seminar.
    3. SQL Saturday #434 Holland – September 25th-26th. If you miss the Denmark Data Mining Algorithms seminar, I am repeating it in Utrecht.
    4. SQL Server Days 2015 Belgium – September 28th-29th. This will be my first talk at SQL Server Days.
    5. SQL Saturday #454 Turin – October 10th. I was not confirmed as a speaker yet, but I still plan to go there, to combine the SQL part with the Expo in Milan part.
    6. PASS Summit 2015 Seattle – October 27th-30th. I still continue to be present at every single summit:-) This year I have one presentation.
    7. SharePoint Days 2015 Slovenia – November 17th-18th. No, I don’t like SPS. I will just have one small BI presentation there.
    8. SQL Saturday #475 Belgrade – November 28th. First SQL Saturday in Serbia. I simply must be there.
    9. SQL Saturday #460 Slovenia – December 11th-12th. I am co-organizing this event. Everybody is welcome, this will be fully English-speaking event. Don’t miss beautiful, relaxed and friendly Ljubljana!

    That’s it. For now:-)

  • Data Mining Algorithms – Pluralsight Course

    This is a bit different post in the series about the data mining and machine learning algorithms. This time I am honored and humbled to announce that my fourth Pluralsight course is alive. This is the Data Mining Algorithms in SSAS, Excel, and R course. besides explaining the algorithms, I also show demos in different products. This gives you even better understanding than just reading the blog posts.

    Of course, I will continue with describing the algorithms here as well.

  • PASS SQL Saturday #460 Slovenia 2015

    So we are back. PASS SQL Saturday is coming to Slovenia again on December 12th, 2015. Remember last two years? We had two great events. According to feedback, everybody was satisfied and happy. Let's make another outstanding event! How can you help?

    First of all, these events are free for attendees. Of course, this is possible only because sponsors pay the bills, and speakers come on their own expenses, using their own free time. For a start, please respect these facts.

    We always need more sponsors. Maybe your company is interested, maybe you know a company that would be interested for the sponsorship. Please spread the word about the event. This event is second to none in Slovenia. No other event in Slovenia attracts so many top speakers from all around the world as our SQL Saturday. Last year there were around two hundred attendees, with more than one third coming abroad. Therefore, as a sponsor, you can reach quite a broad audience, no matter that the event is in a small country.

    The second pillar of the event are the speakers. Please consider speaking at the event. We are not limited to top speakers only; one of the goals of the event is also to raise new speakers. We are successful with this task. Every year we get somebody new. Therefore, don't hesitate; if you would like to join the SQL Server speakers community, this is a good time and place!

    Organization of the event takes a lot of time as well. Maybe you can help here? When you register, you can also volunteer for a help. Or you can contact Matija Lah, Mladen Prajdić, or me directly.

    There is never enough advertisement. Again, please spread the word about the event. Maybe your coworkers and friends didn't hear about the event yet. Please tell them. If you attended SQL Saturday Slovenia before, please share your experience.

    Finally, please register and then attend the event. Last year we reached less than 5% drop-off rate, which is, as far as we know, a world record. Let's keep up the good work!

  • Data Mining Algorithms – Support Vector Machines

    Support vector machines are both, unsupervised and supervised learning models for classification and regression analysis (supervised) and for anomaly detection (unsupervised). Given a set of training examples, each marked as belonging to one of categories, an SVM training algorithm builds a model that assigns new examples into one category. An SVM model is a representation of the cases as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

    A support vector machine constructs a hyper-plane or set of hyper-planes in a high-dimensional space defined by the input variables, which can be used for classification, regression, or other tasks. A SVM is a discrete linear classifier. A good separation is achieved by the hyper-plane that has the largest distance to the nearest training data point of any class (so-called functional margin). The larger the margin the lower the generalization error. Let me show you this on a graphical example. Of course, I am showing a two-dimensional space defined by only two input variables, and therefore my separating hyper-plane is just a line.

    The first figure shows the two-dimensional space with cases and a possible single-dimensional hyper-plane (line). Of course, you can see that this line cannot be a separator at all, because there are some cases on the line, or said differently, on both sides of the line.


    The next try is better. The line is separating the cases in the space. However, this is not the best possible separation. Some cases are pretty close to the line.


    The third picture shows the best possible separation. The hyper-plane that separates the cases the best is found, and the model is trained.


    Support Vector Machines are powerful for some specific classifications:

    • Text and hypertext categorization
    • Images classification
    • Classifications in medicine
    • Hand-written characters recognition

    One-class SVM can be used for anomaly detection, like detection of dirty data, or fraud detection. It uses a classification function without parameters, the one selected for the separation without regard to a target variable. Cases that are close to the separation hyper-plane are the suspicious cases. Therefore, the result is dichotomous: 1=regular case, 0=outlier.

  • Data Mining Algorithms – Principal Component Analysis

    Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues.

    A short definition of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric.

    Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a two-dimensional space. Values of the variables v1 and v2 define cases in this 2-D space. Variability of the cases is spread across both source variables approximately equally.


    Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture.


    We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple two-dimensional matrix.


    The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for business-oriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:

    • Explore the data to explain the variability;
    • Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;
    • Use the residual variability not explained by the PCs for anomaly detection.
  • Data Mining Algorithms – EM Clustering

    With the K-Means algorithm, each object is assigned to exactly one cluster. It is assigned to this cluster with a probability equal to 1.0. It is assigned to all other clusters with a probability equal to 0.0. This is hard clustering.

    Instead of distance, you can use a probabilistic measure to determine cluster membership. For example, you can cover the objects with bell curves for each dimension with a specific mean and standard deviation. A case is assigned to every cluster with a certain probability. Because clusters can overlap, this is called soft clustering. The Expectation-Maximization (EM) method changes the parameters of the bell curve to improve covering in each iteration.

    The Expectation - Maximization (EM) Clustering algorithm extends the K-Means paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of the membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures.

    The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case (i.e., the probability that a case belongs to a given cluster from the initially defined k number of clusters). In the second step—the "maximization" step—the algorithm uses these cluster memberships to re-estimate the models' parameters, such as the location and scale parameters of Gaussian distribution.

    The algorithm assumes that the data is drawn from a mixture of Gaussian distributions (bell curves). Take a look at the graphics. In the first row, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets the specified stopping criteria—for example, when it reaches a certain likelihood-of-improvement rate between iterations.

    Step 1: Initializing the mixture distribution


    Step 2: Modifying the mixture distribution


    Step 3: Final modification


    You use the EM Clustering for the same purposes as the K-Means Clustering.

    In addition, you can search for outliers based on combinations of values of all input variables with the EM algorithm. You check the highest probability of cases over all clusters. The cases where the highest probability is still low do not fit well into any cluster. Said differently, they are not like other cases, and therefore you can assume that they are outliers. See the last figure in this blog post bellow.


    The green case belongs to the cluster D with probability 0.95, to the cluster C with probability 0.002, to the cluster E with probability 0.0003, and so on.

    The red case belongs to the cluster C with probability 0.03, to the cluster B with probability 0.02, to the cluster D with probability 0.003, and so on. The highest probability for the red case is still a low value; therefore, this case does not fit well to any of the clusters and thus might represent an outlier.

    Outliers can also represent potentially fraudulent transactions. EM Clustering is therefore useful also for fraud detection. Finally, you can use EM Clustering for advanced data profiling to find the rows with suspicious combinations of column values.

  • Data Mining Algorithms – K-Means Clustering

    Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it.

    K-Means is a distance-based partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes.

    The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables).

    The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there.

    The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster.

    Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster.

    Here is a graphical representation of the process. You can see the cases in a two-dimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters.


    After the centroids were selected, the algorithm assigns each case to the nearest centroid.


    Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move.


    The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster.


    In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop.

    K-means clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best.

    The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.

This Blog


Privacy Statement