Search results matching tag 'data analysis'http://sqlblog.com/search/SearchResults.aspx?o=DateDescending&tag=data+analysis&orTags=0Search results matching tag 'data analysis'en-USCommunityServer 2.1 SP2 (Build: 61129.1)Data Mining Algorithms – Logistic Regressionhttp://sqlblog.com/blogs/dejan_sarka/archive/2016/08/31/data-mining-algorithms-logistic-regression.aspxWed, 31 Aug 2016 08:37:14 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:61811Dejan Sarka<p>It’s been awhile since I wrote the last blog on the data mining / machine learning algorithms. I described the <a href="http://sqlblog.com/blogs/dejan_sarka/archive/2016/01/26/data-mining-algorithms-neural-network.aspx">Neural Network</a> algorithm. In addition, it is a good time to write another post in order to remind the readers of the two upcoming seminars about the algorithms I have in <a href="https://www.eventbrite.com/e/data-mining-algorithms-in-ssas-excel-r-and-azure-ml-by-dejan-sarka-tickets-24897467997">Oslo</a>, Friday, September 2nd, 2016, and in <a href="https://www.eventbrite.co.uk/e/data-mining-algorithms-in-ssas-excel-r-and-azure-ml-by-dejan-sarka-tickets-21560337550">Cambridge</a>, Thursday, September 8th. Hope to see you in one of the seminars. Finally, to conclude this marketing part: if you are interested in the R language, I am preparing another seminar “EmbRace R”, which will cover R from basics to advanced analytics. Stay tuned.</p> <p>Now for the algorithm. If you remember the post, a Neural network has an input, an output, and one or more hidden layers. The Neural Network algorithm uses the hyperbolic tangent activation function in the hidden layer and the sigmoid function in output layer. However, the Sigmoid function is called the Logistic function as well. Therefore, describing the Logistic Regression algorithm is simple after I described the Neural Network. If a neural network has only input neurons that are directly connected to the output neurons, it is a Logistic Regression. Or, to repeat the same thing in a different way: Logistic Regression is Neural Network with zero hidden layers. </p> <p>This was quick:-) To add more meat to the post, I am adding the formulas and the graphs for the hyperbolic tangent and sigmoid functions. </p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_629BF7C4.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_2438C4C2.png" width="1028" height="467" /></a></p>DevWeek 2016 BI in SQL Server 2016 Workshop Setuphttp://sqlblog.com/blogs/dejan_sarka/archive/2016/04/15/devweek-2016-bi-in-sql-server-2016-workshop-setup.aspxFri, 15 Apr 2016 06:02:11 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:61066Dejan Sarka<p>I got some questions about virtual machine / notebook setup for my <a href="http://devweek.com/agenda#business-intelligence-in-sql-server-2016">Business Intelligence in SQL Server 2016</a> DevWeek post-conference workshop. I am writing this blog because I want to spread this information as quickly as possible.</p> <p>There will be no labs during the seminar, no time for this. However, I will make all of the code available. Therefore, if the attendees would like to test the code, they need to prepare their own setup. I will use the following SW:</p> <p><strong><u>Windows Server 2012 R2</u></strong></p> <p><strong><u>SQL Server 2016 components</u></strong></p> <ul> <li>Database Engine</li> <li>SSIS</li> <li>SSRS</li> <li>SSAS</li> <li>DQS</li> <li>MDS</li> <li>R Services</li> </ul> <p><strong><u>Tools</u></strong></p> <ul> <li>SQL Server Management Studio (this is not included in SQL Server setup anymore)</li> <li>SQL Server Data Tools</li> <li>R Tools for Visual Studio</li> <li>R Studio </li> </ul> <p><strong><u>Excel 2016 Professional Plus with add-ins</u></strong></p> <ul> <li>MDS add-in</li> <li>Power Pivot</li> <li>Power Query</li> <li>Power Map</li> <li>Power View</li> <li>Azure ML add-in</li> </ul> <p><strong><u>Excel 2013 Professional Plus with add-ins</u></strong></p> <ul> <li>Data Mining add-in (this add-in does not work for Excel 2016 yet, this one is announced for Excel 2016 only later this year, after SQL Server 2016 release)</li> </ul> <p><strong><u>Power BI Apps and Services</u></strong></p> <ul> <li>Power BI Desktop</li> <li>Power BI Service (they need to create a free account at PowerBI.com)</li> <li>Azure ML (they need to create a free account at AzureML.com)</li> </ul> <p><strong><u>Mobile Publisher</u></strong></p> <p><strong><u>AdventureWorks demo databases version 2016, 2014 or 2012</u></strong></p> <p>I know the list is long:-) However, nobody needs to test everything. Just pick the parts you need and you want to learn about.</p> <p>See you soon!</p>Data Mining Algorithms – Neural Networkhttp://sqlblog.com/blogs/dejan_sarka/archive/2016/01/26/data-mining-algorithms-neural-network.aspxTue, 26 Jan 2016 15:26:13 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:60512Dejan Sarka<p>A neural network is a powerful data modeling tool that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform "intelligent" tasks similar to those performed by the human brain. Neural networks resemble the human brain in the following two ways: </p> <ul> <li>A neural network acquires knowledge through learning </li> <li>A neural network's knowledge is stored within inter-neuron connection strengths known as synaptic weights </li> </ul> <p>The Neural Network algorithm is an artificial intelligence technique that explores more possible data relationships than other algorithms. Because it is such a thorough technique, the processing of it is usually slower than the processing of other classification algorithms.</p> <p>A neural network consists of basic units modeled after biological neurons. Each unit has many inputs that it combines into a single output value. These inputs are connected together, so the outputs of some units are used as inputs into other units. The network can have one or more middle layers called hidden layers. The simplest are feed-forward networks (pictured), where there is only a one-way flow through the network from the inputs to the outputs. There are no cycles in the feed-forward networks.</p> <p>As mentioned, units combine inputs into a single output value. This combination is called the unit’s activation function. Consider this example: The human ear can function near a working jet engine. Yet, if it were only 10 times more sensitive, you would be able to hear a single molecule hitting the membrane in your ears! What does that mean? When you go from 0.01 to 0.02, the difference should be comparable with going from 100 to 200. In biology, there are many types of non-linear behavior.</p> <p>Thus, an activation function has two parts. The first part is the combination function that merges all of the inputs into a single value (weighted sum, for example). The second part is the transfer function, which transfers the value of the combination function to the output value of the unit. The linear transfer function would do just the linear regression. The transfer functions are S-shaped, like the sigmoid function:</p> <p>Sigmoid(x) = 1 / (1 + e(-x)).</p> <p>A single hidden layer is optimal, so the Neural Network algorithm always uses a maximum of one (or zero for Logistic Regression).</p> <p>The Neural Network algorithm uses the hyperbolic tangent activation function in the hidden layer and the sigmoid function in output layer. You can see a Neural Network with a single hidden layer in the following picture.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_0112CBC9.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_205A93FE.png" width="644" height="375" /></a> </p> <p>Training a neural network is the process of setting the best weights on the inputs of each of the units. This <em>backpropagation</em> process does the following:</p> <ul> <li>Gets a training example and calculates outputs</li> <li>Calculates the error – the difference between the calculated and the expected (known) result</li> <li>Adjusts the weights to minimize the error</li> </ul> <p>Like the Decision Trees algorithm, you can use the Neural Network algorithm for classification and prediction. The interpretation of the Neural Network algorithm results is somewhat more complex than the interpretation of the Decision Trees algorithm results. Consequently, the Decision Trees algorithm is more popular. </p>Data Mining Algorithms – Decision Treeshttp://sqlblog.com/blogs/dejan_sarka/archive/2015/10/29/data-mining-algorithms-decision-trees.aspxThu, 29 Oct 2015 15:06:38 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:59903Dejan Sarka<p>Decision Trees is a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked, etc.). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). With limited statistical significance, you are going to predict the target variable for a new case using its known values of the input variables based on results of your trained model.</p> <p>Recursive partitioning is used to build the tree. The data is split into partitions using a certain value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box.</p> <p>The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. You know intuitively that purity is related to the percentage of the cases in each class of the target variable. There are many better, but more complicated measures of the purity, for example entropy or information gain.</p> <p>The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure. This would not make any sense. The results could not be used for any meaningful prediction. This phenomenon is called “over-fitting”. There are two basic approaches to solve this problem: pre-pruning (bonsai) and post-pruning techniques.</p> <p>The pre-pruning (bonsai) methods prevent growth of the tree in advance by applying tests at each node to determine whether a further split is going to be useful; the tests can be simple (number of cases) or complicated (complexity penalty). The post-pruning methods allow the tree to grow and then prune off the useless branches. The post-pruning methods tend to give more accurate results, but they require more computation than pre-pruning methods.</p> <p>Imagine the following example. You have the answers to a simple question: Did you like the famous Woodstock movie? You also have some demographic data: age (20 to 60) and education (ranged in 7 classes from the lowest to the highest). In all, 55% of the interviewees liked the movie and 45% did not like it. </p> <p>Can you discover factors that have an influence on whether they liked the movie?</p> <p>Starting point: 55% of the interviewees liked the movie and 45% did not like it.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_1A1C193A.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_3E68AFCF.png" width="644" height="436" /></a> </p> <p>After checking all possible splits, you find the best initial split made at the age of 35.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_69D482DC.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_1A8F3C9B.png" width="644" height="380" /></a> </p> <p>With further splitting, you finish with a full-grown tree. Note that not all branches lead to purer classes. Some of them are not useful at all and should be pruned.</p> <a href="http://sqlblog.com/blogs/dejan_sarka/image_146D1162.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_2093ACEB.png" width="644" height="359" /></a> <p></p> <p>Decision trees are used for classification and prediction. Typical usage scenarios include:</p> <ul> <li>Predicting which customers will leave</li> <li>Targeting the audience for mailings and promotional campaigns</li> <li>Explain reasons for a decision</li> <li>Answering questions such as “What movies do young female customers buy?”</li> </ul> <p>Decision Trees is the most popular data mining algorithm. This is because the results are very understandable and simple to interpret, and the quality of the predictions is usually very high.</p>Data Mining Algorithms – Naive Bayeshttp://sqlblog.com/blogs/dejan_sarka/archive/2015/09/09/data-mining-algorithms-naive-bayes.aspxWed, 09 Sep 2015 13:16:56 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:59509Dejan Sarka<p>I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction. </p> <p>It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. Starting with a single dependent and a single independent variable, the algorithm is not too complex to understand (I am using an example from my old book about statistics - Thomas H. Wonnacott, Ronald J. Wonnacot: Introductory Statistics, Wiley 1990).</p> <p>Let’s say I am buying a used car. In an auto magazine I find that 30% of second-hand cars are faulty. I take with me a mechanic who can make a shrewd guess on a basis of a quick drive. Of course, he isn’t always right. Of the faulty cars he examined in the past he correctly pronounced 90 % faulty and wrongly pronounced 10% ok. When judging good cars, he correctly pronounced 80% of them as good, and wrongly 20% as faulted. In the graph, we can see that 27% (90% of 30%) of all cars are actually faulty and then correctly identified as such. 14% (20% of 70%) are judged faulty, although they are good. Altogether, 41% (27% + 14%) of cars are judged faulty. Of these cars, 67% (27% / 41%) are actually faulty. To sum up: Once the car has been pronounced faulty by the mechanic, the chance that it is actually faulty rises from the original 30% up to 67%. The following figure shows this process.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_1427D68F.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_03AF6594.png" width="644" height="452" /></a> </p> <p>The calculations in the previous slide can be summarized in another tree, the reverse tree. You can start branching with opinion of the mechanic (59% ok and 41% faulty). Moving to the right, the second branching shows the actual conditions of the cars, and this is the valuable information for you. For example, the top branch says: Once the car is judged faulty, the chance that it actually turns faulty is 67%. The third branch from the top displays clearly: Once the car is judged good, the chance that it is actually faulty is just 5%. You can see the reverse tree in the following figure.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_05138866.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_10ABD656.png" width="644" height="407" /></a> </p> <p>As mentioned, Naïve Bayes treats all of the input attributes as independent of each other with respect to the target variable. While this could be a wrong assumption, it allows multiplying the probabilities to determine the likelihood of each state of the target variable based on states of input variables. For example, let’s say that you need to analyze whether there is any association between NULLs in different columns of your Products table. You realize that if Color is missing, 80% of Weight values are missing as well; and if Class is missing, 60% of Weight values are missing as well. You can multiply these probabilities. If Weight is missing, you can calculate the product:</p> <p><strong>0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48</strong></p> <p>You can also check what happens to the not missing state of the target variable, the Weight:</p> <p><strong>0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08</strong></p> <p>You can easily see that the likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown. You can convert the likelihoods to probabilities by normalizing their sum to 1:</p> <p><strong>P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857</strong></p> <p><strong>P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143</strong></p> <p>Now when you know that the Color value is NULL and the Class value is null, then you have nearly 86% chances that you get NULL also in the Weight attribute. This might lead you to some conclusions where to start improving your data quality.</p> <p>In general, you use the Naive Bayes algorithm for classification. You want to extract models describing important data classes and then assign new cases to predefined classes. Some typical usage scenarios include:</p> <ul> <li>Categorizing bank loan applications (safe or risky) based on previous experience</li> <li>Determining which home telephone lines are used for Internet access</li> <li>Assigning customers to predefined segments.</li> <li>Quickly obtaining a basic comprehension of the data by checking the correlation between input variables.</li> </ul>Data Mining Algorithms – Pluralsight Coursehttp://sqlblog.com/blogs/dejan_sarka/archive/2015/07/30/data-mining-algorithms-pluralsight-course.aspxThu, 30 Jul 2015 07:00:17 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:59233Dejan Sarka<p>This is a bit different post in the series about the data mining and machine learning algorithms. This time I am honored and humbled to announce that my fourth Pluralsight course is alive. This is the <a href="http://www.pluralsight.com/courses/data-mining-algorithms-ssas-excel-r">Data Mining Algorithms in SSAS, Excel, and R</a> course. besides explaining the algorithms, I also show demos in different products. This gives you even better understanding than just reading the blog posts.</p> <p>Of course, I will continue with describing the algorithms here as well.</p>Data Mining Algorithms – Support Vector Machineshttp://sqlblog.com/blogs/dejan_sarka/archive/2015/06/23/data-mining-algorithms-support-vector-machines.aspxTue, 23 Jun 2015 09:33:23 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:58915Dejan Sarka<p>Support vector machines are both, unsupervised and supervised learning models for classification and regression analysis (supervised) and for anomaly detection (unsupervised). Given a set of training examples, each marked as belonging to one of categories, an SVM training algorithm builds a model that assigns new examples into one category. An SVM model is a representation of the cases as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.</p> <p>A support vector machine constructs a hyper-plane or set of hyper-planes in a high-dimensional space defined by the input variables, which can be used for classification, regression, or other tasks. A SVM is a discrete linear classifier. A good separation is achieved by the hyper-plane that has the largest distance to the nearest training data point of any class (so-called functional margin). The larger the margin the lower the generalization error. Let me show you this on a graphical example. Of course, I am showing a two-dimensional space defined by only two input variables, and therefore my separating hyper-plane is just a line. </p> <p>The first figure shows the two-dimensional space with cases and a possible single-dimensional hyper-plane (line). Of course, you can see that this line cannot be a separator at all, because there are some cases on the line, or said differently, on both sides of the line.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_6F2E99EC.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_7E6502B9.png" width="644" height="433" /></a> </p> <p>The next try is better. The line is separating the cases in the space. However, this is not the best possible separation. Some cases are pretty close to the line.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_57997D2A.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_601CDC74.png" width="644" height="433" /></a> </p> <p>The third picture shows the best possible separation. The hyper-plane that separates the cases the best is found, and the model is trained.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_21B718C1.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_65FA10BE.png" width="644" height="433" /></a> </p> <p>Support Vector Machines are powerful for some specific classifications:</p> <ul> <li>Text and hypertext categorization</li> <li>Images classification</li> <li>Classifications in medicine</li> <li>Hand-written characters recognition</li> </ul> <p>One-class SVM can be used for anomaly detection, like detection of dirty data, or fraud detection. It uses a classification function without parameters, the one selected for the separation without regard to a target variable. Cases that are close to the separation hyper-plane are the suspicious cases. Therefore, the result is dichotomous: 1=regular case, 0=outlier.</p>Data Mining Algorithms – Principal Component Analysishttp://sqlblog.com/blogs/dejan_sarka/archive/2015/06/02/data-mining-algorithms-principal-component-analysis.aspxTue, 02 Jun 2015 09:15:32 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:58796Dejan Sarka<p> Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues. </p> <p><a href="http://en.wikipedia.org/wiki/Principal_component_analysis">A short definition</a> of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. </p> <p>Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a two-dimensional space. Values of the variables v1 and v2 define cases in this 2-D space. Variability of the cases is spread across both source variables approximately equally. </p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_0E3B86EA.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_4FD5C336.png" width="644" height="436" /></a></p> <p>Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_188F3BFB.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_55B2F780.png" width="644" height="436" /></a> </p> <p>We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple two-dimensional matrix.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_19F5EF7E.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_16E6223A.png" width="644" height="343" /></a> </p> <p>The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for business-oriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:</p> <ul> <li>Explore the data to explain the variability;</li> <li>Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;</li> <li>Use the residual variability not explained by the PCs for anomaly detection.</li> </ul>Data Mining Algorithms – EM Clusteringhttp://sqlblog.com/blogs/dejan_sarka/archive/2015/05/12/data-mining-algorithms-em-clustering.aspxTue, 12 May 2015 14:30:32 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:58637Dejan Sarka<p>With the K-Means algorithm, each object is assigned to exactly one cluster. It is assigned to this cluster with a probability equal to 1.0. It is assigned to all other clusters with a probability equal to 0.0. This is hard clustering.</p> <p>Instead of distance, you can use a probabilistic measure to determine cluster membership. For example, you can cover the objects with bell curves for each dimension with a specific mean and standard deviation. A case is assigned to every cluster with a certain probability. Because clusters can overlap, this is called soft clustering. The Expectation-Maximization (EM) method changes the parameters of the bell curve to improve covering in each iteration.</p> <p>The Expectation - Maximization (EM) Clustering algorithm extends the K-Means paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of the membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures.</p> <p>The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case (i.e., the probability that a case belongs to a given cluster from the initially defined k number of clusters). In the second step—the "maximization" step—the algorithm uses these cluster memberships to re-estimate the models' parameters, such as the location and scale parameters of Gaussian distribution.</p> <p>The algorithm assumes that the data is drawn from a mixture of Gaussian distributions (bell curves). Take a look at the graphics. In the first row, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets the specified stopping criteria—for example, when it reaches a certain likelihood-of-improvement rate between iterations.</p> <p>Step 1: Initializing the mixture distribution</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_30F8E19C.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_12448862.png" width="904" height="184" /></a> </p> <p>Step 2: Modifying the mixture distribution</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_3A76C174.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_45A2DC6F.png" width="904" height="184" /></a> </p> <p>Step 3: Final modification</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_62AB8B37.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_60056037.png" width="904" height="184" /></a> </p> <p>You use the EM Clustering for the same purposes as the K-Means Clustering. </p> <p>In addition, you can search for outliers based on combinations of values of all input variables with the EM algorithm. You check the highest probability of cases over all clusters. The cases where the highest probability is still low do not fit well into any cluster. Said differently, they are not like other cases, and therefore you can assume that they are outliers. See the last figure in this blog post bellow.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_23DC2540.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_48041080.png" width="904" height="263" /></a> </p> <p>The green case belongs to the cluster D with probability 0.95, to the cluster C with probability 0.002, to the cluster E with probability 0.0003, and so on.</p> <p>The red case belongs to the cluster C with probability 0.03, to the cluster B with probability 0.02, to the cluster D with probability 0.003, and so on. The highest probability for the red case is still a low value; therefore, this case does not fit well to any of the clusters and thus might represent an outlier.</p> <p>Outliers can also represent potentially fraudulent transactions. EM Clustering is therefore useful also for fraud detection. Finally, you can use EM Clustering for advanced data profiling to find the rows with suspicious combinations of column values.</p>Data Mining Algorithms – K-Means Clusteringhttp://sqlblog.com/blogs/dejan_sarka/archive/2015/04/17/data-mining-algorithms-k-means-clustering.aspxFri, 17 Apr 2015 06:03:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:58471Dejan Sarka<p>Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it.</p> <p>K-Means is a distance-based partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes.</p> <p>The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables). </p> <p>The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there.</p> <p>The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster.</p> <p>Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster.</p> <p>Here is a graphical representation of the process. You can see the cases in a two-dimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_1ACFC0C6.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_37D86F8E.png" width="604" height="391" /></a> </p> <p>After the centroids were selected, the algorithm assigns each case to the nearest centroid.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_12DAAF15.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_7085AA4C.png" width="604" height="392" /></a> </p> <p>Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_0B51D059.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_16EA1E49.png" width="604" height="392" /></a>  <p></p> <p>The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_71EC5DCF.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_73BCB396.png" width="604" height="392" /></a> </p> <p>In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop.</p> <p>K-means clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best. </p> <p>The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.</p></p>