


A neural network is a powerful data modeling tool that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform "intelligent" tasks similar to those performed by the human brain. Neural networks resemble the human brain in the following two ways:  A neural network acquires knowledge through learning
 A neural network's knowledge is stored within interneuron connection strengths known as synaptic weights
The Neural Network algorithm is an artificial intelligence technique that explores more possible data relationships than other algorithms. Because it is such a thorough technique, the processing of it is usually slower than the processing of other classification algorithms. A neural network consists of basic units modeled after biological neurons. Each unit has many inputs that it combines into a single output value. These inputs are connected together, so the outputs of some units are used as inputs into other units. The network can have one or more middle layers called hidden layers. The simplest are feedforward networks (pictured), where there is only a oneway flow through the network from the inputs to the outputs. There are no cycles in the feedforward networks. As mentioned, units combine inputs into a single output value. This combination is called the unit’s activation function. Consider this example: The human ear can function near a working jet engine. Yet, if it were only 10 times more sensitive, you would be able to hear a single molecule hitting the membrane in your ears! What does that mean? When you go from 0.01 to 0.02, the difference should be comparable with going from 100 to 200. In biology, there are many types of nonlinear behavior. Thus, an activation function has two parts. The first part is the combination function that merges all of the inputs into a single value (weighted sum, for example). The second part is the transfer function, which transfers the value of the combination function to the output value of the unit. The linear transfer function would do just the linear regression. The transfer functions are Sshaped, like the sigmoid function: Sigmoid(x) = 1 / (1 + e(x)). A single hidden layer is optimal, so the Neural Network algorithm always uses a maximum of one (or zero for Logistic Regression). The Neural Network algorithm uses the hyperbolic tangent activation function in the hidden layer and the sigmoid function in output layer. You can see a Neural Network with a single hidden layer in the following picture. Training a neural network is the process of setting the best weights on the inputs of each of the units. This backpropagation process does the following:  Gets a training example and calculates outputs
 Calculates the error – the difference between the calculated and the expected (known) result
 Adjusts the weights to minimize the error
Like the Decision Trees algorithm, you can use the Neural Network algorithm for classification and prediction. The interpretation of the Neural Network algorithm results is somewhat more complex than the interpretation of the Decision Trees algorithm results. Consequently, the Decision Trees algorithm is more popular.


Our third SQL Saturday in Ljubljana is over. Two weeks seems to be enough time to sleep over and see things a bit from a distance. Without any further delays, I can declare that it is clear that the event was a pure success Let me start with the numbers, comparing total number of people, including speakers, sponsors, attendees, and organizers, with previous two SQL Saturdays in Ljubljana:  2013: 135 people from 12 countries
 2014: 215 people from 16 countries
 2015: 253 people from 20 countries
You can clearly see the growth. Even the keynote was full, like the following picture shows. We again experienced very small drop rate; more than 90% or registered attendees showed up. That’s very nice, showing respect to the speakers and sponsors. So thank you, attendees, for being fair and respectful again! We had more sponsors than previous years. This was extremely important, because this time we did not get the venue for free, and therefore we needed more money than for the previous two events. Thank you, sponsors, for enabling the event! Probably the most important part of these SQL Saturday events are the speakers. We got 125 sessions submitted by 51 speakers from 20 countries! We were really surprised. We take this a sign of our good work in the past. 30 great sessions with two state of the art precon seminars is more than we expected, yet still not enough to accommodate all speakers that sent the submissions. Thank you all speakers, those who were selected and those who were not! I hope we see you again in Slovenia next year. You can see some of the most beautiful speakers and volunteers in the following picture (decide by yourself if there is somebody spoiling the picture). Next positive surprise were the volunteers. With these number of speakers and attendees, we would not be able to handle the event without them. We realized that we have a great community, consisting of some really helpful people, that we can always count on. Thank you all! I think I can say for all three organizers, Mladen Prajdić, Matija Lah, and me, that we were more tired than any year before. However, hosting a satisfied crowd is the best payback you can imagine And the satisfaction level was high even among the youngest visitors, as you can see from the following picture. Of course, we experienced also some negative things. However, just a day before the New Year evening, I am not going to deal with them now. Let me finish this post in a positive way


Decision Trees is a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked, etc.). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). With limited statistical significance, you are going to predict the target variable for a new case using its known values of the input variables based on results of your trained model. Recursive partitioning is used to build the tree. The data is split into partitions using a certain value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box. The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. You know intuitively that purity is related to the percentage of the cases in each class of the target variable. There are many better, but more complicated measures of the purity, for example entropy or information gain. The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure. This would not make any sense. The results could not be used for any meaningful prediction. This phenomenon is called “overfitting”. There are two basic approaches to solve this problem: prepruning (bonsai) and postpruning techniques. The prepruning (bonsai) methods prevent growth of the tree in advance by applying tests at each node to determine whether a further split is going to be useful; the tests can be simple (number of cases) or complicated (complexity penalty). The postpruning methods allow the tree to grow and then prune off the useless branches. The postpruning methods tend to give more accurate results, but they require more computation than prepruning methods. Imagine the following example. You have the answers to a simple question: Did you like the famous Woodstock movie? You also have some demographic data: age (20 to 60) and education (ranged in 7 classes from the lowest to the highest). In all, 55% of the interviewees liked the movie and 45% did not like it. Can you discover factors that have an influence on whether they liked the movie? Starting point: 55% of the interviewees liked the movie and 45% did not like it. After checking all possible splits, you find the best initial split made at the age of 35. With further splitting, you finish with a fullgrown tree. Note that not all branches lead to purer classes. Some of them are not useful at all and should be pruned. Decision trees are used for classification and prediction. Typical usage scenarios include:  Predicting which customers will leave
 Targeting the audience for mailings and promotional campaigns
 Explain reasons for a decision
 Answering questions such as “What movies do young female customers buy?”
Decision Trees is the most popular data mining algorithm. This is because the results are very understandable and simple to interpret, and the quality of the predictions is usually very high.


It is alive! It was really hard to make the choices. Nevertheless, the schedule is public now. To everybody that submitted proposals – thank you very much again! We are sorry we cannot accommodate everybody. Please note that even if you were not selected, we would be happy to see you in Ljubljana.



I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. Starting with a single dependent and a single independent variable, the algorithm is not too complex to understand (I am using an example from my old book about statistics  Thomas H. Wonnacott, Ronald J. Wonnacot: Introductory Statistics, Wiley 1990). Let’s say I am buying a used car. In an auto magazine I find that 30% of secondhand cars are faulty. I take with me a mechanic who can make a shrewd guess on a basis of a quick drive. Of course, he isn’t always right. Of the faulty cars he examined in the past he correctly pronounced 90 % faulty and wrongly pronounced 10% ok. When judging good cars, he correctly pronounced 80% of them as good, and wrongly 20% as faulted. In the graph, we can see that 27% (90% of 30%) of all cars are actually faulty and then correctly identified as such. 14% (20% of 70%) are judged faulty, although they are good. Altogether, 41% (27% + 14%) of cars are judged faulty. Of these cars, 67% (27% / 41%) are actually faulty. To sum up: Once the car has been pronounced faulty by the mechanic, the chance that it is actually faulty rises from the original 30% up to 67%. The following figure shows this process. The calculations in the previous slide can be summarized in another tree, the reverse tree. You can start branching with opinion of the mechanic (59% ok and 41% faulty). Moving to the right, the second branching shows the actual conditions of the cars, and this is the valuable information for you. For example, the top branch says: Once the car is judged faulty, the chance that it actually turns faulty is 67%. The third branch from the top displays clearly: Once the car is judged good, the chance that it is actually faulty is just 5%. You can see the reverse tree in the following figure. As mentioned, Naïve Bayes treats all of the input attributes as independent of each other with respect to the target variable. While this could be a wrong assumption, it allows multiplying the probabilities to determine the likelihood of each state of the target variable based on states of input variables. For example, let’s say that you need to analyze whether there is any association between NULLs in different columns of your Products table. You realize that if Color is missing, 80% of Weight values are missing as well; and if Class is missing, 60% of Weight values are missing as well. You can multiply these probabilities. If Weight is missing, you can calculate the product: 0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48 You can also check what happens to the not missing state of the target variable, the Weight: 0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08 You can easily see that the likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown. You can convert the likelihoods to probabilities by normalizing their sum to 1: P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857 P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143 Now when you know that the Color value is NULL and the Class value is null, then you have nearly 86% chances that you get NULL also in the Weight attribute. This might lead you to some conclusions where to start improving your data quality. In general, you use the Naive Bayes algorithm for classification. You want to extract models describing important data classes and then assign new cases to predefined classes. Some typical usage scenarios include:  Categorizing bank loan applications (safe or risky) based on previous experience
 Determining which home telephone lines are used for Internet access
 Assigning customers to predefined segments.
 Quickly obtaining a basic comprehension of the data by checking the correlation between input variables.


I am finishing my list of conferences and seminars I am attending in the second half of the year 2015. Here is my list.  Kulendayz 2015 – September 4th5th. Although I will have huge problems to get there on time, I would never like to miss it. I have one talk there.
 SQL Saturday #413 Denmark – September 17th19th. You can join me already on Friday for the Data Mining Algorithms seminar.
 SQL Saturday #434 Holland – September 25th26th. If you miss the Denmark Data Mining Algorithms seminar, I am repeating it in Utrecht.
 SQL Server Days 2015 Belgium – September 28th29th. This will be my first talk at SQL Server Days.
 SQL Saturday #454 Turin – October 10th. I was not confirmed as a speaker yet, but I still plan to go there, to combine the SQL part with the Expo in Milan part.
 PASS Summit 2015 Seattle – October 27th30th. I still continue to be present at every single summit:) This year I have one presentation.
 SharePoint Days 2015 Slovenia – November 17th18th. No, I don’t like SPS. I will just have one small BI presentation there.
 SQL Saturday #475 Belgrade – November 28th. First SQL Saturday in Serbia. I simply must be there.
 SQL Saturday #460 Slovenia – December 11th12th. I am coorganizing this event. Everybody is welcome, this will be fully Englishspeaking event. Don’t miss beautiful, relaxed and friendly Ljubljana!
That’s it. For now:)


This is a bit different post in the series about the data mining and machine learning algorithms. This time I am honored and humbled to announce that my fourth Pluralsight course is alive. This is the Data Mining Algorithms in SSAS, Excel, and R course. besides explaining the algorithms, I also show demos in different products. This gives you even better understanding than just reading the blog posts. Of course, I will continue with describing the algorithms here as well.


So we are back. PASS SQL Saturday is coming to Slovenia again on December 12th, 2015. Remember last two years? We had two great events. According to feedback, everybody was satisfied and happy. Let's make another outstanding event! How can you help? First of all, these events are free for attendees. Of course, this is possible only because sponsors pay the bills, and speakers come on their own expenses, using their own free time. For a start, please respect these facts. We always need more sponsors. Maybe your company is interested, maybe you know a company that would be interested for the sponsorship. Please spread the word about the event. This event is second to none in Slovenia. No other event in Slovenia attracts so many top speakers from all around the world as our SQL Saturday. Last year there were around two hundred attendees, with more than one third coming abroad. Therefore, as a sponsor, you can reach quite a broad audience, no matter that the event is in a small country. The second pillar of the event are the speakers. Please consider speaking at the event. We are not limited to top speakers only; one of the goals of the event is also to raise new speakers. We are successful with this task. Every year we get somebody new. Therefore, don't hesitate; if you would like to join the SQL Server speakers community, this is a good time and place! Organization of the event takes a lot of time as well. Maybe you can help here? When you register, you can also volunteer for a help. Or you can contact Matija Lah, Mladen Prajdić, or me directly. There is never enough advertisement. Again, please spread the word about the event. Maybe your coworkers and friends didn't hear about the event yet. Please tell them. If you attended SQL Saturday Slovenia before, please share your experience. Finally, please register and then attend the event. Last year we reached less than 5% dropoff rate, which is, as far as we know, a world record. Let's keep up the good work!


Support vector machines are both, unsupervised and supervised learning models for classification and regression analysis (supervised) and for anomaly detection (unsupervised). Given a set of training examples, each marked as belonging to one of categories, an SVM training algorithm builds a model that assigns new examples into one category. An SVM model is a representation of the cases as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. A support vector machine constructs a hyperplane or set of hyperplanes in a highdimensional space defined by the input variables, which can be used for classification, regression, or other tasks. A SVM is a discrete linear classifier. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (socalled functional margin). The larger the margin the lower the generalization error. Let me show you this on a graphical example. Of course, I am showing a twodimensional space defined by only two input variables, and therefore my separating hyperplane is just a line. The first figure shows the twodimensional space with cases and a possible singledimensional hyperplane (line). Of course, you can see that this line cannot be a separator at all, because there are some cases on the line, or said differently, on both sides of the line. The next try is better. The line is separating the cases in the space. However, this is not the best possible separation. Some cases are pretty close to the line. The third picture shows the best possible separation. The hyperplane that separates the cases the best is found, and the model is trained. Support Vector Machines are powerful for some specific classifications:  Text and hypertext categorization
 Images classification
 Classifications in medicine
 Handwritten characters recognition
Oneclass SVM can be used for anomaly detection, like detection of dirty data, or fraud detection. It uses a classification function without parameters, the one selected for the separation without regard to a target variable. Cases that are close to the separation hyperplane are the suspicious cases. Therefore, the result is dichotomous: 1=regular case, 0=outlier.


Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues. A short definition of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a twodimensional space. Values of the variables v1 and v2 define cases in this 2D space. Variability of the cases is spread across both source variables approximately equally. Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture. We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple twodimensional matrix. The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for businessoriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:  Explore the data to explain the variability;
 Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;
 Use the residual variability not explained by the PCs for anomaly detection.


With the KMeans algorithm, each object is assigned to exactly one cluster. It is assigned to this cluster with a probability equal to 1.0. It is assigned to all other clusters with a probability equal to 0.0. This is hard clustering. Instead of distance, you can use a probabilistic measure to determine cluster membership. For example, you can cover the objects with bell curves for each dimension with a specific mean and standard deviation. A case is assigned to every cluster with a certain probability. Because clusters can overlap, this is called soft clustering. The ExpectationMaximization (EM) method changes the parameters of the bell curve to improve covering in each iteration. The Expectation  Maximization (EM) Clustering algorithm extends the KMeans paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of the membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case (i.e., the probability that a case belongs to a given cluster from the initially defined k number of clusters). In the second step—the "maximization" step—the algorithm uses these cluster memberships to reestimate the models' parameters, such as the location and scale parameters of Gaussian distribution. The algorithm assumes that the data is drawn from a mixture of Gaussian distributions (bell curves). Take a look at the graphics. In the first row, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets the specified stopping criteria—for example, when it reaches a certain likelihoodofimprovement rate between iterations. Step 1: Initializing the mixture distribution Step 2: Modifying the mixture distribution Step 3: Final modification You use the EM Clustering for the same purposes as the KMeans Clustering. In addition, you can search for outliers based on combinations of values of all input variables with the EM algorithm. You check the highest probability of cases over all clusters. The cases where the highest probability is still low do not fit well into any cluster. Said differently, they are not like other cases, and therefore you can assume that they are outliers. See the last figure in this blog post bellow. The green case belongs to the cluster D with probability 0.95, to the cluster C with probability 0.002, to the cluster E with probability 0.0003, and so on. The red case belongs to the cluster C with probability 0.03, to the cluster B with probability 0.02, to the cluster D with probability 0.003, and so on. The highest probability for the red case is still a low value; therefore, this case does not fit well to any of the clusters and thus might represent an outlier. Outliers can also represent potentially fraudulent transactions. EM Clustering is therefore useful also for fraud detection. Finally, you can use EM Clustering for advanced data profiling to find the rows with suspicious combinations of column values.


Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on it. KMeans is a distancebased partitioning algorithm that divides data set in predetermined (“k”) number of clusters around the average location (“mean”). In your mind, you intuitively know how to group people or any other cases. Groups do not need to have an equal number of members. You can do grouping according to one or more attributes. The algorithm comes from geometry. Imagine record space with attributes as dimensions. Each record (case) is uniquely located in space with values of the attributes (variables). The algorithm initially creates k fictitious members and defines them at the means of the clusters. These fictitious cases are also called centroids. The values of the input variables for these centroids could be selected randomly. Some algorithms use also a bit of heuristics and use the marginal distributions of the attributes as a starting point and randomly perturb from there. The algorithm then assigns each record to nearest centroid. This way, you get the initial clusters. When the clusters are defined, the algorithm can calculate the actual centroids of clusters and get new centroids. After the new centroids are calculated, the algorithm reassigns each record to the nearest centroid. Some records jump from cluster to cluster. Now the algorithm can calculate new centroids and then new clusters. The algorithm iterates last two steps until cluster boundaries stop changing. You can stop the iterations when there is less than the minimum number of cases defined as a parameter that can jump from cluster to cluster. Here is a graphical representation of the process. You can see the cases in a twodimensional space. The dark brown cases are the fictitious centroids. The green case is the one that will jump between clusters. After the centroids were selected, the algorithm assigns each case to the nearest centroid. Now we have our three initial clusters. The algorithm can calculate the real centroids of those three clusters. This means that the centroids move. The algorithm has to recalculate the cluster membership. The green case jumps from the middle cluster to the bottom left cluster. In the next iteration, no case jumps from a cluster to a cluster. Therefore, the algorithm can stop. Kmeans clustering scales much better than hierarchical methods. However, it has drawbacks as well. First of all, what is the optimal number of clusters? You can’t know in advance. Therefore, you need to create different models with different number of clusters and than select the one that fits your data the best. The next problem is the meaning of the clusters. There are no labels for the clusters that would be known in advance. Once the model is built, you need to check the distributions of the input variables in each cluster to understand what kind of cases constitute each cluster. Only after this step you can label the clusters.


Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. There are a large number of clustering algorithms. The major methods can be classified into the following categories.  Partitioning methods. A partitioning method constructs K partitions of the data, which satisfy the following requirements: (1) each group must contain at least one object and (2) each object must belong to exactly one group. Given the initial K number of partitions to construct, the method creates initial partitions. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. There are various kinds of criteria for judging the quality of the partitions. Some most popular include the kmeans algorithm, where each cluster is represented by the mean value of the objects in the cluster, and the kmedoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster.
 Hierarchical methods. A hierarchical method creates a hierarchical decomposition of the given set of data objects. These methods are agglomerative or divisive. The agglomerative (bottomup) approach starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all groups are merged into one. The divisive (topdown) approach starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster or until a termination condition holds.
 Densitybased methods. Methods based on the distance between objects can find only sphericalshaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. So other methods have been developed based on the notion of density. The general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
 Modelbased methods. Modelbased methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A modelbased technique might locate clusters by constructing a density function that reflects the spatial distribution of the data points. Unlike conventional clustering, which primarily identifies groups of like objects, this conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or a class.
A hierarchical clustering model training typically starts by calculating a distance matrix – a matrix with distances between data points in a multidimensional hyperspace, where each input variable defines one dimension of that hyperspace. Distance measure can be a geometrical distance or some other, more complex measure. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are also often used in computational biology to illustrate the clustering of genes or samples. The following set of pictures shows the process of building an agglomerative hierarchical clustering dendrogram. Cluster analysis segments a heterogeneous population into a number of more homogenous subgroups or clusters. Typical usage scenarios include:  Discovering distinct groups of customers
 Identifying groups of houses in a city
 In biology, deriving animal and plant taxonomies
 Can even make predictions once the clusters are built and distribution of a target variable in the clusters is calculated.


The Association Rules algorithm is specifically designed for use in market basket analyses. This knowledge can additionally help in identifying crossselling opportunities and in arranging attractive packages of products. This is the most popular algorithm used in web sales. You can even include additional discrete input variables and predict purchases over classes of input variables. Association Rules Basics The algorithm considers each attribute/value pair (such as product/bicycle) as an item. An itemset is a combination of items in a single transaction. The algorithm scans through the dataset trying to find itemsets that tend to appear in many transactions. Then it expresses the combinations of the items as rules (such as “if customers purchase potato chips, they will purchase cola as well”). Often association models work against datasets containing nested tables, such as a customer list followed by a nested purchases table. If a nested table exists in the dataset, each nested key (such as a product in the purchases table) is considered an item. Understanding Measures Besides the itemsets and the rules, the algorithm also return some measures for the itemsets and the rules. Imagine the following transactions:  Transaction 1: Frozen pizza, cola, milk
 Transaction 2: Milk, potato chips
 Transaction 3: Cola, frozen pizza
 Transaction 4: Milk, pretzels
 Transaction 5: Cola, pretzels
The Association Rules measures include:  Support, or frequency, means the number of cases that contain the targeted item or combination of items. Therefore, support is a measure for the itemsets.
 Probability, also known as confidence, is a measure for the rules. The probability of an association rule is the support for the combination divided by the support for the condition. For example, the rule "If a customer purchases cola, then they will purchase potato chips" has a probability of 33%. The support for the combination (potato chips + cola) is 20%, occurring in one of each five transactions. However, the support for the condition (cola) is 60%, occurring in three out of each five transactions. This gives a confidence of 0.2 / 0.6 = 0.33 or 33%.
 Importance is a measure for both, itemsets and rules. When importance is calculated for an itemset, then when importance equals one, the items in the itemset are independent. If importance is greater than one, then the items are positively correlated. If importance is lower than one, then the items are negatively correlated. When importance is calculated for a rule “If {A} then {B}, then the value zero means there is no association between the items. Positive importance means that the probability for the item {B} goes up when the item {A} is in the basket, and negative importance of the rule means that the probability for the item {B} goes down when the tem {A} is in the basket.
Common Business Use Cases You use the Association Rules algorithm for market basket analyses. You can identify crossselling opportunities or arrange attractive packages. This is the most popular algorithm used in web sales. You can even include additional input variables and predict purchases over classes of input variables.




