THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – an Introduction

Data mining is the most advanced part of business intelligence. With statistical and other mathematical algorithms, you can automatically discover patterns and rules in your data that are hard to notice with on-line analytical processing and reporting. However, you need to thoroughly understand how the data mining algorithms work in order to interpret the results correctly. In this blog I am introducing the data mining, and in the following blogs I am unveiling the black box of data mining and explaining how the most popular algorithms work.

Data Mining Definition

Data mining is a process of exploration and analysis, by automatic or semiautomatic means, of historical data in order to discover patterns and rules, which can be used later on new data for predictions and forecasting. With data mining, you deduce some hidden knowledge by examining, or training, the data. The unit of examination is called a case, which can be interpreted as one appearance of an entity, or a row, in a table. The knowledge is patterns and rules. In the process, you use attributes of a case, which are called variables in data mining terminology. For better understanding, you can compare data mining to On-Line Analytical Processing (OLAP), which is a model-driven analysis where you build the model in advance. Data mining is a data-driven analysis, where you search for the model. You examine the data with data mining algorithms.

There are many alternative names for data mining, such as knowledge discovery in databases (KDD) and predictive analytics. Originally, data mining was not the same as machine learning in that it gives business users insights for actionable decisions; machine learning determines which algorithm performs the best for a specific task. However, nowadays data mining and machine learning are in many cases used as synonyms.

The Two Types of Data Mining

Data mining techniques are divided into two main classes:

  • The directed, or supervised approach: You use known examples and apply gleaned information to unknown examples to predict selected target variable(s).
  • The undirected, or unsupervised approach: You discover new patterns inside the dataset as a whole.

Some of the most important directed techniques include classification, estimation, and forecasting. Classification means to examine a new case and assign it to a predefined discrete class. Examples are assigning keywords to articles and assigning customers to known segments. Very similar is estimation, where you are trying to estimate a value of a variable of a new case in a continuously defined pool of values. You can, for example, estimate the number of children or the family income. Forecasting is somewhat similar to classification and estimation. The main difference is that you can’t check the forecasted value at the time of the forecast. Of course, you can evaluate it if you just wait long enough. Examples include forecasting which customers will leave in the future, which customers will order additional services, and the sales amount in a specific region at a specific time in the future.

The most common undirected techniques are clustering and affinity grouping. An example of clustering is looking through a large number of initially undifferentiated customers and trying to see if they fall into natural groupings. This is a pure example of "undirected data mining" where the user has no preordained agenda and hopes that the data mining tool will reveal some meaningful structure. Affinity grouping is a special kind of clustering that identifies events or transactions that occur simultaneously. A well-known example of affinity grouping is market basket analysis. Market basket analysis attempts to understand what items are sold together at the same time.

Common Business Use Cases

Some of the most common business questions that you can answer with data mining include:

  • What’s the credit risk of this customer?
  • Are there any groups of my customers?
  • What products do customers tend to buy together?
  • How much of a specific product can I sell in the next time period?
  • What is the potential number of customers shopping in this store?
  • What are the major groups of my web-click customers?
  • Is this a spam email?

However, the actual questions you might want to answer with data mining could be by far broader and depend on your imagination only. For an unconventional example, you might use data mining to try to lower the mortality rate in a hospital.

Data mining is already widely used in many different applications. Some of the typical usages, along with the most commonly used algorithms for a specific task, include the following:

  • Cross-selling: Widely used for web sales with the Association Rules and Decision Trees algorithms.
  • Fraud detection: An important task for banks and credit card issuers, who want to limit the damage that fraud creates, including that experienced by customers and companies. The Clustering and Decision Trees algorithms are commonly used for fraud detection.
  • Churn detection: Service providers, including telecommunications, banking, and insurance companies, perform this to detect which of their subscribers are about to leave them in an attempt to prevent it. Any of the directed methods, including the Naive Bayes, Decision Trees, or Neural Network algorithm, is suitable for this task.
  • Customer Relationship Management (CRM) applications: Based on knowledge about customers, which you can extract with segmentation using, for example, the Clustering or Decision Trees algorithm.
  • Website optimization: To do this, you should know how your website is used. Microsoft developed a special algorithm, the Sequence Clustering algorithm, for this task.
  • Forecasting: Nearly any business would like to have some forecasting, in order to prepare better plans and budgets. The Time Series algorithm is specially designed for this task.

A Quick Introduction to the Most Popular Algorithms

In order to raise the expectations for the upcoming blogs, I am adding a brief introduction to the most popular data mining algorithms in a condensed way, in a table.

Algorithm

Usage

Association Rules

The algorithm used for market basket analysis, this defines an itemset as a combination of items in a single transaction. It then scans the data and counts the number of times the itemsets appear together in transactions. Market basket analysis is useful to detect cross-selling opportunities.

Clustering

This groups cases from a dataset into clusters containing similar characteristics. You can use the Clustering method to group your customers for your CRM application to find distinguishable groups of your customers. In addition, you can use it for finding anomalies in your data. If a case does not fit well to any cluster, it is kind of an exception. For example, this might be a fraudulent transaction.

Naïve Bayes

This calculates probabilities for each possible state of the input attribute for every single state of predictable variable. Those probabilities predict the target attribute based on the known input attributes of new cases. The Naïve Bayes algorithm is quite simple; it builds the models quickly. Therefore, it is very suitable as a starting point in your predictive analytics project.

Decision Trees

The most popular DM algorithm, it predicts discrete and continuous variables. It uses the discrete input variables to split the tree into nodes in such a way that each node is more pure in terms of target variable, i.e. each split leads to nodes where a single state of a target variable is represented better than other states.

Regression Trees

For continuous predictable variables, you get a piecemeal multiple linear regression formula with a separate formula in each node of a tree. Discrete input variables are used to split the tree into nodes. A tree that predicts continuous variables is a Regression Tree. Use Regression Trees for estimation of a continuous variable; for example, a bank might use this technique to estimate the family income for a loan applicant.

Linear Regression

Predicts continuous variables, using a single multiple linear regression formula. The input variables must be continuous as well. Linear Regression is a simple case of a Regression Tree, a tree with no splits. Use it for the same purpose as Regression Trees.

Neural Network

This algorithm is from artificial intelligence, but you can use it for predictions as well. Neural networks search for nonlinear functional dependencies by performing nonlinear transformations on the data in layers, from the input layer through hidden layers to the output layer. Because of the multiple nonlinear transformations, neural networks are harder to interpret compared to Decision Trees.

Logistic Regression

As Linear Regression is a simple Regression Tree, a Logistic Regression is a Neural Network without any hidden layers.

Support Vector Machines

Support Vector Machines are supervised learning models with associated learning algorithms that analyse data and recognize patterns, used for classification. A support vector machine constructs a hyper plane or set of hyper planes in a high-dimensional space where the input variables define the dimensions. The hyper planes split the data points into discrete groups of the target variable. Support Vector Machines are powerful for some specific classifications, like text and images classifications and hand-written characters recognition.

Sequence Clustering

This searches for clusters based on a model, and not on similarity of cases as Clustering does. The models are defined on sequences of events by using Markov Chains. Typical usage of the Sequence Clustering would be an analysis of your company’s Web site usage, although you can use this algorithm on any sequential data.

Time Series

You can use this algorithm to forecast continuous variables. Time Series many times denotes two different internal algorithms. For short-term forecasting, Auto-Regression Trees (ART) algorithm is used. For long-term prediction, Auto-Regressive Integrated Moving Average (ARIMA) algorithm is used.

Conclusion

This brief introduction to data mining should give you the idea what you could use it for and an overview which algorithms are appropriate for the business problem you are trying to solve. I guess you also noticed I am not talking about any specific technology here. These most popular data mining algorithms are available in many different products. For example, you can find them in SQL Server Analysis Services, Excel with Data Mining Add-ins, R, Azure ML, and more. Please learn how to use them with your specific product using the documentation of the product, by reading books that deal with your product, or by visiting a course about the product.

I hope you got excited enough to read the upcoming blogs and visit some of my presentations on various conferences.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

KRK said:

Excellent Introductory article.

Looking forward to the rest of the series.

February 19, 2015 6:41 PM
 

Dejan Sarka said:

Thank you!

Dejan

February 20, 2015 1:55 AM
 

LadyE said:

Great Job here.  

Where does Affinity Grouping occur in data mining

January 27, 2016 9:07 AM
 

Dejan Sarka said:

Association Rules, also Sequence Clustering, you can even use the Decision Trees for affinity grouping.

January 27, 2016 9:48 AM

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement