THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – Association Rules

The Association Rules algorithm is specifically designed for use in market basket analyses. This knowledge can additionally help in identifying cross-selling opportunities and in arranging attractive packages of products. This is the most popular algorithm used in web sales. You can even include additional discrete input variables and predict purchases over classes of input variables.

Association Rules Basics

The algorithm considers each attribute/value pair (such as product/bicycle) as an item. An itemset is a combination of items in a single transaction. The algorithm scans through the dataset trying to find itemsets that tend to appear in many transactions. Then it expresses the combinations of the items as rules (such as “if customers purchase potato chips, they will purchase cola as well”).

Often association models work against datasets containing nested tables, such as a customer list followed by a nested purchases table. If a nested table exists in the dataset, each nested key (such as a product in the purchases table) is considered an item.

Understanding Measures

Besides the itemsets and the rules, the algorithm also return some measures for the itemsets and the rules. Imagine the following transactions:

  1. Transaction 1: Frozen pizza, cola, milk
  2. Transaction 2: Milk, potato chips
  3. Transaction 3: Cola, frozen pizza
  4. Transaction 4: Milk, pretzels
  5. Transaction 5: Cola, pretzels

The Association Rules measures include:

  • Support, or frequency, means the number of cases that contain the targeted item or combination of items. Therefore, support is a measure for the itemsets.
  • Probability, also known as confidence, is a measure for the rules. The probability of an association rule is the support for the combination divided by the support for the condition. For example, the rule "If a customer purchases cola, then they will purchase potato chips" has a probability of 33%. The support for the combination (potato chips + cola) is 20%, occurring in one of each five transactions. However, the support for the condition (cola) is 60%, occurring in three out of each five transactions. This gives a confidence of 0.2 / 0.6 = 0.33 or 33%.
  • Importance is a measure for both, itemsets and rules. When importance is calculated for an itemset, then when importance equals one, the items in the itemset are independent. If importance is greater than one, then the items are positively correlated. If importance is lower than one, then the items are negatively correlated. When importance is calculated for a rule “If {A} then {B}, then the value zero means there is no association between the items. Positive importance means that the probability for the item {B} goes up when the item {A} is in the basket, and negative importance of the rule means that the probability for the item {B} goes down when the tem {A} is in the basket.

Common Business Use Cases

You use the Association Rules algorithm for market basket analyses. You can identify cross-selling opportunities or arrange attractive packages. This is the most popular algorithm used in web sales.

You can even include additional input variables and predict purchases over classes of input variables.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

KKline said:

Thank you for sharing, Dejan! I think SQL Server people, in general, need more education on algorithms.

How would this normally be implemented using SQL?

Best regards,

-Kev

March 24, 2015 11:49 AM
 

Dejan Sarka said:

Kevin,

First of all, thank you for your kind comment.

In SQL, you typically search for distinct combinations of items in the same transaction with either join or apply operator. I prefer apply. Bellow is an example that finds itemsets of size 1, 2, and 3. However, I would not recommend doing this in SQL - why would you reinvent the wheel? You get so much more from data mining, and it is simple to use. Anyway, here is an example:

USE AdventureWorksDW2012;

GO

-- Creating a permanent table from dbo.vAssocSeqLineItems

SELECT *

INTO tAssocSeqLineItems

FROM dbo.vAssocSeqLineItems;

GO

-- Index

CREATE INDEX ncl_

ON dbo.tAssocSeqLineItems (OrderNumber);

GO

-- Query to check on order with 8 line items

SELECT t1.OrderNumber, t1.Model AS Model1

FROM dbo.tAssocSeqLineItems AS t1

WHERE OrderNumber = N'SO58845'

ORDER BY t1.OrderNumber, t1.Model;

GO

-- Frequency of itemsets with a single model

SELECT Model, COUNT(*) AS Support

FROM dbo.tAssocSeqLineItems

GROUP BY Model

ORDER BY Support DESC;

GO

-- Frequency of itemsets with two models

-- JOIN

SELECT t1.Model AS Model1,

t2.Model AS Model2,

COUNT(*) AS Support

FROM dbo.tAssocSeqLineItems AS t1

INNER JOIN dbo.tAssocSeqLineItems AS t2

 ON t1.OrderNumber = t2.OrderNumber

    AND t1.Model > t2.Model

GROUP BY t1.Model, t2.Model

ORDER BY Support DESC;

-- APPLY

WITH Pairs_CTE AS

(

SELECT t1.OrderNumber,

t1.Model AS Model1,

t2.Model2

FROM dbo.tAssocSeqLineItems AS t1

CROSS APPLY

 (SELECT Model AS Model2

  FROM dbo.tAssocSeqLineItems

  WHERE OrderNumber = t1.OrderNumber

    AND Model > t1.Model) AS t2

)

SELECT Model1, Model2, COUNT(*) AS Support

FROM Pairs_CTE

GROUP BY Model1, Model2

ORDER BY Support DESC;

GO

-- Frequency of itemsets with three models

WITH Pairs_CTE AS

(

SELECT t1.OrderNumber,

t1.Model AS Model1,

t2.Model2

FROM dbo.tAssocSeqLineItems AS t1

CROSS APPLY

 (SELECT Model AS Model2

  FROM dbo.tAssocSeqLineItems

  WHERE OrderNumber = t1.OrderNumber

    AND Model > t1.Model) AS t2

),

Triples_CTE AS

(

SELECT t2.OrderNumber,

t2.Model1,

t2.Model2,

t3.Model3

FROM Pairs_CTE AS t2

CROSS APPLY

 (SELECT Model AS Model3

  FROM dbo.tAssocSeqLineItems

  WHERE OrderNumber = t2.OrderNumber

    AND Model > t2.Model1

AND Model > t2.Model2) AS t3

)

SELECT Model1, Model2, Model3, COUNT(*) AS Support

FROM Triples_CTE

GROUP BY Model1, Model2, Model3

ORDER BY Support DESC;

GO

-- Clean up

DROP TABLE dbo.tAssocSeqLineItems;

GO

Best regards,

Dejan

March 25, 2015 2:12 PM

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement