THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – Decision Trees

Decision Trees is a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked, etc.). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). With limited statistical significance, you are going to predict the target variable for a new case using its known values of the input variables based on results of your trained model.

Recursive partitioning is used to build the tree. The data is split into partitions using a certain value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box.

The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. You know intuitively that purity is related to the percentage of the cases in each class of the target variable. There are many better, but more complicated measures of the purity, for example entropy or information gain.

The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure. This would not make any sense. The results could not be used for any meaningful prediction. This phenomenon is called “over-fitting”. There are two basic approaches to solve this problem: pre-pruning (bonsai) and post-pruning techniques.

The pre-pruning (bonsai) methods prevent growth of the tree in advance by applying tests at each node to determine whether a further split is going to be useful; the tests can be simple (number of cases) or complicated (complexity penalty). The post-pruning methods allow the tree to grow and then prune off the useless branches. The post-pruning methods tend to give more accurate results, but they require more computation than pre-pruning methods.

Imagine the following example. You have the answers to a simple question: Did you like the famous Woodstock movie? You also have some demographic data: age (20 to 60) and education (ranged in 7 classes from the lowest to the highest). In all, 55% of the interviewees liked the movie and 45% did not like it.

Can you discover factors that have an influence on whether they liked the movie?

Starting point: 55% of the interviewees liked the movie and 45% did not like it.

image

After checking all possible splits, you find the best initial split made at the age of 35.

image

With further splitting, you finish with a full-grown tree. Note that not all branches lead to purer classes. Some of them are not useful at all and should be pruned.

image

Decision trees are used for classification and prediction. Typical usage scenarios include:

  • Predicting which customers will leave
  • Targeting the audience for mailings and promotional campaigns
  • Explain reasons for a decision
  • Answering questions such as “What movies do young female customers buy?”

Decision Trees is the most popular data mining algorithm. This is because the results are very understandable and simple to interpret, and the quality of the predictions is usually very high.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement