You can find many different data modeling resources. It is impossible to list all of them. I selected only the most valuable ones for me, and, of course, the ones I contributed to.
- Chris J. Date: An Introduction to Database Systems – IMO a “must” to understand the relational model correctly.
- Terry Halpin, Tony Morgan: Information Modeling and Relational Databases – meet the object-role modeling leaders.
- Chris J. Date, Nikos Lorentzos and Hugh Darwen: Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL – all theory needed to manage temporal data.
- Louis Davidson, Jessica M. Moss: Pro SQL Server 2012 Relational Database Design and Implementation – the best SQL Server focused data modeling book I know by two of my friends.
- Dejan Sarka, et al.: MCITP Self-Paced Training Kit (Exam 70-441): Designing Database Solutions by Using Microsoft® SQL Server™ 2005 – SQL Server 2005 data modeling training kit. Most of the text is still valid for SQL Server 2008, 2008 R2, 2012 and 2014.
- Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008 T-SQL Querying – Steve wrote a chapter with mathematical background, and I added a chapter with theoretical introduction to the relational model.
- Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside Microsoft SQL Server 2008 T-SQL Programming – I added three chapters with theoretical introduction and practical solutions for the user-defined data types, dynamic schema and temporal data.
- Dejan Sarka, Matija Lah, Grega Jerkič: Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft SQL Server 2012 – my first two chapters are about data warehouse design and implementation.
- Data Modeling Essentials – I wrote a 3-day course for SolidQ. If you are interested in this course, which I could also deliver in a shorter seminar way, you can contact your closes SolidQ subsidiary, or, of course, me directly on addresses firstname.lastname@example.org or email@example.com. This course could also complement the existing courseware portfolio of training providers, which are welcome to contact me as well.
- Logical and Physical Modeling for Analytical Applications – online course I wrote for Pluralsight.
- Working with Temporal data in SQL Server – my latest Pluralsight course, where besides theory and implementation I introduce many original ways how to optimize temporal queries.
- Forthcoming presentations
- SQL Bits 12, July 17th – 19th, Telford, UK – I have a full-day pre-conference seminar Advanced Data Modeling Topics there.
My third Pluralsight course, Working with Temporal Data in SQL Server, is published. I am really proud on the second part of the course, where I discuss optimization of temporal queries. This was a nearly impossible task for decades. First solutions appeared only lately. I present all together six solutions (and one more that is not a solution), and I invented four of them. http://pluralsight.com/training/Courses/TableOfContents/working-with-temporal-data-sql-server
My previous blog post “Truncate Table – DDL or DML Statement?” got quite a few comments. Now I am continuing with a similar discussion: is the ALTER TABLE SWITCH [PARTITION] DDL (Data Definition Language) or DML (Data Manipulation Language) statement? At least it is clear that this is not a Data Control Language (DCL) statement.
Again, I can find a lot of arguments why this would be a DDL statement. First of all, we all know the classical categorization of the statements:
· DDL includes CREATE, ALTER, and DROP statements
· DCL includes GRANT, REVOKE, and DENY statements
· DML includes SELECT, INSERT, UPDATE, DELETE, and MERGE statements.
SQL Server changes system pages when you use this statement. In addition, you need elevated permissions to use the statement. Clearly, from the syntax perspective, it is a DDL statement.
However, logically, you just move the data from one table or partition to another table or partition. You do not change the schema at all. Therefore, semantically, this is a DML statement. And again, in my opinion, the logical perspective is the most important here, because this is the main point of the Relational Model: work with it from the logical perspective, and leave the physical execution to the underlying system.
Let me show how this statement works. I start with the clean-up code, if some of the objects I am going to create in the tempdb system database already exist.
Next, let me create the partition function and the partition scheme, create a demo table dbo.FactInternetSales and populate it with the data from the dbo.FactInternetSales from the AdventureWorksDW2012 demo database. I will also create two additional tables, one for the new data load, and one for the data from the oldest partition of the dbo.FactInternetSales table.
Note that the table for the new data includes a check constraints that guarantees that all of the data can be switched to a single partition of the partitioned table. I am loading the dbo.FactInternetSalesNew table with the last data you can find in the demo database, internet sales for year 2008. Let me check the data in all three tables, and also all partitions of the partitioned table.
If you check the results, you can see that there is data in three partitions of the partitioned table, and in the table for the new data. Next step is to switch the data from the new data table to a single partition of the partitioned table.
Can I do the same thing with a single DML statement? The TRUNCATE TABLE works logically similarly to the DELETE statement without the WHERE clause if a table is without a trigger and without the identity property. Does something similar exist for the ALTER TABLE SWITCH [PARTITION] statement? The answer is, of course, yes. You can use the composable DML statements with the OUTPUT clause. With the next statement, I am moving all of the data from the oldest partition of the partitioned table to the table created for the old data.
Let’s check where the data is now, and if schema has anyhow changed.
If you execute the statements, you can clearly see that the effect of the last composable DML statement was completely the same as the effect of the ALTER TABLE SWITCH [PARTITION] statement. The schema did not change a bit. Therefore, the ALTER TABLE SWITCH [PARTITION] is clearly a DML statement.
Many times, categories of concepts and things overlap. It can be hard to categorize some items in a single category. The SQL TRUNCATE TABLE statement is an example of an item that is not so easy to categorize. Is it a DDL (Data Definition Language) or DML (Data Manipulation Language) statement?
There is an ongoing discussion about this topic. However, if you quickly bingle for this question, you get the impression that the majority is somehow leaning more toward defining the TRUNCATE TABLE statement as a DDL statement. For example, Wikipedia clearly states: “In SQL, the
TRUNCATE TABLE statement is a Data Definition Language (DDL) operation that marks the extents of a table for deallocation (empty for reuse).” Disclaimer: please note that I do not find Wikipedia as the “ultimate, trustworthy source” – I prefer sources that are signed!
Some of the reasons why many people define the statement as a DDL statement include:
- It requests schema locks in some systems
- It is not possible to rollback it in some systems
- It does not include a WHERE clause
- It does not fire triggers in some systems
- It resets the autonumbering column value in some systems
- It deallocates system pages directly, not through an internal table operation
- and more.
On the other hand, it looks like there is only one reason to treat the statement as a DML statement:
- Logically, you just get rid of the data, like with the DELETE statement.
Even the Wikipedia article that I referred to says “The
TRUNCATE TABLE mytable statement is logically (though not physically) equivalent to the
DELETE FROM mytable statement (without a
Like many times, I have to disagree with the majority. I understand that the categorization is somehow confusing, and might even be overlapping. However, the only reason for categorizing the TRUNCATE TABLE statement in the DML category is “THE” reason in my understanding. One of the most important ideas in the Relational Model is the separation between the logical and the physical level. We, users, or people, if you wish, are manipulating with data on the logical level; the physical implementation is left to the database management system. And this is the important part – logically, when you truncate table, you don’t care how this statement is implemented internally, you just want to get rid of the data. It really does not matter what kind of locks a system uses, does it allow WHERE clause or not, etc. The logical point is what matters. Therefore, I would categorize the TRUNCATE TABLE statement as a DML statement.
Of course, this is a purely theoretical question, and is really not important for your practical implementation. As long as your app is doing what it should do, you don’t care too much about these nuances. However, IMO in general there is not enough of theoretical knowledge spread around, and therefore it makes sense to try to get the correct understanding.
But there is always a “but”. Of course, I have immediately another question. What about the ALTER TABLE mytable SWITCH [PARTITION…] TO… statement? ALTER statements have been defined as DDL statements forever. however, again, logically you are just moving the data from one table to another. Therefore – what? What do you think?
Why wouldn’t you make a nice trip in Central Europe? Besides spending some nice days in beautiful cities, you have also a business-type excuse for the trip. You can join the two SQL Saturday events. The first one is going to take place in Budapest, March 1st, and the second one in Vienna, March 6th.
I am very happy to meet the SQL community again, and proud to speak at both events. Although I would prefer to have the dates turned around. The two sessions selected are Temporal Data in SQL Server, which I will preset in Vienna, and Optimizing Temporal Queries, which I will present in Budapest. The sessions are more or less independent; however, I see the Vienna session as a prequel to the Budapest one.
Come, learn and enjoy Central Europe!
It is hard to imagine searching for something on the Web without modern search engines like Bing or Google. However, most contemporary applications still limit users to exact searches only. For end users, even the standard SQL LIKE operator is not powerful enough for approximate searches. In addition, many documents are stored in modern databases; end users would probably like to get powerful search inside document contents as well. Text mining is also becoming more and more popular. Everybody would like to understand data from blogs, Web sites, and social media. Microsoft SQL Server in versions 2012 and 2014 enhances full-text search support that was available in previous editions substantially. Semantic Search, a new component in Full-Text Search, can help you understand the meaning of documents. Finally, the Term Extraction and Term Lookup components from SQL Server Integration Services also helps with text analysis.
I am proud to introduce my second course I authored for Pluralsight - Indexing, Querying and Analyzing Text with SQL Server 2012-2014. You can learn how to get the most out of your texts with SQL Server tools. Enjoy!
As a speaker, I would like to help promoting the conferences where I am proud to present. Of course, this does not mean that other conferences are any worse. Anyway, here is the list of the conferences where I am going to speak, and I would like to invite you to join me.
- January 21st, NTK Zima 2014, Ljubljana, Slovenia
- February 5th, 24 Hours of PASS: Business Analytics Edition, online event
- March 1st, SQL Saturday #278, Budapest, Hungary
- March 6th, SQL Saturday #280, Vienna, Austria
- March 31st – April 4th, DevWeek 2014, London, UK
- April 9th – April 10th, NTK Pomlad 2014, Bled, Slovenia
- April 12th, SQL Saturday #267, Lisbon, Portugal
- May 7th – May 9th, PASS Business Analytics Conference, San Jose, CA, USA
Looking forward to all of these events. I hope we will meet in at least one of them!
UNION, INTERSECT and EXCEPT operators are commonly called Set Operators. For example, in Books Online you can find a topic “Set Operators”, where these three operators are explained. They should represent set operations UNION, INTERSECT and MINUS (synonym for EXCEPT DISTINCT). Also Wikipedia has a topic called “SET OPERATIONS (SQL)”, where these three operators are introduced. And these operators are commonly represented by Venn diagrams. Logically, Venn diagrams are also called Set diagrams. Here are the three operators presented with Venn diagrams:
However, is the name “Set Operators” really correct? The first question I asked myself was very simple: why would we have 10 and more relational operator and three set operators in the relational algebra? Well, makes no sense. The relational algebra comprise relational operators only, of course.
So what exactly is a relation? A relation is a special kind of set, set of entities that
are related, i.e that are of the same kind. How do we know that the two entities are of the same kind and can thus be grouped in a single entity set, i.e. in a relation? Of course, two entities are of the same kind if they have the same attributes. Therefore, every relation is a set; however, not every set is a relation.
Set operators work on sets and produce a set. Relational operators work on relations and produce a relation. SQL operators UNION, INTERSECT and EXCEPT produce relations, i.e. special kind of sets. Set operator UNION can combine a set of differential equations and a set of hammers into a single set. Relational operator UNION can’t combine a relation of differential equations and a relation of hammers into a single relation, because elements of these two relations have nothing in common. And don’t think that if you take only keys of both relation, and both have a single-column integer key, that a UNION of this would be a relation. First of all, you can do such an union because we don’t use strong types in a relational database (each key should be of its own type – in this case, you should have a “hammer” and a “differential equation” key types, which would disallow such operations). In addition, in the case I mentioned, you would get two
a relations that has with a single attribute, the key only, which would probably be meaningless from the business perspective. A relation without a meaningful attribute is not really an entity set, as defined by Peter Chen. An entity is something we can identify and is of interest. If we don’t have any real attribute, then this “thing” (whatever) is definitely not an entity, because without attributes it can’t be of any interest.
To summarize: SQL UNION, INTERSECT and EXCEPT are simply relational operators. Talking about them as of set operators is at least imprecise. However, representing them with Venn diagrams is not just imprecise, it is wrong. Here is a better presentation of there three relational operators.
I had opportunity to read the SQL Server 2012 Reporting Services Blueprints book by Marlon Ribunal (@MarlonRibunal) and Mickey Stuewe (@SQLMickey), Packt Publishing. Here is my short review.
I find the book very practical. The authors guide you right to the point, without unnecessary obstructions. Step by step, you create more and more complex reports, and learn SQL Server Reporting Services (SSRS) 2012 features. If you are more hands-on guy, this is the right book for you. Well, to be honest, there is not much theory in reporting. Reporting is the starting point of business intelligence (BI), with on-line analytical processing as the next step, and data mining as the I in BI.
The book correctly presents all SSRS features. In some chapters and appendices, the authors also went beyond basics. I especially enjoyed advanced topics in chapter 5, “Location, Location, Locations!”, appendix A, “SSRS Best Practices”, and appendix B, “Transactional Replication for Reporting Services”. All together, the book is more than sufficient for creating nice reports without much previous knowledge in a short time, and for avoiding common pitfalls at the same time.
The only thing I am missing is a bit more of theory. Yes, as a data mining person, I like to learn things a bit more in depth. I usually don’t deal with presentation; I prefer numbers. And this is my point – I would like to see more guidelines about proper usage of report elements – when to use which graph type, when to use maps, how to properly present different kinds of data…
Anyway, all together, the book is very useful, and I would recommend it to anybody that wants to learn SSRS in a short time.
This is the fifth, the final part of the fraud detection whitepaper. You can find the first part, the second part, the third part, and the fourth part in my previous blog posts about this topic.
In my original fraud detection whitepaper I wrote for SolidQ, I was advised by my friends to include some concrete and simple numbers to calculate the return on investment (ROI) in a language that managers can understand. with some customers, we really managed to get very impressive numbers. However, I am not repeating this calculation here. However, this is my personal blog. Therefore, I am writing my personal opinion here.
I am kind of bored with this constant requests to show simple numbers that managers can understand. Personally, I don’t think that managers are that stupid that they would not understand anything beyond primary school mathematics. And even if some of them are that dumb, I don’t care, as they can’t become my customers. They would never be able to understand the value of such an advanced technique like data mining.
I am pretty sure that the vast majority of managers can calculate approximate ROI by themselves, and also better than I can do. They definitely know their business better than I can do, and already know how much money they are losing because of frauds and how many frauds they are already preventing or catching early. In addition, I am pretty sure that most of the managers do understand the value of learning, and appreciate building of the learning infrastructure.
Therefore, in short, I am leaving to you, to the reader, to evaluate what can you expect from implementing a fraud detection continuous learning cycle. And thank you for understanding my point!
Fraud detection is a very popular, albeit very complex, data mining task. I have developed my own approach to fraud detection. The most significant element of this approach is the continuous learning cycle.
Although Microsoft SQL Server is not the most popular tool for data mining, I am using it. The SQL Server suite gives us all of the tools we need, and because all of the tools come from a single suite, they work perfectly together, thus substantially lowering the time needed to bring a project from the initial meeting to a production-ready deployment.
Another advantage of my approach is the mentoring with the knowledge transfer. It is not my intention to get permanent consulting contracts; I want to progress together with my customers. Once we finish the project, or sometimes even as soon as we finish the POC project, the customer can begin using and continue improving the fraud detection system constantly with the help of the continuous learning infrastructure.
Finally, due to Microsoft’s licensing policies, the customers that already possess Microsoft SQL Server Standard Edition or higher and Microsoft Excel, do not need to purchase any additional licenses.
I could simply stop here. However, I want to mention again everybody involved in this, and also some who were unfortunately missing.
First of all, PASS is the organization that defined SQL Saturdays. And apparently the idea works
I have to thank again to all of the speakers. Coming to share your amazing knowledge is something we really appreciate. The presentations were great, from the technical and other perspectives.
Of course, we could not do the event without sponsors. I am not going to enlist all of them again; I will just mention the host, pixi* labs, the company that hosted the event and who's employees helped with all of the organization. In addition, I need to mention Vina Kukovec. Boštjan Kukovec, an old member of Slovenian SQL Server and Developers users group, organized free wine tasting after the event. And what a wine it is!
Finally, thanks to all attendees for coming. We had approximately 85% show up rate; only 15% or the registered attendees didn’t come. This is an incredible result, worldwide! And from the applause after the raffle, when we closed the vent (and started wine tasting), I conclude that the attendees were very satisfied as well.
I want to mention three people that wanted to come, but run out of luck this time. Peter Stegnar from pixi* labs was the one that immediately offered the venue, and permeated with his enthusiasm also other pixi* labs members. Due to family reasons he couldn’t join us to see the results of his help. Tobiasz Janusz Koprowski wanted to speak, organized his trip, looked forward to join us; however, just couple of days before the event he had to cancel because of some urgent work at customer’s site. And what to say about Kevin Boles? He really tried hard to come. Think of it, he was prepared to come from USA! he was already on the airport, when his flight got cancelled due to technical problems. We were in constant touch Friday evening. He managed to change the flight, went to the gate, but was not admitted to the plane. because there was only seven minutes left till take off. Catching next flights would not make any sense anymore, because he would come too late anyway. He really did the best he could do, he just didn’t have enough luck this time. Peter, Tobiasz, and Kevin, thank you for your enthusiasm, we seriously missed you, and we certainly hope we will meet on our next event!
I would like to expose all of the sponsors that enabled the event. The following companies provided invaluable help to us, and therefore we are thanking them. I think they deserve this exposure.
SQL Saturday #274 Slovenia is full (150 registered attendees) for more than a week There are still some people in the waiting list. We sent an e-mail to registered attendees asking them to unregister if they already know they can’t make it. I have to thank to all of those who followed this advice and made room for part of the people from the waiting list. Apparently, we will really have a full house, and very low no show rate.
I am positively surprised with everything so far. Fantastic support from PASS, immediate help from sponsors, and a huge number of speakers prepared to spend their own money and time and come over to present high-end topics, and also passionate attendees who registered far in advance and are fair enough to release their place for somebody who can come more surely than they can. Let me also mention that we got classrooms for free, that we got minimal price for lunch, and we even got a free vine tasting during the event. Fantastic response from PASS, SQL Server Community, from sponsors, and from my fellow countrymen from Slovenia!
See you all on Saturday. We will all enjoy, this is a promise!
This is the fourth part of the fraud detection whitepaper. You can find the first part
, the second part
, and the third part
in my previous blog posts about this topic.
We create multiple mining models by using different algorithms, different input data sets, and different algorithm parameters. Then we evaluate the models in order to find the most appropriate candidates for the actual deployment to production.
Many different algorithms can be used for fraud detection; it is difficult to say which one would generally yield the best result. In a project, the available algorithms are typically chosen, based on experience and the knowledge about the given domain. Because we use the Microsoft SQL Server suite, we use Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Naïve Bayes directed algorithms, and Microsoft Clustering for the undirected one. In recent years, the Support Vector Machines methods are becoming more and more popular. SSAS does not bring this algorithm out of the box. However, it can be downloaded as a free plug-in algorithm for SSAS from the Microsoft CodePlex site at
Valkonet, J. (2008). Support Vector Machine plug-in in Analysis Services. Retrieved from Microsoft CodePlex: http://svmplugin.codeplex.com/
Of course, if there are time and software policy constraints that prevent us from using this download, we simply skip it. We do not lose much, because, according to
Sahin Y., & Duman E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.,
the Decision Tress algorithm usually yields better results in fraud detection analysis than Support Vector Machines. For details on specific data mining algorithms, please refer to
Han J., Kamber M., & Pei J. (2011). Data Mining: Concepts and Techniques, Third Edition. Morgan Kaufmann,
or to the SolidQ course
Sarka D. (2012). Data Mining with SQL Server 2012. SolidQ. Retrieved from http://www.solidq.com/squ/courses/Pages/Data-Mining-with-SQL-Server-2012.aspx.
We evaluate the efficiency of different supervised models by using standard techniques, namely the Lift Chart, the Classification Matrix, and Cross Validation. All of these techniques are built into the SSAS data mining feature and are described in more detail in
MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.
To evaluate the Clustering models, we have developed a technique of our own, implementing entropy. If the individual clusters are homogenous, the entropy in any given cluster must be low. We calculate the average entropy and the standard deviation of the entropy across all clusters. In a SSAS Clustering model that has been trained (or processed), it is possible to read the model data with DMX queries. In the cluster notes we can identify the distribution of the input variables, and then use it to can calculate the entropy.
From experience, we have learned that not all algorithms are equally useful for all data sets. The Microsoft Neural Network algorithm works best when the frequency of the target state (i.e. fraud) is about 50%. Microsoft Naïve Bayes can work well when the target state is represented by approximately 10% of the population or more. However, Microsoft Decision Trees work well even if the target state frequency is only about 1%, and is thus a very suitable algorithm for small data sets and low frequency of the target state as well.
The continuous learning cycle is shown graphically in Figure 1.
Figure 1: The continuous learning cycle
We start by creating the directed models, assuming that the customer has already flagged frauds in the existing data. We evaluate the directed models and then use the best one to predict the frauds in the new data. We also create the undirected models, evaluate them, and use the best one for selection of potential frauds. We do this over time and check the difference between the number or percentage of frauds caught with the directed and the undirected model deployed. When this difference drops, it is time to refine the directed model. In addition, we store the predictions of both models and the actual, confirmed or reported frauds in a data warehouse. When the percentage of the predicted frauds in the total number of frauds drops, it is time to refine both models. We use an OLAP cube on the top of the DW to measure the efficiency of the models over time.
This is the third part of the fraud detection whitepaper. You can find the first part and the second part in my previous blog posts about this topic.
The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.
It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in
Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.
In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:
- A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
- A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
- Whether there are multiple persons using the same credit card or user account
- Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
- The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
- The frequency of transactions in a moving time frame
- The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
- The quantity of deviations from a moving average for the type of the transaction
We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at
Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Sample_size_determination.
Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:
- The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
- The Microsoft Naïve Bayes algorithm already works well with 10% frauds
- The Microsoft Decision Trees algorithm even works well with only 1% of frauds
During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
- Ignore the column (note that we could also ignore too many columns)
- Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
- Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
- Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
- Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)
Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.
Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:
- Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
- Ignore the column (note that we could ignore too many columns)
- Replace outliers with common (mean) values (note that, of course, we are losing variability)
- Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
- Normalize the data values in predefined limited ranges
Some well-known methods of normalization include:
- Range normalization.
- Z-score normalization
- The logistic (sigmoid) function normalization
- The hyperbolic tangent function normalization
Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:
- Transaction amount
Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to
Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann
for further reading about data preparation for data mining.
As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.
For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.
For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.
It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:
- Chi-Squared test for pairs of discrete variables
- Correlation Coefficient for pairs of continuous variables
- Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable
Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.
Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.
For more details about the statistics mentioned, please refer to
Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.
For more information on the information theory, please refer to
Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.