This is the third part of the fraud detection whitepaper. You can find the first part and the second part in my previous blog posts about this topic.
The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.
It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in
Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.
In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:
- A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
- A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
- Whether there are multiple persons using the same credit card or user account
- Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
- The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
- The frequency of transactions in a moving time frame
- The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
- The quantity of deviations from a moving average for the type of the transaction
We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at
Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Sample_size_determination.
Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:
- The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
- The Microsoft Naïve Bayes algorithm already works well with 10% frauds
- The Microsoft Decision Trees algorithm even works well with only 1% of frauds
During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
- Ignore the column (note that we could also ignore too many columns)
- Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
- Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
- Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
- Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)
Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.
Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:
- Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
- Ignore the column (note that we could ignore too many columns)
- Replace outliers with common (mean) values (note that, of course, we are losing variability)
- Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
- Normalize the data values in predefined limited ranges
Some well-known methods of normalization include:
- Range normalization.
- Z-score normalization
- The logistic (sigmoid) function normalization
- The hyperbolic tangent function normalization
Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:
- Transaction amount
Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to
Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann
for further reading about data preparation for data mining.
As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.
For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.
For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.
It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:
- Chi-Squared test for pairs of discrete variables
- Correlation Coefficient for pairs of continuous variables
- Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable
Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.
Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.
For more details about the statistics mentioned, please refer to
Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.
For more information on the information theory, please refer to
Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.
As always, there are also some limitations for the SQL Saturday #274 Slovenia event. We have limited space in the classrooms, and can cover only a limited number of lunch meals. Therefore, if you plan to attend, you should register very soon. We are positively surprised with the attendance and general acceptance of the event. It is a pleasure and a pain at the same time, but it has to be said: we are closing the registrations soon, there are only a few free places left. Seriously, hurry up! We will try to maintain a waiting list, but please don’t rely just on that.
I would also like to use this blog to kindly ask all registered attendees to really attend the event. Of course, there are always some who can’t come because of some objective reasons. However, experience tells us that the “no show” percentage is higher for free than for payable events. Please don’t register to the event just for the last exit for the Saturday December 21st, just for the case if you don’t find anything better to do. In addition, please don’t get lazy in the morning of that day, please get up and come to the event. Please understand that if you register and then don’t show up, you are preventing somebody else to attend. Also don’t forget that all expenses are covered by sponsors. They, of course, prefer bigger attendance than smaller. Big “no show” percentage therefore also lowers the probability for future events.
Thank you for understanding and see you at SQL Saturday #274 Slovenia!
Correction: here is the final schedule for the SQL Saturday #274 Slovenia event! (OK, if some of the speakers does not cancel the presentation)
08:30 – 09:00
09:00 – 09:15
Lah, Prajdič, Sarka: Keynote
09:15 – 09:30
09:30 – 10:30
Ozhorasan, Belkis: Features Coming With Microsoft SQL Server 2014
Parenzan, Marco: .NET Development for the SQL Server Developer
Stacey, Mark: Analysis Services for the DBA
10:30 – 10:45
10:45 – 11:45
Hristov, Boris: Replay your workload as it's your actual one!
Boles, Kevin: Common TSQL Mistakes
Pontoppidan, Kennie: Autogenerating a process data warehouse
11:45 – 13:15
Lunch / Vine Tasting Break
13:15 – 14:15
Vanhove, Pieter: Harder, Better, Stronger and More Secure
Radivojevic, Milos: SQL Server 2014 - New Cardinality Estimator
Kaštrun, Tomaž: Using Data Quality services on Customer related data
14:15 – 14:30
14:30 – 15:30
Koprowski, Tobiasz: 2 AM. A Disaster just began...
Szeliga, Marcin: Indexing for denormalization
Reinholdt, Rasmus: Optimizing BI with SQL2014 In-Memory tables
15:30 – 15:45
15:45 – 16:45
Durkin, William: SQL Server Consolidation - Resistance is Futile
El-Nagar, Shehap: T-SQL Performance Guidelines for better DB stress power
Kamman, André: Automate your SSIS development with BIML
16:45 – 17:00
17:00 – 17:30
Lah, Prajdič, Sarka: Raffle, Wrap-up
I am proud to announce that my first course for Pluralsight is released. The course title is Logical and Physical Modeling for Analytical Applications. Here is the description of the course.
A bad data model leads to an application that does not perform well. Therefore, when developing an application, you should create a good data model from the start. However, even the best logical model can’t help when the physical implementation is bad. It is also important to know how SQL Server stores and accesses data, and how to optimize the data access. Database optimization starts by splitting transactional and analytical applications. In this course, you learn how to support analytical applications with logical design, get understanding of the problems with data access for queries that deal with large amounts of data, and learn about SQL Server optimizations that help solving these problems.
Enjoy the course!
This is the second part of the fraud detection whitepaper. You can find the first part in my previous blog post about this topic.
It is impossible to evaluate the time and money needed for a complete fraud detection infrastructure in advance. Personally, I do not know the customer’s data in advance. I don’t know whether there is already an existing infrastructure, like a data warehouse, in place, or whether we would need to build one from scratch. Therefore, I always suggest to start with a proof-of-concept (POC) project. A POC takes something between 5 and 10 working days, and involves personnel from the customer’s site – either employees or outsourced consultants. The team should include a subject matter expert (SME) and at least one information technology (IT) expert. The SME must be familiar with both the domain in question as well as the meaning of data at hand, while the IT expert should be familiar with the structure of data, how to access it, and have some programming (preferably Transact-SQL) knowledge. With more than one IT expert the most time consuming work, namely data preparation and overview, can be completed sooner. I assume that the relevant data is already extracted and available at the very beginning of the POC project.
If a customer wants to have their people involved in the project directly and requests the transfer of knowledge, the project begins with training. I strongly advise this approach as it offers the establishment of a common background for all people involved, the understanding of how the algorithms work and the understanding of how the results should be interpreted, a way of becoming familiar with the SQL Server suite, and more.
Once the data has been extracted, the customer’s SME (i.e. the analyst), and the IT expert assigned to the project will learn how to prepare the data in an efficient manner. Together with me, knowledge and expertise allow us to focus immediately on the most interesting attributes and identify any additional, calculated, ones soon after. By employing our programming knowledge, we can, for example, prepare tens of derived variables, detect outliers, identify the relationships between pairs of input variables, and more, in only two or three days, depending on the quantity and the quality of input data. I favor the customer’s decision of assigning additional personnel to the project. For example, I actually prefer to work with two teams simultaneously. I demonstrate and explain the subject matter by applying techniques directly on the data managed by each team, and then both teams continue to work on the data overview and data preparation under our supervision. I explain to the teams what kind of results we expect, the reasons why they are needed, and how to achieve them. Afterwards we review and explain the results, and continue with new instructions, until we resolve all known problems.
Simultaneously with the data preparation the data overview is performed. The logic behind this task is the same – again I show to the teams involved the expected results, how to achieve them and what they mean. This is also done in multiple cycles as is the case with data preparation, because, quite frankly, both tasks are completely interleaved. A specific objective of the data overview is of principal importance – it is represented by a simple star schema and a simple OLAP cube that will first of all simplify data discovery and interpretation of the results, and will also prove useful in the following tasks. The presence of the customer’s SME is the key to resolving possible issues with the actual meaning of the data. We can always replace the IT part of the team with another database developer; however, we cannot conduct this kind of a project without the customer’s SME.
After the data preparation and when the data overview is available, we begin the scientific part of the project. I assist the team in developing a variety of models, and in interpreting the results. The results are presented graphically, in an intuitive way. While it is possible to interpret the results on the fly, a much more appropriate alternative is possible if the initial training was also performed, because it allows the customer’s personnel to interpret the results by themselves, with only some guidance from me. The models are evaluated immediately by using several different techniques. One of the techniques includes evaluation over time, where we use an OLAP cube.
After evaluating the models, we select the most appropriate model to be deployed for a production test; this allows the team to understand the deployment process. There are many possibilities of deploying data mining models into production; at the POC stage, we select the one that can be completed quickly. Typically, this means that we add the mining model as an additional dimension to an existing DW or OLAP cube, or to the OLAP cube developed during the data overview phase. Finally, we spend some time presenting the results of the POC project to the stakeholders and managers.
Even from a POC, the customer will receive lots of benefits, all at the sole risk of spending money and time for a single 5 to 10 day project:
- The customer learns the basic patterns of frauds and fraud detection
- The customer learns how to do the entire cycle with their own people, only relying on me for the most complex problems
- The customer’s analysts learn how to perform much more in-depth analyses than they ever thought possible
- The customer’s IT experts learn how to perform data extraction and preparation much more efficiently than they did before
- All of the attendees of this training learn how to use their own creativity to implement further improvements of the process and procedures, even after the solution has been deployed to production
- The POC output for a smaller company or for a subsidiary of a larger company can actually be considered a finished, production-ready solution
- It is possible to utilize the results of the POC project at subsidiary level, as a finished POC project for the entire enterprise
- Typically, the project results in several important “side effects”
- Improved data quality
- Improved employee job satisfaction, as they are able to proactively contribute to the central knowledge about fraud patterns in the organization
- Because eventually more minds get to be involved in the enterprise, the company should expect more and better fraud detection patterns
After the POC project is completed as described above, the actual project would not need months of engagement from my side. This is possible due to our preference to transfer the knowledge onto the customer’s employees: typically, the customer will use the results of the POC project for some time, and only engage me again to complete the project, or to ask for additional expertise if the complexity of the problem increases significantly. I usually expect to perform the following tasks:
- Establish the final infrastructure to measure the efficiency of the deployed models
- Deploy the models in additional scenarios
- Through reports
- By including Data Mining Extensions (DMX) queries in OLTP applications to support real-time early warnings
- Include data mining models as dimensions in OLAP cubes, if this was not done already during the POC project
- Create smart ETL applications that divert suspicious data for immediate or later inspection
- I would also offer to investigate how the outcome could be transferred automatically to the central system; for instance, if the POC project was performed in a subsidiary whereas a central system is available as well
- Of course, for the actual project, I would repeat the data and model preparation as needed
It is virtually impossible to tell in advance how much time the deployment would take, before we decide together with customer what exactly the deployment process should cover. Without considering the deployment part, and with the POC project conducted as suggested above (including the transfer of knowledge), the actual project should still only take additional 5 to 10 days.
The approximate timeline for the POC project is, as follows:
- 1-2 days of training
- 2-3 days for data preparation and data overview
- 2 days for creating and evaluating the models
- 1 day for initial preparation of the continuous learning infrastructure
- 1 day for presentation of the results and discussion of further actions
Quite frequently I receive the following question: are we going to find the best possible model during the POC project, or during the actual project? My answer is always quite simple: I do not know. Maybe, if we would spend just one hour more for data preparation, or create just one more model, we could get better patterns and predictions. However, we simply must stop somewhere, and the best possible way to do this, according to my experience, is to restrict the time spent on the project in advance, after an agreement with the customer. You must also never forget that, because we build the complete learning infrastructure and transfer the knowledge, the customer will be capable of doing further investigations independently and improve the models and predictions over time without the need for a constant engagement with me.
In Celebration of #Strataconf, O’Reilly is offering 50% discount on all data-related e-books and videos for the next three days, starting at 12:01AM PDT (8:00AM CET). Don’t miss the opportunity at http://oreil.ly/1gNsE28.
While working on different fraud detection projects, I developed my own approach to the solution for this problem. In my PASS Summit 2013 session I am introducing this approach. I also wrote a whitepaper on the same topic, which was generously reviewed by my friend Matija Lah. In order to spread this knowledge faster, I am starting a series of blog posts which will at the end make the whole whitepaper.
With the massive usage of credit cards and web applications for banking and payment processing, the number of fraudulent transactions is growing rapidly and on a global scale. Several fraud detection algorithms are available within a variety of different products. In this paper, we focus on using the Microsoft SQL Server suite for this purpose. In addition, we will explain our original approach to solving the problem by introducing a continuous learning procedure. Our preferred type of service is mentoring; it allows us to perform the work and consulting together with transferring the knowledge onto the customer, thus making it possible for a customer to continue to learn independently.
This paper is based on practical experience with different projects covering online banking and credit card usage.
A fraud is a criminal or deceptive activity with the intention of achieving financial or some other gain. Fraud can appear in multiple business areas. You can find a detailed overview of the business domains where fraud can take place in
Sahin Y., & Duman E. (2011), Detecting Credit Card Fraud by Decision Trees and Support Vector Machines, Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.
Dealing with frauds includes fraud prevention and fraud detection. Fraud prevention is a proactive mechanism, which tries to disable frauds by using previous knowledge. Fraud detection is a reactive mechanism with the goal of detecting suspicious behavior when a fraudster surpasses the fraud prevention mechanism. A fraud detection mechanism checks every transaction and assigns a weight in terms of probability between 0 and 1 that represents a score for evaluating whether a transaction is fraudulent or not. A fraud detection mechanism cannot detect frauds with a probability of 100%; therefore, manual transaction checking must also be available. With fraud detection, this manual part can focus on the most suspicious transactions. This way, an unchanged number of supervisors can detect significantly more frauds than could be achieved with traditional methods of selecting which transactions to check, for example with random sampling.
There are two principal data mining techniques available both in general data mining as well as in specific fraud detection techniques: supervised or directed and unsupervised or undirected. Supervised techniques or data mining models use previous knowledge. Typically, existing transactions are marked with a flag denoting whether a particular transaction is fraudulent or not. Customers at some point in time do report frauds, and the transactional system should be capable of accepting such a flag. Supervised data mining algorithms try to explain the value of this flag by using different input variables. When the patterns and rules that lead to frauds are learned through the model training process, they can be used for prediction of the fraud flag on new incoming transactions. Unsupervised techniques analyze data without prior knowledge, without the fraud flag; they try to find transactions which do not resemble other transactions, i.e. outliers. In both cases, there should be more frauds in the data set selected for checking by using the data mining knowledge compared to selecting the data set with simpler methods; this is known as the lift of a model. Typically, we compare the lift with random sampling. The supervised methods typically give a much better lift than the unsupervised ones. However, we must use the unsupervised ones when we do not have any previous knowledge. Furthermore, unsupervised methods are useful for controlling whether the supervised models are still efficient.
Accuracy of the predictions drops over time. Patterns of credit card usage, for example, change over time. In addition, fraudsters continuously learn as well. Therefore, it is important to check the efficiency of the predictive models with the undirected ones. When the difference between the lift of the supervised models and the lift of the unsupervised models drops, it is time to refine the supervised models. However, the unsupervised models can become obsolete as well. It is also important to measure the overall efficiency of both, supervised and unsupervised models, over time. We can compare the number of predicted frauds with the total number of frauds that include predicted and reported occurrences. For measuring behavior across time, specific analytical databases called data warehouses (DW) and on-line analytical processing (OLAP) systems can be employed. By controlling the supervised models with unsupervised ones and by using an OLAP system or DW reports to control both, a continuous learning infrastructure can be established.
There are many difficulties in developing a fraud detection system. As has already been mentioned, fraudsters continuously learn, and the patterns change. The exchange of experiences and ideas can be very limited due to privacy concerns. In addition, both data sets and results might be censored, as the companies generally do not want to publically expose actual fraudulent behaviors. Therefore it can be quite difficult if not impossible to cross-evaluate the models using data from different companies and different business areas. This fact stresses the importance of continuous learning even more. Finally, the number of frauds in the total number of transactions is small, typically much less than 1% of transactions is fraudulent. Some predictive data mining algorithms do not give good results when the target state is represented with a very low frequency. Data preparation techniques like oversampling and undersampling can help overcome the shortcomings of many algorithms.
SQL Server suite includes all of the software required to create, deploy any maintain a fraud detection infrastructure. The Database Engine is the relational database management system (RDBMS), which supports all activity needed for data preparation and for data warehouses. SQL Server Analysis Services (SSAS) supports OLAP and data mining (in version 2012, you need to install SSAS in multidimensional and data mining mode; this was the only mode in previous versions of SSAS, while SSAS 2012 also supports the tabular mode, which does not include data mining). Additional products from the suite can be useful as well. SQL Server Integration Services (SSIS) is a tool for developing extract transform–load (ETL) applications. SSIS is typically used for loading a DW, and in addition, it can use SSAS data mining models for building intelligent data flows. SQL Server Reporting Services (SSRS) is useful for presenting the results in a variety of reports. Data Quality Services (DQS) mitigate the occasional data cleansing process by maintaining a knowledge base. Master Data Services is an application that helps companies maintaining a central, authoritative source of their master data, i.e. the most important data to any organization. For an overview of the SQL Server business intelligence (BI) part of the suite that includes Database Engine, SSAS and SSRS, please refer to
Veerman E., Lachev T., & Sarka D. (2009). MCTS Self-Paced Training Kit (Exam 70-448): Microsoft® SQL Server® 2008 Business Intelligence Development and Maintenance. MS Press.
For an overview of the enterprise information management (EIM) part that includes SSIS, DQS and MDS, please refer to
Sarka D., Lah M., & Jerkič G. (2012). Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft® SQL Server® 2012. O'Reilly.
For details about SSAS data mining, please refer to
MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.
SQL Server Data Mining Add-ins for Office, a free download for Office versions 2007, 2010 and 2013, bring the power of data mining to Excel, enabling advanced analytics in Excel. Together with PowerPivot for Excel, which is also freely downloadable and can be used in Excel 2010, is already included in Excel 2013. It brings OLAP functionalities directly into Excel, making it possible for an advanced analyst to build a complete learning infrastructure using a familiar tool. This way, many more people, including employees in subsidiaries, can contribute to the learning process by examining local transactions and quickly identifying new patterns.
Many companies or organizations do regular data cleansing. When you cleanse the data, the data quality goes up to some higher level. The data quality level is determined by the amount of work invested in the cleansing. As time passes, the data quality deteriorates, and you need to repeat the cleansing process. If you spend an equal amount of effort as you did with the previous cleansing, you can expect the same level of data quality as you had after the previous cleansing. And then the data quality deteriorates over time again, and the cleansing process starts over and over again.
The idea of Data Quality Services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, etc. You don’t throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process. The following figure shows this graphically.
The idea of master data management, which you can perform with Master Data Services (MDS), is to prevent data quality from deteriorating. Once you reach a particular quality level, the MDS application—together with the defined policies, people, and master data management processes—allow you to maintain this level permanently. This idea is shown in the following picture.
OK, now you know what DQS and MDS are about. You can imagine the importance on maintaining the data quality. Here are some resources that help you preparing and executing the data quality (DQ) and master data management (MDM) activities.
- Dejan Sarka and Davide Mauri: Data Quality and Master Data Management with Microsoft SQL Server 2008 R2 – a general introduction to MDM, MDS, and data profiling. Matching explained in depth.
- Dejan Sarka, Matija Lah and Grega Jerkič: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 – I wrote quite a few chapters about DQ and MDM, and introduced also SQL Server 2012 DQS.
- Thomas Redman: Data Quality: The Field Guide – you should start with this book. Thomas Redman is the father of DQ and MDM.
- Tyler Graham: Microsoft SQL Server 2012 Master Data Services – MDS in depth from a product team mate.
- Arkady Maydanchik: Data Quality Assessment – data profiling in depth.
- Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning – advanced data profiling with data mining.
- Forthcoming presentations
- I am presenting a DQS and MDM seminar at PASS SQL Rally Amsterdam 2013:
- Wednesday, November 6th, 2013: Enterprise Information Management with SQL Server 2012 – a good kick start to your first DQ and / or MDM project.
- Data Quality and Master Data Management with SQL Server 2012 – I wrote a 2-day course for SolidQ. If you are interested in this course, which I could also deliver in a shorter seminar way, you can contact your closes SolidQ subsidiary, or, of course, me directly on addresses firstname.lastname@example.org or email@example.com. This course could also complement the existing courseware portfolio of training providers, which are welcome to contact me as well.
Start improving the quality of your data now!
Yes, here it is SQL Saturday #274 is coming to Slovenia (#sqlsatSlovenia). The event will take place on Saturday, December 21st, at company
pixi* labs, Informacijske tehnologije, d.o.o.
Poslovna cona A 2
This company generously offered to host the event. We, the whole Slovenian SQL Server community, are very grateful for this.
At this time, a call for speakers went out, and we are already getting the first proposals.
We are especially happy that we will get possibility to show the foreign speakers how beautiful Slovenia and especially the capital Ljubljana is in December. Expect a lot of partying right on the streets, no matter of weather. Be prepared, we have slightly weird customs when it comes to drinks. For example, our regular special discount offer is not three drinks for the price of two; it is six drinks for the price of five.
If you are a speaker or want to become one, consider sending a proposal. Since most of the sessions will be held in English and you don’t want to speak, consider coming as a visitor as well. Or maybe you would be interested to become a sponsor. Although we are targeting a low budgeted event, any kind of sponsorship is very welcome. Please feel free to contact the organizers if you are interested to become a sponsor:
Matija Lah – firstname.lastname@example.org, Mladen Prajdič - email@example.com, or Dejan Sarka - firstname.lastname@example.org.
Looking forward to see you all!
There are many different types of analyses, each one with its own pros and cons.
Relational reports have a predefined structure, and end users cannot change it. They are simple to use for end users. Reports can use real-time data and snapshots of data to show the state of a report at specific points in time. One of the drawbacks is that report authoring is limited to IT pros and advanced users. Any kind of dynamic restructuring is very limited. If real-time data is used for a report, the report has a negative impact on the performance of the source system. Processing of the reports might be slow because the data comes from relational database management systems, which are not optimized for reporting only.
If you create a semantic model of your data, your end users can create ad-hoc report structures. However, the development is more complex because a developer is needed to create these semantic models.
For OLAP, you typically use specialized database management systems. You get lightning speed of analyses. End users can use rich and thin clients to interactively change the structure of the report. Typically, they do it graphically. However, the development of an OLAP system is many times quite complex. It involves the preparation and maintenance of an enterprise data warehouse and OLAP cubes. In order to exploit the possibility of real-time restructuring of reports, the users must be both active and educated. The data is usually stale, as it is loaded into data warehouses and OLAP cubes with a scheduled process.
With data mining, a structure is not selected in advance; it searches for the structure. As a result, data mining can give you the most valuable results because you can discover patterns you did not expect. A data mining model structure is limited only by the attributes that you use to train the model. One of the drawbacks is that a lot of knowledge is needed for a successful data mining project. End users have to understand the results. Subject matter experts and IT professionals need to understand business problem thoroughly. The development might be sometimes even more complex than the development of OLAP cubes.
Each type of analysis has its own place in an enterprise system. SQL Server has tools for all kinds of analyses. However, data mining is the most advanced way of analyzing the data; this is the “I” in BI. In order to get the most out of it, you need to learn quite a lot. In this blog post, I am gathering together resources for learning, including forthcoming events.
- Multiple authors: SQL Server MVP Deep Dives – I wrote an introductory data mining chapter there.
- Erik Veerman, Teo Lachev and Dejan Sarka: MCTS Self-Paced Training Kit (Exam 70-448): Microsoft SQL Server 2008 - Business Intelligence Development and Maintenance – you can find a good overview of a complete BI solution, including data mining, in this book.
- Jamie MacLennan, ZhaoHui Tang, and Bogdan Crivat: Data Mining with Microsoft SQL Server 2008 – can’t miss this book if you want to mine your data with SQL Server tools.
- Michael Berry, Gordon Linoff: Mastering Data Mining: The Art and Science of Customer Relationship Management – data mining from both, business and technical perspective.
- Dorian Pyle: Data Preparation for Data Mining – an in-depth book about data preparation.
- Thomas and Ronald Wonnacott: Introductory Statistics – if you thought that you could get away without statistics, then you are not serious about data mining.
- Jiawei Han and Micheline Kamber: Data Mining Concepts and Techniques – in-depth explanation of the most popular data mining algorithms.
- Michael Berry and Gordon Linoff: Data Mining Techniques – another book that explains data mining algorithms, more fro a business perspective.
- Paolo Guidici: Applied Data Mining – very mathematical book, only if you enjoy statistics and mathematics in general.
- Forthcoming presentations
- I am presenting two data mining related sessions during the PASS Summit in Charlotte, NC:
- Wednesday, October 16th, 2013 - Fraud Detection: Notes from the Field – I am showing how to use data mining for a specific business problem. The presentation is based on real-life projects.
- Friday, October 18th: Excel 2013 Advanced Analytics – I am focusing on Excel Data Mining Add-ins, and how to use them together with Power Pivot and other add-ins. This is the most you can get out of Excel.
- Sinergija 2013, Belgrade, Serbia
- Tuesday, October 22nd: Excel 2013 Analytics to the Max – another presentation focusing on the most advanced analytics you can get in Excel.
- SQL Rally Amsterdam, Netherlands
- Thursday, November 7th: Advanced Analytics in Excel 2013 – and again I am presenting about data mining in Excel. Why three different titles for the same presentation? I don’t know, I guess I forgot the name I proposed every time right after I sent the proposal.
- Data Mining with SQL Server 2012 – I wrote a 3-day course for SolidQ. If you are interested in this course, which I could also deliver in a shorter seminar way, you can contact your closes SolidQ subsidiary, or, of course, me directly on addresses email@example.com or firstname.lastname@example.org. This course could also complement the existing courseware portfolio of training providers, which are welcome to contact me as well.
OK, now you know: no more excuses, start learning data mining, get the most out of your data
After years of blogging at SolidQ site, I decided to move my blog here, to SQLBLOG.COM. It seems that most of the SQL Server specialists like to hang around together. In addition, blogging on a specific company’s site is not too popular because of possible limitations; independent sites are preferred. Instead of competing with my friends, I decided to join them Stay tuned, my blogs are coming.