THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

  • Indexing, Querying and Analyzing Text with SQL Server 2012-2014

    It is hard to imagine searching for something on the Web without modern search engines like Bing or Google. However, most contemporary applications still limit users to exact searches only. For end users, even the standard SQL LIKE operator is not powerful enough for approximate searches. In addition, many documents are stored in modern databases; end users would probably like to get powerful search inside document contents as well. Text mining is also becoming more and more popular. Everybody would like to understand data from blogs, Web sites, and social media. Microsoft SQL Server in versions 2012 and 2014 enhances full-text search support that was available in previous editions substantially. Semantic Search, a new component in Full-Text Search, can help you understand the meaning of documents. Finally, the Term Extraction and Term Lookup components from SQL Server Integration Services also helps with text analysis.

    I am proud to introduce my second course I authored for Pluralsight - Indexing, Querying and Analyzing Text with SQL Server 2012-2014. You can learn how to get the most out of your texts with SQL Server tools. Enjoy!

  • Winter and Spring Conferences

    As a speaker, I would like to help promoting the conferences where I am proud to present. Of course, this does not mean that other conferences are any worse. Anyway, here is the list of the conferences where I am going to speak, and I would like to invite you to join me.

    Looking forward to all of these events. I hope we will meet in at least one of them!

  • SQL Set Operators. Set? Really?

    UNION, INTERSECT and EXCEPT operators are commonly called Set Operators. For example, in Books Online you can find a topic “Set Operators”, where  these three operators are explained. They should represent set operations UNION, INTERSECT and MINUS (synonym for EXCEPT DISTINCT). Also Wikipedia has a topic called “SET OPERATIONS (SQL)”, where these three operators are introduced. And these operators are commonly represented by Venn diagrams. Logically, Venn diagrams are also called Set diagrams. Here are the three operators presented with Venn diagrams:


    However, is the name “Set Operators” really correct? The first question I asked myself was very simple: why would we have 10 and more relational operator and three set operators in the relational algebra? Well, makes no sense. The relational algebra comprise relational operators only, of course.

    So what exactly is a relation? A relation is a special kind of set, set of entities that are related, i.e that are of the same kind. How do we know that the two entities are of the same kind and can thus be grouped in a single entity set, i.e. in a relation? Of course, two entities are of the same kind if they have the same attributes. Therefore, every relation is a set; however, not every set is a relation.

    Set operators work on sets and produce a set. Relational operators work on relations and produce a relation. SQL operators UNION, INTERSECT and EXCEPT produce relations, i.e. special kind of sets. Set operator UNION can combine a set of differential equations and a set of hammers into a single set. Relational operator UNION can’t combine a relation of differential equations and a relation of hammers into a single relation, because elements of these two relations have nothing in common. And don’t think that if you take only keys of both relation, and both have a single-column integer key, that a UNION of this would be a relation. First of all, you can do such an union because we don’t use strong types in a relational database (each key should be of its own type – in this case, you should have a “hammer” and a “differential equation” key types, which would disallow such operations). In addition, in the case I mentioned, you would get two a relations that has with a single attribute, the key only, which would probably be meaningless from the business perspective. A relation without a meaningful attribute is not really an entity set, as defined by Peter Chen. An entity is something we can identify and is of interest. If we don’t have any real attribute, then this “thing” (whatever) is definitely not an entity, because without attributes it can’t be of any interest.

    To summarize: SQL UNION, INTERSECT and EXCEPT are simply relational operators. Talking about them as of set operators is at least imprecise. However, representing them with Venn diagrams is not just imprecise, it is wrong. Here is a better presentation of there three relational operators.


  • SQL Server 2012 Reporting Services Blueprints Review

    I had opportunity to read the SQL Server 2012 Reporting Services Blueprints book by Marlon Ribunal (@MarlonRibunal) and Mickey Stuewe (@SQLMickey), Packt Publishing. Here is my short review.


    I find the book very practical. The authors guide you right to the point, without unnecessary obstructions. Step by step, you create more and more complex reports, and learn SQL Server Reporting Services (SSRS) 2012 features. If you are more hands-on guy, this is the right book for you. Well, to be honest, there is not much theory in reporting. Reporting is the starting point of business intelligence (BI), with on-line analytical processing as the next step, and data mining as the I in BI.

    The book correctly presents all SSRS features. In some chapters and appendices, the authors also went beyond basics. I especially enjoyed advanced topics in chapter 5, “Location, Location, Locations!”, appendix A, “SSRS Best Practices”, and appendix B, “Transactional Replication for Reporting Services”. All together, the book is more than sufficient for creating nice reports without much previous knowledge in a short time, and for avoiding common pitfalls at the same time.

    The only thing I am missing is a bit more of theory. Yes, as a data mining person, I like to learn things a bit more in depth. I usually don’t deal with presentation; I prefer numbers. And this is my point – I would like to see more guidelines about proper usage of report elements – when to use which graph type, when to use maps, how to properly present different kinds of data…

    Anyway, all together, the book is very useful, and I would recommend it to anybody that wants to learn SSRS in a short time.

  • Fraud Detection with the SQL Server Suite Part 5

    This is the fifth, the final part of the fraud detection whitepaper. You can find the first part, the second part, the third part, and the fourth part in my previous blog posts about this topic.

    The Results

    In my original fraud detection whitepaper I wrote for SolidQ, I was advised by my friends to include some concrete and simple numbers to calculate the return on investment (ROI) in a language that managers can understand. with some customers, we really managed to get very impressive numbers. However, I am not repeating this calculation here. However, this is my personal blog. Therefore, I am writing my personal opinion here.

    I am kind of bored with this constant requests to show simple numbers that managers can understand. Personally, I don’t think that managers are that stupid that they would not understand anything beyond primary school mathematics. And even if some of them are that dumb, I don’t care, as they can’t become my customers. They would never be able to understand the value of such an advanced technique like data mining.

    I am pretty sure that the vast majority of managers can calculate approximate ROI by themselves, and also better than I can do. They definitely know their business better than I can do, and already know how much money they are losing because of frauds and how many frauds they are already preventing or catching early. In addition, I am pretty sure that most of the managers do understand the value of learning, and appreciate building of the learning infrastructure.

    Therefore, in short, I am leaving to you, to the reader, to evaluate what can you expect from implementing a fraud detection continuous learning cycle. And thank you for understanding my point!


    Fraud detection is a very popular, albeit very complex, data mining task. I have developed my own approach to fraud detection. The most significant element of this approach is the continuous learning cycle.

    Although Microsoft SQL Server is not the most popular tool for data mining, I am using it. The SQL Server suite gives us all of the tools we need, and because all of the tools come from a single suite, they work perfectly together, thus substantially lowering the time needed to bring a project from the initial meeting to a production-ready deployment.

    Another advantage of my approach is the mentoring with the knowledge transfer. It is not my intention to get permanent consulting contracts; I want to progress together with my customers. Once we finish the project, or sometimes even as soon as we finish the POC project, the customer can begin using and continue improving the fraud detection system constantly with the help of the continuous learning infrastructure.

    Finally, due to Microsoft’s licensing policies, the customers that already possess Microsoft SQL Server Standard Edition or higher and Microsoft Excel, do not need to purchase any additional licenses.

  • SQL Saturday #274 Slovenia Recapitulation

    Pure success!

    I could simply stop here. However, I want to mention again everybody involved in this, and also some who were unfortunately missing.

    First of all, PASS is the organization that defined SQL Saturdays. And apparently the idea worksSmile

    I have to thank again to all of the speakers. Coming to share your amazing knowledge is something we really appreciate. The presentations were great, from the technical and other perspectives.

    Of course, we could not do the event without sponsors. I am not going to enlist all of them again; I will just mention the host, pixi* labs, the company that hosted the event and who's  employees helped with all of the organization. In addition, I need to mention Vina Kukovec. Boštjan Kukovec, an old member of Slovenian SQL Server and Developers users group, organized free wine tasting after the event. And what a wine it is!

    Finally, thanks to all attendees for coming. We had approximately 85% show up rate; only 15% or the registered attendees didn’t come. This is an incredible result, worldwide! And from the applause after the raffle, when we closed the vent (and started wine tasting), I conclude that the attendees were very satisfied as well.

    I want to mention three people that wanted to come, but run out of luck this time. Peter Stegnar from pixi* labs was the one that immediately offered the venue, and permeated with his enthusiasm also other pixi* labs members. Due to family reasons he couldn’t join us to see the results of his help. Tobiasz Janusz Koprowski wanted to speak, organized his trip, looked forward to join us; however, just couple of days before the event he had to cancel because of some urgent work at customer’s site. And what to say about Kevin Boles? He really tried hard to come. Think of it, he was prepared to come from USA! he was already on the airport, when his flight got cancelled due to technical problems. We were in constant touch Friday evening. He managed to change the flight, went to the gate, but was not admitted to the plane. because there was only seven minutes left till take off. Catching next flights would not make any sense anymore, because he would come too late anyway. He really did the best he could do, he just didn’t have enough luck this time. Peter, Tobiasz, and Kevin, thank you for your enthusiasm, we seriously missed you, and we certainly hope we will meet on our next event!

  • SQL Saturday #274 Slovenia Sponsors

    I would like to expose all of the sponsors that enabled the event. The following companies provided invaluable help to us, and therefore we are thanking them. I think they deserve this exposure.


    General sponsor


    Gold sponsors




    Silver sponsors


    Bronze sponsors


    Raffle sponsors



  • SQL Saturday #274 Slovenia Full House

    SQL Saturday #274 Slovenia is full (150 registered attendees) for more than a weekSmile There are still some people in the waiting list. We sent an e-mail to registered attendees asking them to unregister if they already know they can’t make it. I have to thank to all of those who followed this advice and made room for part of the people from the waiting list. Apparently, we will really have a full house, and very low no show rate.

    I am positively surprised with everything so far. Fantastic support from PASS, immediate help from sponsors, and a huge number of speakers prepared to spend their own money and time and come over to present high-end topics, and also passionate attendees who registered far in advance and are fair enough to release their place for somebody who can come more surely than they can. Let me also mention that we got classrooms for free, that we got minimal price for lunch, and we even got a free vine tasting during the event. Fantastic response from PASS, SQL Server Community, from sponsors, and from my fellow countrymen from Slovenia!

    See you all on Saturday. We will all enjoy, this is a promise!

  • Fraud Detection with the SQL Server Suite Part 4

    This is the fourth part of the fraud detection whitepaper. You can find the first part, the second part, and the third part in my previous blog posts about this topic.

    Data Mining Models

    We create multiple mining models by using different algorithms, different input data sets, and different algorithm parameters. Then we evaluate the models in order to find the most appropriate candidates for the actual deployment to production.

    Many different algorithms can be used for fraud detection; it is difficult to say which one would generally yield the best result. In a project, the available algorithms are typically chosen, based on experience and the knowledge about the given domain. Because we use the Microsoft SQL Server suite, we use Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Naïve Bayes directed algorithms, and Microsoft Clustering for the undirected one. In recent years, the Support Vector Machines methods are becoming more and more popular. SSAS does not bring this algorithm out of the box. However, it can be downloaded as a free plug-in algorithm for SSAS from the Microsoft CodePlex site at

    Valkonet, J. (2008). Support Vector Machine plug-in in Analysis Services. Retrieved from Microsoft CodePlex:

    Of course, if there are time and software policy constraints that prevent us from using this download, we simply skip it. We do not lose much, because, according to

    Sahin Y., & Duman E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.,

    the Decision Tress algorithm usually yields better results in fraud detection analysis than Support Vector Machines. For details on specific data mining algorithms, please refer to

    Han J., Kamber M., & Pei J. (2011). Data Mining: Concepts and Techniques, Third Edition. Morgan Kaufmann,

    or to the SolidQ course

    Sarka D. (2012). Data Mining with SQL Server 2012. SolidQ. Retrieved from

    We evaluate the efficiency of different supervised models by using standard techniques, namely the Lift Chart, the Classification Matrix, and Cross Validation. All of these techniques are built into the SSAS data mining feature and are described in more detail in

    MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.

    To evaluate the Clustering models, we have developed a technique of our own, implementing entropy. If the individual clusters are homogenous, the entropy in any given cluster must be low. We calculate the average entropy and the standard deviation of the entropy across all clusters. In a SSAS Clustering model that has been trained (or processed), it is possible to read the model data with DMX queries. In the cluster notes we can identify the distribution of the input variables, and then use it to can calculate the entropy.

    From experience, we have learned that not all algorithms are equally useful for all data sets. The Microsoft Neural Network algorithm works best when the frequency of the target state (i.e. fraud) is about 50%. Microsoft Naïve Bayes can work well when the target state is represented by approximately 10% of the population or more. However, Microsoft Decision Trees work well even if the target state frequency is only about 1%, and is thus a very suitable algorithm for small data sets and low frequency of the target state as well.

    The Continuous Learning Cycle

    The continuous learning cycle is shown graphically in Figure 1.


    Figure 1: The continuous learning cycle

    We start by creating the directed models, assuming that the customer has already flagged frauds in the existing data. We evaluate the directed models and then use the best one to predict the frauds in the new data. We also create the undirected models, evaluate them, and use the best one for selection of potential frauds. We do this over time and check the difference between the number or percentage of frauds caught with the directed and the undirected model deployed. When this difference drops, it is time to refine the directed model. In addition, we store the predictions of both models and the actual, confirmed or reported frauds in a data warehouse. When the percentage of the predicted frauds in the total number of frauds drops, it is time to refine both models. We use an OLAP cube on the top of the DW to measure the efficiency of the models over time.

  • Fraud Detection with the SQL Server Suite Part 3

    This is the third part of the fraud detection whitepaper. You can find the first part and the second part in my previous blog posts about this topic.

    Data Preparation

    The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.

    It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in

    Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.

    In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:

    • A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
    • A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
    • Whether there are multiple persons using the same credit card or user account
    • Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
    • The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
    • The frequency of transactions in a moving time frame
    • The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
    • The quantity of deviations from a moving average for the type of the transaction

    We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at

    Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia:

    Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:

    • The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
    • The Microsoft Naïve Bayes algorithm already works well with 10% frauds
    • The Microsoft Decision Trees algorithm even works well with only 1% of frauds

    During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:

    • Do nothing (a simple, but rarely a valid approach)
    • Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
    • Ignore the column (note that we could also ignore too many columns)
    • Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
    • Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
    • Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
    • Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)

    Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.

    Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:

    • Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
    • Do nothing (a simple, but rarely a valid approach)
    • Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
    • Ignore the column (note that we could ignore too many columns)
    • Replace outliers with common (mean) values (note that, of course, we are losing variability)
    • Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
    • Normalize the data values in predefined limited ranges

    Some well-known methods of normalization include:

    • Range normalization.
    • Z-score normalization
    • The logistic (sigmoid) function normalization
    • The hyperbolic tangent function normalization

    Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:

    • Age
    • Income
    • Transaction amount

    Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to

    Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann

    for further reading about data preparation for data mining.

    Data Overview

    As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.

    For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.

    For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.

    It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:

    • Chi-Squared test for pairs of discrete variables
    • Correlation Coefficient for pairs of continuous variables
    • Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable

    Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.

    Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.

    For more details about the statistics mentioned, please refer to

    Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.

    For more information on the information theory, please refer to

    Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.

  • SQL Saturday #274 Slovenia Registrations Closing Soon!

    As always, there are also some limitations for the SQL Saturday #274 Slovenia event. We have limited space in the classrooms, and can cover only a limited number of lunch meals. Therefore, if you plan to attend, you should register very soon. We are positively surprised with the attendance and general acceptance of the event. It is a pleasure and a pain at the same time, but it has to be said: we are closing the registrations soon, there are only a few free places left. Seriously, hurry up! We will try to maintain a waiting list, but please don’t rely just on that.

    I would also like to use this blog to kindly ask all registered attendees to really attend the event. Of course, there are always some who can’t come because of some objective reasons. However, experience tells us that the “no show” percentage is higher for free than for payable events. Please don’t register to the event just for the last exit for the Saturday December 21st, just for the case if you don’t find anything better to do. In addition, please don’t get lazy in the morning of that day, please get up and come to the event. Please understand that if you register and then don’t show up, you are preventing somebody else to attend. Also don’t forget that all expenses are covered by sponsors. They, of course, prefer bigger attendance than smaller. Big “no show” percentage therefore also lowers the probability for future events.

    Thank you for understanding and see you at SQL Saturday #274 Slovenia!

  • SQL Saturday #274 Slovenia Schedule

    Correction: here is the final schedule for the SQL Saturday #274 Slovenia event! (OK, if some of the speakers does not cancel the presentation)





    08:30 – 09:00


    09:00 – 09:15

    Lah, Prajdič, Sarka: Keynote

    09:15 – 09:30


    09:30 – 10:30

    Ozhorasan, Belkis: Features Coming With Microsoft SQL Server 2014

    Parenzan, Marco: .NET Development for the SQL Server Developer

    Stacey, Mark: Analysis Services for the DBA

    10:30 – 10:45


    10:45 – 11:45

    Hristov, Boris: Replay your workload as it's your actual one!

    Boles, Kevin: Common TSQL Mistakes

    Pontoppidan, Kennie: Autogenerating a process data warehouse

    11:45 – 13:15

    Lunch / Vine Tasting Break

    13:15 – 14:15

    Vanhove, Pieter: Harder, Better, Stronger and More Secure

    Radivojevic, Milos: SQL Server 2014 - New Cardinality Estimator

    Kaštrun, Tomaž: Using Data Quality services on Customer related data

    14:15 – 14:30


    14:30 – 15:30

    Koprowski, Tobiasz: 2 AM. A Disaster just began...

    Szeliga, Marcin: Indexing for denormalization

    Reinholdt, Rasmus: Optimizing BI with SQL2014 In-Memory tables

    15:30 – 15:45


    15:45 – 16:45

    Durkin, William: SQL Server Consolidation - Resistance is Futile

    El-Nagar, Shehap: T-SQL Performance Guidelines for better DB stress power

    Kamman, André: Automate your SSIS development with BIML

    16:45 – 17:00


    17:00 – 17:30

    Lah, Prajdič, Sarka: Raffle, Wrap-up

  • Logical and Physical Modeling for Analytical Applications

    I am proud to announce that my first course for Pluralsight is released. The course title is Logical and Physical Modeling for Analytical Applications. Here is the description of the course.

    A bad data model leads to an application that does not perform well. Therefore, when developing an application, you should create a good data model from the start. However, even the best logical model can’t help when the physical implementation is bad. It is also important to know how SQL Server stores and accesses data, and how to optimize the data access. Database optimization starts by splitting transactional and analytical applications. In this course, you learn how to support analytical applications with logical design, get understanding of the problems with data access for queries that deal with large amounts of data, and learn about SQL Server optimizations that help solving these problems.

    Enjoy the course!

  • Fraud Detection with the SQL Server Suite Part 2

    This is the second part of the fraud detection whitepaper. You can find the first part in my previous blog post about this topic.

    My Approach to Data Mining Projects

    It is impossible to evaluate the time and money needed for a complete fraud detection infrastructure in advance. Personally, I do not know the customer’s data in advance. I don’t know whether there is already an existing infrastructure, like a data warehouse, in place, or whether we would need to build one from scratch. Therefore, I always suggest to start with a proof-of-concept (POC) project. A POC takes something between 5 and 10 working days, and involves personnel from the customer’s site – either employees or outsourced consultants. The team should include a subject matter expert (SME) and at least one information technology (IT) expert. The SME must be familiar with both the domain in question as well as the meaning of data at hand, while the IT expert should be familiar with the structure of data, how to access it, and have some programming (preferably Transact-SQL) knowledge. With more than one IT expert the most time consuming work, namely data preparation and overview, can be completed sooner. I assume that the relevant data is already extracted and available at the very beginning of the POC project.

    If a customer wants to have their people involved in the project directly and requests the transfer of knowledge, the project begins with training. I strongly advise this approach as it offers the establishment of a common background for all people involved, the understanding of how the algorithms work and the understanding of how the results should be interpreted, a way of becoming familiar with the SQL Server suite, and more.

    Once the data has been extracted, the customer’s SME (i.e. the analyst), and the IT expert assigned to the project will learn how to prepare the data in an efficient manner. Together with me, knowledge and expertise allow us to focus immediately on the most interesting attributes and identify any additional, calculated, ones soon after. By employing our programming knowledge, we can, for example, prepare tens of derived variables, detect outliers, identify the relationships between pairs of input variables, and more, in only two or three days, depending on the quantity and the quality of input data. I favor the customer’s decision of assigning additional personnel to the project. For example, I actually prefer to work with two teams simultaneously. I demonstrate and explain the subject matter by applying techniques directly on the data managed by each team, and then both teams continue to work on the data overview and data preparation under our supervision. I explain to the teams what kind of results we expect, the reasons why they are needed, and how to achieve them. Afterwards we review and explain the results, and continue with new instructions, until we resolve all known problems.

    Simultaneously with the data preparation the data overview is performed. The logic behind this task is the same – again I show to the teams involved the expected results, how to achieve them and what they mean. This is also done in multiple cycles as is the case with data preparation, because, quite frankly, both tasks are completely interleaved. A specific objective of the data overview is of principal importance – it is represented by a simple star schema and a simple OLAP cube that will first of all simplify data discovery and interpretation of the results, and will also prove useful in the following tasks. The presence of the customer’s SME is the key to resolving possible issues with the actual meaning of the data. We can always replace the IT part of the team with another database developer; however, we cannot conduct this kind of a project without the customer’s SME.

    After the data preparation and when the data overview is available, we begin the scientific part of the project. I assist the team in developing a variety of models, and in interpreting the results. The results are presented graphically, in an intuitive way. While it is possible to interpret the results on the fly, a much more appropriate alternative is possible if the initial training was also performed, because it allows the customer’s personnel to interpret the results by themselves, with only some guidance from me. The models are evaluated immediately by using several different techniques. One of the techniques includes evaluation over time, where we use an OLAP cube.

    After evaluating the models, we select the most appropriate model to be deployed for a production test; this allows the team to understand the deployment process. There are many possibilities of deploying data mining models into production; at the POC stage, we select the one that can be completed quickly. Typically, this means that we add the mining model as an additional dimension to an existing DW or OLAP cube, or to the OLAP cube developed during the data overview phase. Finally, we spend some time presenting the results of the POC project to the stakeholders and managers.

    Even from a POC, the customer will receive lots of benefits, all at the sole risk of spending money and time for a single 5 to 10 day project:

    • The customer learns the basic patterns of frauds and fraud detection
    • The customer learns how to do the entire cycle with their own people, only relying on me for the most complex problems
      • The customer’s analysts learn how to perform much more in-depth analyses than they ever thought possible
      • The customer’s IT experts learn how to perform data extraction and preparation much more efficiently than they did before
      • All of the attendees of this training learn how to use their own creativity to implement further improvements of the process and procedures, even after the solution has been deployed to production
    • The POC output for a smaller company or for a subsidiary of a larger company can actually be considered a finished, production-ready solution
      • It is possible to utilize the results of the POC project at subsidiary level, as a finished POC project for the entire enterprise
    • Typically, the project results in several important “side effects”
      • Improved data quality
      • Improved employee job satisfaction, as they are able to proactively contribute to the central knowledge about fraud patterns in the organization
      • Because eventually more minds get to be involved in the enterprise, the company should expect more and better fraud detection patterns

    After the POC project is completed as described above, the actual project would not need months of engagement from my side. This is possible due to our preference to transfer the knowledge onto the customer’s employees: typically, the customer will use the results of the POC project for some time, and only engage me again to complete the project, or to ask for additional expertise if the complexity of the problem increases significantly. I usually expect to perform the following tasks:

    • Establish the final infrastructure to measure the efficiency of the deployed models
    • Deploy the models in additional scenarios
      • Through reports
      • By including Data Mining Extensions (DMX) queries in OLTP applications to support real-time early warnings
      • Include data mining models as dimensions in OLAP cubes, if this was not done already during the POC project
      • Create smart ETL applications that divert suspicious data for immediate or later inspection
    • I would also offer to investigate how the outcome could be transferred automatically to the central system; for instance, if the POC project was performed in a subsidiary whereas a central system is available as well
    • Of course, for the actual project, I would repeat the data and model preparation as needed

    It is virtually impossible to tell in advance how much time the deployment would take, before we decide together with customer what exactly the deployment process should cover. Without considering the deployment part, and with the POC project conducted as suggested above (including the transfer of knowledge), the actual project should still only take additional 5 to 10 days.

    The approximate timeline for the POC project is, as follows:

    • 1-2 days of training
    • 2-3 days for data preparation and data overview
    • 2 days for creating and evaluating the models
    • 1 day for initial preparation of the continuous learning infrastructure
    • 1 day for presentation of the results and discussion of further actions

    Quite frequently I receive the following question: are we going to find the best possible model during the POC project, or during the actual project? My answer is always quite simple: I do not know. Maybe, if we would spend just one hour more for data preparation, or create just one more model, we could get better patterns and predictions. However, we simply must stop somewhere, and the best possible way to do this, according to my experience, is to restrict the time spent on the project in advance, after an agreement with the customer. You must also never forget that, because we build the complete learning infrastructure and transfer the knowledge, the customer will be capable of doing further investigations independently and improve the models and predictions over time without the need for a constant engagement with me.

  • Save 50% on *All* Data-Related E-Books and Videos @ O’Reilly

    In Celebration of #Strataconf, O’Reilly is offering 50% discount on all data-related e-books and videos for the next three days, starting at 12:01AM PDT (8:00AM CET). Don’t miss the opportunity at

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement