I have to correct the post from a month ago, I had to make some changes in my plans.
1. SQLBits 2014 , July 17th-19th: high time to register, pre-conf seminars start in two days. I have an interesting pre-con seminar there – Advanced Data Modeling Topics . In addition, I have a session as well. This is now over.
2. Kulendayz 2014, September 5th – 7th:
I guess I will speak there. However, The most important thing there is to cool down:-) However, besides eating Kulen and other Slavonia specialties, I am giving a Data Quality and Master Data Management precon and a talk during the conference.
SQL Saturday #344 Tirana , October 4th: Never been in Albania, looking forward to visit Tirana. Seems like this one is either cancelled or moved.
4. SQL Saturday #311 Bulgaria, October 11th: Visiting Sofia after many years. Nice people, good rakija.
5. Sinergija 14, October 20th – 21st, Belgrade, Serbia: Best barbecue. Besides great sessions, of course.
6. SharePoint dnevi 2014, October 21st - 22nd, Portorož, Slovenia: I think the conference party would not be the same if I don’t appear there:-)
7. PASS Summit 2014, November 4th - 7th, Seattle, WA, USA: Can’t miss the most important community conference in the world.
8. SQL Saturday #355, November 22nd, Parma, Italy – I guess you all heard for Parma Prosciutto and Parmigiano Reggiano.
9. SQL Saturday #359, December 6th, Istanbul, Turkey: I am not 100% whether I will make it, but I definitely hope to get some real kebab, köfte, dolma, sarma…
10. SQL Saturday Ljubljana, December 13th, Ljubljana, Slovenia: I am co-organizing this event. Date reserved, call for speakers is going out soon. Consider coming – most of the session will be in English. Maybe not the best parties, but definitely the longest. Ljubljana lives all night. This will be the best possible ending of the 2014 conferences.
Hope we will meet at some of these events.
You can find many different data modeling resources. It is impossible to list all of them. I selected only the most valuable ones for me, and, of course, the ones I contributed to.
- Chris J. Date: An Introduction to Database Systems – IMO a “must” to understand the relational model correctly.
- Terry Halpin, Tony Morgan: Information Modeling and Relational Databases – meet the object-role modeling leaders.
- Chris J. Date, Nikos Lorentzos and Hugh Darwen: Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL – all theory needed to manage temporal data.
- Louis Davidson, Jessica M. Moss: Pro SQL Server 2012 Relational Database Design and Implementation – the best SQL Server focused data modeling book I know by two of my friends.
- Dejan Sarka, et al.: MCITP Self-Paced Training Kit (Exam 70-441): Designing Database Solutions by Using Microsoft® SQL Server™ 2005 – SQL Server 2005 data modeling training kit. Most of the text is still valid for SQL Server 2008, 2008 R2, 2012 and 2014.
- Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008 T-SQL Querying – Steve wrote a chapter with mathematical background, and I added a chapter with theoretical introduction to the relational model.
- Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside Microsoft SQL Server 2008 T-SQL Programming – I added three chapters with theoretical introduction and practical solutions for the user-defined data types, dynamic schema and temporal data.
- Dejan Sarka, Matija Lah, Grega Jerkič: Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft SQL Server 2012 – my first two chapters are about data warehouse design and implementation.
- Data Modeling Essentials – I wrote a 3-day course for SolidQ. If you are interested in this course, which I could also deliver in a shorter seminar way, you can contact your closes SolidQ subsidiary, or, of course, me directly on addresses email@example.com or firstname.lastname@example.org. This course could also complement the existing courseware portfolio of training providers, which are welcome to contact me as well.
- Logical and Physical Modeling for Analytical Applications – online course I wrote for Pluralsight.
- Working with Temporal data in SQL Server – my latest Pluralsight course, where besides theory and implementation I introduce many original ways how to optimize temporal queries.
- Forthcoming presentations
- SQL Bits 12, July 17th – 19th, Telford, UK – I have a full-day pre-conference seminar Advanced Data Modeling Topics there.
My third Pluralsight course, Working with Temporal Data in SQL Server, is published. I am really proud on the second part of the course, where I discuss optimization of temporal queries. This was a nearly impossible task for decades. First solutions appeared only lately. I present all together six solutions (and one more that is not a solution), and I invented four of them. http://pluralsight.com/training/Courses/TableOfContents/working-with-temporal-data-sql-server
My previous blog post “Truncate Table – DDL or DML Statement?” got quite a few comments. Now I am continuing with a similar discussion: is the ALTER TABLE SWITCH [PARTITION] DDL (Data Definition Language) or DML (Data Manipulation Language) statement? At least it is clear that this is not a Data Control Language (DCL) statement.
Again, I can find a lot of arguments why this would be a DDL statement. First of all, we all know the classical categorization of the statements:
· DDL includes CREATE, ALTER, and DROP statements
· DCL includes GRANT, REVOKE, and DENY statements
· DML includes SELECT, INSERT, UPDATE, DELETE, and MERGE statements.
SQL Server changes system pages when you use this statement. In addition, you need elevated permissions to use the statement. Clearly, from the syntax perspective, it is a DDL statement.
However, logically, you just move the data from one table or partition to another table or partition. You do not change the schema at all. Therefore, semantically, this is a DML statement. And again, in my opinion, the logical perspective is the most important here, because this is the main point of the Relational Model: work with it from the logical perspective, and leave the physical execution to the underlying system.
Let me show how this statement works. I start with the clean-up code, if some of the objects I am going to create in the tempdb system database already exist.
Next, let me create the partition function and the partition scheme, create a demo table dbo.FactInternetSales and populate it with the data from the dbo.FactInternetSales from the AdventureWorksDW2012 demo database. I will also create two additional tables, one for the new data load, and one for the data from the oldest partition of the dbo.FactInternetSales table.
Note that the table for the new data includes a check constraints that guarantees that all of the data can be switched to a single partition of the partitioned table. I am loading the dbo.FactInternetSalesNew table with the last data you can find in the demo database, internet sales for year 2008. Let me check the data in all three tables, and also all partitions of the partitioned table.
If you check the results, you can see that there is data in three partitions of the partitioned table, and in the table for the new data. Next step is to switch the data from the new data table to a single partition of the partitioned table.
Can I do the same thing with a single DML statement? The TRUNCATE TABLE works logically similarly to the DELETE statement without the WHERE clause if a table is without a trigger and without the identity property. Does something similar exist for the ALTER TABLE SWITCH [PARTITION] statement? The answer is, of course, yes. You can use the composable DML statements with the OUTPUT clause. With the next statement, I am moving all of the data from the oldest partition of the partitioned table to the table created for the old data.
Let’s check where the data is now, and if schema has anyhow changed.
If you execute the statements, you can clearly see that the effect of the last composable DML statement was completely the same as the effect of the ALTER TABLE SWITCH [PARTITION] statement. The schema did not change a bit. Therefore, the ALTER TABLE SWITCH [PARTITION] is clearly a DML statement.
Many times, categories of concepts and things overlap. It can be hard to categorize some items in a single category. The SQL TRUNCATE TABLE statement is an example of an item that is not so easy to categorize. Is it a DDL (Data Definition Language) or DML (Data Manipulation Language) statement?
There is an ongoing discussion about this topic. However, if you quickly bingle for this question, you get the impression that the majority is somehow leaning more toward defining the TRUNCATE TABLE statement as a DDL statement. For example, Wikipedia clearly states: “In SQL, the
TRUNCATE TABLE statement is a Data Definition Language (DDL) operation that marks the extents of a table for deallocation (empty for reuse).” Disclaimer: please note that I do not find Wikipedia as the “ultimate, trustworthy source” – I prefer sources that are signed!
Some of the reasons why many people define the statement as a DDL statement include:
- It requests schema locks in some systems
- It is not possible to rollback it in some systems
- It does not include a WHERE clause
- It does not fire triggers in some systems
- It resets the autonumbering column value in some systems
- It deallocates system pages directly, not through an internal table operation
- and more.
On the other hand, it looks like there is only one reason to treat the statement as a DML statement:
- Logically, you just get rid of the data, like with the DELETE statement.
Even the Wikipedia article that I referred to says “The
TRUNCATE TABLE mytable statement is logically (though not physically) equivalent to the
DELETE FROM mytable statement (without a
Like many times, I have to disagree with the majority. I understand that the categorization is somehow confusing, and might even be overlapping. However, the only reason for categorizing the TRUNCATE TABLE statement in the DML category is “THE” reason in my understanding. One of the most important ideas in the Relational Model is the separation between the logical and the physical level. We, users, or people, if you wish, are manipulating with data on the logical level; the physical implementation is left to the database management system. And this is the important part – logically, when you truncate table, you don’t care how this statement is implemented internally, you just want to get rid of the data. It really does not matter what kind of locks a system uses, does it allow WHERE clause or not, etc. The logical point is what matters. Therefore, I would categorize the TRUNCATE TABLE statement as a DML statement.
Of course, this is a purely theoretical question, and is really not important for your practical implementation. As long as your app is doing what it should do, you don’t care too much about these nuances. However, IMO in general there is not enough of theoretical knowledge spread around, and therefore it makes sense to try to get the correct understanding.
But there is always a “but”. Of course, I have immediately another question. What about the ALTER TABLE mytable SWITCH [PARTITION…] TO… statement? ALTER statements have been defined as DDL statements forever. however, again, logically you are just moving the data from one table to another. Therefore – what? What do you think?
Why wouldn’t you make a nice trip in Central Europe? Besides spending some nice days in beautiful cities, you have also a business-type excuse for the trip. You can join the two SQL Saturday events. The first one is going to take place in Budapest, March 1st, and the second one in Vienna, March 6th.
I am very happy to meet the SQL community again, and proud to speak at both events. Although I would prefer to have the dates turned around. The two sessions selected are Temporal Data in SQL Server, which I will preset in Vienna, and Optimizing Temporal Queries, which I will present in Budapest. The sessions are more or less independent; however, I see the Vienna session as a prequel to the Budapest one.
Come, learn and enjoy Central Europe!
It is hard to imagine searching for something on the Web without modern search engines like Bing or Google. However, most contemporary applications still limit users to exact searches only. For end users, even the standard SQL LIKE operator is not powerful enough for approximate searches. In addition, many documents are stored in modern databases; end users would probably like to get powerful search inside document contents as well. Text mining is also becoming more and more popular. Everybody would like to understand data from blogs, Web sites, and social media. Microsoft SQL Server in versions 2012 and 2014 enhances full-text search support that was available in previous editions substantially. Semantic Search, a new component in Full-Text Search, can help you understand the meaning of documents. Finally, the Term Extraction and Term Lookup components from SQL Server Integration Services also helps with text analysis.
I am proud to introduce my second course I authored for Pluralsight - Indexing, Querying and Analyzing Text with SQL Server 2012-2014. You can learn how to get the most out of your texts with SQL Server tools. Enjoy!
As a speaker, I would like to help promoting the conferences where I am proud to present. Of course, this does not mean that other conferences are any worse. Anyway, here is the list of the conferences where I am going to speak, and I would like to invite you to join me.
- January 21st, NTK Zima 2014, Ljubljana, Slovenia
- February 5th, 24 Hours of PASS: Business Analytics Edition, online event
- March 1st, SQL Saturday #278, Budapest, Hungary
- March 6th, SQL Saturday #280, Vienna, Austria
- March 31st – April 4th, DevWeek 2014, London, UK
- April 9th – April 10th, NTK Pomlad 2014, Bled, Slovenia
- April 12th, SQL Saturday #267, Lisbon, Portugal
- May 7th – May 9th, PASS Business Analytics Conference, San Jose, CA, USA
Looking forward to all of these events. I hope we will meet in at least one of them!
UNION, INTERSECT and EXCEPT operators are commonly called Set Operators. For example, in Books Online you can find a topic “Set Operators”, where these three operators are explained. They should represent set operations UNION, INTERSECT and MINUS (synonym for EXCEPT DISTINCT). Also Wikipedia has a topic called “SET OPERATIONS (SQL)”, where these three operators are introduced. And these operators are commonly represented by Venn diagrams. Logically, Venn diagrams are also called Set diagrams. Here are the three operators presented with Venn diagrams:
However, is the name “Set Operators” really correct? The first question I asked myself was very simple: why would we have 10 and more relational operator and three set operators in the relational algebra? Well, makes no sense. The relational algebra comprise relational operators only, of course.
So what exactly is a relation? A relation is a special kind of set, set of entities that
are related, i.e that are of the same kind. How do we know that the two entities are of the same kind and can thus be grouped in a single entity set, i.e. in a relation? Of course, two entities are of the same kind if they have the same attributes. Therefore, every relation is a set; however, not every set is a relation.
Set operators work on sets and produce a set. Relational operators work on relations and produce a relation. SQL operators UNION, INTERSECT and EXCEPT produce relations, i.e. special kind of sets. Set operator UNION can combine a set of differential equations and a set of hammers into a single set. Relational operator UNION can’t combine a relation of differential equations and a relation of hammers into a single relation, because elements of these two relations have nothing in common. And don’t think that if you take only keys of both relation, and both have a single-column integer key, that a UNION of this would be a relation. First of all, you can do such an union because we don’t use strong types in a relational database (each key should be of its own type – in this case, you should have a “hammer” and a “differential equation” key types, which would disallow such operations). In addition, in the case I mentioned, you would get two
a relations that has with a single attribute, the key only, which would probably be meaningless from the business perspective. A relation without a meaningful attribute is not really an entity set, as defined by Peter Chen. An entity is something we can identify and is of interest. If we don’t have any real attribute, then this “thing” (whatever) is definitely not an entity, because without attributes it can’t be of any interest.
To summarize: SQL UNION, INTERSECT and EXCEPT are simply relational operators. Talking about them as of set operators is at least imprecise. However, representing them with Venn diagrams is not just imprecise, it is wrong. Here is a better presentation of there three relational operators.
I had opportunity to read the SQL Server 2012 Reporting Services Blueprints book by Marlon Ribunal (@MarlonRibunal) and Mickey Stuewe (@SQLMickey), Packt Publishing. Here is my short review.
I find the book very practical. The authors guide you right to the point, without unnecessary obstructions. Step by step, you create more and more complex reports, and learn SQL Server Reporting Services (SSRS) 2012 features. If you are more hands-on guy, this is the right book for you. Well, to be honest, there is not much theory in reporting. Reporting is the starting point of business intelligence (BI), with on-line analytical processing as the next step, and data mining as the I in BI.
The book correctly presents all SSRS features. In some chapters and appendices, the authors also went beyond basics. I especially enjoyed advanced topics in chapter 5, “Location, Location, Locations!”, appendix A, “SSRS Best Practices”, and appendix B, “Transactional Replication for Reporting Services”. All together, the book is more than sufficient for creating nice reports without much previous knowledge in a short time, and for avoiding common pitfalls at the same time.
The only thing I am missing is a bit more of theory. Yes, as a data mining person, I like to learn things a bit more in depth. I usually don’t deal with presentation; I prefer numbers. And this is my point – I would like to see more guidelines about proper usage of report elements – when to use which graph type, when to use maps, how to properly present different kinds of data…
Anyway, all together, the book is very useful, and I would recommend it to anybody that wants to learn SSRS in a short time.
This is the fifth, the final part of the fraud detection whitepaper. You can find the first part, the second part, the third part, and the fourth part in my previous blog posts about this topic.
In my original fraud detection whitepaper I wrote for SolidQ, I was advised by my friends to include some concrete and simple numbers to calculate the return on investment (ROI) in a language that managers can understand. with some customers, we really managed to get very impressive numbers. However, I am not repeating this calculation here. However, this is my personal blog. Therefore, I am writing my personal opinion here.
I am kind of bored with this constant requests to show simple numbers that managers can understand. Personally, I don’t think that managers are that stupid that they would not understand anything beyond primary school mathematics. And even if some of them are that dumb, I don’t care, as they can’t become my customers. They would never be able to understand the value of such an advanced technique like data mining.
I am pretty sure that the vast majority of managers can calculate approximate ROI by themselves, and also better than I can do. They definitely know their business better than I can do, and already know how much money they are losing because of frauds and how many frauds they are already preventing or catching early. In addition, I am pretty sure that most of the managers do understand the value of learning, and appreciate building of the learning infrastructure.
Therefore, in short, I am leaving to you, to the reader, to evaluate what can you expect from implementing a fraud detection continuous learning cycle. And thank you for understanding my point!
Fraud detection is a very popular, albeit very complex, data mining task. I have developed my own approach to fraud detection. The most significant element of this approach is the continuous learning cycle.
Although Microsoft SQL Server is not the most popular tool for data mining, I am using it. The SQL Server suite gives us all of the tools we need, and because all of the tools come from a single suite, they work perfectly together, thus substantially lowering the time needed to bring a project from the initial meeting to a production-ready deployment.
Another advantage of my approach is the mentoring with the knowledge transfer. It is not my intention to get permanent consulting contracts; I want to progress together with my customers. Once we finish the project, or sometimes even as soon as we finish the POC project, the customer can begin using and continue improving the fraud detection system constantly with the help of the continuous learning infrastructure.
Finally, due to Microsoft’s licensing policies, the customers that already possess Microsoft SQL Server Standard Edition or higher and Microsoft Excel, do not need to purchase any additional licenses.
I could simply stop here. However, I want to mention again everybody involved in this, and also some who were unfortunately missing.
First of all, PASS is the organization that defined SQL Saturdays. And apparently the idea works
I have to thank again to all of the speakers. Coming to share your amazing knowledge is something we really appreciate. The presentations were great, from the technical and other perspectives.
Of course, we could not do the event without sponsors. I am not going to enlist all of them again; I will just mention the host, pixi* labs, the company that hosted the event and who's employees helped with all of the organization. In addition, I need to mention Vina Kukovec. Boštjan Kukovec, an old member of Slovenian SQL Server and Developers users group, organized free wine tasting after the event. And what a wine it is!
Finally, thanks to all attendees for coming. We had approximately 85% show up rate; only 15% or the registered attendees didn’t come. This is an incredible result, worldwide! And from the applause after the raffle, when we closed the vent (and started wine tasting), I conclude that the attendees were very satisfied as well.
I want to mention three people that wanted to come, but run out of luck this time. Peter Stegnar from pixi* labs was the one that immediately offered the venue, and permeated with his enthusiasm also other pixi* labs members. Due to family reasons he couldn’t join us to see the results of his help. Tobiasz Janusz Koprowski wanted to speak, organized his trip, looked forward to join us; however, just couple of days before the event he had to cancel because of some urgent work at customer’s site. And what to say about Kevin Boles? He really tried hard to come. Think of it, he was prepared to come from USA! he was already on the airport, when his flight got cancelled due to technical problems. We were in constant touch Friday evening. He managed to change the flight, went to the gate, but was not admitted to the plane. because there was only seven minutes left till take off. Catching next flights would not make any sense anymore, because he would come too late anyway. He really did the best he could do, he just didn’t have enough luck this time. Peter, Tobiasz, and Kevin, thank you for your enthusiasm, we seriously missed you, and we certainly hope we will meet on our next event!
I would like to expose all of the sponsors that enabled the event. The following companies provided invaluable help to us, and therefore we are thanking them. I think they deserve this exposure.
SQL Saturday #274 Slovenia is full (150 registered attendees) for more than a week There are still some people in the waiting list. We sent an e-mail to registered attendees asking them to unregister if they already know they can’t make it. I have to thank to all of those who followed this advice and made room for part of the people from the waiting list. Apparently, we will really have a full house, and very low no show rate.
I am positively surprised with everything so far. Fantastic support from PASS, immediate help from sponsors, and a huge number of speakers prepared to spend their own money and time and come over to present high-end topics, and also passionate attendees who registered far in advance and are fair enough to release their place for somebody who can come more surely than they can. Let me also mention that we got classrooms for free, that we got minimal price for lunch, and we even got a free vine tasting during the event. Fantastic response from PASS, SQL Server Community, from sponsors, and from my fellow countrymen from Slovenia!
See you all on Saturday. We will all enjoy, this is a promise!
This is the fourth part of the fraud detection whitepaper. You can find the first part
, the second part
, and the third part
in my previous blog posts about this topic.
We create multiple mining models by using different algorithms, different input data sets, and different algorithm parameters. Then we evaluate the models in order to find the most appropriate candidates for the actual deployment to production.
Many different algorithms can be used for fraud detection; it is difficult to say which one would generally yield the best result. In a project, the available algorithms are typically chosen, based on experience and the knowledge about the given domain. Because we use the Microsoft SQL Server suite, we use Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Naïve Bayes directed algorithms, and Microsoft Clustering for the undirected one. In recent years, the Support Vector Machines methods are becoming more and more popular. SSAS does not bring this algorithm out of the box. However, it can be downloaded as a free plug-in algorithm for SSAS from the Microsoft CodePlex site at
Valkonet, J. (2008). Support Vector Machine plug-in in Analysis Services. Retrieved from Microsoft CodePlex: http://svmplugin.codeplex.com/
Of course, if there are time and software policy constraints that prevent us from using this download, we simply skip it. We do not lose much, because, according to
Sahin Y., & Duman E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.,
the Decision Tress algorithm usually yields better results in fraud detection analysis than Support Vector Machines. For details on specific data mining algorithms, please refer to
Han J., Kamber M., & Pei J. (2011). Data Mining: Concepts and Techniques, Third Edition. Morgan Kaufmann,
or to the SolidQ course
Sarka D. (2012). Data Mining with SQL Server 2012. SolidQ. Retrieved from http://www.solidq.com/squ/courses/Pages/Data-Mining-with-SQL-Server-2012.aspx.
We evaluate the efficiency of different supervised models by using standard techniques, namely the Lift Chart, the Classification Matrix, and Cross Validation. All of these techniques are built into the SSAS data mining feature and are described in more detail in
MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.
To evaluate the Clustering models, we have developed a technique of our own, implementing entropy. If the individual clusters are homogenous, the entropy in any given cluster must be low. We calculate the average entropy and the standard deviation of the entropy across all clusters. In a SSAS Clustering model that has been trained (or processed), it is possible to read the model data with DMX queries. In the cluster notes we can identify the distribution of the input variables, and then use it to can calculate the entropy.
From experience, we have learned that not all algorithms are equally useful for all data sets. The Microsoft Neural Network algorithm works best when the frequency of the target state (i.e. fraud) is about 50%. Microsoft Naïve Bayes can work well when the target state is represented by approximately 10% of the population or more. However, Microsoft Decision Trees work well even if the target state frequency is only about 1%, and is thus a very suitable algorithm for small data sets and low frequency of the target state as well.
The continuous learning cycle is shown graphically in Figure 1.
Figure 1: The continuous learning cycle
We start by creating the directed models, assuming that the customer has already flagged frauds in the existing data. We evaluate the directed models and then use the best one to predict the frauds in the new data. We also create the undirected models, evaluate them, and use the best one for selection of potential frauds. We do this over time and check the difference between the number or percentage of frauds caught with the directed and the undirected model deployed. When this difference drops, it is time to refine the directed model. In addition, we store the predictions of both models and the actual, confirmed or reported frauds in a data warehouse. When the percentage of the predicted frauds in the total number of frauds drops, it is time to refine both models. We use an OLAP cube on the top of the DW to measure the efficiency of the models over time.