In 1994, I learned a method for data modeling that is based on three principles. I immediately knew that these principles should embraced by anyone who does any data modeling or process modeling. Or almost any other job, for that matter. I have described these principles in three previous blog posts: the Jargon Principle, the Concreteness Principle, and the Reproducibility Principle. But I have later found that there are more principles and guidelines that are important to keep in mind when modeling.
I almost hear you think: “Yes, avoid redundancy. Duh! Do you have any more open doors to kick in?”
But there is actually more to this than meets the eye. I’ll start with how I word this guideline, and then comment on it. Note that there are actually three parts to this guideline.
“No data model shall contain any unchecked redundancy.
“Process models should avoid making the processor do redundant actions.
“The modeler shall strive to avoid redundant work for the domain expert.”
As you see, the three parts apply to three different parts of the work of a modeler: the data model, the process model, and the process to create these models.
Redundancy in the data model
Many sources say that there should never be any redundancy in a data model (in an OLTP workload, that is; data warehouses are a completely different game). As you can see in the first part of my guideline to avoid redundancy, my opinion is a bit more nuanced than that – I say that no data model should contain any unchecked redundancy. And yes, that means that I believe redundancy in a data model can be okay – as long as the developer is aware that it exists, has a good reason to introduce or keep it, and makes sure that this is all thoroughly documented.
With the concreteness example in mind, I will immediately give an example of a situation where I think redundancy is not only acceptable, but even necessary. Just think of the database that a bank uses for your checking account. It will probably include a table that has a single row for each account, including information such as the date it was opened, a reference to the account holder, the currency used for the account, etc. And it will also include a table with a row for each transaction, that includes the transaction date, the amount transferred, an optional description, and of course a reference to the account.
Would you include a column for the current balance in the account table? I bet you would! And yet, this column would be redundant – after all, the current balance can easily be calculated by adding together all the values in the transaction table. So if you believe that there should never be any redundancy in a data model, you should not include this current balance column, but instead compute it whenever it’s needed. Nice in theory – but in practice, that would probably cost too much performance. Which is why every sane data modeler would (and should) include the current balance in the account table. But (s)he must also document this, and any other redundancy that (s)he decides to introduce or keep in the data model – for accountability, for review at a later time, and (maybe the most important) to ensure that the QA department will include some thorough tests to thwart the risk that this redundancy leads to inconsistent data.
Redundancy in the process model
The process model is the blueprint for the program. When the process model contains redundancy, the computer has to do the same work more than once. And that is always wrong, right? Well, no. Not always.
For example, imagine a table with information about items that have to be shipped. Three of the columns in that table would be Net weight, Tare (packaging) weight, and Gross weight. If you store all three of them as normal columns, you avoid making the computer do extra work, but at the cost of redundancy in the data model. If you don’t include the Gross weight column, or define it as a computed column, you remove the data redundancy – but now, the computer has to compute it every time you query the table. That’s redundancy in the process model. And in this case, other than in the previous example (the checking account), the cost to compute the Gross weight when the table is queried has so little effect on performance that I (and I hope all other modelers) would immediately chose to use a computed column for Gross weight, or to not include it at all.
This is just one example to illustrate that redundancy in the process model can be okay. But (just as with redundancy in the data model) it should be a conscious and well-documented decision, not an accident. So the modeler “should avoid making the processor do redundant actions” – but not at all cost.
Redundancy for the domain expert
The third part of the guideline to avoid redundancy concerns itself with redundant work for the domain expert, the dialog partner of the modeler. A very simple other way to say this, is that you shall not ask the same question more than once. Nor any different but equivalent question.
But it is no coincidence that this guideline is the last of the three, and that the wording is quite weak (“shall strive”). The correctness of the data and process model is always the most important. Sometimes, the developer already has the answer to a question, but the wording is a bit ambiguous. In such a case, (s)he should always go to the domain expert and make sure to get an unambiguous answer – even if the domain expert may feel this to be redundant.
But don’t go overboard with this! There is no faster way to unemployment than to keep bugging the domain expert with countless superfluous questions. When working without any method, always keep in mind how you expect the domain expert to experience the questions you ask. And when using a method, choose a method that ensures no redundant questions are asked.
Redundancy for the modeler
Maybe you have been wondering why there is no fourth element to the guideline to avoid redundancy, one that applies to the modeler him- or herself. Does this mean that I think that a modeler should just repeat each task two, three, five, or even twenty times?
Sure! Knock yourself out, go crazy!
Well, okay, not really. If you repeat everything twenty times, you will take so much time for your work that you will pretty fast find that you no longer have any work to do.
And yet, I do not include this in the guideline to avoid redundancy. For good reason. Two good reasons, actually. First, it’s not needed – everyone hates repeating the same work over and over again, so no guideline to avoid doing that is ever needed. And why should I include redundancy in a guideline about avoiding redundancy?
But more important – as a result of the Reproducibility Principle, there is actually a good reason to duplicate your work. If you have finished a task and then do it a second time, the result should be the same. If it’s not, you have obviously done something wrong – either the first time you did it, or the second time. If the results are the same, that’s still no guarantee that you made no mistakes, but the chance is a lot lower. So doing a task twice can be a valuable check for errors, rather than redundant work.
The bottom line of this post is pretty simple. Redundancy is pretty bad in almost all situations, but there are always exceptions. Avoid redundancy, but not at all costs.
optimizer is the part of SQL Server that takes your query and reorders and
rearranges your query to find the optimal execution plan. In theory.
practice, that doesn’t always work out well. Often, the optimizer manages to
come up with brilliant ways to execute a complex query very efficiently – but sometimes,
it misses an option that appears to be so simple that you can only stare in
utter amazement at the execution plan before going to the Connect site.
Here is an
example I recently ran into. I tested it on SQL Server 2012 and on SQL Server
2008 R2, and it reproduces on both. Execute the query below in the
AdventureWorks sample database, with the option to Include Actual Execution Plan
enabled (Ctrl+M), or request an Estimated Execution Plan (Ctrl-L).
RANK() OVER (ORDER BY SalesLastYear) AS Rank1,
RANK() OVER (ORDER BY SalesYTD) AS Rank2
ORDER BY SalesLastYear;
following the flow of data (by reading the execution plan from right to left),
we see that the data is first read (with an unordered clustered index scan,
since there is no better index available), then sorted by SalesLastYear, so
that Rank1 can be computed (using two Segment operators and a Sequence Project operator
– don’t ask). After that, the rows are sorted again, now by SalesYTD, and we
see another combination of two Segment and one Sequence Project, for the
calculation of Rank2. And then, finally, the rows are re-sorted by SalesLastYear
so that they can be returned in the requested order.
Now the big
question is: why does the plan include two sort operators that both sort the
data in the same (SalesLastYear) order? If the task of the optimizer is to find
smart ways to rearrange computation order for better performance, why doesn’t
it simply compute Rank2 first and Rank1 after that? In that case, the rows are
already in the SalesLastYear order after the last Sequence Project, so no extra
sort is needed. The execution plan of the query below confirms this suspicion:
RANK() OVER (ORDER BY SalesYTD) AS Rank2,
RANK() OVER (ORDER BY SalesLastYear) AS Rank1
ORDER BY SalesLastYear;
execution plan of this query includes only two Sort operators instead of the
three we had in the first execution plan. If you include both queries in a single
batch, you’ll see an estimated cost of 59% for the first query, and 41% for the
second. (Some people think that the percentages shown in an Actual Execution
Plan are an indication of the actual cost; that is not the case – the percentages
are computed from the Estimated Subtree Cost property of
the left-most SELECT operator). The SalesTerritory table is too small to
measure any actual performance differences, but I tried queries with a similar
pattern on the SalesOrderDetail table (121,317 rows), and on
FactProductInventory in AdventureWorksDW (776,286 rows) and I did see an actual
difference in execution time. No surprise, but now I know for sure!
So, we have
seen that simply reordering the two columns that use an OVER clause reduces the
query cost by about 30%. How is it possible that such a simple, basic
reordering strategy is not considered by the optimizer? Surely, this can only be
Fabiano Neves Amorim thought when he filed this bug
on Connect. But, as you can see, the bug has been closed as “By design”. That
probably doesn’t mean that someone wrote a design document telling the
optimizer team to make sure that OVER clauses must always be evaluated in the
order in which they appear in the query, even if a different order would be
cheaper. I rather think of it as “missing an optimization opportunity is not a
bug; the results are still correct, just a bit slower – so we’re going to close
this “bug” report”. In this case, maybe the optimization opportunity was not
identified during the design phase, or it was just too hard to implement. The
latter statement may sound ridiculous at first (how can such a basic rewrite be
too hard?), but you have to be aware that the optimizer does not operate on the
original query text, but on an internal representation that is based on
mathematics and set theory. Rewrites that may be complex in the query text may
be obvious in this representation, but the reverse can also be true – so I’m
prepared to accept the comment that Andrew Richardson made on behalf of
Microsoft to the above Connect item: that it would be fairly complicated for
the Query Optimizer.
not mean I agree with the rest of Andrew’s comment. He suggests that this is a
case where we should not rely on the optimizer, but rewrite our queries
ourselves, especially since it’s such an easy rewrite in this case. Well, I
would agree with that, except that: (a) this missed optimization opportunity is
not documented, so how are developers supposed to know that they may need to
reorder columns in a SELECT clause for optimal performance? (that is one of the
reasons for this blog post); and (b) the behavior of the optimizer in this
situation is not documented, so it can change at any time; I’d hate to rewrite
all my queries and then find that the sysadmin just installed a patch and now
the optimizer always starts with the last
instead of the first OVER clause (or, worse, I don’t find it and just get all
the bad performance right back).
Andrew is right in so far that, at this time, rewriting queries does
consistently improve performance in all my tests. So at this time, rewriting
does seem to be the right thing to do. Just keep in mind that you have to test
all your queries, not only on your test server but also on your production
hardware, and that you’ll have to repeat these tests on a regular basis (at
least after each patch, CU, service pack, or other change).
rewrite pattern is simple – for each query that uses OVER clauses with
different ORDER BY subclauses as well as an ORDER BY clause on the query that
matches one of the ORDER BY subclauses, make sure that the OVER clause that
uses the matching ORDER BY comes last in the SELECT list. If you have a client
that expects the columns in a particular order, the rewrite becomes a bit more
complex – in that case, you have to use a CTE that includes the OVER clauses in
the optimized order, and then you can reorder the columns in the query that
references the CTE – as shown in this example:
AS (SELECT TerritoryID,
RANK() OVER (ORDER BY SalesYTD) AS Rank2,
RANK() OVER (ORDER BY SalesLastYear) AS Rank1
ORDER BY SalesLastYear;
rewrites are indeed the best option for the time being – but I still think that
the optimizer should be improved to do these rewrites internally. So I decided
to file a
new item on Connect, this time labeling it as a suggestion instead of a
bug. If you agree with me that this would be a great investment of Microsoft’s
engineering time, then please add your vote to this item. (Or vote it down if
you think I’m just being ridiculous). But don’t stop there! Microsoft knows me;
they know I’m a geek who plays around with options like this and then runs into
this issue. No real production databases were hurt during the production of
this blog. And if I am the only one, then, frankly, I myself will say that they
have better things to do with their engineering time. However, if I know that this
affects real people, I can make a much stronger case to Microsoft for getting
So – go out
to your production servers and find if you use queries with this pattern (two OVER
clauses with different ORDER BY and an ORDER BY on the final query), then check
to see if you should rewrite them. And then report back – add a comment here or
on the Connect item; share if this affected you, and how much performance you
were able to gain as a result of the rewrite.
Microsoft knows that their customers would actually benefit, they’ll be much
more inclined to add this improvement to the optimizer then if it’s just about
me, a geek moaning about an edge case that no one in the real world would ever
Almost a week has passed after SQLBits X in London, so I guess it’s about time for me to share the slides and demo code of my session on columnstore indexes. After all, I promised people I would do that – especially when I found out that I had enough demos prepared to fill two sessions!
I made some changes to the demo code. I added extra comments, not only to the demos I could not explain and run during the session, but also to the rest, so that people who missed the session will also be able to benefit. I also found and fixed the error that caused one of my demos to fail. It turned out to be as embarrassing as it was unspectacular – somewhere along the way, I must have accidentally fat-fingered the backspace button while the cursor was on the name of an index. And if the index name doesn’t match, queries against index-related DMVs tend to produce no results. <sigh>
After fixing this typo, I re-ran all demos and they now worked flawlessly.
One major catch (and those who were in my session already know this). I ran my demos on a database that I got from within Microsoft, and I have no permission to redistribute this database. That means that people can only study the code, but not run it – well, okay, they can run it, on the “small” version of the database and table (change database names AdventureWorks2008DWXL and AdventureWorks2008DWBig to AdventureWorks2008DW, and change table names FactResellerSalesXL and FactResellerSalesPart to FactResellerSales), but with the size of that table, I expect the optimizer to make completely different choices for the execution plans. So while you can’t see the actual performance benefit by running the code yourself, you can still learn the patterns to use to work around the many limitations of columnstore indexes and batch mode.
I normally prefer to use demo code that any attendee can replicate on their own test databases, but in this case, I simply did not have the time to make a realistic 100+-million row table, and I did not want to demonstrate columnstore indexes on an unrealistic and heavily skewed table or on a table that is too small to show the performance benefit.
I have spent the last three days at SQLBits X in London – a truly great experience! There were lots of quality sessions, but I also enjoyed meeting new people and catching up with old friends. One of these friends (and I hope he’s still a friend after I post this) is Buck Woody. Not only a great and humorous speaker, but also a very nice fellow – for those who don’t mind being teased every now and then.
When we were chatting, he told me that he was planning to announce a special access code to allow attendees of the SQLBits X conference to buy a SQL Azure subscription at a true bargain price. He told me how he had worked with various departments to make sure everything was set up, so that attendees who got the code would be able to sign up for a limited time (expiring today, at midnight GMT), and how he even made sure to pass the truly Buck-style access code by Legal, just in case – and then, when he already was in London, he got an email from higher management telling him he could not do it, so he had to remove this slide from his deck and change his talk.
However, since Buck told me that everything was already set up for the offer, and he had told me the funny special offer code, I decided today to just try and see what happens if I put it in. Who knows, maybe someone in Microsoft HQ just forgot to disable the discount code – and lo and behold, the offer actually worked! And since Buck (probably expecting the code to be disabled) has not asked me to keep this information secret, I decided to write this blog post.
So here are the details of the offer – for a very limited time only, you can get a 6-month subscription to a 50 GB SQL Azure database for the price of £9.99/month instead of the normal price (£64.23/month) – that is a whopping 84% discount!
To use this offer, simply go to the SQL Azure pricing page, drag and draw the database bar to exactly 50 GB, change the currency to “British Pound (£)” (This is essential! The offer will not work otherwise. You can change the actual currency back to your currency later), and pick the “6-Month Plans” option. The price should now display “£64.23/mo”. Then, click “review + purchase”. A new popup window will now open, asking if you have a special offer code. In this popup window, enter the code “BuckWoodyIsTrulyAwesome!!!” (using no spaces, upper- and lowercase as indicated, and with all three exclamation marks), then click “Ok”, and the quoted price will change to “£9.99/mo”. After this, you can continue the setup process.
Final note: After sharing this information with a few of my friends, I heard that for some people, the popup window did not appear. I don’t know why, and since I don’t think Buck ever intended me to share this code that he was not allowed to mention in his sessions, I am not planning to ask. If you don’t get a popup after following the steps above, check that your popup blocker is disabled; if that still doesn’t help, cancel the subscription process, unless you are willing to pay the full price. You can always try again in a year or so; maybe you are more lucky then.
Finally, remember that this offer IS time-limited. You only have until midnight GMT to use it; on April 2nd, it should no longer work.
EDIT: To prevent confusion by people who land on this page later - this post was published on April 1st, 2012, and it was an April Fools prank. The first paragraph is true: I did go to SQLBits, I did have a great time there, I did manage to catch up with Buck Woody, and he is a really nice fellow. But everything else in this post is pure fiction. There is, unfortunately, no huge savings offer for SQLBits - at least not one I am aware of.
Almost two months have passed since my last blog post. And while it’s true that I’ve had (much) longer breaks, I do have a good reason now. All the time that I would normally at least in part spend on preparing new blog posts is now reserved for preparing presentations for a few upcoming events. I’ll give you an overview – who knows, maybe you’ll have a chance to attend one of them and meet me there? I’m looking forward to it!
On Saturday, February 25, I’ll present my session on “Advanced Indexing” at SQL Saturday 108, in Redmond, WA. This session covers included columns, filtered indexes, indexed views, tools for finding the right indexes, and the dangers of over-indexing. I see several top speakers on the agenda, so this is definitely an event you don’t want to miss out – especially considering the price! Unfortunately, I hear that the event is fully booked, but you can still get yourself added to the waiting list. If other people have to cancel, you may get a chance to get admitted.
I’ll stay in the Redmond area the whole next week, attending the MVP Summit. Not as a presenter, but to soak up all the information the product team wishes to share with us, and to give feedback on current and planned functionality. And to have lots of funs with my fellow MVPs, of course!
Shortly after that, I’ll be heading to SQL Saturday 115, in Lisbon, Portugal. I will first present a full-day training session on database design on Friday, March 16. The design method I’ll teach is one that observes all the principles of modelling I presented in some previous blog posts. The next day, I’ll present two sessions: the one on advanced indexing I already mention above; and one about the MERGE statement, that covers not only the intended use but also several other interesting possibilities of this statement – and warns about some caveats that you need to know if you want to use MERGE. This event has an awesome line-up of speakers, so if you see a chance to go to Lisbon, do so!
And finally, I’ll head to London for SQLBits X, which lasts from Thursday, March 29 until Saturday, March 31. On Thursday, I’ll present a training day on database design, so if you can’t attend in Lisbon, you get a second chance in London. I then have a day off on Friday, which gives me a chance to attend some of the superb sessions that are scheduled for this day; on Saturday, I will present a session on columnar indexes, a new feature introduced in SQL Server 2012 that has the potential to yield enormous performance benefits for queries that have to process large tables of read-only (or mainly read-only) data. In this session, I will peek below the hood and disclose a bit about how this new kind of indexes works, and why that results in such enormous performance benefits.
So, if you were waiting for my next blog post with actual technical content, I have to disappoint you. That will probably have to wait until after these busy conference weeks are over.
A year or so ago, I watched a few episodes of a Dutch television program that had an interesting format. The name of the series was (or is, I have no idea if it still runs) “Sterren op het doek” (“Stars on Canvas”). Every episode featured a Dutch celebrity, three painters, and an interviewer. For the program, the three painters each paint a picture of the celebrity (who is interviewed while posing – and because this takes a long time, the interviews were typically quite deep, and had enough material for an interesting first 15 minutes of the program). After that, the painters were given two weeks to finish their paintings. And at the end of the program, the painters each revealed their painting to the celebrity, who then got to pick one to keep for him- or herself; the remaining two portraits were auctioned off and the proceeds were given to a charity.
What made this program intriguing was the enormous differences between the finished paintings. Not only did each painter have his or her own unique style, they also all chose to capture and emphasize different aspects of the physical appearance and different personality traits of the celebrity. They all made portraits that captured the celebrity really well, and yet they were as different as one could imagine. Even though the painters all received the same “input” (watching the celebrity pose and listening to the interview), they produced radically different “output” (paintings), while still all following the same task. It was also interesting to discuss the results with my wife. We often disagreed on which painting we liked best, and the celebrity would then sometimes pick the one we both disliked! Again, three people who, with the same input, generated vastly different output. Because when it comes to art, picking the “best” is a matter of taste.
But what’s good in art, and in this television format, is not necessarily good elsewhere. When I fly somewhere, I don’t want the result of the pre-flight security check to depend on the engineer assigned to that task. For a given “input” (the current technical state of the plane), I want the “output” (‘clear to go’ or ‘needs repairs’) to be the same, regardless of the engineer who inspects the plane. And luckily, the airline companies are aware of this, so their safety inspection engineers use checklists that list exactly what details of the engine have to be inspected, and what value ranges are or are not acceptable.
Data models, like paintings, often end up as decoration on office walls. Put they serve a very different purpose. For a painting, hanging on the wall is the end goal, pleasing the eye of all beholders. For a data model, being stuck to the wall is only a further step in a longer process to the end product: a working database application that serves some business need. If a painting fails to please the beholder, it’s simply a difference in taste. If a data model fails to serve the business needs, a mistake has been made – and mistakes like this can cost companies large sums of money, or, in some cases, even cost lives.
I want data models to be like the safety inspection on planes: dependent on the input only, not on the person who carries out the task. That brings me to the third principle of modeling, after the Jargon Principle and the Concreteness Principle: The Reproducibility Principle.
The Reproducibility Principle
“The analyst shall perform each of his or her tasks in such a way that repeating the same task with the same input shall invariably yield the same result.”
Note that the wording of the Reproducibility Principle (which, again, is put in my own words because I don’t recall the original wording, only the meaning) does not state who repeats the task, and that is on purpose. This principle should still hold if someone else repeats the same task; the result of an information analysis should not depend on who does the analysis.
Easier said than done?
While writing this, I almost heard the angry outcry from all over the world: “Yeah, right! I can work in a way that almost always ensures consistent results, but some of my co-workers are blithering idiots who simply don’t get it – how can you seriously expect them to produce the same quality analysis that I make?” That’s a fair point, and maybe a weak spot in the wording of the Reproducibility Principle – but not a weak spot in the principle itself! Hang on, I’m sure it will become more clear after a few paragraphs.
But even with coworkers who are level with your experience and knowledge, you’ll still often have different opinions about a data model. I experienced this once, many years ago, when I attended a class on data modeling. The students had to draw a data model for a given scenario, and then present it in class. To my surprise, several models that were very different were all praised by the teacher. That just didn’t make sense to me. How can two models that don’t represent the same information both be correct? Clearly, the teacher of that class did not know or care about the reproducibility principle! If that is how information modeling is taught, it’s impossible to expect the modelers to suddenly all start producing the same results on a given input.
Does this mean that the Reproducibility Principle is good in theory but impossible to achieve in practice for information and process analysis? No! It just means that the currently accepted methods of doing information and process analysis are not equipped for this principle. They lack recipes. Not recipes for pie, as you’d find in a cookbook, but recipes for the steps in making a model – but these modeling recipes should be just like the cookbook recipes, in that they tell you exactly what ingredients you need, and how and in what order you process and combine these ingredients to get the required result: a delicious apple pie, or a completely correct data or process model.
The only way to implement the Reproducibility Principle is to have strict recipes that govern each and every conclusion that an analyst even infers from information given to him or her by the subject matter expert. With recipes, I know for sure that if I give the same information to my co-worker, the result will be either the same – or it will be different, and then we’ll sit together, go over the steps in the recipe until we see where one of us made a mistake.
And those blithering idiots who just don’t get it? Having recipes will not suddenly change them to superstar modelers. They’ll still be idiots, they’ll still not get it. But with recipes, we no longer have to confront their bad insights with our good insights, and still disagree. Now, we can simply point out exactly where they failed to apply the recipe in the correct way, and even our bosses will agree. The blithering idiot coworkers will still be a royal pain, but fighting them will be easier if your company agrees on implementing the Reproducibility Principle by supplying recipes for everything you have to do in your role as information or process analyst.
Putting my money where my mouth is
Look at me, all blathering in abstract terms about how information modeling does not follow the Reproducibility Principle – and all this time, I myself am not applying the Concreteness Principle that I advertised in an earlier blog post. Time to change that!
First, I’d like to start on the positive side. The job of creating a data model is not all up to the experience and insight of the analyst; some of the steps involved actually have very good recipes for them. Normalizing a data model is a fine example of this. For a given set of attributes (columns) and functional dependencies, the steps are clearly described. Simply apply them one by one, and the end result is always a nice, normalized data model. Unless you made a mistake somewhere (which does happen; I never said that the steps are easy, I just say that they are well prescribed!), your teacher (if still on course) or a coworker can tell you exactly where you made a mistake.
But where do that list of attributes and their functional dependencies come from? Now we get into the realm of the bad and the ugly – none of the mainstream analysis methods have strict recipes for this. If I ask how to find the attributes for my model, the answer is always a variation of “talk to the subject matter expert”. Yeah, sure, I get that. But what questions do I ask? How do I ensure that no important attributes are missed in our conversation? And how do I separate the wheat from the chaff, so that I don’t waste time on attributes that are not relevant at all? The only answer I have ever got to that question is “experience”. And that is not an answer that I, as a fan of the Reproducibility Principle, like.
For functional dependencies, there is a similar problem. Given a list of columns, how do I find the dependencies? Again, there is no answer that satisfies me. Most people will tell me that these dependencies are “obvious”, that you “just see” them. And indeed, in 99.9% of all cases, people will agree on the dependencies. It’s not those 99.9% that I’m concerned about; my concern is for the remaining 0.1% – those few cases where the functional dependencies are not obvious is where errors are introduced. Errors that sometimes are caught quickly, costing only a few man months of work. Or sometimes, they are not caught until it’s too late, bringing down entire companies –or, worse, killing people!– as a result.
After reading the above, you’ll understand that I only consider a modeling method to be good if it includes strict “recipe” procedures. And thinking back about my previous posts on the principles of modeling, you’ll also understand that I think those procedures should tell you not only what questions to ask the domain expert, but also tell you to use concrete examples in the jargon of the domain expert when asking them – and tell you how to do just that.
Unfortunately, most design methods fail to embrace these principles. The only method I ever encountered that does fully embrace all these principles is NIAM – a method that, as far as I know, has only been documented in a Dutch book that has never been translated. (There are some more or less closely related methods that are documented better, but they all either lack full support for the Modeling Principles, or focus on the wrong details). Uptake of this method in the Netherlands is minimal, and virtually non-existent in the rest of the world.
However, that has never stopped me from using it, extending it, and improving it. With all the changes I made the method I am now using can be said to have evolved from NIAM to a personal (though obviously still NIAM-derived) method. And I am proud to announce that this personal method will be the subject of a seminar that I will be giving on March 29 in London, on the first day of SQLBits X – a conference that anyone who can should attend. Thanks, SQLBits, for giving me a platform where I can share my experience with this modeling method with others.
In part 5 of my series on the bin packing problem, I presented a method that sits somewhere in between the true row-by-row iterative characteristics of the first three parts and the truly set-based approach of the fourth part. I did use iteration, but each pass through the loop would use a set-based statement to process a lot of rows at once. Since that statement is fairly complex, I am sure that a single execution of it is far from cheap – but the algorithm used is efficient enough that the entire input set is processed after only a few repetitions of that statement.
This approach runs rings (performance-wise) around all the other solutions I had previously tried, and it also produces a more efficient packing than most (but not all) of the slower alternatives. After adding the set-based iteration algorithm to the table of test results and removing all alternatives that are both slower and less efficient (except the baseline, of course), I now only have a few options left:
Based on the measurements in this table, your choices when you need to implement a bin packing algorithm appear pretty clear – either go for a very fast algorithm that wastes very little bins, or use an algorithm that’s over 5 times as slow but wastes even less bins, or sacrifice another 35-40% performance to save yet another 0.01% of space. I have to note, though, that for a true efficiency comparison the algorithms have to be tested with many different sets of data instead of the single set I have used in this series. If you really have to implement a bin packing algorithm and efficiency is more relevant than speed, I’d advise you to conduct those tests, using all the algorithms in all the parts I wrote, using many different sets of sample data that match the distribution of your actual data as close as possible.
On the other hand, if you can live with a fairly high (but not the highest) packing efficiency and need the process to finish fast, go for set-based iteration. The performance difference between this version and all the others is significant enough to safely conclude that even with other data distributions, it will still be the fastest of them all. The only caveat is here is how the algorithms scale. If execution time for one algorithm grows linearly with the amount of input and another scales exponentially, then there will probably be some point where another algorithm will become the winner. I will take a look at that in a later blog post – but first, I want to try if I can improve the existing algorithm a bit more.
How to abuse integer division
When I was busy trying out the variations I’ll describe later in this post, I suddenly realized that one of the queries in the original version of my algorithm uses computations with fractions instead of integers. Since integer computation is generally cheaper than computation with fractional precision, changing this should improve performance a bit. I refer to the expression that calculates the minimum amount of sessions as the total number of candidates divided by the maximum session size, rounded up. I implemented this very straightforward with the expression: CEILING(1.0 * SUM(NumCandidates) / @MaxCandidatesPerSession). I multiply by 1.0 to convert from integer to numeric (basically a trick to save the few extra keystrokes that CEILING(CAST(SUM(NumCandidates) AS numeric(10,2)) needs). Otherwise the division would be done as integer division, truncating the fractional part; the CEILING function would then have no further effect. However, it’s also possible to use a trick that effectively changes the truncation of integer division to rounding up – add the divisor minus 1 to the divided before dividing. So I changed the formula for the computation of the minimum required number of sessions to (SUM(NumCandidates) + @MaxCandidatesPerSession - 1) / @MaxCandidatesPerSession and saved this version as SetBasedIter1. Now this is not the most executed query in the procedure, but it’s still executed a few times, and this change did indeed give me a performance improvement – albeit a tiny one.
Improving the first estimate
There are scenarios where the set-based iteration algorithm starts with an amount of bins that every human being will immediately recognise as insufficient. For example, if the bin size is 10 and there are three packages of size 6, everybody will immediately see that you need three bins. But my algorithm will calculate the total size of all packages (18), divide by bin size and round up to 2, and then attempt to squeeze all packages in those bins. The result will be that the outer loop will have to be executed a second time; the entire process will thus be far less efficient then it would have been if the algorithm had started with three bins.
To profit from this observation, I created yet another version of the algorithm and called it SetBasedIter2. This one uses two methods to calculate the amount of bins needed: the already familiar one (total size divided by bin size and rounded up), but also a simple count of the number of packages that are over half the bin size. The actual amount of bins used is then the highest of those two numbers. First tests (I did more extensive tests later) indicated that this version did indeed run a bit quicker while using the exact same amount of bins.
Encouraged by this success, I went even further. After all, we don’t need to limit ourselves to just looking at the number of packages over half the maximum size. The next step is packages over one third of the bin size. Let’s for example assume that the bin size is still 10, and we now have five packages of size 4. You will immediately see that only two of those packages fit in a bin, so we need three bins. But even with the improved algorithm, we still start with one bin too little – the total size of all packages is 20, divided by 10 and rounded up yields 2. And the number of packages over size 5 is 0, so we do indeed start with two bins, and we’ll need another iteration of the outer loop to get all packages in a bin. So SetBasedIter3 expands on SetBasedIter2 by also including a count of packaged over one third of the bin size, divided by 2 and rounded up. And I then also added SetBasedIter4, that adds yet another step to get a good first estimation of the amount of bins needed: the amount of packages over a quarter of the bin size, divided by three and rounded up. I could have gone on like that, but I decided to stop here and do some more extensive testing first.
The proper way to test
In the previous posts, I always tested with the same set of data, to enable easy comparison of the efficiency of the various algorithms. But I realised that the effect of the new versions of the algorithm depends strongly on the actual data distribution. If, for example, there are no packages over half the bin size, then the extra code to count them is just overhead that adds some CPU cycles but doesn’t change the outcome in any way. But if there are a lot of those large packages, then the extra CPU invested can pay off by saving an extra execution of the outer loop. So the only way to get a fair comparison between the different versions of the algorithm would be to run it lots of times, with different data all the time.
I set up an automated test script that generates a new set of test data, using a different seed but the same distribution (so you could argue that my test method is still far from perfect – and indeed, I do recommend anyone who wants to implement any bin packing algorithm in a production environment where speed and/or packing efficiency matter to set up their own tests, including ALL versions of the various algorithms, while making sure that the amount of data and its distribution matches the expected production data as closely as possible), then executes each of the algorithms ten times in a row, inserting the elapsed time and the amount of bins used in a table. I put this in an endless loop, tested that it worked the way I wanted, and then hit the execute button and then went to enjoy myself at the snooker club, followed by a good night sleep. The next morning, a total of 215 different sets of test data had been processed. I cancelled execution and noted the average execution time of each of the algorithms.
The most surprising result was that of SetBasedIter. When I tested this same algorithm for my previous post, it took 6,741 ms. This time, the exact same code resulted in an average execution time of 2,962 ms, over 50% less. Even the highest measured execution time of this algorithm, at 4,463 ms is much faster than my previously measured average execution time. I don’t know how to explain this. Maybe the distribution of test data in my original test case just happens to be extremely unlucky for this algorithm, maybe I had forgotten to shut down some other program that was running on my laptop? Maybe even a bit of both? If you have any explanation to offer, please let me know!
Other than that, the results confirmed both my suspicions: that the integer calculation of SetBaserIter1 is a tiny bit faster than the numeric computation of the original version, and that at some point, the overhead of all the extra computations in the other versions starts to exceed the average savings due to reduced number of executions of the outer loop. It turns out that this cut-off point comes pretty quick – counting the number of packages over half the maximum bin size pays off, but also counting the number of packages over a third of the maximum bin size already makes the average execution time go up.
How about efficiency?
Though not included in the table above, I have also monitored the number of sessions used by each of the algorithms. In most test cases, all versions of the algorithm used the same amount of bins, but there were a few exception. Out of 215 sets of test data, 7 used 1 extra session with the algorithms SetBasedIter2, SetBasedIter3, and SetBasedIter4, and 3 others used 1 extra session with only the algorithms SetBasetIter3 and SetBasedIter4.
The explanation for this difference is partly simple, partly not. The simple part is that, when extra bins are introduced in the first pass, the division of packages across bins will be totally different, so that bins will have different amounts of empty space remaining. And if, for instance, the remaining empty space for two bins is divided as 2+2 in one case and as 3+1 in another case, that remaining package of size 3 will take up an extra bin in the former case, but not in the latter.
I had expected this difference to show up (that’s why I kept counting sessions in my fully automated overnight-version of the test script). I had no idea beforehand of how frequent I would see these differences, and what algorithms would benefit from the differences. So I was surprised to see how little versions of the test data were actually impacted, and equally surprised that in cases where there was a difference, it was always in the disadvantage of the versions with a better estimation for the first pass. Apparently, starting with more sessions (even if we know they will definitely be needed) is never good for the packing efficiency; for some reason the packing is less tight; more empty room remains. Again, if you have any explanation of this, please let me know!
If you have read the explanation of the set-based iteration method in my previous post on this topic, you’ll remember that at one point I wrote: “The threshold calculation, the ranking of bins by remaining capacity, and the test that a package fits the bin with the same rank may all seem pretty pointless when assigning the first bunch of packages to the bins. But the same code is reused for later iterations and then these are all important ingredients, as you will see in a bit”. But if that extra complexity is only required for the first execution of the code, then why not use separate code paths for the first and the subsequent executions? That way, the first execution can use a less complicated (and hence faster, I hope) query.
I decided to put that theory to the test. I created another version of the code, SetBasedIter2B, starting at SetBasedIter2 (the fastest version so far). I duplicated the code of the complex query that creates the correct amount of session and the even more complex query that assigns registrations to sessions (which also required me to reorder some of the code), then looked at possible optimizations.
For the query that creates new sessions, the first execution could be simplified because I don’t have to find the highest used session number for each quarter. The subsequent sessions could also be simplified by changing the join type for those highest used session numbers to an inner join (the outer join was only needed for the first execution), and by removing the extra line to count registrations over half the maximum session size – the algorithm guarantees that those will all be assigned in the first pass of the outer loop, so there’s no need to count them on subsequent passes.
For the query that assigns registrations to sessions, the first execution in the inner loop can be simplified a lot. If it’s also the first execution of the outer loop, all sessions are still completely empty. But it’s also possible that it’s a later iteration of the outer loop, in which case there may be sessions with some empty space (but not enough for any of the remaining registrations) as well as new sessions that are completely empty. So instead of filtering on SpaceLeft > 0, I now filter on SpaceLeft = @MaxCandidatesPerSession. Since all sessions now have the same amount of empty space, the ordering for the ROW_NUMBER() function is not relevant; I made this explicit in the query (it’s unfortunately not allowed to leave out the ORDER BY clause), so that the optimizer doesn’t have to do unnecessary work. But the best simplification is that I can leave out the entire threshold calculation – since all (new) sessions are still completely empty, the threshold is not relevant.
I could also have added an extra codepath to separate the first execution of the inner loop in the first execution of the outer loop from the first execution of the inner loop in a later execution of the outer loop. In that case, I could have omitted the WHERE clause in both CTE expressions for the former codepath. But since the columns used in these WHERE clauses are not indexed, I don’t expect any saving from this, whereas the logic to choose the right code path would increase complexity (and maybe even execution time).
I included this version as well in the huge test on 215 sets of test data. The amount of sessions for SetBasedIter2B was always the same as for SetBasedIter2, but the performance was now a bit better – 2,789 ms on average instead of the 2,884 for SetBasedIter2; a 3.3% saving.
Sacrificing speed for efficiency
At this point I had run out of ideas to optimize the algorithm any further. But I did want to try something else. When testing the versions that sometimes add a few extra sessions based on the number of registrations over half, one third, or one quarter of the maximum, I found that these would usually produce the same amount of sessions, but sometimes produce one extra sessions. And though I could not explain it, I did decide to try if the reverse would also hold. In other words, if I deliberately start with too few sessions, will the packing efficiency increase?
In order to test this, I created yet two new versions of my algorithm: SetBasedIter1BMin10 and SetBasedIter1BMin25. Both are based on SetBasedIter1, but with the performance enhancement of the SetBasedIter2B version (duplicating the queries and simplifying them for the first pass). The change I made was that the amount of new sessions used on each pass is now set to 10% or 25% less than the amount needed (rounded up, to prevent endless repetition of the outer loop). This guarantees that we will always need extra iterations of the outer loop, so performance will definitely suffer. But I hope this will be offset by increased efficiency.
These two versions were included in the test with 215 sets of random data as well. The speed was as expected: 4,171 ms for SetBasedIter1BMin10 (almost 50% slower than the fastest option algorithm), and 5,903 ms for SetBasedIter1BMin25 (over 111% slower). But the packing efficiency did indeed increase. For the 215 sets, the amount of sessions was decreased by 0.57% on average with the Min10 version, and by 1.55% with the Min25 version. For the original set of test data, the Min10 version required 19,182 session; the Min25 version needed 18,986. Still not as good as the most efficient (but slow!) cursor-based versions, though; it seems my algorithm cannot beat those versions for packing efficiency.
After all these measurements, it’s high time for a new overview of all the relevant versions of the bin packing algorithms. I consider a version irrelevant if there is another version that is both faster and more efficient. Leaving those versions out (except the baseline), I now have the following overview. Note that the execution time for all the “SetBasedIter” versions is based on testing with 215 different sets of random data, where each test is repeated 10 times and the fastest and slowest times are discarded; the execution time for the other algorithms and the number of sessions for all algorithms is based on a single execution with the standard set of data used in the previous parts of this series.
1 I expect the SetBasedIter1B version to be 3-3.5% faster (~2.84 ms), but I did not include this version in my tests. It’s quite easy to write this version – simply start with the SetBasedIter2B version and remove the extra code required to count the number of registrations over half the maximum session size.
2 Though the number of sessions for SetBasedIter2B is equal to that for SetBasedIter1 in this specific test case, I observed it to be one more in 3.2% of all my test cases. So this version is (slightly) less efficient.
When I finished the previous part of this series, I expected to be able to cover both the final improvements to the set-based iteration algorithm and the scaling tests in this part. But as I was working, I found many more improvement possibilities than I expected. This post is, again, much longer than what I think a blog post should be.
So I will have to postpone testing how the various algorithms scale to the next post in this series. Will that post conclude this series? Frankly, I don’t know yet. At this time, I have no ideas for further improvements or other approaches. However, if someone else has a good idea, feel free to let me know (in the comments or by mail). I’ll look at your ideas and test them – and if they are good, I will include them in yet another episode in this series (that was originally planned to span 5 posts at most – oops!)
In an earlier post, I talked about the Jargon Principle, one of three principles I learned in 1994 and that have not only helped make me a better modeler, but that I have found to be very valuable in many other situations as well. Today, I will cover the second of those principles.
The Concreteness Principle
Again, not in the exact wording I learned it but paraphrased, here is “The Concreteness Principle”:
“When communicating with the domain expert, the analyst shall avoid all abstract statements and questions and instead use concrete examples as the base of all communications; these examples shall use a notation that is familiar to the domain expert.”
Note that the last part of the Concreteness Principle is basically a repetition of the Jargon Principle. One could argue that this part of my wording is redundant, but in this case I really don’t mind driving the point home a bit.
What’s wrong with abstract questions?
I have tried to create a completely new and original example to illustrate this, but I was unable to find anything better than the one I already used in one of my chapters in the first SQL Server MVP Deep Dives book, so I’ll just reuse that here.
As a reader of my blog on SQLBlog.com, I’ll assume that you are fairly familiar with SQL Server, and that you could play the role of domain expert if some data modeler wants to model SQL Server but doesn’t know it himself. So let me assume that role of the data modeler. I’ll forget (temporarily) everything I know about SQL Server, and ask you this fairly simple question: “Can a table with a composite primary key be referenced by a foreign key constraint from several tables?”
Before reading on, please take a minute to ponder the question. Try to interpret what I’m asking and then decide what your answer would be.
Give me a concrete example, then!
The question I asked you is an abstract question. It talks about tables, primary keys, and foreign keys in general, on an abstract level. Replacing it with a concrete example is fairly simple – just substitute all these abstract terms with concrete examples. Instead of “a table” and “several tables”, I use tables TableA, TableB, and TableC; instead of “a composite primary key”, I use “Key PK_TableA on columns Foo and Bar in TableA”, etc.
Here’s the new version of the question: “Suppose I have a table TableA with a composite primary key made up of the columns Foo and Bar, a table TableB with amongst others a column Foo, and a table TableC with amongst others a column Bar. Is it allowed to define a foreign key constraint from the columns Foo in TableB and Bar in TableC that references the combination of columns Foo and Bar in TableA?”
Again, please take the time to read the question carefully and consider your answer. Was it still the same? Or did you actually misinterpret my original question?
If you did misinterpret the question, now is a good time to go back and reread the original question. I did phrase it exactly as intended. I did not ask “… referenced by foreign keys from …” (plural), but “… referenced by a foreign key from …” (singular). And yet, I am sure that a majority of readers will interpret the question as if I meant it in plural: multiple keys from multiple tables. Why? Simple. You are in this business; you work with tables and foreign keys; you are bound to interpret abstract questions in a way that fits your image of the world.
I’ll admit that I could have written the original, abstract question as “… referenced by a single foreign key from …”. That would have prevented the misinterpretation. But that would only be a logical way to phrase the question if the one asking the question assumes that without the addition of “single”, misinterpretations are likely. I would have done that in a real interview, because I already know the answer (so I actually would not have asked the question at all). But a modeler who actually does not know the answer has no way of suspecting that this misinterpretation is possible.
In my wording of the Concreteness Principle, I also explicitly say that the concrete example should be in a familiar notation – familiar to the domain expert, that is. Even though I already rephrased the original abstract question using a concrete example, and even though I did not violate the jargon principle (both the abstract question and the one using a concrete example use the jargon of SQL Server developers, which should be familiar to any SQL Server domain expert), I am still not happy with the way I asked it.
Why? Fair question. After all, most (if not all) readers will have understood the second version of my question the way I intended it. But I also assume that many readers will not have known the answer immediately. Most will probably first have drawn a mental image of the model I was describing. Some might even have taken out a scratchpad and pencil and drawn a real image. That’s a waste of time. And it introduces a new risk of misunderstanding. With the mental image, there is a risk of the domain expert misplacing one of the columns, and since the image exists in her head only, the modeler would not even see it. With the actual scratchpad, that risk is somewhat reduced if the modeler takes the time to inspect if the model on the scratchpad matches his intention. Provided the modeler understands this image, the jargon that the image is based upon, sufficiently to grasp the picture.
But if the modeler is indeed able to read this concrete example in the domain expert’s familiar notation, would he not be also able to draw it himself? Try to recall how long you took before you understood my second question sufficiently to answer the question. Now look at the picture below and watch how long it takes you to see what (if anything) is wrong with it:
I’m pretty sure that most readers will instantly see that this is not allowed. It is, again, the same question I already asked twice before. But now, you (still my domain expert) are able to see immediately what’s wrong with it, even if I don’t draw your attention to the part of the example I need you to focus on (which, in a real situation, I would do).
I hope the above example is sufficient to convince you that concrete examples are a much better way to phrase a question than abstract language. This does not end at asking questions, though; it’s the same for explaining something. Who has ever helped a kid with his or her first division exercises? I’m willing to bet that almost everyone introduced food at some point in the explanation – be it a pizza that has to be shared by three persons, a bag with 10 cookies for 5 children, or whatever. Why? Simple – because those concrete examples are much easier to understand than vague abstract explanations about division. And if you need to explain the remainder as well, you simple add another cookie to the bag (or, if you’re cheap, another kid).
And look at this blog post. I include an example about table design, and then another example about elementary school homework assistance. This whole blog post is one giant example of how to use concrete examples, rather than long series of abstract words, if you want to be understood.
(And for a counter example, analyze what politicians say during interviews. Many of them like to give long series of vague abstract quotes when they have to explain unpopular decisions. That’s because they hope that most voters won’t understand what they are actually saying. They do tend to use concrete examples when claiming a success, though!)
But he said…
As much as I believe in my principles, I also believe in not imposing them on others. Okay, you might argue that I am now trying to convince you of their value, and that’s true. But when talking to a domain expert, I would never force him or her to use my principles. But I am still very aware that abstractions can be easily misinterpreted. That poses a dilemma, because how can I thwart this risk if the domain expert chooses to use abstractions when talking to me?
The answer is surprisingly simple. I try to parse what the domain expert tells me, then verify my interpretation by presenting him or her with a concrete example (in familiar notation, of course) that illustrates my interpretation of his or her words.
The above might sound a bit vague (it’s an abstract formulation, after all), so I’ll give an example. Let’s say I am interviewing a domain expert from an insurance company. At one point, he might tell me “customers who smoke or have an occupation with hazard level F or higher and who are over 40 years of age cannot buy a life insurance unless they agree to pay a 20% surcharge on their premiums”. There is some room for misunderstanding in that statement. Mixing “or” and “and” together is one of them (does the 40 year age limit apply to dangerous professions only, or to smokers as well?). The or might be inclusive or exclusive (though the latter is quite improbable in this context). The “over 40 years” technically implies that this does not apply to people at the age of exactly 40, but many people will use these words when they mean “40 or over”. And does the age refer to the age when buying the insurance, or also to the yearly renewal? In other words, will a 38-year old smoker be confronted with a 20% premium range when he turns 41 (or 40), or does this only apply to people who are already that old when buying the policy?
At this point, I would whip out a set of blank policy forms, fill them with imaginary customers, making sure to touch all the relevant combinations of smokers vs non-smokers, dangerous vs safe occupations, and various ages. I would then present these to the domain expert and ask him or her to verify if my interpretation of the above statement is correct or not.
I have learned that applying the Concreteness Principle is a great way to prevent miscommunication. Not only in the data modeling job, but also in many other situations – yes, even in discussions with my teenage daughter (okay, there it helps a bit; it’s not the silver bullet that all parents are so desperately looking for). Almost everyone applies this principle, unconsciously, in many cases. Being aware of this principle has helped me to increase the number of times I use concrete examples instead of (or sometimes in addition to) vague abstractions.
Presenting the concrete examples in familiar notation can sometimes be a bit challenging. How do you even know what notation is familiar for the domain expert? And if that notation is not familiar to you, then how could you ever present your question in that notation? Those questions are all very justified, but answering them is beyond the scope of this blog post.
If this blog post has helped you to become more aware of the way you talk with people, the way you ask questions, or the way you make an argument; if you now try to avoid abstractions and use concrete examples instead, I have achieved what I aim for. Even if you are not (or not always) able to present those examples in a notation that’s familiar to the person you are talking to.
In one of my previous posts, I discussed whether data modeling is art or science, and I concluded that, unfortunately, the current state of affairs is that it’s closer to art than to science, whereas I would like to see the opposite. And I think that the same applies to process modeling.
Back in 1994, I learned about a methodology that managed to transition data modeling from art to science. As a result, I have become much more effective at creating successful data models. The root cause of this methodology is that it is founded on three basic principles that I have since embraced for everything I do.
The Jargon Principle
The first of the principles I learned in 1994 is called “The Jargon Principle”. I don’t recall the exact wording, but here’s how I would paraphrase it:
“For all communication that takes place between analyst and domain expert, the analyst will use the jargon of the domain expert instead of forcing the domain expert to use the jargon of the analyst.”
So, what does this mean and how to apply this principle?
Jargon is great …
For most people, the first association that comes with the word “jargon” is the nutty professor scribbling tons of unintelligible symbols on the chalkboard, or the highly qualified engineer that no one understands because he assumes that everyone not only knows exactly what an ACMF engine is, but is also extremely interested in discussing its details. And though these are definitely fine examples of jargon, I am using the term in a much broader sense here.
The jargon that the Jargon Principle references applies to all forms of communication, be it language (spoken or written words), symbolic (numbers, diagrams, tables, etc), or any other form. As soon as this communication uses elements that are familiar to people in a certain group but not (or less) familiar to people outside that group, I consider it to be jargon.
Every group and every profession has its own jargon. And for good reason. When I am in the United States to attend a conference (like, recently, the PASS Summit 2011), I sometimes switch the television to a sports channel. And then I invariably get to see people playing the game the Americans call football (even though they primarily hold the ball in their hands), and I then hear the commentator mention that a player is “three for seven”. I have no idea what that means. But I am sure it makes sense to all who regularly watch these games, and it would become very boring and longwinded if the presenter had to explain this bit of jargon every time he uses it.
And just think how hard our own job would become if you could not say to your colleague that you believe that discount should be an attribute of the Sale entity instead of a relationship. Or if you could not use Entity-Relationship Diagrams to lay out the design of a new application and discuss it with your peers. Those are exactly the situations that jargon is intended for, because it helps you communicate more efficiently and more clearly with your co-workers – the shared jargon ensures rapid communications without misunderstandings that can could result from having to describe everything in “normal” English.
… except when it isn’t.
The examples above that illustrate the greatness of jargon all have one common factor, and that is that in these examples, the jargon was used to facilitate communications between people that share knowledge of the same jargon – except the first example, that left me sitting bewildered in front of the television set, wondering what the hey the commenter meant when telling me that a player was “three for seven”.
When you, in your job role as a data modeler or process modeler, step outside the safety of your office where your coworkers are to discuss the next application to build with the domain expert, you are talking with someone who does not know the same jargon you do. If you have to create the data model for the local candy store and you ask the owner if she agrees that discount should be an attribute of the Sale entity, your only answer will probably the “thud” of her jaw hitting the floor. If she answers anything else, she either is an expert in the field of data modeling herself, in which case she doesn’t need you, or she is afraid she’ll look dumb if she admits that your words are Greek to her and gives random replies in an attempt to mask her perceived lack of knowledge. Whereas in fact, you are the dumb person, because you asked a question in a jargon that is unfamiliar to the domain expert, setting her, yourself, and the entire project up for failure.
What’s the alternative?
I hope my examples above have convinced you that you should not ever use your own jargon, the jargon of the data or process modeler, when talking to domain experts. And since not talking to them is not an option, you are now left with only two options. The first is to avoid the use of any jargon. That might look like a viable option, but it has problems. How do you even know if a word is jargon or not? When I use the word “person”, I usually think about a human being. But for a lawyer, a person is a legal entity. He will use the word in that context without even realizing he’s using jargon. And I can’t help it, because (a) I may not even be aware that he is using jargon, and (b) that would still mean forcing the domain expert to talk my (now jargon-free) language, a practice I don’t like.
And that leaves only one option – modelers should use the domain expert’s jargon when talking to him or her. That’s the Jargon Principle.
Stating a principle is easy, living up to it can be hard. But if I want to take myself seriously, if I want to ensure that I base my models on trustworthy information, I have to force myself to live up to this principle. Any other method of communicating with the domain expert simply carries too much risk of failure.
And I’m not the only one who believes so. If you check job adverts for data modeling jobs, or even for analyst and developer positions, you’ll almost always see the requirement that candidates have to have several years experience in the companies’ line of business. That is not because oil producers use different modeling languages or different dialects of T-SQL and C# than hospitals. It’s because people with several years in a line of business generally have picked up enough of the relevant jargon to be able to communicate efficiently with the domain experts. However, this interpretation of the Jargon Principle is both limiting (for the modeler) and dangerous (for the business).
The “limiting” aspect is quite obvious. If your first job as modeler happens to be with a bank and you decide to move on after six years, you’ll probably have a good chance to be hired as data modeler for another bank – but not in any other industry. You might as well update the job title on your resume from “data modeler” to “data modeler in the banking industry”, for that is what you’ll be doing the rest of your working life. Unless you are willing to slip back into the junior role again, and to accept the corresponding pay cut.
The “dangerous” aspect is less obvious. So let’s start with a quick show of hands. Everyone who has ever witnessed a data professional, be it a modeler, an analyst, a developer, or whatever other function, say something like “they don’t specify this configuration option, but I’ll put it in anyway, just in case”, or “rip this out? Hmmm, I’ll just comment it, they’ll surely want it back within a few months” (or encountered such commented code in decades old programs), or even “nah, that spec is incorrect, I’ll do it another way”, raise your hands. Yup, thought so. Now raise your other hand if the person who said that just happened to be you. I’m willing to bet that at least 80% of you are now in the universal “I surrender” position.
This comes from the fact that several years of experience in an industry will teach you way more than just the jargon. You’ll find that your knowledge starts to match that of the domain expert, or even exceed it if the domain expert has less experience than you do. And if the domain expert says something that does not match your experience and your notions of how the business works, you may even be tempted to disregard the words of the domain expert and push on with the model that you think is the more correct one. And you might even be correct. But you might be wrong as well, and that’s where the danger lies. In a well-organized company, statements made by the domain experts are under scrutiny, because they make up the specs of the system. Deviations from those specs by a smart-ass modeler (or developer) have a much higher chance to go unnoticed until it’s too late.
I believe the Jargon Principle to be of utmost importance when doing data or process modeling. In the current state of affairs in our profession (that of data professionals), this unfortunately means that one needs sufficient experience in the industry of the domain expert to pick up his jargon. The down sides and risks of this have to be accepted as the lesser of all evils.
However, I also believe that there must be a better way. I believe that it’s possible for a data modeler to speak the domain expert’s jargon without first having to master that jargon, by simply following rules that specify exactly what questions to ask and how to phrase them. How these rules work exactly is out of scope for this post, but those who own the first SQL Server MVP Deep Dives book (and those who don’t should buy it now, along with the second SQL Server MVP Deep Dives book – all author royalties go to charity, so you get two exceptional books and help children in need along the way) can read my chapter on finding functional dependencies (a small but important part of the work of a data modeler) to get some general idea how this communication method works.
One of the most common techniques authors use to keep their readers interested is to leave them with a cliff-hanger. It’s what I did when I finished part 4 of my series on the bin packing problem – never intending to leave you all hanging over a cliff for almost three years, though that is exactly what happened. My apologies to everyone who has been checking my blog on a daily basis all that time, in the idle hope of finally learning that faster method I promised.
For those of you had have forgotten what I wrote in the previous parts, or who have never read them before, here are a few quick links:
- The first post includes an explanation of the bin-packing solution, sets up some sample tables and test data, and establishes a baseline that all other solutions have to be compared against – for both performance (lowest speed) and efficiency (lowest number of bins – sessions in the case of the chosen sample).
- The second post introduces several techniques to increase packing efficiency, though at the cost of reduced performance.
- In the third post, I investigate several ways to improve performance, and find out exactly how much they decrease packing efficiency.
- The fourth post investigates how a completely set-based solution is guaranteed to find the best possible solution for bin-packing problems that are limited enough that you could do it by hand, but falls apart completely when you want to scale.
Changed hardware, changed performance
Three years have passed since I last worked on this series, and my test environment has changed considerably. My old laptop has been replaced by a newer one, with two disk drives, more memory and a faster, 64-bit processor. And I have also upgraded my DBMS to SQL Server 2008 R2, Service Pack 1. This has invalidated all my previous measurements, so I decided to first repeat the performance tests for the shortlist of relevant solutions that I included in part 4 – except for OrderDesc, since I found in part 4 that this should not have been listed as a relevant solution at all. I have executed each version ten times and calculated the average execution time from those test results. In most cases the execution times were fairly close; only the baseline had a larger variety. The table below lists the results of my tests:
If you compare the table above to the table I posted in part 4, you’ll notice that all algorithms got faster by 13 to 20 percent. The overall performance boost is a great confirmation that the hard-earned cash I invested in my new laptop was not wasted. The difference in percentage performance gain suggests that extensive testing of all algorithms might yield some surprises; some of the algorithms previously excluded from the list above might have to be added again – but I don’t expect the difference to ever be more than a few percent. I decided not to spend the extra time that would have been needed for this full investigation. You are of course free to do so yourself, all the required code is still available from my blog. But after seeing the performance of the algorithm I’ll describe in this blog post, you’ll probably understand why I am not that interested anymore in whether any of the other cursor-based algorithms now happens to be one or two percent faster than those listed above.
All at once or one at a time?
One of the more common best practices for SQL Server is to avoid using cursors and other iterative solution, and use set-based logic instead – and only choose an iterative solution if you are 100% sure that you have run into one of the very few situation where a set-based solution will not work. While I do endorse this best practice in general, it has a down side: it makes people believe that either set-based (process all at once) or iterative (process one row at a time, usually with cursors) are the only alternatives.
They are not. There are more options available. One of these alternatives that I have found to be highly useful in a few situations is what I have dubbed “set-based iteration”.
Set-based iteration: the perfect blend?
If you characterize set-based processing as “using a single query that processes all rows at once”, and iterative processing as “using a loop that processes one row per execution”, then you can characterize set-based iteration as “using a loop that processes many rows per execution”. So you have a set-based query that processes many (but not all!) rows, that is enclosed in a loop to repeat that query as often as needed. The challenge here is to find a form where the set-based query does not take too much time, yet processes as many rows as possible so that the number of iterations remains low.
For the bin packing problem, this means that instead of filling one bucket at a time (as we did in the various cursor-based solutions), we’ll fill many buckets at once. The amount of buckets to fill should be as high as possible, but not so high that we end up taking more buckets than needed, as the goal was to use as little buckets as possible. The only problem here is that we don’t know in advance how many buckets we will end up needing. But we do know the minimum amount that will be required anyway – if the maximum seating capacity of the examination room is 100 students and there are 1742 students registered, we can be absolutely sure that there is no way we will ever pack those students in 17 sessions; we know for sure that we will need at least 18 sessions. So instead of opening one session and assigning registrations one at a time to it, we can now create 18 sessions at once, assign 18 registrations to those sessions at a time, and repeat this until either all registrations are assigned or all sessions are full – and if at that point we are still left with unassigned registrations, then the distribution of registration sizes was apparently unlucky and we need one or more extra sessions to assign the remaining registrations to; this is done by simply repeating the process for only the unassigned registrations.
The algorithm in detail
The exact algorithm I use in my “set-based iteration” solution for the bin packing problem needs some explaining, so I decided to use some sample data and some pretty (ahem) pictures to illustrate the various steps. To keep it simple, I limit the bucket capacity to 10, and I pack 9 packages, three of size 6, three of size 5, and the last three of sizes 3, 2, and 1. To find the minimum number of bins required, we calculate the total size of all packages (3 * 6 + 3 * 5 + 3 + 2 + 1 = 39) and divide by bin size, rounding all fractions up (39 / 10 = 3.9 à 4 bins). So we immediately create 4 empty bins.
To assign packages to these bins, we find the threshold (the largest available capacity in the current range of bins – 10 for now, since all bins are still empty), rank the bins by descending available capacity, rank the packages that don’t exceed the threshold by descending size, and then assign packages to bins based on equal rank – but only if the package does actually fit in the bin with the same rank. This is illustrated below.
The threshold calculation, the ranking of bins by remaining capacity, and the test that a package fits the bin with the same rank may all seem pretty pointless when assigning the first bunch of packages to the bins. But the same code is reused for later iterations and then these are all important ingredients, as you will see in a bit.
After assigning these first four packages, the remaining capacity of all bins is recalculated and the process repeats – the threshold is calculated (now 5, since that is the largest remaining capacity). None of the remaining packages exceeds this threshold, so all remaining packages are ranked by size; all bins are ranked by remaining capacity, and packages are once more assigned to bins based on equal ranks, as illustrated below:
As you can see, package F and bin 1 are both ranked 2 in their respective orderings, so they are assigned to each other, but package F exceeds the remaining capacity of bin 1, so this combination is discarded, as indicated by the dotted arrow. The other packages all do fit in their assigned bins, so packages E, G, and H are assigned to bins 4, 2, and 3 respectively.
On the third iteration of this step, bin 1 has the highest remaining capacity, so the threshold is set to 4. Package F at size 5 exceeds this threshold; this package won’t any of the remaining bins, so this package is exempted from the process until we have finished filling the current batch of bins and start a new series of bins.
The bins are ranked by descending size. The packages (or rather, the single remaining package) is ranked as well, and then assigned to the bin with the same rank, as illustrated in the figure below:
A fourth iteration of this process doesn’t cause any new changes. There is only one package left to assign and it exceeds the threshold (that is now even down to 3), so the iteration stops here; all 4 bins that were assigned at the start of the process have been filled as far as possible with the available packages.
Not all packages have been assigned to a bin, though. Apparently, the sizes of the available packages were distributed such that it was not possible to distribute them to only four bins; an extra bin is needed. So the whole process starts again from scratch, using only the single remaining package: calculate total size (5), divide by bin size and round up to find the minimum required number of additional bins (1), rank both the single bin for this batch and the single package, then assign the package to the bin. The end result can be seen below:
I won’t spend much time on the T-SQL implementation of this algorithm. You can find the full code in “SetBasedIter.sql”, which is part of the ZIP file I attached to this blog post. I included comments where I thought they might be relevant. The T-SQL I used in this code uses several features that many won’t use on a regular basis, so just looking at this code and trying to understand how it works should already present a learning opportunity.
But how does it perform?
At the end of the day, the only thing we’re interested in are the results. So I executed this procedure a total of ten times, and the average execution time is only 6,741 ms. That is an improvement of almost 90% over the baseline, and over 75% faster than FillThenNext, the previous fastest solution. And unlike FillThenNext, the SetBasedIter algorithm does not pay for its increased performance by lower efficiency. With my standard set of test data, SetBasedIter packs all registrations in a total of 19,293 sessions. Better than the baseline or the two fastest cursor-based algorithms, but admittedly not as good as the two slower cursor-based algorithms FillThenSearchDesc and Order50FirstB. But given the performance difference, I expect most companies to be willing to accept that 2% efficiency loss for the 80% performance gain.
Too good to be true?
You probably know the saying “when something sounds too good to be true, it probably is”. Well, this is the exception. Saving 75% on ever the fastest solution does sound too good to be true, but all my tests show that it is actually true. Could that be a sign that this algorithm is not “too good to be true” at all? That it is, in fact, still not good enough? Well, to me it definitely isn’t. In the next part of this series, I will show how some smart changes in the code of the set-based iteration algorithm can reduce its execution time even further. And I will also investigate how the various algorithms scale, for an algorithm that is the winner for this test-set but scales exponentially could quickly become a loser when the company increases its business – and I like to know that kind of stuff before it happens!
So stay tuned for the sixth part of this series. And I promise, it won’t be another three-year wait this time!
Almost a year ago now, I started a series of blog post on the bin packing problem. But after the first three posts, various reasons caused the research I still had to do for the fourth part to be massively delayed. It’s only now that I have finally found the time to finish my research and write up the fourth installment.
After a nine-month delay, I can hardly expect you to remember what I covered in the first posts, so you may want to follow these links to re-read that:
· The first post includes an explanation of the bin-packing solution and establishes a baseline that all other solutions have to be compared against – for both performance (lowest speed) and efficiency (lowest number of sessions, or bins).
· The second post introduces several techniques to increase packing efficiency, though at the cost of reduced performance.
· In the third post, I investigate several ways to improve performance, and find out exactly how much they decrease packing efficiency.
Changed version, changed performance
All performance numbers quoted in the previous posts were based on tests on my laptop, which was at that time running SQL Server 2005, SP2. But not anymore. In the meantime, I have upgraded my laptop to SQL Server 2008 RTM, so it’s highly probable that performance of all queries tested so far has changed – hopefully for the better. Since I don’t want to revive SQL Server 2005 on my laptop, I had no choice but to repeat all tests that I ran earlier in the course of researching and writing the first three parts of this series. I won’t bore you with all the details; instead I’ll just show an updated version of the table that I posted in the conclusion of the third part, listing the performance and efficiency of the baseline and of all versions of the algorithm that show an interesting trade-off between efficiency and performance.
Execution time (ms)
Number of sessions
FillThenSearch (with index)
FillThenSearchDesc (with index)
Order50FirstB (with index)
OrderDesc (with index)
If you compare the table above to the table I posted in part 3, you’ll notice a few interesting changes:
· There are no really big performance changes – queries got just a couple of percents faster or slower, but not the big speedups one would hope for when moving to a new version.
· Two of the algorithms, the baseline and OrderDesc (with index) are now slower than before. The others all got a bit faster.
· There’s also a new algorithm in the table: Order50FirstB (with index). Without the index, this 18,932 sessions – but with the index, this improved to 18,923. Unfortunately, when I originally tested the effects of the index I was too focused on the performance to note the increased packing efficiency. So, at the presumed 18,932 sessions I concluded that another algorithm (FillThenSearchDesc) was both faster and more efficient, and thus excluded this one from the table – whereas in reality, Order50FirstB should have been in and OrderDesc should have been out (for it’s now exactly as efficient but slower). In fact, the only reason why I still included OrderDesc (with index) in the table above is because it was so that you could compare it to the table in the previous part.
And in case you are wondering how it’s possible that adding an index affected packing efficiency as well as speed, here’s why: the Order50FirstB procedure uses a SELECT TOP 1 query without ORDER BY, which is known to produce undetermined results; in this case, adding the INDEX obviously affected what sessions were returned whenever this query was executed, and this had its effects on the packing efficiency. Which only goes to prove, once more, that one should never use the TOP clause without ORDER BY. (Okay, in this case I’m pretty satisfied with the unexpected change, but it could just as well have been the other way around!)
Less sample data
As promised in the previous part, I will now focus on developing a purely set-based solution for the bin packing problem. But before I can go there, I have to produce a new set of test data that is drastically reduced in size as compared to the original test data. There are various reasons for this. One reason is that the set-based solution requires an advanced knowledge of the number of sessions required, with the size of the query expanding if the theoretic number of sessions increases. Another reason is that the size of the query also increases as the session size increases, so I wanted to test with a smaller session size first. And finally, the performance of this tiny test-set turned out to be already pretty bad; I’ll try to make an estimation of the total execution time for a set-based solution for the original problem near the end of this post – after which you’ll understand why I never actually built or tested this version.
In the attached ZIP file, you’ll find the file “Tiny Testset (Random).sql”. This is a new version of the random test data generator presented previously in the first part of this series, with the following modifications:
· The maximum number of candidates per registration that was hardcoded as 100 in the original script has been replaced by a variable, defaulting to 10.
· All quarters now use the same, simple expression to get an even distribution of registration sizes between 1 and the maximum size. The switches to include or exclude each quarter have been removed. I felt that the extra complexity of different distributions was not required in this case.
· Since I had to code the query for a fixed maximum number of possible solutions, I added an extra variable for the maximum number of candidates in all sessions for each quarter; if the random distribution of registration sizes causes this to be exceeded, the script will simply remove the largest registrations until the total size of all remaining sessions in each quarter no longer exceeds the maximum.
For my first round of testing, I decided to limit the sessions to a maximum of 20 candidates, and to assume a maximum of four sessions per quarter. I limited the amount of subjects (sessions per quarter) to 10, and set the maximum registration size to 10 as well. With these numbers, I decided to set the maximum number of candidates per quarter to 70. In theory, this still allows for impossible data combinations – for instance, there is no way that you can ever fit 10 registrations for 7 candidates each in less than 5 sessions –, but none of the tests I ran ever actually ran into this problem. But if you are going to run your own test with a different seed for the randomizer, or with different settings for other variables, you need to be aware of this possibility. Since the set-based algorithm will simply not return any results for a quarter that needs more sessions than the query caters for, this problem is easy to spot – you just need to check that the query returns results for as many quarters as your test data spans (which can be done by simply checking the amount of rows returned – easily visible when returning results in grid view in SQL Server Management Studio).
The set-based solution
Credit where credit is due. The set-based solution I present here is not mine. I have first seen this solution in a newsgroup posting by John Gilson, dating back to early 2005. His code did contain some minor errors that I corrected, and I have also added some optimizations to his code – but the logic used in this solution is entirely his.
The set-based solution presented here is not implemented in a single query but as a series of views, building on top of each other. This has several advantages. One of them is that the logic is easier to understand. Another advantage is that it prevents endless code repetition: every view can reference the previously defined views by name instead of repeating the entire definition of the view as a subquery. This does not automatically lead to performance gains though – SQL Server will replace each view by its definition before handing the result over to the query optimizer, so the resulting plan will be exactly the same as if the entire query were written out referencing only base tables.
As in the previous parts, the code is too long to reproduce here. In the attached ZIP file, you’ll find the code for this solution in the script “PureSetbasedViews.sql”, complete with extensive comments. This code also contains (commented out) statements to review the contents of each of the intermediate views, so if you wish you can uncomment them and see how the steps build upon each other to massage the data into the final solution.
Step 1: Aggregating by size
The first step of the algorithm is pretty simple. It builds on the basic idea that in any give solution, it’s not important which subjects are included in each session, but only how many of each size. And for the purpose of the bin packing algorithm, subjects with the same number of candidates are completely interchangeable. As an example, let’s say that there are five registrations for a given quarter, three for 6 candidates each and two for 9 candidates each. This will be represented by five rows in the dbo.Registrations table, but it can just as well be represented by two rows – one row representing 3 registrations for 6 candidates, and a second row representing 2 registrations for 9 candidates. This is implemented in the attached code in the dbo.RegsBySize view.
The great advantage of this is that it vastly reduces the number of potential (but not really different) solutions. For instance, there may be solutions where a session is built by combining one of the registrations for 6 candidates is with one of the registrations of 9 candidates. When building straight from the dbo.Registrations table, any possible set-based solution would find 6 variations on this session (combining each of the three 6-candidate registrations with each of the two 9-candidate ones); when building from dbo.RegsBySize, there will be just one row representing this possible combination of two registrations.
Step 2: Building sessions
Using the aggregated registrations, the second step is to combine one or more registrations into possible sessions of all sizes from one up to the maximum session size. Combinations of registrations that would exceed the maximum session size are excluded, but combinations that are smaller than the maximum are not. The view dbo.PossibleSessions describes a completely denormalized result set, holding up to five (Size, Count) pairs from the dbo.RegsBySize view. Like in the previous step, only one row is created for each possible permutation – so based on the example above, there will be one row for “1 registration size 6 and 1 registration size 9 (total size 15)”, even though two such sessions could be formed from the total registrations available. Other possible sessions constructed from the above sample data would be “1 registration size 6 (total size 6)”, “1 registration size 9 (total size 9)”, “2 registrations size 6 (total size 12)”, “3 registrations size 6 (total size 18)”, and “2 registrations size 9 (total size 18)”.
As you can see, this view also includes a column SessionSize (holding the total number of candidates in all the session’s registrations) and columns for the total number of sessions for each size available. The former can easily be computed from the Size and Cnt columns, and the latter can easily be fetched by joining to dbo.RegsBySize – but redundantly including these columns here vastly reduces the amount of code required in the next step.
The final column in this view, RowNum, uses the ROW_NUMBER() function to assign each possible session a unique number within its quarter. The numbers are assigned by ordering the possible sessions by session size and then by size and count of each of the (size, count) pairs; this could in fact have easily been replaced by any other ordering – the only goal is to have some method to assign unique row-numbers within each quarter. These numbers are used in the next step , when solutions are made by combining multiple possible sessions, to prevent duplicate solutions – for instance, if one solution combines possible sessions 1 and 2, we don’t want another solution formed by combining sessions 2 and 1. If you want to run this code on SQL Server 2000, or on any other database platform that does not support the ROW_NUMBER() function, check John Gilson’s original posting (that was written before SQL Server 2005 was released) to see how he achieved the same goal without the RowNum column – by adding lots of complicated comparisons in the next logical step).
You’ll probably note that the maximum session size of 20 is hardcoded in this view, since parameters are not supported in views. I could have used an inline table-values user-defined function instead (sacrificing support for SQL Server 2000 and before), but that would not really have helped, since the number of columns this view requires is also dependant on the maximum session size. The five (size, count) pairs supported in this version are not chosen arbitrarily; it is the minimum number required to describe all possible permutations of registrations whose total size does not exceed 20. The smallest possible session size that would require a sixth (size, count) pair would be 21 (combining one registration each of the sizes 1, 2, 3, 4, 5, and 6). Similarly, a seventh (size, count) pair (and accompanying avail column) would be required when the maximum session size is 28 or more, and so on. For the maximum session size of 100 that I used in the original test data, no less than 13 (size, count) pairs are required – so if you want to extend the set-based solution to this original size, prepare to expand the view to a total of 43 columns, and no less than a whopping 307 lines of code (if I didn’t mess up my estimations…). I’m sure this makes you start appreciating why I chose to limit myself to a much smaller set of test data for this part of the series!
Step 3: Finding solutions
Now that we know all possibilities to assemble sessions of all allowed sizes, it’s time to combine these possible sessions into complete solutions. Of course, not any combination is a solution – in order to qualify as a solution, each registration has to be used exactly once. This is implemented with a clever trick – the code checks to see that no registration is used more than once (or rather, that no registration size is used more often than the number of registrations of that size), and that the total number of candidates in all sessions combined equals the total number of candidates in all registrations. Since the only way to get at a total equal to the total registration size without using registrations twice is to use them all, this requirement is equivalent with the requirement to use each registration exactly once, but much easier to test.
When assembling solutions from the possible sessions view, it’s important to remember that the possible sessions stores permutations – so it’s allowed to use multiple instances of the same possible session as long as the total number of available registrations of each size is not exceeded. So going back to the sample data presented earlier, one possible solution would include two occurrences of the “1 registration size 6 and 1 registration size 9 (total size 15)” session, plus one occurrence of “1 registration size 6 (total size 6)” to top it off. Another solution would include one occurrence of “3 registrations size 6 (total size 18)”, and one occurrence of “2 registrations size 9 (total size 18)”.
If you look at the dbo.Solutions view in the code, you’ll see a design that is even more extremely denormalized than the dbo.PossibleSessions view already was. It includes the five (size, count) pairs already familiar from the latter, but repeats them four times – thus allowing for a maximum of four possible sessions to be combined to make up a solution. This means that the solution fails when it’s given data that requires five or more sessions to accommodate all registrations – it will simply not return any rows for that quarter. To solve that, all you have to do is add more columns for a fifth session, and add an extra LEFT JOIN to a fifth occurrence of the dbo.PossibleSessions view. You may have noted that the join condition for each join references all “previously” joined occurrences of dbo.PossibleSessions, thus increasing in size with each additional occurrence.
Apart from the columns for the denormalized description of all sessions in a solution, the dbo.Solutions view includes four more columns. Apart from the unavoidable Year and Quarter columns, there is a NumSessions column holding the number of sessions used, based on a simple CASE expression (only added for easier understanding of the output; it’s not really required). The final column calculates a ranking for each solution by using the ROW_NUMBER() function, partitioning by quarter and ordering by number of sessions in the solution (least sessions used first) using a clever trick based on the fact that NULL sorts before other values (I could also have repeated the CASE expression used for the NumSessions column, but that would have been more typing). Note that the order of solutions with the same number of sessions is irrelevant for the approach used here. This Ranking column will be used in the next step.
If you have already checked the query, you’ll probably understand that it’s impossible to scale this to the dimensions required to solve the original problem. The maximum session size of 100 candidates already forces us to use 13 (size, count) pairs in the dbo.PossibleSessions view, that all return here; the inclusion of one copy of that view per session in the solution moves us into some really insane figures. For instance, if you want to allow up to 200 sessions per quarter, you’d need to define 5,204 columns in the view (which is more than SQL Server allows), and the CREATE VIEW statement would have more than four million lines, well beyond the maximum batch size SQL Server allows. And that would still not be enough for the original test data – the tests conducted earlier show that some quarters need slightly more than 700 sessions. Maybe this query will find a way to pack them into less than 700, but then we’d first need to find a database that allows views with 18,204 columns and a definition of 51,155,633 (!) lines…
Step 4: The best solution …
After all this hardcore query magic, the fourth and final step is almost disappointingly simple. Finding the best solution is a simple matter of returning the solution with the lowest score in the Ranking column – i.e. a ranking equal to 1. And if you prefer to see all possible solutions with the lowest possible number of sessions, all it takes is to replace ROW_NUMBER() with RANK() in the definition of the dbo.Solutions view – this will assign the number 1 to all solutions with the lowest number of sessions, and higher numbers to all solutions that use more sessions.
Once more, you can check John Gilson’s posting to see that without ROW_NUMBER() the dbo.Solutions view has to be joined to (a derived table based upon) itself. In practice this means that after replacing views by their definitions, the resulting query passed internally to the query optimizer doubles in size – I don’t even want to begin to imagine what effect this will have on this solution’s performance. Especially considering that performance is already pretty bad with using the ROW_NUMBER() function.
Maybe this is also a good place to point out that the results you get when you query the dbo.BestSolutions view are not actually yet the final results. Okay, you will get a result set showing that, for some quarter, you have a session consisting of three registrations for five candidates plus one registration for three candidates, but you still have to select the four actual registrations to assign to this session. Since this blog post is already longer than any I’ve done before (and still growing), I decided to leave that part as an exercise to the reader. Just one word of warning – if you come up with a solution that uses multiple copies of the dbo.BestSolutions view, replace it with a temporary table holding the contents of the view. Otherwise, SQL Server will happily produce a query execution plan where dbo.BestSolutions is evaluated as often as you use it in your query – and trust me, once you’ve seen how it performs, you don’t want that to happen!
… for some definition of “best”
I you have already checked out the definitions of the views used, you probably don’t expect a great performance anymore. And after reading my ominous hints above, you’re really bracing yourself – and with good right. With the amount of test data reduced to just ten (or less) registrations per quarter, querying the dbo.BestSolutions view to find the best solutions for all quarters (on an empty cache) turns out to take an average execution time of 94,535 milliseconds – that’s more than one and a half minute!
I have already indicated that the original code that I based this solution on was written before SQL Server 2005 was available. That is probably the reason why separate views were used instead of a single query with CTE’s. But now that I’m already running SQL Server 2008, there is of course no reason not to try a CTE-based version. After all, I also started to make use of the ranking functions that have also been introduced in SQL Server 2005. So I changed the view-based code to the CTE-based but further equivalent code in PureSetbasedCTEs.sql (also in the attached ZIP file), and executed it a couple of times to see if this would affect performance. In theory, it shouldn’t – but in practice it did: the average execution time for the CTE-based version was 90,984 ms, almost 4% faster. Even more intriguing was my observation that the CTE-based version performed more constant, with all measurements between 89 and 94 seconds. The view-based version ranged from 80 to over 100 seconds! I have not been able to explain either the performance difference or the difference in variation.
I also tested a third version. Not as “purely relational” as the views or the CTE’s, but much better performing: by materializing intermediate results in temporary tables instead of keeping the virtual in views or CTE’s, I worked around the annoying weakness in SQL Server’s query optimizer, namely that it does not realise that for a query that includes multiple copies of the same logic, that logic needs not be duplicated in the execution plan. Even in cases such as these, where the duplication is very easy to spot because it’s just multiple copies of the same view or CTE name, the optimizer will still spit out a plan that happily repeats the same steps over and over again. So I replaced the views with temporary tables, forcing SQL Server to materialize and reuse intermediate results instead of recalculating them over and over again. This code, that you can find in PureSetbasedTemp.sql in the attached ZIP file, did indeed perform lots better, though still a far cry from the cursor-based solution I presented in the previous posts: 32,879 milliseconds for the tiny amount of test data this solution handles.
It gets worse
Even though I already know that the set-based solution will never scale to the size required to handle the original problem, I still wanted to have some idea of how it scales as the maximum session size, the maximum number of sessions, and/or the amount of registrations increase. So I started with the “best” version so far, the version with temp tables, and created two variations on it: one that allows for one more session in the solution, bringing the total to a maximum of five sessions per quarter; and another one that does the same but also allows the session size to go up to a maximum of 35 candidates by using seven (size, amount) pairs for the possible sessions instead of five. These versions are also in the attached ZIP file as PureSetbasedTempMore.sql and PureSetbasedTempBigger.sql.
I did not test these versions with data that actually required bigger or more sessions, but I did test all the versions with various amounts of data by changing the amount of registrations per quarter. In the table below, you will find the number of registrations I chose, the actual average number of registrations per quarter after trimming the data down to at most 70 candidates per quarter, and the running time of each of the three set-based versions with temporary tables. Note that in this case, I executed the longer running queries only once instead of my usual practice of averaging the execution time of five executions. And in case you intend to test these queries yourself, make sure to size your databases appropriately – to repeat the tests I executed, you’ll need at least 200 MB in the data file and 400 MB in the log file. Allocate less, and autogrow will probably kick in, reducing performance even further.
Execution time (ms)
After seeing these numbers, I didn’t even try to extrapolate them to the original problem (that needs 13 (size, amount) pairs for possible sessions up to 100 candidates, and up to 700 sessions per quarter with the full amount of test data). It would probably be several years, maybe even centuries – though of course I have already demonstrated that the query will never run on any current database anyway, because it’s too long and returns too many columns.
Back to the cursor?
By now, you might have concluded that this post shows that for the bin-packing problem, using a cursor really is the only way to go. But allow me to disagree. While there is no doubt that a truly set-based single query solution is absolutely not viable here, there are still other possibilities. As I already indicated in the first part of the series, it is possible to combine set-based queries with iteration for a solution that packs about as well as the better cursor-based variants, but runs much faster. This solution will be the subject of my next post in this series – so stay tuned!
Are you planning to attend this year’s PASS Community Summit? There’s only three weeks left before the pre-conference seminars kick off, so if you’re not registered yet, now is the time to act!
The session schedule shows two days with seven full-day pre-conference seminars each, followed by three days packed full of high-quality sessions – with four time slots per day, over three days, and twelve sessions in parallel at each time slot, you problem will not be finding sessions that are of interest to you, but rather choosing between them.
And you will be able to see and hear me speaking at two of those sessions.
- On Wednesday, I will be one of the participants in Much Ado: A Panel Discussion About “Nothing”, a debate between several MVPs and SQLBloggers about the always heated subject of NULL.
- On Thursday, I’ll present “Cursors or setbased? Why not both?”. In this session, I’ll show that using cursors or a set-based query is not an either or decision; but that there are ways to combine the strengths of both approaches into a (sometimes!) winning strategy.
(Please note that I’m not sure if the session schedule is final or still subject to change)
So if you are at PASS, I hope that you’ll decide to drop in on either or both of those sessions. But even if you don’t, I still look forward to meeting you – so if you see me passing by and want to shake hands or exchange some opinions, don’t hesitate to catch me!
When I started blogging here on sqlblog.com, I intended to write about stuff like T-SQL, performance, and such; but also about data modeling and database design. In reality, the latter has hardly happened so far – but I will try to change that in the future. Starting off with this post, in which I will pose (and attempt to answer) the rather philosophical question of the title: is data modeling an art or a science?
Before I can answer the question, or at least tell you how I think about the subject, we need to get the terms straight. So that if we disagree, we’ll at least disagree over the same things instead of actually agreeing but not noticing. In the next two paragraphs, I’ll try to define what data modeling is and how it relates to database design, and what sets art apart from science in our jobs.
Data modeling and database design
When you have to create a database, you will ideally go through two stages before you start to type your first CREATE TABLE statement. The first stage is data modeling. In this stage, the information that has to be stored is inventoried and the structure of that information is determined. This results in a logical data model. The logical data model should focus on correctness and completeness; it should be completely implementation agnostic.
The second stage is a transformation of the logical data model to a database design. This stage usually starts with a mechanical conversion from the logical data model to a first draft of the implementation model (for instance, if the data model is represented as an ERM diagram but the data has to be stored in an RDBMS, all entity and all many-to-many relationships become tables and all attributes and 1-to-many relationships become columns). After that, optimization starts. Some of the optimizations will not affect the layout of tables and columns (examples are choosing indexes, or implementing partitioning), but some other optimizations will do just that (such as adding a surrogate key, denormalizing tables, or building indexed views to pre-aggregate some data). The transformation from logical data model to database design should focus on performance for an implementation on a specific platform. As long as care is taken that none of the changes made during this phase affect the actual meaning of the model, the resultant database design will be just as correct and complete (or incorrect and incomplete) as the logical data model that the database design is based on.
Many people prefer to take a shortcut by combining these two stages. They produce a data model that is already geared towards implementation in a specific database. For instance by adding surrogate keys right into the first draft of the data model, because they already know (or think) that they will eventually be added anyway. I consider this to be bad practice for the following reasons:
- It increases the chance of errors. When you have to focus on correctness and performance at the same time, you can more easily lose track.
- It blurs the line. If a part of a data model can never be implemented at acceptable performance, an explicit decision has to be made (and hopefully documented) to either accept crappy performance or change the data model. If you model for performance, chances are you’ll choose a better performing alternative straight away and never document properly why you made a small digression from the original requirements.
- It might have negative impact on future performance. The next release of your DMBS, or maybe even the next service pack, may cause today’s performance winner to be tomorrows performance drainer. By separating the logical data model from the actual database design, you make it very easy to periodically review the choices you made for performance and assess whether they are still valid.
- It reduces portability and maintainability. If, one day, your boss informs you that you need to port an existing application to another RDBMS, you’ll bless the day you decided to separate the logical data model from the physical database design. Because you now only need to pull out the (still completely correct) logical data model, transform again, but this time apply optimization tricks for the new RDBMS. And also, as requirements change, it is (in my experience) easier to identify required changes in an implementation-independent logical data model and then move the changes over to the physical design, than to do that all at once if only the design is available.
- It may lead to improper choices. More often that I’d like, I have seen good modelers fall victim to bad habits. Such as, for instance, adding a surrogate key to every table (or entity or whatever) in the data model. But just because surrogate keys are often good for performance (on SQL Server that is – I wouldn’t know about other DBMS’s) doesn’t mean they should always be used. And the next step (that I’ve witnessed too!) is forgetting to identify the real key because there already is a key (albeit a surrogate key).
Art and science
For the sake of this discussion, “art” (work created by an artist) is the result of some creative process, usually completely new and unique in some way. Most artists apply learned skills, though not always in the regular way. Artists usually need some kind of inspiration. There is no way to say whether a work of art is “good” or “bad”, as that is often in the eye of the beholder – and even if all beholders agree that the work sucks, you still can’t pinpoint what exactly the artist has done wrong. Examples of artists include painters, composers, architects, etc. But some people not usually associated with art also fit the above definition, such as politicians, blog authors, or scientists (when breaking new grounds, such as Codd did when he invented the relational model). Or a chef in a fancy restaurant who is experimenting with ingredients and cooking processes to find great new recipes to include on the menu.
“Science” does NOT refer to the work of scientists, but to work carried out by professionals, for which not creativity but predictability is the first criterion. Faced with the same task, a professional should consistently arrive at correct results. That doesn’t imply that he or she always will get correct results, but if he or she doesn’t, then you can be sure that an error is made and that careful examination of the steps taken will reveal exactly what that error was. Examples of professionals include bakers, masons, pilots, etc. All people that you trust to deliver work of a consistent quality – you want your bread to taste the same each day, you want to be sure your home won’t collapse, and you expect to arrive safely whenever you embark a plane. And a regular restaurant cook is also supposed to cook the new meal the chef put on the menu exactly as the chef instructed.
Data modeling: art or science?
Now that all the terms have been defined, it’s time to take a look at the question I started this blog post with – is data modeling an art or a science? And should it be?
To start with the latter, I think it should be a science. If a customer pays a hefty sum of money to a professional data modeler to deliver a data model, then, assuming the customer did everything one can reasonably expect1 to answer questions from the data modeler, the customer can expect the data model to be correct2.
1 I consider it reasonable to expect that the customer ensures that all relevant questions asked by the data modeler are answered, providing the data modeler asks these questions in a language and a jargon familiar to the customer and/or the employees he interviews. I do not consider it reasonable to expect the customer (or his employees) to learn language, jargon, or diagramming style used by data modelers.
2 A data model is correct if it allows any data collection that is allowed according to the business rules, and disallows any data collection that would violate any business rule.
Unfortunately, my ideas of what data modeling should be like appear not to be on par with the current state of reality. One of the key factors I named for “science” is predictability. And to have a predictable outcome, a process must have a solid specification. As in, “if situation X arises, you need to ask the customer question Y; in case of answer Z, add this and remove that in the data model”. Unfortunately, such exactly specified process steps are absent in most (and in all commonly used!) data modeling methods. However, barring those rules, you have to rely on the inspiration of the data modeler – will he or she realize that question Y has to be asked? And if answer Z is given, will the modeler realize that this has to be added and that has to be removed? And if that doesn’t happen, then who’s to blame? The modeler, for doing everything the (incomplete) rules that do exist prescribe, but lacking the inspiration to see what was required here? Or should we blame ourselves, our industry, for allowing ourselves to have data modeling as art for several decades already, and still accepting this as “the way it is”?
Many moons ago, when I was a youngster that had just landed a job as a PL/I programmer, I was sent to a course for Jackson Structured Programming. This is a method to build programs that process one or more inputs and produce one or more outputs. Though it can be used for interactive programs as well, it’s main strength is for batch programs, accessing sequential files. The course was great – though the students would not always arrive at the exact same design, each design would definitely be either correct, or incorrect. All correct designs would yield the same end result when executed against the same data. And for all incorrect designs, the teacher was able to pinpoint where in the process an error was made. For me, this course changed programming from art into science.
A few years later, I was sent to a data modeling course. Most of the course focused on how to represent data models (some variation of ERM was used). We were taught how to represent the data model, but not how to find it. At the end, we were given a case study and asked to make a data model, which we would then present to the class. When we were done and the first model was presented, I noticed some severe differences from my model – differences that would result in different combinations of data being allowed or rejected. So when the teacher made some minor adjustments and then said that this was a good model, I expected to get a low note for my work. Then the second student had model that differed from both the first and my model – and again, the teacher mostly agreed to the choices made. This caused me to regain some of my confidence – and indeed, when it was my turn to present my, once again very different, model, I too was told that this was a very good one. So we were left with three models, all very different, and according to the instructor, they were all “correct” – and yet, data that would be allowed in the one would be rejected by the other. So this course taught me that data modeling was not science, but pure art.
This was over a decade ago. But has anything changed in between? Well, maybe it has – but if so, it must have been when I was not paying attention, for I still do not see any mainstream data modeling method that does provide the modeler with a clear set of instructions on what to do in every case, or the auditor with a similar set of instructions to check whether the modeler did a great job, or screwed up.
Who’s to blame?
Another difference between art and science is the assignment of blame. If you buy a bread, have a house built, or embark on a plane, then you know who to blame if the bread is sour, if the house collapses, or if the plane lands on the wrong airport. But if you ask a painter to create a painting for your living and you don’t like the result, you can not tell him that he screwed up – because beauty is truly a matter of taste.
Have you ever been to a shop where all colors paint can be made by combining adequate amounts of a few base colors? Suppose you go to such a shop with a small sample of dyed wood, asking them to mix you the exact same color. The shopkeeper rummages through some catalogs, compares some samples, and then scribbles on a note: “2.98 liters white (#129683), 0.15 liters of cyan (#867324), and 0.05 liters of red (#533010)”. He then tells you that you have to sign the note before he can proceed to mix the paint. So you sign. And then, once the paint has been mixed, you see that it’s the wrong color – and the shop keepers then waves the signed slip of paper, telling you it’s “exactly according to the specification you signed off”, so you can’t get your money back. Silly, right?
And yet, in almost every software project, there will be a moment when a data model is presented to the customer, usually in the form of an ERM diagram or a UML class diagram, and the customer is required to sign off for the project to continue. This is, with all respect, the same utter madness as the paint example. Let’s not forget that the customer is probably a specialist in his trade, be it banking, insurance, or selling ice cream, but not in reading ERM or UML diagrams. How is he supposed to check whether the diagram is an accurate representation of his business needs and business rules?
The reason why data modelers require the customer to sign off the data model is, because they know that data modeling is not science but art. They know that the methods they use can’t guarantee correct results, even on correct inputs. So they require a signature on the data model, so that later, when nasty stuff starts hitting the fan, they can wave the signature in the customer’s face, telling him that he himself signed for the implemented model.
In the paint shop, I’m sure that nobody would agree to sign the slip with the paint serial numbers. I wouldn’t! I would agree, though, to place my signature on the sample I brought in, as that is a specification I can understand. Translating that to paint numbers and quantities is supposed to be the shopkeepers’ job, so let him take responsibility for that.
So, I guess the real question is … why do customers still accept it when they are forced to sign for a model they are unable to understand? Why don’t they say that they will gladly sign for all the requirements they gave, and for all the answers they provided to questions that were asked in a language they understand, but that they insist on the data modeler taking responsibility for his part of the job?
Maybe the true art of data modeling is that data modelers are still able to get away with it…
Wow! Would you believe that it’s almost five months since my last blog post? How time flies.
No, I have not forgotten about you. I know you’ve all been faithfully checking the site (or your feed) each day, maybe even each hour, to get my next post. What can I say? I’m sorry for keeping you waiting so long.
The truth is, that I have been very, very busy. Lots of time has been spent on work (for without work, there will be no salary and hence no way to feed the family) and family (for why bother working hard to feed them if I can’t enjoy my time with them). And out of the remaining spare time, large chunks have been allocated to tech editing the next edition of Paul Nielsen’s SQL Server Bible. And apart from that, there are also a few hobby’s and interests that have nothing to do with SQL Server and that I will therefore not share with you J.
I hope to be able to post a bit more in the future. I plan to publish some of my thoughts on database design. But I have not forgotten about the unfinished “bin packing” series I started late last year – at least two more installments are planned there. And apart from that, I have various ideas and snippets collected. So if time permits, you can expect to see a bit more from me in the future.
I guess that many people using UPDATE … FROM on a daily basis do so without being aware that they are violating all SQL standards.
All versions of the ANSI SQL standard that I checked agree that an UPDATE statement has three clauses – the UPDATE clause, naming the table to be updated; the SET clause, specifying the columns to change and their new values; and the optional WHERE clause to filter the rows to be updated. No FROM or JOIN – if you need data from a different table, use a subquery in the SET clause. The optional FROM and JOIN clauses were added by Microsoft, as an extension to the standard syntax (and just to make out lives more interesting, they invented different variations of the syntax for SQL Server and for Access). So when you are in the habit of using them, be prepared to review all your UPDATE statements when moving to Oracle, DB2, Sybase, MySQL, or even a different Microsoft database!
Standards? Bah, who cares?
Well, some do. Me for instance – I will never use proprietary syntax if I know a standard alternative, expect if using the latter has severe negative consequences. And maybe you will, one day, when your boss comes back from the golf course with the great news that he managed to convince a colleague (who just happens to work in an Oracle shop) to buy a copy of your company’s application instead of some off-the-shelf product. Or when there’s a great job opportunity for someone with cross platform skills. Or when you are asked to help out this new colleague with 10+ years of DB2 experience. One of the lesser known side effects of Murphy’s Law is that those who least expect having to move their database to another platform, will.
But even if you really don’t care about portability, there are other reasons to be wary of using UPDATE FROM. In fact, the most important reason why I dislike UPDATE FROM is not that it’s non-standard, but that it is just too easy to make mistakes with.
Correctness? Bah, who cares?
Well, most do. That’s why we test.
If I mess up the join criteria in a SELECT query so that too many rows from the second table match, I’ll see it as soon as I test, because I get more rows back then expected. If I mess up the subquery criteria in an ANSI standard UPDATE query in a similar way, I see it even sooner, because SQL Server will return an error if the subquery returns more than a single value. But with the proprietary UPDATE FROM syntax, I can mess up the join and never notice – SQL Server will happily update the same row over and over again if it matches more than one row in the joined table, with only the result of the last of those updates sticking. And there is no way of knowing which row that will be, since that depends in the query execution plan that happens to be chosen. A worst case scenario would be one where the execution plan just happens to result in the expected outcome during all tests on the single-processor development server – and then, after deployment to the four-way dual-core production server, our precious data suddenly hits the fan…
Well, almost. There’s one more thing. Probably not something you’ll run into on a daily base, but good to know nonetheless. If the target of the update happens to be a view instead of a base table, and there is an INSTEAD OF UPDATE trigger defined for the view, the UPDATE will fail with this error message:
Msg 414, Level 16, State 1, Line 1
UPDATE is not allowed because the statement updates view "v1" which participates in a join and has an INSTEAD OF UPDATE trigger.
Of course, most people will never run into this. But I did have the misfortune of doing so once – unfortunately, I discovered this limitation after rewriting several hundred ANSI standard UPDATE statements to the equivalent UPDATE FROM, and having to convert them all back after as much as a single test…
And that’s why you want to deprecate UPDATE FROM?
Well, no. The view with INSTEAD OF UPDATE trigger won’t affect many people. And the possibility of error can be somewhat thwarted by making sure (and double-checking) to always include all columns of the primary key (or a unique constraint) of the source table. So we’re back to the more principle point of avoiding proprietary syntax if there is an ANSI standard alternative with no or limited negative consequences. And in the case of UPDATE FROM, there are some cases where the standard syntax just doesn’t cut it.
One such scenario is when a file is read in periodically with updated information that has to be pushed into the main table. The code below sets up a simplified example of this – a table Customers, with SSN as its primary key, that stores address and lots of other information, and a table Moved, which is the staging table containing the contents of a file received from a third party listing new address for people who recently moved. I have also included the code to preload the tables with some mocked up data – the Customers table has 10,000 rows, and the Moved table has 3,000 rows, 1,000 of which match an existing row in the Customers table. The others don’t – those people are apparently not our customers.
CREATE TABLE Customers
(SSN char(9) NOT NULL,
Street varchar(40) NOT NULL,
HouseNo int NOT NULL,
City varchar(40) NOT NULL,
LotsOfOtherInfo char(250) NOT NULL DEFAULT (''),
PRIMARY KEY (SSN),
CHECK (SSN NOT LIKE '%[^0-9]%')
CREATE TABLE Moved
(SSN char(9) NOT NULL,
Street varchar(40) NOT NULL,
HouseNo int NOT NULL,
City varchar(40) NOT NULL,
PRIMARY KEY (SSN),
CHECK (SSN NOT LIKE '%[^0-9]%')
INSERT INTO Customers(SSN, Street, HouseNo, City)
SELECT RIGHT(Number+1000000000,9), 'Street ' + CAST(Number AS varchar(10)),
Number, 'City ' + CAST(Number AS varchar(10))
WHERE Number BETWEEN 1 AND 30000
AND Number % 3 = 0;
INSERT INTO Moved(SSN, Street, HouseNo, City)
SELECT RIGHT(Number+1000000000,9), 'New street ' + CAST(Number AS varchar(10)),
Number * 2, 'New city ' + CAST(Number AS varchar(10))
WHERE Number BETWEEN 1 AND 30000
AND Number % 10 = 0;
Since ANSI-standard SQL does not allow a join to be used in the UPDATE statement, we’ll have to use subqueries to find the new information, and to find the rows that need to be updated, resulting in this query:
SET Street = (SELECT Street
FROM Moved AS m
WHERE m.SSN = Customers.SSN),
HouseNo = (SELECT HouseNo
FROM Moved AS m
WHERE m.SSN = Customers.SSN),
City = (SELECT City
FROM Moved AS m
WHERE m.SSN = Customers.SSN)
WHERE EXISTS (SELECT *
FROM Moved AS m
WHERE m.SSN = Customers.SSN);
There’s a lot of duplicated code in here. And if we were getting data from a complicated subquery instead of the table Moved, it would be even worse (though we can at least put all the duplicated code in a CTE since SQL Server 2005). Of course, writing the code is done quickly enough once you master the use of copy and paste, but the code has to be maintained as well.
Maybe even worse is that the performance of this query just sucks – if you run this (enclosed in a BEGIN TRAN / ROLLBACK TRAN, so you can run the variations below without having to rebuild the original data) and check out the execution plan, you’ll see that the optimizer needs no less than five table scans (one for Customers, and four for Moved) and four merge join operators. And that, too, would be much worse if the source of the data had been a complex subquery (and no, using a CTE will not help the optimizer find a better plan – it just doesn’t understand that the four subqueries are similar enough that they can be collapsed.
Now, if Microsoft had chosen to implement row-value constructors (as defined in the ANSI standard), we could have simplified this to
SET (Street, HouseNo, City)
= (SELECT Street, HouseNo, City
FROM Moved AS m
WHERE m.SSN = Customers.SSN)
WHERE EXISTS (SELECT *
FROM Moved AS m
WHERE m.SSN = Customers.SSN);
But this is invalid syntax in any version of SQL Server (including the latest CTP for SQL Server 2008), and I know of no plans to change that before SQL Server 2008 RTMs.
But with using the proprietary UPDATE FROM syntax, we can simplify this, and get a much better performance to boot. Here’s how the same update is written in non-portable code:
SET Street = m.Street,
HouseNo = m.HouseNo,
City = m.City
FROM Customers AS c
INNER JOIN Moved AS m
ON m.SSN = c.SSN;
And now, the optimizer will produce a plan that scans each table only once and has only a single merge join operator. Some quick tests (with much more rows in the tables) show that it executes two to three times quicker than the ANSI standard version. For that performance gain, I will gladly choose the proprietary syntax over the standard!
What’s with the title of this post then? Why deprecate a fine feature?
Patience, we’re getting there. Bear with me.
All the above is true for versions of SQL Server up to SQL Server 2005. But SQL Server 2008 will change the playing field. It introduces a new statement, MERGE, that is specifically designed for situations where rows from a table source either have to be inserted into a destination table, or have to be used to update existing rows in the destination table. However, there is no law that prescribes that any MERGE should always actually include both an insert and an update clause – so with this new statement, we can now rewrite the above code as follows:
MERGE INTO Customers AS c
USING Moved AS m
ON m.SSN = c.SSN
SET Street = m.Street,
HouseNo = m.HouseNo,
City = m.City;
As you can see, the source table and the join criteria are included only once, just as in the proprietary UPDATE FROM. The execution plan (tested on the February CTP, also known as CTP6) is also quite similar, including just a few extra operators that are specific to the new MERGE statement. What really surprised me, was that the plan for the MERGE statement was estimated to be about 65% cheaper (faster) than the corresponding UPDATE FROM statement. However, I think SQL Server is lying here – a quick test with more data shows only an extremely marginal advantage of MERGE over UPDATE FROM. This test was too limited to draw serious conclusions, but I am quite sure that there will not be a 65% saving by using MERGE over UPDATE FROM. (I do expect such a saving form either MERGE or UPDATE FROM over the ANSI-compliant UPDATE statement for this case).
The good news is that:
1) The MERGE statement is described in SQL:2003 and can thus be considered ANSI standard. (In fact, SQL Server implements a superset of the ANSI standard MERGE syntax: everything described in the syntax is implemented, but there are some non-standard extensions that make the command even more useful as well. However, the example above uses only the standard features and should hence run on each DBMS that conforms to the SQL:2003 version of MERGE).
2) The MERGE statement will return an error message if I mess up my join criteria so that more than a single row from the source is matched:
Msg 8672, Level 16, State 1, Line 1
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
3) The MERGE statement will gladly accept a view with an INSTEAD OF UPDATE trigger as the target of the update.
So as you see, MERGE allows me to achieve what I previously could achieve only with an ANSI standard UPDATE statement with lots of duplicated code and lousy performance, or with a UPDATE FROM statement that hinders portability, introduces a higher than normal risk of errors going unnoticed through QA right into the production database, and has some odd limitation on views with INSTEAD OF UPDATE triggers. None of these downsides and limitations apply to MERGE. And if there are any other problems with MERGE, I have yet to find them.
With this alternative available, I fail to see any reason why the proprietary UPDATE FROM syntax should be maintained. In my opinion, it can safely be marked as deprecated in SQL Server 2008. It should of course still work, as “normal” supported syntax in both SQL Server 2008 and the next version, and in at least one version more if the database is set to a lower compatibility – but it should be marked as deprecated, and it should eventually be removed from the product. Why waste resources on maintaining that functionality, when there is an alternative that is better in every conceivable way? I’d much rather see the SQL Server team spend their time and energy on more important stuff, such as full support for row-value constructors and full support for the OVER() clause. Or maybe even on releasing Service Pack 3 for SQL Server 2005!