THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

  • Data modeling: art or science?

    When I started blogging here on sqlblog.com, I intended to write about stuff like T-SQL, performance, and such; but also about data modeling and database design. In reality, the latter has hardly happened so far – but I will try to change that in the future. Starting off with this post, in which I will pose (and attempt to answer) the rather philosophical question of the title: is data modeling an art or a science?

     

    Before I can answer the question, or at least tell you how I think about the subject, we need to get the terms straight. So that if we disagree, we’ll at least disagree over the same things instead of actually agreeing but not noticing. In the next two paragraphs, I’ll try to define what data modeling is and how it relates to database design, and what sets art apart from science in our jobs.

     

    Data modeling and database design

     

    When you have to create a database, you will ideally go through two stages before you start to type your first CREATE TABLE statement. The first stage is data modeling. In this stage, the information that has to be stored is inventoried and the structure of that information is determined. This results in a logical data model. The logical data model should focus on correctness and completeness; it should be completely implementation agnostic.

     

    The second stage is a transformation of the logical data model to a database design. This stage usually starts with a mechanical conversion from the logical data model to a first draft of the implementation model (for instance, if the data model is represented as an ERM diagram but the data has to be stored in an RDBMS, all entity and all many-to-many relationships become tables and all attributes and 1-to-many relationships become columns). After that, optimization starts. Some of the optimizations will not affect the layout of tables and columns (examples are choosing indexes, or implementing partitioning), but some other optimizations will do just that (such as adding a surrogate key, denormalizing tables, or building indexed views to pre-aggregate some data). The transformation from logical data model to database design should focus on performance for an implementation on a specific platform. As long as care is taken that none of the changes made during this phase affect the actual meaning of the model, the resultant database design will be just as correct and complete (or incorrect and incomplete) as the logical data model that the database design is based on.

     

    Many people prefer to take a shortcut by combining these two stages. They produce a data model that is already geared towards implementation in a specific database. For instance by adding surrogate keys right into the first draft of the data model, because they already know (or think) that they will eventually be added anyway. I consider this to be bad practice for the following reasons:

    1. It increases the chance of errors. When you have to focus on correctness and performance at the same time, you can more easily lose track.
    2. It blurs the line. If a part of a data model can never be implemented at acceptable performance, an explicit decision has to be made (and hopefully documented) to either accept crappy performance or change the data model. If you model for performance, chances are you’ll choose a better performing alternative straight away and never document properly why you made a small digression from the original requirements.
    3. It might have negative impact on future performance. The next release of your DMBS, or maybe even the next service pack, may cause today’s performance winner to be tomorrows performance drainer. By separating the logical data model from the actual database design, you make it very easy to periodically review the choices you made for performance and assess whether they are still valid.
    4. It reduces portability and maintainability. If, one day, your boss informs you that you need to port an existing application to another RDBMS, you’ll bless the day you decided to separate the logical data model from the physical database design. Because you now only need to pull out the (still completely correct) logical data model, transform again, but this time apply optimization tricks for the new RDBMS. And also, as requirements change, it is (in my experience) easier to identify required changes in an implementation-independent logical data model and then move the changes over to the physical design, than to do that all at once if only the design is available.
    5. It may lead to improper choices. More often that I’d like, I have seen good modelers fall victim to bad habits. Such as, for instance, adding a surrogate key to every table (or entity or whatever) in the data model. But just because surrogate keys are often good for performance (on SQL Server that is – I wouldn’t know about other DBMS’s) doesn’t mean they should always be used. And the next step (that I’ve witnessed too!) is forgetting to identify the real key because there already is a key (albeit a surrogate key).

     

    Art and science

     

    For the sake of this discussion, “art” (work created by an artist) is the result of some creative process, usually completely new and unique in some way. Most artists apply learned skills, though not always in the regular way. Artists usually need some kind of inspiration. There is no way to say whether a work of art is “good” or “bad”, as that is often in the eye of the beholder – and even if all beholders agree that the work sucks, you still can’t pinpoint what exactly the artist has done wrong. Examples of artists include painters, composers, architects, etc. But some people not usually associated with art also fit the above definition, such as politicians, blog authors, or scientists (when breaking new grounds, such as Codd did when he invented the relational model). Or a chef in a fancy restaurant who is experimenting with ingredients and cooking processes to find great new recipes to include on the menu.

     

    “Science” does NOT refer to the work of scientists, but to work carried out by professionals, for which not creativity but predictability is the first criterion. Faced with the same task, a professional should consistently arrive at correct results. That doesn’t imply that he or she always will get correct results, but if he or she doesn’t, then you can be sure that an error is made and that careful examination of the steps taken will reveal exactly what that error was. Examples of professionals include bakers, masons, pilots, etc. All people that you trust to deliver work of a consistent quality – you want your bread to taste the same each day, you want to be sure your home won’t collapse, and you expect to arrive safely whenever you embark a plane. And a regular restaurant cook is also supposed to cook the new meal the chef put on the menu exactly as the chef instructed.

     

    Data modeling: art or science?

     

    Now that all the terms have been defined, it’s time to take a look at the question I started this blog post with – is data modeling an art or a science? And should it be?

     

    To start with the latter, I think it should be a science. If a customer pays a hefty sum of money to a professional data modeler to deliver a data model, then, assuming the customer did everything one can reasonably expect1 to answer questions from the data modeler, the customer can expect the data model to be correct2.

     

    1          I consider it reasonable to expect that the customer ensures that all relevant questions asked by the data modeler are answered, providing the data modeler asks these questions in a language and a jargon familiar to the customer and/or the employees he interviews. I do not consider it reasonable to expect the customer (or his employees) to learn language, jargon, or diagramming style used by data modelers.

    2          A data model is correct if it allows any data collection that is allowed according to the business rules, and disallows any data collection that would violate any business rule.

     

    Unfortunately, my ideas of what data modeling should be like appear not to be on par with the current state of reality. One of the key factors I named for “science” is predictability. And to have a predictable outcome, a process must have a solid specification. As in, “if situation X arises, you need to ask the customer question Y; in case of answer Z, add this and remove that in the data model”. Unfortunately, such exactly specified process steps are absent in most (and in all commonly used!) data modeling methods. However, barring those rules, you have to rely on the inspiration of the data modeler – will he or she realize that question Y has to be asked? And if answer Z is given, will the modeler realize that this has to be added and that has to be removed? And if that doesn’t happen, then who’s to blame? The modeler, for doing everything the (incomplete) rules that do exist prescribe, but lacking the inspiration to see what was required here? Or should we blame ourselves, our industry, for allowing ourselves to have data modeling as art for several decades already, and still accepting this as “the way it is”?

     

    Many moons ago, when I was a youngster that had just landed a job as a PL/I programmer, I was sent to a course for Jackson Structured Programming. This is a method to build programs that process one or more inputs and produce one or more outputs. Though it can be used for interactive programs as well, it’s main strength is for batch programs, accessing sequential files. The course was great – though the students would not always arrive at the exact same design, each design would definitely be either correct, or incorrect. All correct designs would yield the same end result when executed against the same data. And for all incorrect designs, the teacher was able to pinpoint where in the process an error was made. For me, this course changed programming from art into science.

     

    A few years later, I was sent to a data modeling course. Most of the course focused on how to represent data models (some variation of ERM was used). We were taught how to represent the data model, but not how to find it. At the end, we were given a case study and asked to make a data model, which we would then present to the class. When we were done and the first model was presented, I noticed some severe differences from my model – differences that would result in different combinations of data being allowed or rejected. So when the teacher made some minor adjustments and then said that this was a good model, I expected to get a low note for my work. Then the second student had model that differed from both the first and my model – and again, the teacher mostly agreed to the choices made. This caused me to regain some of my confidence – and indeed, when it was my turn to present my, once again very different, model, I too was told that this was a very good one. So we were left with three models, all very different, and according to the instructor, they were all “correct” – and yet, data that would be allowed in the one would be rejected by the other. So this course taught me that data modeling was not science, but pure art.

     

    This was over a decade ago. But has anything changed in between? Well, maybe it has – but if so, it must have been when I was not paying attention, for I still do not see any mainstream data modeling method that does provide the modeler with a clear set of instructions on what to do in every case, or the auditor with a similar set of instructions to check whether the modeler did a great job, or screwed up.

     

    Who’s to blame?

     

    Another difference between art and science is the assignment of blame. If you buy a bread, have a house built, or embark on a plane, then you know who to blame if the bread is sour, if the house collapses, or if the plane lands on the wrong airport. But if you ask a painter to create a painting for your living and you don’t like the result, you can not tell him that he screwed up – because beauty is truly a matter of taste.

     

    Have you ever been to a shop where all colors paint can be made by combining adequate amounts of a few base colors? Suppose you go to such a shop with a small sample of dyed wood, asking them to mix you the exact same color. The shopkeeper rummages through some catalogs, compares some samples, and then scribbles on a note: “2.98 liters white (#129683), 0.15 liters of cyan (#867324), and 0.05 liters of red (#533010)”. He then tells you that you have to sign the note before he can proceed to mix the paint. So you sign. And then, once the paint has been mixed, you see that it’s the wrong color – and the shop keepers then waves the signed slip of paper, telling you it’s “exactly according to the specification you signed off”, so you can’t get your money back. Silly, right?

     

    And yet, in almost every software project, there will be a moment when a data model is presented to the customer, usually in the form of an ERM diagram or a UML class diagram, and the customer is required to sign off for the project to continue. This is, with all respect, the same utter madness as the paint example. Let’s not forget that the customer is probably a specialist in his trade, be it banking, insurance, or selling ice cream, but not in reading ERM or UML diagrams. How is he supposed to check whether the diagram is an accurate representation of his business needs and business rules?

     

    The reason why data modelers require the customer to sign off the data model is, because they know that data modeling is not science but art. They know that the methods they use can’t guarantee correct results, even on correct inputs. So they require a signature on the data model, so that later, when nasty stuff starts hitting the fan, they can wave the signature in the customer’s face, telling him that he himself signed for the implemented model.

     

    In the paint shop, I’m sure that nobody would agree to sign the slip with the paint serial numbers. I wouldn’t! I would agree, though, to place my signature on the sample I brought in, as that is a specification I can understand. Translating that to paint numbers and quantities is supposed to be the shopkeepers’ job, so let him take responsibility for that.

     

    So, I guess the real question is … why do customers still accept it when they are forced to sign for a model they are unable to understand? Why don’t they say that they will gladly sign for all the requirements they gave, and for all the answers they provided to questions that were asked in a language they understand, but that they insist on the data modeler taking responsibility for his part of the job?

     

    Maybe the true art of data modeling is that data modelers are still able to get away with it…

  • [OT] Alive and kicking

    Wow! Would you believe that it’s almost five months since my last blog post? How time flies.

     

    No, I have not forgotten about you. I know you’ve all been faithfully checking the site (or your feed) each day, maybe even each hour, to get my next post. What can I say? I’m sorry for keeping you waiting so long.

     

    The truth is, that I have been very, very busy. Lots of time has been spent on work (for without work, there will be no salary and hence no way to feed the family) and family (for why bother working hard to feed them if I can’t enjoy my time with them). And out of the remaining spare time, large chunks have been allocated to tech editing the next edition of Paul Nielsen’s SQL Server Bible. And apart from that, there are also a few hobby’s and interests that have nothing to do with SQL Server and that I will therefore not share with you J.

     

    I hope to be able to post a bit more in the future. I plan to publish some of my thoughts on database design. But I have not forgotten about the unfinished “bin packing” series I started late last year – at least two more installments are planned there. And apart from that, I have various ideas and snippets collected. So if time permits, you can expect to see a bit more from me in the future.

  • Let's deprecate UPDATE FROM!

    I guess that many people using UPDATE … FROM on a daily basis do so without being aware that they are violating all SQL standards.

     

    All versions of the ANSI SQL standard that I checked agree that an UPDATE statement has three clauses – the UPDATE clause, naming the table to be updated; the SET clause, specifying the columns to change and their new values; and the optional WHERE clause to filter the rows to be updated. No FROM or JOIN – if you need data from a different table, use a subquery in the SET clause. The optional FROM and JOIN clauses were added by Microsoft, as an extension to the standard syntax (and just to make out lives more interesting, they invented different variations of the syntax for SQL Server and for Access). So when you are in the habit of using them, be prepared to review all your UPDATE statements when moving to Oracle, DB2, Sybase, MySQL, or even a different Microsoft database!

     

    Standards? Bah, who cares?

     

    Well, some do. Me for instance – I will never use proprietary syntax if I know a standard alternative, expect if using the latter has severe negative consequences. And maybe you will, one day, when your boss comes back from the golf course with the great news that he managed to convince a colleague (who just happens to work in an Oracle shop) to buy a copy of your company’s application instead of some off-the-shelf product. Or when there’s a great job opportunity for someone with cross platform skills. Or when you are asked to help out this new colleague with 10+ years of DB2 experience. One of the lesser known side effects of Murphy’s Law is that those who least expect having to move their database to another platform, will.

     

    But even if you really don’t care about portability, there are other reasons to be wary of using UPDATE FROM. In fact, the most important reason why I dislike UPDATE FROM is not that it’s non-standard, but that it is just too easy to make mistakes with.

     

    Correctness? Bah, who cares?

     

    Well, most do. That’s why we test.

     

    If I mess up the join criteria in a SELECT query so that too many rows from the second table match, I’ll see it as soon as I test, because I get more rows back then expected. If I mess up the subquery criteria in an ANSI standard UPDATE query in a similar way, I see it even sooner, because SQL Server will return an error if the subquery returns more than a single value. But with the proprietary UPDATE FROM syntax, I can mess up the join and never notice – SQL Server will happily update the same row over and over again if it matches more than one row in the joined table, with only the result of the last of those updates sticking. And there is no way of knowing which row that will be, since that depends in the query execution plan that happens to be chosen. A worst case scenario would be one where the execution plan just happens to result in the expected outcome during all tests on the single-processor development server – and then, after deployment to the four-way dual-core production server, our precious data suddenly hits the fan…

     

    That’s all?

     

    Well, almost. There’s one more thing. Probably not something you’ll run into on a daily base, but good to know nonetheless. If the target of the update happens to be a view instead of a base table, and there is an INSTEAD OF UPDATE trigger defined for the view, the UPDATE will fail with this error message:

     

    Msg 414, Level 16, State 1, Line 1

    UPDATE is not allowed because the statement updates view "v1" which participates in a join and has an INSTEAD OF UPDATE trigger.

     

    Of course, most people will never run into this. But I did have the misfortune of doing so once – unfortunately, I discovered this limitation after rewriting several hundred ANSI standard UPDATE statements to the equivalent UPDATE FROM, and having to convert them all back after as much as a single test…

     

    And that’s why you want to deprecate UPDATE FROM?

     

    Well, no. The view with INSTEAD OF UPDATE trigger won’t affect many people. And the possibility of error can be somewhat thwarted by making sure (and double-checking) to always include all columns of the primary key (or a unique constraint) of the source table. So we’re back to the more principle point of avoiding proprietary syntax if there is an ANSI standard alternative with no or limited negative consequences. And in the case of UPDATE FROM, there are some cases where the standard syntax just doesn’t cut it.

     

    One such scenario is when a file is read in periodically with updated information that has to be pushed into the main table. The code below sets up a simplified example of this – a table Customers, with SSN as its primary key, that stores address and lots of other information, and a table Moved, which is the staging table containing the contents of a file received from a third party listing new address for people who recently moved. I have also included the code to preload the tables with some mocked up data – the Customers table has 10,000 rows, and the Moved table has 3,000 rows, 1,000 of which match an existing row in the Customers table. The others don’t – those people are apparently not our customers.

     

    CREATE TABLE Customers

          (SSN char(9) NOT NULL,

           Street varchar(40) NOT NULL,

           HouseNo int NOT NULL,

           City varchar(40) NOT NULL,

           LotsOfOtherInfo char(250) NOT NULL DEFAULT (''),

           PRIMARY KEY (SSN),

           CHECK (SSN NOT LIKE '%[^0-9]%')

          );

    CREATE TABLE Moved

          (SSN char(9) NOT NULL,

           Street varchar(40) NOT NULL,

           HouseNo int NOT NULL,

           City varchar(40) NOT NULL,

           PRIMARY KEY (SSN),

           CHECK (SSN NOT LIKE '%[^0-9]%')

          );

    go

    INSERT INTO Customers(SSN, Street, HouseNo, City)

    SELECT RIGHT(Number+1000000000,9), 'Street ' + CAST(Number AS varchar(10)),

           Number, 'City ' + CAST(Number AS varchar(10))

    FROM   dbo.Numbers

    WHERE  Number BETWEEN 1 AND 30000

    AND    Number % 3 = 0;

    INSERT INTO Moved(SSN, Street, HouseNo, City)

    SELECT RIGHT(Number+1000000000,9), 'New street ' + CAST(Number AS varchar(10)),

           Number * 2, 'New city ' + CAST(Number AS varchar(10))

    FROM   dbo.Numbers

    WHERE  Number BETWEEN 1 AND 30000

    AND    Number % 10 = 0;

    go 

     

    Since ANSI-standard SQL does not allow a join to be used in the UPDATE statement, we’ll have to use subqueries to find the new information, and to find the rows that need to be updated, resulting in this query:

     

    UPDATE Customers

    SET    Street  = (SELECT Street

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN),

           HouseNo = (SELECT HouseNo

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN),

           City    = (SELECT City

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN)

    WHERE EXISTS     (SELECT *

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN);

     

    There’s a lot of duplicated code in here. And if we were getting data from a complicated subquery instead of the table Moved, it would be even worse (though we can at least put all the duplicated code in a CTE since SQL Server 2005). Of course, writing the code is done quickly enough once you master the use of copy and paste, but the code has to be maintained as well.

     

    Maybe even worse is that the performance of this query just sucks – if you run this (enclosed in a BEGIN TRAN / ROLLBACK TRAN, so you can run the variations below without having to rebuild the original data) and check out the execution plan, you’ll see that the optimizer needs no less than five table scans (one for Customers, and four for Moved) and four merge join operators. And that, too, would be much worse if the source of the data had been a complex subquery (and no, using a CTE will not help the optimizer find a better plan – it just doesn’t understand that the four subqueries are similar enough that they can be collapsed.

     

    Now, if Microsoft had chosen to implement row-value constructors (as defined in the ANSI standard), we could have simplified this to

     

    UPDATE Customers

    SET   (Street, HouseNo, City)

                   = (SELECT Street, HouseNo, City

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN)

    WHERE EXISTS     (SELECT *

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN);

     

    But this is invalid syntax in any version of SQL Server (including the latest CTP for SQL Server 2008), and I know of no plans to change that before SQL Server 2008 RTMs.

     

    But with using the proprietary UPDATE FROM syntax, we can simplify this, and get a much better performance to boot. Here’s how the same update is written in non-portable code:

     

    UPDATE     c

    SET        Street     = m.Street,

               HouseNo    = m.HouseNo,

               City       = m.City

    FROM       Customers AS c

    INNER JOIN Moved     AS m

          ON   m.SSN      = c.SSN;

     

    And now, the optimizer will produce a plan that scans each table only once and has only a single merge join operator. Some quick tests (with much more rows in the tables) show that it executes two to three times quicker than the ANSI standard version. For that performance gain, I will gladly choose the proprietary syntax over the standard!

     

    What’s with the title of this post then? Why deprecate a fine feature?

     

    Patience, we’re getting there. Bear with me.

     

    All the above is true for versions of SQL Server up to SQL Server 2005. But SQL Server 2008 will change the playing field. It introduces a new statement, MERGE, that is specifically designed for situations where rows from a table source either have to be inserted into a destination table, or have to be used to update existing rows in the destination table. However, there is no law that prescribes that any MERGE should always actually include both an insert and an update clause – so with this new statement, we can now rewrite the above code as follows:

     

    MERGE INTO Customers AS c

    USING      Moved     AS m

          ON   m.SSN      = c.SSN

    WHEN MATCHED

    THEN UPDATE

    SET        Street     = m.Street,

               HouseNo    = m.HouseNo,

               City       = m.City;

     

    As you can see, the source table and the join criteria are included only once, just as in the proprietary UPDATE FROM. The execution plan (tested on the February CTP, also known as CTP6) is also quite similar, including just a few extra operators that are specific to the new MERGE statement. What really surprised me, was that the plan for the MERGE statement was estimated to be about 65% cheaper (faster) than the corresponding UPDATE FROM statement. However, I think SQL Server is lying here – a quick test with more data shows only an extremely marginal advantage of MERGE over UPDATE FROM. This test was too limited to draw serious conclusions, but I am quite sure that there will not be a 65% saving by using MERGE over UPDATE FROM. (I do expect such a saving form either MERGE or UPDATE FROM over the ANSI-compliant UPDATE statement for this case).

     

    The good news is that:

    1)      The MERGE statement is described in SQL:2003 and can thus be considered ANSI standard. (In fact, SQL Server implements a superset of the ANSI standard MERGE syntax: everything described in the syntax is implemented, but there are some non-standard extensions that make the command even more useful as well. However, the example above uses only the standard features and should hence run on each DBMS that conforms to the SQL:2003 version of MERGE).

    2)      The MERGE statement will return an error message if I mess up my join criteria so that more than a single row from the source is matched:

    Msg 8672, Level 16, State 1, Line 1

    The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

    3)      The MERGE statement will gladly accept a view with an INSTEAD OF UPDATE trigger as the target of the update.

     

    So as you see, MERGE allows me to achieve what I previously could achieve only with an ANSI standard UPDATE statement with lots of duplicated code and lousy performance, or with a UPDATE FROM statement that hinders portability, introduces a higher than normal risk of errors going unnoticed through QA right into the production database, and has some odd limitation on views with INSTEAD OF UPDATE triggers. None of these downsides and limitations apply to MERGE. And if there are any other problems with MERGE, I have yet to find them.

     

    With this alternative available, I fail to see any reason why the proprietary UPDATE FROM syntax should be maintained. In my opinion, it can safely be marked as deprecated in SQL Server 2008. It should of course still work, as “normal” supported syntax in both SQL Server 2008 and the next version, and in at least one version more if the database is set to a lower compatibility – but it should be marked as deprecated, and it should eventually be removed from the product. Why waste resources on maintaining that functionality, when there is an alternative that is better in every conceivable way? I’d much rather see the SQL Server team spend their time and energy on more important stuff, such as full support for row-value constructors and full support for the OVER() clause. Or maybe even on releasing Service Pack 3 for SQL Server 2005!

  • Want a Service Pack? Ask for it!

    Service pack 2 for SQL Server 2005 is already 11 months old. And there is still no sign of service pack 3 on the horizon. Why is that? Has Microsoft managed to release a perfect, completely bug-free product? No, of course not – with the size and complexity of a product such as SQL Server is, that will simply never happen.

     

    There have, in fact, enormous numbers of bugs been uncovered and fixed since SP2 was released. And roughly once every two months, a so-called “cumulative update package” gets released. The last one is officially called “Cumulative update package 5 for SQL Server 2005 Service Pack 2”, or simply CU5 for friends. Quite a mouthful. But if you think that name is long, check out the list of bugs that CU5 fixes!

     

    I think that it’s great that Microsoft now releases these cumulative update packages at regular intervals. I see them as a good addition that nicely fits in between hot-fixes for quick fixes with limited testing, only for those needing it, on one side, and fully tested service packs that are released once or at most twice per year on the other side.

     

    Given the long list of bugs fixed in CU5, should everyone be recommended to install it, just as if it were a service pack? Well, no. Microsoft themselves advise against this. In fact, you can’t even just download and install the package; you have to “submit a request to Microsoft Online Customer Services to obtain the cumulative update package”. This quote comes directly from the Knowledge Base article for CU5, as come these further disclaimers:

    it is intended to correct only the problems that are described in this article. Apply it only to systems that are experiencing these specific problems

    if you are not severely affected by any of these problems, we recommend that you wait for the next SQL Server 2005 service pack that contains the hotfixes in this cumulative update package

     

    These quotes, which have been in the same or a similar form in all cumulative updates, make it pretty clear that I should wait for the next service pack. So we waited. And waited. And waited some more. And then, some MVP's got impatient and suggested (in the MVP newsgroup) to release SP3 as soon as possible. The answer surprised me – apparently, Microsoft has no plans yet to release a new service pack, because not enough customers have asked for it. (MS: “Good, but honestly, at least our management says, we're not getting feedback requesting SP3 from enough customers to require it” – Me: “Is that reason under NDA? Because if it's not, I can post a blog entry supporting people to write MS management asking for SP3” – MS: “As far as I know that's a public response”). So, the KB article says to wait, all customers do as asked and then Microsoft concludes that nobody wants a service pack because nobody asks for it? And I misunderstood when I thought that “any hotfix that is provided in a SQL Server service pack is included in the next SQL Server service pack” implies that there will actually be a next service pack. Apparently, my graps of the English language is not as good as I’d like to believe…

     

    Anyway, I now understand that Microsoft will only release a new service pack if enough people ask for it. So I’ve decided to make sure that they get the message. I’ve gone to Connect and filed a suggestion to release Service Pack for SQL Server 2005, including all changes up to and including CU5.

     

    If you read this, and you agree with me that Service Pack 3 for SQL Server 2005 is overdue, you now know what to do – log in to Connect and vote 5 for my suggestion. And if you think that I’m losing my marbles and that there should be no Service Pack 3, then you should log in to Connect and vote 1. In short: make yourself heard!

     

    Microsoft will not release service pack 3 because insufficient customer are asking for it? Well, I’m asking – can we have service pack 3, please? Pretty please? Pretty pretty pretty pretty pretty please????

     

    With sugar on top…
  • Bin packing part 3: Need for speed

    In the first post of this series, I explained the bin-packing problem and established a baseline solution. The second post investigated ways to increase the packing efficiency. In none of these posts did I pay particular attention to performance – and frankly, it shows. Performance of all solutions presented thus far sucks eggs. Time to see what can be done about that.

     

    If you look in detail at the most efficient solution so far (dbo.OrderDesc, as described in the second post of the series), you’ll see that all 40,000 registrations are processed one by one, with a cursor. No surprise here, as I’ve already promised to postpone the set-based solution to a later moment. For each of these 40,000 registrations, the following actions are executed:

     

    ·        The dbo.Sessions table is queried to find a session for the correct year and quarter that still has enough space left for the current registration.

    ·        When needed, a new session is added to the dbo.Sessions table.

    ·        The registration is updated in the dbo.Registrations table. Well, updating all 40,000 registrations is pretty hard to avoid as the goal is to assign each registration to a session, and as long as the processing is cursor-based, the updates will inevitably come in one by one.

    ·        The chosen session is updated in the dbo.Sessions table to reduce the space left by the size of registration that was just assigned to the session.

     

    Improve indexes

     

    Indexing is one of the most common answers to performance problems. In this case, based on the above summary of actions taken in the loop, it’s clear that adding or changing an index on dbo.Registrations won’t do much. The update of this table is currently based on the clustered index key, which is the fastest possible way to perform an update. The cursor requires a table scan (to read all rows) and a sort (to order them by year, quarter, and descending number of candidates); the sort can be avoided if the clustered index is on these three columns, but at the price of 40,000 bookmark lookups for the updates – I don’t need to run a test to see that this is a bad idea!

     

    The dbo.Sessions table is less clear-cut. The clustered index is on (Year, Quarter, SessionNo), so searching for a session with enough room for a registration currently seeks the clustered index to process only the sessions of a single quarter, but it still has to scan through sessions in the quarter until it finds one with enough space. A nonclustered index on (Year, Quarter, SpaceLeft) will speed up this process, especially since it is covering for the query (the query uses the SessionNo column as well, but that column is included in the nonclustered index as part of the reference to the clustered index key). The downside to this index is that it has to be updated each time the space left in a session changes and each time a session is added. So, the question to answer is whether gained when searching for a session to add a registration to outweighs the performance lost during updates. To find out, I added this index before repeating the tests of dbo.OrderDesc:

     

    CREATE NONCLUSTERED INDEX ix_Sessions

    ON dbo.Sessions (Year, Quarter, SpaceLeft);

     

    The packing efficiency didn’t change. The execution time did change, though – with the index in place, the average execution time of five test runs of dbo.OrderDesc was down to 68,300 ms, a 12.4% improvement over the execution time without the index. Clearly, the extra overhead incurred on the UPDATE statements is less than the savings on the SELECT statements.

     

    Note that I create the index before starting the test. I assume that the index can be added permanently to the database – if the database frequently updates the dbo.Sessions table at other moments, when the bin packing procedure is not executing, it might make more sense to create it at the start of this procedure and remove it when done – and in that case, the time taken for creating and dropping the index (less than 100 ms) should be added in.

     

    For completeness, I also tested the dbo.Order50FirstB version, that proved to be a good trade-off between performance and efficiency in the previous post. This version should of course see a similar performance benefit from the additional index, and indeed it does – the average execution time for dbo.Order50FirstB was down to 68,144 ms after adding the index, a saving of 9.8%.

     

    If you’re duplicating the tests on your own server, then don’t forget to remove the index – we won’t need it in the following step, and the overhead of maintaining it would just waste precious time.

     

    DROP INDEX dbo.Sessions.ix_Sessions;

     

    Do less work …

     

    Saving 12.4% execution time is great – but (of course) not enough to satisfy me! So it’s time to take another approach: let’s see if I can’t find any way to reduce the number of times the dbo.Sessions table is searched and updated. How, you might ask? Well, let’s again check how a human would operate. If I have to stow packages in boxes, I’d probably keep adding packages to the same box until I get a package that won’t fit. Only when I get a package that doesn’t fit the box anymore would I put the box away and open an new box. I have coded the T-SQL equivalent of this method, and called it dbo.FillThenNext (see FillThenNext.sql in the attached ZIP file).

     

    The average execution time for this simple algorithm turned out to be just 39,110 ms, so this saves 42.6% over dbo.OrderDesc with index, or 47.0% over the baseline. But the packing efficiency turns out to be really down the drain with this one:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       6670        75.364617        16431.800000

    2       3548        78.769447        7532.600000

    3       8066        75.025787        20144.200000

    4       5346        78.283015        11609.900000

    ALL     23630       76.420440        13929.625000

     

     

    A total of 23,630 sessions exceeds the number of sessions required by the baseline by 21.4%, and that of dbo.OrderDesc by 24.9%. What a high price to pay for a speed advantage!

     

    … but do it smart

     

    The reason for this enormous efficiency loss is that I became very wasteful. Suppose that the first registration processed is for 20 persons, and the second for 85. The second can not be combined with the first in a single session, so the algorithm opens a new session – and never again looks at the first session, even though there still are 80 seats left! That is of course a HUGE waste of resources. So I modified the algorithm. I still have a “current” session that I keep adding registrations to as long as I can, but if I find a registration that won’t fit I now first search the previous sessions for one with enough empty seats before closing the current session and opening a new one. The code for this version (dbo.FillThenSearch (is in FillThenSearch.sql in the attached ZIP file. And the test results are really awesome! The average execution time is now 45,506 ms (with the nonclustered index back in place – without it, execution time is 50,874 ms), which is of course slightly slower than my previous attempt but still 33.4% faster than dbo.OrderDesc with index, and 38.3% faster than the baseline. But the packing is much better:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5399        93.106501        3721.800000

    2       2863        97.615787        682.600000

    3       6945        87.135781        8934.200000

    4       4542        92.140246        3569.900000

    ALL     19749       91.438300        4227.125000

     

    This version still doesn’t pack as efficient as dbo.OrderDesc (4.4% more sessions) or the baseline (1.5% more sessions) – but saving 33.4% execution time at the price of only 4.4% more sessions sounds like a proposition that deserves serious consideration, unlike the previous attempt!

     

    Revive an old trick

     

    If you have checked the source code, you may have noticed that I have once more removed the extra ordering that I added in the previous installment. I did that on purpose, because this extra ordering

    a)      decreased performance – I want to increase performance in this part, and

    b)      is not guaranteed to have the same effects in a different packing algorithm.

     

    But I did not want to end this post without at least testing the effect of adding back in the ordering that proved most efficient in the previous episode, so I re-introduced the sorting by descending number of candidates in dbo.FillThenSearchDesc (FillThenSearchDesc.sql in the ZIP file), and I discovered that this might be the best tradeoff so far – only 2 sessions more than dbo.OrderDesc, at only 48,626 ms (28.6% less than dbo.OrderDesc – with the nonclustered index still in place, of course).

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5087        98.816984        601.800000

    2       2799        99.847802        42.600000

    3       6771        89.374981        7194.200000

    4       4268        98.055529        829.900000

    ALL     18925       95.419550        2167.125000

     

    Best options so far

     

    After having investigated so many different options, it’s all too easy to lose track. The table below lists the versions most worth remembering – except for the baseline, I did not include any version for which there was another version that produces less sessions in less time. The remaining sessions are the one you’ll have to choose from, making your own trade-off between saving sessions or saving execution time.

     

    Version

    Execution time (ms)

    Number of sessions

    Baseline

    73,759

    19,457

    FillThenNext

    39,110

    23,630

    FillThenSearch (with index)

    45,506

    19,749

    FillThenSearchDesc (with index)

    48,626

    18,925

    OrderDesc (with index)

    68,300

    18,923

     

    Note that I ordered these versions, except the baseline, fastest to slowest (or least efficient to most efficient packer).

     

    This concludes the third part of this series. Though I still have some ideas that might improve the performance of my current cursor-based approach, I’ll postpone them and switch gears – next episode, I’ll start investigating if these numbers can be beaten by a set-based approach.

  • Bin packing part 2: Packing it tighter

    In my previous post, I explained the bin packing problem, explained an example scenario, and established a baseline for both speed and efficiency of bin packing algorithms by writing a rather crude cursor-based procedure. In this part, I will look at some small modifications that can be made to this code to make it better at packing bins as full as possible, so that less bins will be needed.

     

    Reordering the cursor

     

    The Baseline procedure didn’t process the registrations in any particular order. Granted, there is an ORDER BY Year, Quarter in the cursor declaration, but that is only needed to enable me to generate consecutive session numbers within each quarter without having to do a SELECT MAX(SessionNo) query. Had I elected to use an IDENTITY column for the Sessions table, I could even have omitted the ORDER BY completely.

     

    Since there is no order specified for the registrations within the quarter, we can assume that they will be processed in some unspecified order. But maybe we can get a more efficient distribution of registrations if we order them before processing? I am, of course, not referring to ordering by subject code, as the actual subjects are irrelevant for the algorithm – I am referring to order by the number of candidates in a registration.

     

    This is very easy to test, of course. All it takes is adding one extra column to the ORDER BY clause of the cursor definition. So I created the procedures dbo.OrderAsc and dbo.OrderDesc to test the effects of ordering by ascending or descending number of candidates (see OrderAsc.sql and OrderDesc.sql in the attached ZIP file).

     

    Ordering by ascending number of candidates turned out to be a pretty lousy idea. Well, to be fair, I didn’t expect otherwise – after all, if you save all the biggest registrations for the last, you’ll have no smaller registrations left to fill up the gaps. In fact, all registrations for 51 or more candidates will get a session of their own, and will not be combined with any other registration. So it’s not surprising at all to see that this method results in a huge amount of extra sessions as compared to the baseline version – an increase of no less than 19.9%!

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       6435        78.116860        14081.800000

    2       3419        81.741444        6242.600000

    3       7796        77.624166        17444.200000

    4       5686        73.602004        15009.900000

    ALL     23336       77.383227        13194.625000

     

    Equally unsurprising is the fact that changing the ORDER BY clause to sort by descending number of candidates results in more successful packing of the registrations in less sessions. This version saves 2.7% as compared to the baseline, as shown in this breakdown:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5085        98.855850        581.800000

    2       2799        99.847802        42.600000

    3       6771        89.374981        7194.200000

    4       4268        98.055529        829.900000

    ALL     18923       95.429635        2162.125000

     

    Even though this part mainly focuses on efficiency achieved, the time taken still remains an important factor. So I ran these two procedures five times in a row, recorded execution times, and calculated the average.  For dbo.OrderAsc, the average execution time was 78,531 ms, whereas dbo.OrderDesc clocked in at 77,934 ms. As you see, both versions are slightly slower than the baseline version. For the OrderDesc version, this is caused by the rapid growth of the number of sessions as the first, biggest registrations are processed first. This means that searching a session with enough empty space for a registration soon becomes a more time-consuming task. For OrderAsc, the reverse is true – since the smallest registrations are processed first, there will at first be only a few sessions. This means that this algorithm will be a lot faster at first – but once the bigger registrations are processed and the total number of sessions rapidly increases to be way more that that in the baseline version, this advantage is lost, and the time required to search for sessions with enough empty space soon gets so high that the advantage this algorithm had as first then turns into a disadvantage.

     

    Sorting the registrations by ascending number of candidates within a quarter before processing them hurts both speed and packing efficiency of the algorithm; we can henceforth forget about this option. On the other hand, sorting by descending number of candidates increases the packing efficiency by 2.7%, though this comes at the price of a 5.7% increase in execution time. If I had to choose between these two, my pick would depend on the needs of the organization I’m working for – if the cost per session is high and plenty of time is available for the computation, I’d go with the ordered version, but if speed is more  important than saving those few extra sessions, I’d use the unsorted one.

     

    Less obvious orderings

     

    But why choose between these two only? I have thus far only considered the obvious sort orders, ascending and descending. Why not try a few more variations?

     

    Thinking about the even distribution of data generated for the first quarter of each year in my test set, I considered that, were I tasked with manually combining the registrations as efficient as possible, I’d probably start by making sessions by combining two registrations for 50 candidates, than combining a registration for 51 candidates with one for 49, and so on. All of these sessions would total 100 candidates, and because of the even distribution of data, I expect roughly the same number of registrations for each size so I’d have only a few spare sessions left in the end.

     

    That technique can’t be exactly mimicked by changing the sort order of the cursor. There are other ways to mimic it, though – but I’ll leave those for a future post J. But we can simulate this effect by ordering the registrations so that those with 50 candidates come first, then those with 49 and 51 candidates, and so on. This is done by changing the ORDER BY clause in the cursor definition to order by the “distance” between the number of candidates and the magic number 50, being half the maximum session size:

     

      ORDER BY Year, Quarter,

               ABS(NumCandidates - (@MaxCandidatesPerSession / 2.0)) ASC;

     

    I didn’t hardcode the number 50, because I wanted my stored procedures to be fit for any maximum number of candidates. I divide by 2.0 instead of just 2 so that for an odd maximum session size (e.g. 25), the fraction is retained and registrations for 12 and 13 candidates are kept together because they are the same distance from half the maximum size (12.5).

     

    It is of course also possible to use DESC instead of ASC to start with registrations for 100 candidates, then those for 99 or 1, and so on, saving the 50-candidate registration for the last. Both these versions are included in the attached ZIP file, in the files Order50First.sql and Order50Last.sql.

     

    These versions both took slightly more time than the baseline version when I tested them on my laptop: 76,807 ms for dbo.Order50First, and 77,322 ms for dbo.Order50Last. The packing efficiency of dbo.Order50First is better than the baseline, but not as good as that of dbo.OrderDesc:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5096        98.642464        691.800000

    2       2805        99.634224        102.600000

    3       6780        89.256342        7284.200000

    4       4269        98.032560        839.900000

    ALL     18950       95.293667        2229.625000

     

    For dbo.Order50Last, the resulting number of sessions is even more than we had in the baseline!

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5444        92.336884        4171.800000

    2       3032        92.174802        2372.600000

    3       6928        87.349595        8764.200000

    4       4617        90.643491        4319.900000

    ALL     20021       90.196044        4907.125000

     

    The reason for the disappointing efficiency of the dbo.Order50First procedure is that there is no control over the order of the registrations that have the same distance to 50. So it is quite possible, for instance, to start with a bunch of registrations for 49 candidates that will promptly be combined to sessions for 98 candidates each – so that, when the 51-sized registrations start coming in, they have to get sessions of their own. In an attempt to fix that, I tweaked the sort order some more, making sure that the for registrations with the same distance from 50, the “bigger” registrations come before the “smaller” ones.

     

      ORDER BY Year, Quarter,

               ABS(NumCandidates - (@MaxCandidatesPerSession / 2.0)) ASC,

               NumCandidates DESC;

     

    With this ORDER BY clause, I can be certain that all 51-candidate registrations are processed first, each getting its own session. After that, the 49-candidate registrations will exactly fill out all those sessions. This version (enclosed in Order50FirstB.sql) had a slightly better packing ration than dbo.Order50First – but still not as good as dbo.OrderDesc. Here are the results, which took 75,547 ms (on average) to achieve:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5089        98.778148        621.800000

    2       2804        99.669757        92.600000

    3       6771        89.374981        7194.200000

    4       4268        98.055529        829.900000

    ALL     18932       95.384270        2184.625000

     

    After these tests, there was still one thing left I wanted to try. Starting with registrations for 50 places was based on an idea for evenly distributed data. For other distributions, this might turn out to be a much worse idea (though the results don’t show as much). But what if, instead of starting at half the maximum session size, we start at the average registration size? For evenly distributed data, this should work out approximately the same. But maybe this order achieves a better packing ratio for other distributions? Let’s find out.

     

    Ordering by distance from the average session size for a quarter can be accomplished by using a correlated subquery in the ORDER BY clause (compatible with all versions of SQL Server), or by using an AVG function with the OVER clause (only SQL Server 2005 and up):

     

      ORDER BY a.Year, a.Quarter,

               ABS(a.NumCandidates - (SELECT AVG(b.NumCandidates * 1.0)

                                      FROM   dbo.Registrations AS b

                                      WHERE  b.Year = a.Year

                                      AND    b.Quarter = a.Quarter)) ASC;

    or

      ORDER BY a.Year, a.Quarter,

               ABS(a.NumCandidates - AVG(a.NumCandidates * 1.0)

                       OVER (PARTITION BY a.Year, a.Quarter)) ASC;

     

    Surprisingly, when dry-testing the query by itself, the correlated subquery turned out to be faster than the one using the OVER clause, so I didn’t have to sacrifice speed for backward compatibility. I used the correlated subquery, both with the ASC and the DESC sort option (see OrderHalfFirst.sql and OrderHalfLast.sql in the attachment), to test the two possible variations of this option. Both versions turned out to be quite inefficient packers, since they both took more sessions than the baseline. Here are the results of dbo.OrderHalfFirst, acquired in 75,056 ms:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5120        98.180078        931.800000

    2       3264        85.623161        4692.600000

    3       6771        89.374981        7194.200000

    4       5044        82.970063        8589.900000

    ALL     20199       89.401207        5352.125000

     

    And OrderHalfLast, after running 79,294 ms, produced these results:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       6036        83.280649        10091.800000

    2       2941        95.026861        1462.600000

    3       7796        77.624166        17444.200000

    4       4951        84.528580        7659.900000

    ALL     21724       83.125345        9164.625000

     

    Conclusion

     

    I’ve investigated many options to increase packing efficiency. It turns out that, at least with the test data I used, just starting with the biggest registration and working down to the smallest yields the best results. This is not the fastest option, though. The baseline version discussed in the previous episode of this series is still fastest. So the choice would appear to depend on the requirements of your application – if you have plenty of time and computer capacity but need to use as little sessions as possible, go for the dbo.OrderDesc version. If execution time is of utmost importance and a few extra sessions are no big deal, then stick to the baseline (for now).

     

    If you are in search of a solution that offers both speed and efficient packing, then the dbo.Order50FirstB version seems to be the right choice. It is only 0.05% less efficient than the best packer (dbo.OrderDesc), but over 3% faster. In the next episode I’ll be looking at ways to make the algorithm go faster. I’ll be making huge performance improvements – but packing efficiency will suffer. How much? As soon as I have completed all my tests and written the accompanying text, you’ll read it. Right here on sqlblog.com.

  • Bin packing part 1: Setting a baseline

    Some problems can only be solved by brute-forcing every possible combination. The problem with such an approach, is that execution time grows exponentially as the amount of input data grows – so that even on the best possible hardware, you will get inacceptable performance once the input data goes beyond the size of a small test set. These problems are called “NP-complete” or “NP-hard”, and the most viable way to deal with them is to use an algorithm that finds a solution that, while not perfect, is at least good enough – with a performance that, while not perfect, is at least fast enough.

     

    The bin packing problem is one of those problems. It is basically a very simple problem – for example, you are given a number of packages, each with its own weight, and an unlimited number of bins with a given maximum weight capacity. Your task is to use as little bins as possible for packing all packages. There are various situations in real life where some sort of bin packing is required – such as loading trucks to transport all freight in as little trucks as possible, assigning groups of people to rooms, cutting forms from raw material (this is a two-dimensional variation of bin packing), etc.

     

    Back in 2004 and 2005, when I had just started answering questions in the SQL Server newsgroups, I replied to some questions that were essentially a variation on the bin packing problem – one involving packages (with a weight) and trucks (with a maximum load capacity); the other involving inviting families to dinner. I came up with an algorithm that combined set-based and iterative characteristics, and managed to find a very efficient distribution of packages, at a very good performance. I wanted to share this algorithm ever since I started blogging; the reason I haven’t done so yet is that there is much more to say about this category of problems, and that I never before found the time to investigate and describe it all.

     

    This is the first part of what will become a series of posts investigating all possible (and some impossible) solutions to this problem, including some new possibilities (such as SQLCLR) that have only become available since SQL Server 2005 was released.

     

    The sample scenario

     

    As a sample scenario for testing various algorithms, I decided to stray from the packages and trucks, and switch to examinations. The scenario outlined here is imaginary, though it is (very loosely) based on a Dutch examination institute that I have done some work for, several years ago.

     

    ImEx (Imaginary Examinations Inc.) is responsible for various certifications. It does not teach students, but it does define what candidates need to know, publish model exams, and (of course) take exams. The latter activity is four times per year. Candidates register for one of the many subjects available. ImEx only has a small office for its staff, so it has to rent a room in a conference centre where the exams are takes. This room has a maximum capacity of 100 seats; there are more candidates, so the room is rented for a period of time; during that time, ImEx will hold two examination sessions per day. All candidates that take an exam in the same subject have to be seated in the same session, since an expert on that subject has to be available to answer questions and settle disputes. However, candidates for different subjects can be combined in the same session, as long as the maximum capacity is not exceeded. Since the rent for this room is high, ImEx wants to seat all registered candidates in as little sessions as possible.

     

    The script file “Create DB + tables.sql” (see attached ZIP file) contains the SQL to create a database for ImEx, and to create the two tables that we will focus on in this series. The table dbo.Registrations holds the registrations – not the individual registrations (imagine that they are in a different table, that is not relevant for the bin packing problem), but the aggregated number of registrations per subject for each quarter. The table dbo.Sessions holds the sessions that will be held in each quarter. One of the columns in dbo.Registrations is a foreign key to dbo.Sessions; this column is NULL at the start and has to be filled with a link to the session in which each subject will be examined. The table dbo.Sessions has a column SpaceLeft that is equal to (100 – SUM(NumCandidates) of registrations appointed to this session); this is of course just a helper column, included solely for performance.

     

    Another script file, “Generate test data.sql” (also in the attached ZIP file), fills the table dbo.Registrations with a set of randomly generated data. I use different data distributions for each of the four quarters, so that I can test the various bin packing algorithms for evenly distributed data (Q1), for data with more small groups and less large groups (Q2), for data with more large groups and less small groups (Q3), and for data with a non-linear distribution of group size – very few small (4 – 10) and large (70 – 80) groups; many average (35 – 45) groups (Q4). I seed the random number generator with a known value at the beginning of the script to get reproducible results, so that I can make meaningful comparisons when regenerate the data to test a new algorithm. For “truly” random data, you’ll have to remove this statement – or you can use a different seed value to see if different data makes much difference for the results.

     

    Setting a baseline: the first attempt

     

    I’ll conclude this first post of the series with a first attempt at solving the problem. One that is probably neither the fastest, nor the most effective. This first version will than act as a baseline to compare future attempts to. I will measure both speed (how fast does it complete on a given set of test data) and effectiveness (how many sessions does it create for a given set of test data). I hope to find an algorithm that yields the maximum effectiveness while still exceeding the speed of other algorithms, but it’s more likely that I’ll end up having to choose for a trade-off between speed and effectiveness.

     

    This first attempt is based on mimicking how a human would approach this problem – inspect each registration in turn, and assign it to a session that still has sufficient capacity if one exists, or to a new session otherwise. Implementing this algorithm in T-SQL results in the code that you find in the enclosed ZIP file in Baseline.sql. The code is pretty straightforward. Since the algorithm will inspect the registrations one by one, it is all centred around a cursor over the registrations table – of course, using the fastest cursor options available (see this blog entry for details). I also experimented with the “poor man’s cursor” (see this post), but in this case the real cursor turned out to be (slightly) faster.

     

    I want the session numbers to be sequential and starting from 1 within each quarter. There are two ways to do that – either query the Sessions table for the MAX(SessionNo) within the current quarter each time a new session is added, or use a variable that I increment for each new session, and that I reset when a new quarter starts. I chose the latter, since variable manipulation is lots cheaper than accessing the Sessions table.

     

    At the heart of the procedure is the SELECT TOP 1 query that I use to find a single session that has enough space left to accommodate the current registration. Since there is no ORDER BY in this query, the results are not deterministic – that is, I know it will return a session with enough space left if there is one, but if there is more than one session with enough space left, no one can predict which one will be returned, nor whether consecutive runs will yield the exact same results. Many details, such as the number of processors available, workload, and other factors, can influence the query execution plan that is used, so don’t be surprised if you get different results when testing this code on your machine. I could make this query return deterministic, reproducible results – but that would affect both performance and efficiency of the procedure, so I left that for the next part of this series.

     

    The test setup

     

    To test this and all future algorithms, I created a generic stored procedure for testing, called dbo.PerfTest (see PerfTest.sql in the attached ZIP file). In this stored procedure, I first wipe clean any results that may have been left behind by the previous run. Then I make sure that both the data cache and the procedure cache are empty. And then, I call the procedure I need to test (which is passed to dbo.PerfTest as a parameter and assumed to be in the dbo schema), making sure to note the time the call is made and the time the procedure returns control. The difference in milliseconds is then returned to the client, as the duration of the procedure to be tested.

     

    The script file RunTest.sql is the file I actually execute to do the tests. Whenever I need to test a new algorithm, I only have to change the name of the stored procedure to test in the line that calls dbo.PerfTest and then I can hit the execute button and sit back. When the procedure finishes, it displays two result sets – one from dbo.PerfTest displaying the duration of the test procedure in milliseconds; the second generated by the code in RunTest.sql to assess the efficiency of the algorithm by comparing the number of sessions, the average session size, and the average number of free seats per quarter for each of the quarters and overall.

     

    As was to be expected, the speed of my first attempt is abysmal. For the 40,000 registrations in my randomly generated test set, the average elapsed time for 5 test runs was 73,759 milliseconds. Faster than when I had to do it by hand, but that’s about all I can say in favour of this “speed”.

     

    The efficiency of this algorithm turned out to be pretty good. The tightest packing ratio is achieved with the data for quarter 2, that consists mainly of small groups. Quarter 3, with an overdose of big groups, turns out to be much more challenging for this algorithm. Even though the average group size for quarter 4 is slightly smaller than that of quarter 1, it is harder to pack because the bell curve results in a much lower number of small groups that can be used to fill those last few seats in an almost packed session. Here are the full results:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5271        95.367482        2441.800000

    2       2825        98.928849        302.600000

    3       6871        88.074225        8194.200000

    4       4490        93.207349        3049.900000

    ALL     19457       92.810556        3497.125000

     

    And now?

     

    As I said before, this is just a baseline. I have already investigated several other algorithms and intend to investigate even more – such as improving the cursor code (for both speed and efficiency), using a set based solution, combining set based and cursor based techniques, and employing the CLR. Watch this space for the next episode!

  • Poor men see sharp - more cursor optimization

    After making my post on cursor optimization I received some comments that triggered me to do some further investigation. Adam Machanic wrote in my blog’s comments that using SQLCLR to loop over a SqlDataReader would be much faster than any T-SQL based cursor. And Erland Sommarskog wrote in a usenet thread that he has some colleagues who think that a “poor man’s cursor” is always better than a real cursor. So I decided to give these options a try and see what comes out in a real test. I simply reused the test cases I had already used for testing the various cursor options, but with the code adapted to use SQLCLR or to use a cursor-less iteration.

     

    The poor man’s cursor

     

    I don’t think that “poor man’s cursor” is an official phrase, but what the hay – if we all start using it, we can make it official J. In case you want to know what a term means before using it, the term “poor man’s cursor” refers to any method of iterating over the rows in the result set of a query, processing them one by one, without using the DECLARE CURSOR, OPEN CURSOR, FETCH, CLOSE CURSOR, and DEALLOCATE CURSOR keywords that were added to T-SQL for the sole purpose of iterating over the rows in a result set of a query.

     

    Why would you want to do that, you may ask? Well, I think that the most common reason is that programmers have heard that cursors are generally bad for performance, but fail to understand that the performance impact is not caused by the cursor itself, but by the fact that iterating over a result set reduces the options available to the query optimizer and negates the development team in Redmond has done to optimize SQL Server for set based operations. So they think that the cursor itself is to blame, and try to code around it without moving from their algorithmic, iteration-based approach to a declarative, set-based approach.

     

    Usenet newsgroups and web-forums being full of simple one-liners such as “cursors are evil”, many people claiming that cursors incur a heavy overhead, and even some otherwise respectable websites listing WHILE loops first in a list of cursor alternatives, all have done their fair share to contribute to the popularity of the idea that you can improve cursor performance by simply replacing it with a different iteration mechanism. So, let’s find out if there is any truth to this claim.

     

    Reading data

     

    I started with the fastest of all cursor options, the one using a local, forward only, static, read only cursor with an ORDER BY matching the clustered index. I ripped out all cursor-related command and replaced them with the appropriate SELECT TOP(1) commands to read and process one row at a time, and ended up with this code:

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Declare and initialize variables for loop

    DECLARE @SalesOrderID int,

            @SalesOrderDetailID int,

            @OrderQty smallint,

            @ProductID int,

            @LineTotal numeric(38,6),

            @SubTotal numeric(38,6);

    SET @SubTotal = 0;

     

    -- Read first row to start loop

    SELECT TOP (1) @SalesOrderID = SalesOrderID,

                   @SalesOrderDetailID = SalesOrderDetailID,

                   @OrderQty = OrderQty,

                   @ProductID = ProductID,

                   @LineTotal = LineTotal

    FROM           Sales.SalesOrderDetail

    ORDER BY       SalesOrderID, SalesOrderDetailID;

     

    -- Process all rows

    WHILE @@ROWCOUNT > 0

    BEGIN;

     

      -- Accumulate total

      SET @SubTotal = @SubTotal + @LineTotal;

     

      -- Read next row

      SELECT TOP (1) @SalesOrderID = SalesOrderID,

                     @SalesOrderDetailID = SalesOrderDetailID,

                     @OrderQty = OrderQty,

                     @ProductID = ProductID,

                     @LineTotal = LineTotal

      FROM           Sales.SalesOrderDetail

      WHERE          SalesOrderID > @SalesOrderID

      OR (           SalesOrderID = @SalesOrderID

          AND        SalesOrderDetailID > @SalesOrderDetailID)

      ORDER BY       SalesOrderID, SalesOrderDetailID;

     

    END;

     

    -- Display result and duration

    SELECT @SubTotal;

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go


    I ran this code five times in a row and calculated average execution time as 3166 milliseconds. I then re-ran the cursor code five times (I didn’t want to use the old measurements, as I was unsure if I had the same applications active – and having a different load on my machine would surely influence results); this code took 3265 milliseconds. So the first round goes to the poor man’s cursor, for beating the “real” cursor by three percent. I must add to this that I have later run another test, as part of research for a future blog post, where the results were reversed and the real cursor beat the poor man’s cursor by a small margin.

     

    Of course, real life is not always so nice as to throw us only problems that require ordering the data by the clustered index key. So my next step was to investigate what happens to the comparison if the problem requires the data to be read in an order that can be served by a clustered index. Remember that I had a similar test case in the cursor option comparison, so I again was able to reuse the existing cursor code for ordering by ProductID. For the poor man’s cursor version, this involved changing the ORDER BY on both queries, but I also had to change the WHERE clause in the second – to make sure that the WHERE clause filters out all rows already processed, I have to include rows with a higher ProductID as well as rows with an equal ProductID and a higher primary key value – and in order for this to work, I also have to include the primary key columns as tie-breakers to the ORDER BY clause. I won’t post the full code, as most of it remains the same, but the “Read next row” query in the loop now reads like this:

     

      -- Read next row

      SELECT TOP (1) @SalesOrderID = SalesOrderID,

                     @SalesOrderDetailID = SalesOrderDetailID,

                     @OrderQty = OrderQty,

                     @ProductID = ProductID,

                     @LineTotal = LineTotal

      FROM           Sales.SalesOrderDetail

      WHERE          ProductID > @ProductID

      OR (           ProductID = @ProductID

          AND        SalesOrderID > @SalesOrderID)

      OR (           ProductID = @ProductID

          AND        SalesOrderID = @SalesOrderID

          AND        SalesOrderDetailID > @SalesOrderDetailID)

      ORDER BY       ProductID, SalesOrderID, SalesOrderDetailID;

     

    The impact on performance is dramatic, to say the least. With this slight modification in the order in which rows have to be processed, the average execution time for five consecutive test runs rises to 5822 ms. The cursor version gets slower as well as a result of the new sort order, but by far less – it still takes only 3377 ms, so the poor man’s cursor is now worse by over seventy percent!

     

    For the final test, I checked the effects of ordering by a column that’s not indexed at all. I did this in the original cursor test by ordering on LineTotal, so I’ll do the same here. Since LineTotal is, like ProductID in the previous test case, not constrained to be unique, the same consideration apply. That means that I can reuse the code of the version that ordered by ProductID except of course that I have to change each occurrence of ProductID to LineTotal.

     

    This change really wrecked performance for the poor man’s cursor. I wanted to sit it out, but finally decided to kill the test after one and a half hours. I finally realized that the LineTotal column I was using is a non-persisted computed column, which adds an enormous amount of overhead – for each of the 121,317 iterations, SQL Server has to recalculate the LineTotal for each of the 121,317 rows – that is a total of almost 15 billion calculations! So I decided to change this test case to sort on OrderQty instead, then left the computer to execute the test run overnight. The next day, the duration was listed as a whopping 27,859,593 ms (seven and three quarter hours!) – just a tad slower than the real cursor, which clocked in at an average execution time of 3430 ms when sorting on the computed LineTotal column and 3352 ms when sorting of OrderQty.

     

    Modifying data

     

    Naturally, I wanted to test the performance of a poor man’s cursor in a modifying scenario as well. I didn’t really expect any surprises. After all, I already know that the fastest cursor solution uses the exact same cursor options as when reading data. I’ll spare you the poor man’s cursor code this time, since it’s based on the cursor code published on my previous blog posts, with the same modifications as above. Since this update scenario happens to be based on ordering by the clustered index key, I expected the poor man’s cursor to be just a tad faster, just as in the reading scenario.

     

    After running the tests, I was surprised. The real cursor version took 5009 ms on average; the poor man’s cursor achieved the same task in just 4722 ms – a speed gain of over five percent. The speed gain was so much more than I expected that I actually repeated the tests – but with the same results. I must admit that I have no idea why the exact same cursor, transformed to the exact same poor man’s cursor, results in more speed gain when the rows are then updated then when they are merely used in a computation.

     

    I did not test performance of the poor man’s cursor in a scenario where the rows have to be processed in a different order than the clustered index key. Based on the results of the tests for reading data, I expect performance to go down the drain in a very similar way.

     

    Conclusion

     

    People claiming that a poor man’s cursor performs better than a real cursor are mostly wrong. When the order in which rows have to be processed does not match the clustered index key, a properly optimized cursor will run rings around the poor man’s cursor.

     

    The only exception is if the required ordering happens to coincide with the clustered index key. In those cases, a poor man’s cursor may sometimes beat a real cursor by a few percent, although there are other cases where the cursor still wins (also by a few percent). Even in the cases where the poor man’s cursor does win, the margin is so small that I’d recommend just using real cursors, with the appropriate options for maximum performance (that is, LOCAL, FORWARD_ONLY, STATIC, and READ_ONLY) in all cases.

     

    Except of course in the 99.9% of all cases where a set-based solution beats the cursor-based one J.

     

    Using the CLR

     

    When you use CLR code to process data provided by SQL Server, iterating over rows to process them one at a time becomes the standard – after all, there is no support for set-based operations in C#, VB.Net, or any other .Net enabled programming language. As such, the claim made by Adam Machanic has valid grounds. A language that has no other option but to iterate over the rows and process them one at a time pretty well should be optimized for this kind of processing!

     

    Reading data

     

    The CLR version of the code to calculate the sum of all LineTotal values is about as simple as it gets:

     

    [Microsoft.SqlServer.Server.SqlProcedure]

    public static SqlInt32 ReadData([SqlFacet(Precision = 38, Scale = 6)] out SqlDecimal Total)

    {

        // Initialize subtotal

        decimal SubTotal = 0;

     

        // Set up connection and query

        SqlConnection conn = new SqlConnection("context connection=true");

        conn.Open();

        SqlCommand cmd = conn.CreateCommand();

        cmd.CommandText = "SELECT   SalesOrderID, SalesOrderDetailID, " +

                          "         OrderQty, ProductID, LineTotal " +

                          "FROM     Sales.SalesOrderDetail " +

                          "ORDER BY SalesOrderID, SalesOrderDetailID;";

                          //"ORDER BY ProductID;";

                          //"ORDER BY LineTotal;";

                          //"ORDER BY OrderQty;";

        cmd.CommandType = CommandType.Text;

     

        // Process rows from reader; accumulate total

        SqlDataReader rdr = cmd.ExecuteReader();

        while (rdr.Read() == true)

        {

            SubTotal += (decimal)rdr[4];

        }

     

        // Clean up and return result

        conn.Dispose();

        Total = new SqlDecimal(SubTotal);

        return 0;

    }

     

    Note that I did not do any dynamic SQL or fancy stuff to make the processing order variable; I just commented one line, uncommented another and recompiled. This avoids dangerous dynamic SQL and complex CASE expressions in the ORDER BY, plus it mimics much better how a real application would work – cursors and other iterative solutions are often used when developers (think they) need to process in a certain order, so the ORDER BY would usually be fixed.

     

    The results of running the tests proved that Adam got it completely right – this CLR implementation does indeed run rings around even the fastest of all cursor solution, taking on average only 1060 milliseconds when ordering by the clustered index, 1072 milliseconds when ordering by the nonclustered index, 1132 milliseconds when ordering by the computed non-indexed column, and 1050 milliseconds when ordering by the non-computed non-indexed column. Three of these are so close together that I think that the differences are within the statistical margin of error and that they should be considered to be the same. The 70 extra milliseconds for ordering by a computed column are obviously the time taken to compute the value for each row in order to do the sorting.

     

    I don’t understand why ordering by the clustered index key doesn’t result in some additional performance gain, as I expected this one to differ from the others by one sort step in the execution plan. This was another test case I repeated a few more times to make sure that I didn’t accidentally mess things up. If I execute the cmd.CommandText as a separate query in SQL Server Management Studio, I do get a significant cheaper execution plan when ordering by the clustered key index, so I guess that this will just have to be filed as one of the many things of SQL Server I don’t understand.

     

    Modifying data

     

    The CLR starts showing a completely different face when you have to modify the data you read from a cursor. The main problem is that you can’t use SqlDataReader anymore, since this blocks the context connection from being used for any other queries. You could choose to open a separate connection, but that has the disadvantage that you perform the updates in a separate transaction context so that you run into a huge blocking risk, plus a rollback of the main transaction would not roll back the changes made from this procedure.

     

    So that leaves me with only one other option – use SqlDataAdapter.Fill method to copy the entire results of the query to a DataSet, then loop over it and process its rows one by one. This results in the CLR version doing a lot more work and using a significant amount of memory. The fact that we no longer update the row we have just read, but rather read them all and only then update them all means that there is also an increased chance that the row is no longer in the data cache and hence has to be read from disk a second time for the update, effectively doubling the amount of physical I/O (though this did not happen in my case).

     

    [Microsoft.SqlServer.Server.SqlProcedure]

    public static SqlInt32 ModifyData()

    {

        // Open connection (context connection since we're called in-process)

        SqlConnection conn = new SqlConnection("context connection=true");

        conn.Open();

     

        // Prepare commands to fetch rows and to update

        String SelCmd = "SELECT   SalesOrderID, SalesOrderDetailID, " +

                        "         OrderQty, ProductID, LineTotal " +

                        "FROM     Sales.SalesOrderDetail " +

                        "ORDER BY SalesOrderID, SalesOrderDetailID;";

        SqlDataAdapter da = new SqlDataAdapter(SelCmd, conn);

     

        String UpdCmd = "UPDATE Sales.SalesOrderDetail " +

                        "SET    OrderQty = @OrderQty " +

                        "WHERE  SalesOrderID = @SalesOrderID " +

                        "AND    SalesOrderDetailID = @SalesOrderDetailID;";

        SqlCommand upd = new SqlCommand(UpdCmd, conn);

        upd.Parameters.Add("@SalesOrderID", SqlDbType.Int);

        upd.Parameters.Add("@SalesOrderDetailID", SqlDbType.Int);

        upd.Parameters.Add("@OrderQty", SqlDbType.SmallInt);

     

        // Read rows to process; copy to DataAdapter

        DataSet ds = new DataSet();

        da.Fill(ds);

       

        // Process rows

        foreach (DataRow dr in ds.Tables[0].Rows)

        {

            Int32 SalesOrderID = (Int32)dr[0];

            Int32 SalesOrderDetailID = (Int32)dr[1];

            Int16 OrderQty = (Int16)dr[2];

     

            // Set parameters; perform update

            upd.Parameters[0].Value = SalesOrderID;

            upd.Parameters[1].Value = SalesOrderDetailID;

            upd.Parameters[2].Value = OrderQty + 1;

            upd.ExecuteNonQuery();

        }

     

        // Cleanup and return

        conn.Dispose();

        return 0;

    }

     

    After compiling and deploying the code above, I once more ran 5 tests. The average execution time for this version was 12,215 milliseconds, almost 150% more than the cursor version. My guess is that this huge increase in time is not a result of the update command itself, but a result of the requirement to pre-load the data in a DataSet and then iterate over that. I did not test it, but I expect to see a similar problem if a cursor requires reading some additional data, based on the data read in the cursor – this, too, would require the CLR version to employ a DataSet instead of simply looping over a SqlDataReader.

     

    Conclusion

     

    Adam’s suggestion to use CLR makes sense, but only for cases where no additional data access, either reading or modifying, is required when processing the rows in the cursor. As soon as the latter becomes a requirement, the CLR version has to switch from using a SqlDataReader to using SqlDataAdapter.Fill, and performance suffers horribly.

  • Curious cursor optimization options

    The best way to optimize performance of a cursor is, of course, to rip it out and replace it with set-based logic. But there is still a small category of problems where a cursor will outperform a set-based solution. The introduction of ranking functions in SQL Server 2005 has taken a large chunk out of that category – but some remain. For those problems, it makes sense to investigate the performance effects of the various cursor options.

     

    I am currently preparing a series of blog posts on a neat set-based solution I found for a problem that screams “cursor” from all corners. But in order to level the playing field, I figured that it would be only fair to optimize the hell out of the cursor-based solution before blasting it to pieces with my set-based version. So I suddenly found myself doing something I never expected to do: finding the set of cursor options that yields the best performance.

     

    That task turned out to be rather time-consuming, as there are a lot of cursor options that can all be combined in a huge number of ways. And I had to test all those combinations in various scenarios, like reading data in a variety of orders, and updating data in two separate ways. I won’t bore you with all the numbers here; instead, I intend to point out some highlights, including some very curious finds. For your reference, I have included a spreadsheet with the results of all test as an attachment to this post.

     

    Disclaimer: All results presented here are only valid for my test cases (as presented below) on my test data (a copy of the SalesOrderDetail table in the AdventureWorks sample database), on my machine (a desktop with 2GB of memory, a dual-core processor, running SQL Server 2005 SP2), and with my workload (just myself, and only the test scripts were active). If your situation is different, for instance if the table will not fit in cache, if the database is heavily accessed by competing processes, or if virtually any other variable changes, you really ought to perform your own test if you want to squeeze everything out of your cursor. And also consider that many options are included to achieve other goals than performance, so you may not be able to use all options without breaking something.

     

    Reading data

     

    Many cursors are used to create reports. The data read is ordered in the order required for the report, and running totals and subtotals are kept and reset as required while reading rows. Those already on SQL Server 2005 can often leverage the new ranking functions to calculate the same running totals without the overhead of a cursor, but if you are still stuck on SQL Server 2000 or if you face a problem that the ranking functions can’t solve, you may find yourself preferring a cursor over the exponentially degrading performance of the correlated subquery that the set-based alternative requires.

     

    Since the order of these cursors is dictated by the report requirements rather than the table and index layout, I decided to test the three variations you might encounter – you may be so lucky that the order of the report matches the clustered index, or you might find that a nonclustered index matches the order you need, or you may be so unlucky that you need to order by a column that is not indexed.

     

    I used the code below for my performance tests. You can run this code as is on the AdventureWorks sample database, or you can do as I did and copy the Sales.SalesOrderDetail table, with all indexes and all data, to your own testing database.

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Declare and initialize variables for cursor loop

    DECLARE @SalesOrderID int,

            @SalesOrderDetailID int,

            @OrderQty smallint,

            @ProductID int,

            @LineTotal numeric(38,6),

            @SubTotal numeric(38,6);

    SET @SubTotal = 0;

     

    -- Declare and init cursor

    DECLARE SalesOrderDetailCursor

      CURSOR

        LOCAL           -- LOCAL or GLOBAL

        FORWARD_ONLY    -- FORWARD_ONLY or SCROLL

        STATIC          -- STATIC, KEYSET, DYNAMIC, or FAST_FORWARD

        READ_ONLY       -- READ_ONLY, SCROLL_LOCKS, or OPTIMISTIC

        TYPE_WARNING    -- Inform me of implicit conversions

    FOR SELECT   SalesOrderID, SalesOrderDetailID,

                 OrderQty, ProductID, LineTotal

        FROM     Sales.SalesOrderDetail

        ORDER BY SalesOrderID, SalesOrderDetailID; -- Match clustered index

    --    ORDER BY ProductID;                      -- Match nonclustered index

    --    ORDER BY LineTotal;                      -- Doesn’t match an index

     

    OPEN SalesOrderDetailCursor;

     

    -- Fetch first row to start loop

    FETCH NEXT FROM SalesOrderDetailCursor

          INTO @SalesOrderID, @SalesOrderDetailID,

               @OrderQty, @ProductID, @LineTotal;

     

    -- Process all rows

    WHILE @@FETCH_STATUS = 0

    BEGIN;

     

      -- Accumulate total

      SET @SubTotal = @SubTotal + @LineTotal;

     

      -- Fetch next row

      FETCH NEXT FROM SalesOrderDetailCursor

            INTO @SalesOrderID, @SalesOrderDetailID,

                 @OrderQty, @ProductID, @LineTotal;

     

    END;

     

    -- Done processing; close and deallocate to free up resources

    CLOSE SalesOrderDetailCursor;

    DEALLOCATE SalesOrderDetailCursor;

     

    -- Display result and duration

    SELECT @SubTotal;

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go

     

    The first surprise came straight when I set my baseline by commenting out all options of the DECLARE CURSOR statement. The execution time when ordering by the clustered index was 6.9 seconds; when ordering by a nonclustered index it was 9 seconds – but when ordering by an unindexed column, the cursor with default options turned out to be faster, at only 6.4 seconds. I later found the reason for this to be that the first two defaulted to a relatively slow dynamic cursor, whereas the latter used the faster technique of a keyset cursor.

     

    Choosing LOCAL or GLOBAL had no effect on cursor performance. This was as expected, since this option only controls the scope of the cursor, nothing else. For this reason, I excluded this option from testing the variants for updating with a cursor.

     

    I didn’t see any difference between the FORWARD_ONLY and SCROLL options either. This came as a surprise, since FORWARD_ONLY exposes only a subset of the functionality of the SCROLL version. I really expected SQL Server to be able to do some clever optimization if it knew that I’d never read in any other direction than from the first to the last row. I’m really wondering why the FORWARD_ONLY option is not deprecated, seeing that there is no advantage at all in specifying it – but maybe the development team in Redmond knows something I don’t?

     

    The static, keyset, and dynamic cursors performed exactly as expected – in all cases, the static cursor was the fastest, the keyset came second, and the dynamic cursor finished last. No surprises here – until I started my tests with the cursor that orders by an unindexed column. In these tests, SQL Server informed be (due to the TYPE_WARNING option) that the created cursor was not of the requested type. It did not tell me what type it did create, nor why it disregarded the requested options. I failed to see anything in Books Online to explain this behavior, so I filed a bug for this. This did explain why the “hardest” sort option was the fastest when running with default options – since a dynamic cursor was not available, this one had to use a keyset cursor instead.

     

    My biggest surprise came when I tested the FAST_FORWARD option. According to Books Online, this option “specifies a FORWARD_ONLY, READ_ONLY cursor with performance optimizations enabled”, so I expected performance to be at least on par with, and probably better than that of a STATIC FORWARD_ONLY READ_ONLY cursor – but instead, the FAST_FORWARD option turned out to be consistently slower, in some cases even by 15%!

     

    The last set of options, the ones specifying the locking behavior, turned out to depend on the chosen cursor type. For a static cursor, the two available options made no difference. For other cursors, READ_ONLY was best – but SCROLL_LOCKS was second for keyset cursors and third for dynamic cursors, and OPTIMISTIC was second for dynamic and third for keyset. Go figure.

     

    Based on all tests, it turns out that the best performance is achieved by specifying a STATIC cursor. I would add the LOCAL, FORWARD_ONLY, and READ_ONLY options for documentation purposes, but they make no performance difference. With these options, execution time went down from 6.3 to 9 seconds (depending on the ORDER BY) to 3.3 to 3.4 seconds. Of course, none of those come even close to the 0.2 seconds of the set-based equivalent for this test case:

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Calculate and display result

    SELECT SUM(LineTotal)

    FROM   Sales.SalesOrderDetail;

     

    -- Display duration

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go

     

    Modifying data

     

    Another scenario in which cursors are used is when data has to be updated, and the calculation to determine the new data is thought to be to complicated for a set-based approach. In those cases, a cursor is used to process the rows one by one, calculate the new data, and update the data with the calculation results.

     

    If you specify the FOR UPDATE clause in the cursor declaration, you can use the WHERE CURRENT OF clause of the UPDATE command to update the last row fetched. Of course, you can also omit the FOR UPDATE clause and use a regular UPDATE statement, using the primary key values of the row just read to find the row to update.

     

    Since I expected a FOR UPDATE cursor to be optimized for updating the last row fetched, I first tested its performance, by using this code:

     

    -- Enclose in transaction so we can roll back changes for the next test

    BEGIN TRANSACTION;

    go

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Declare and initialize variables for cursor loop

    DECLARE @SalesOrderID int,

            @SalesOrderDetailID int,

            @OrderQty smallint,

            @ProductID int,

            @LineTotal numeric(38,6);

     

    -- Declare and init cursor

    DECLARE SalesOrderDetailCursor

      CURSOR

        LOCAL           -- LOCAL or GLOBAL makes no difference for performance

        FORWARD_ONLY    -- FORWARD_ONLY or SCROLL

        KEYSET          -- KEYSET or DYNAMIC

                        --    (other options are incompatible with FOR UPDATE)

        SCROLL_LOCKS    -- SCROLL_LOCKS or OPTIMISTIC

                        --    (READ_ONLY is incompatible with FOR UPDATE)

        TYPE_WARNING    -- Inform me of implicit conversions

    FOR SELECT   SalesOrderID, SalesOrderDetailID,

                 OrderQty, ProductID, LineTotal

        FROM     Sales.SalesOrderDetail

        ORDER BY SalesOrderID, SalesOrderDetailID

    FOR UPDATE          -- FOR UPDATE or FOR UPDATE OF OrderQty

        ;

     

    OPEN SalesOrderDetailCursor;

     

    -- Fetch first row to start loop

    FETCH NEXT FROM SalesOrderDetailCursor

          INTO @SalesOrderID, @SalesOrderDetailID,

               @OrderQty, @ProductID, @LineTotal;

     

    -- Process all rows

    WHILE @@FETCH_STATUS = 0

    BEGIN;

     

      -- Change OrderQty of current order

      UPDATE Sales.SalesOrderDetail

      SET    OrderQty = @OrderQty + 1

      WHERE  CURRENT OF SalesOrderDetailCursor;

     

      -- Fetch next row

      FETCH NEXT FROM SalesOrderDetailCursor

            INTO @SalesOrderID, @SalesOrderDetailID,

                 @OrderQty, @ProductID, @LineTotal;

     

    END;

     

    -- Done processing; close and deallocate to free up resources

    CLOSE SalesOrderDetailCursor;

    DEALLOCATE SalesOrderDetailCursor;

     

    -- Display duration

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go

     

    -- Rollback changes for the next test

    ROLLBACK TRANSACTION;

    go

     

    Just as with the tests that only read the data, there was no difference between SCROLL and FORWARD_ONLY cursors. And just as with the tests that only read the data, KEYSET cursors were consistently faster than their DYNAMIC counterparts. However, in this case the SCROLL_LOCKS locking option turned out to be consistently faster than OPTIMISTIC, though I expect that this might change if only a fraction of the rows is updated.

     

    From a performance point of view, there is absolutely no difference between a generic FOR UPDATE or a completely specified FOR UPDATE OF column, column, … For documentation purposes, I would prefer the latter.

     

    And again, just as with the tests that only read the data, the default cursor options chosen when I did not specify any turned out to select the slowest of all available options. Ugh!

     

    However, the real kicker came when I left out the FOR UPDATE clause of the CREATE CURSOR statement and changed the UPDATE statement to use the primary key values instead of the WHERE CURRENT OF clause. One would expect that this clause would be fast – since it is written especially for, and can be used exclusively in, the processing of a FOR UPDATE cursor, every trick in the book can be used to optimize this. However, the reverse turned out to be true. Even the fastest of all WHERE CURRENT OF variations I tested was easily beaten by even the slowest of all WHERE PrimaryKey = @PrimaryKey variations. Here is the code I used, in case you want to test it yourself:

     

    -- Enclose in transaction so we can roll back changes for the next test

    BEGIN TRANSACTION;

    go

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Declare and initialize variables for cursor loop

    DECLARE @SalesOrderID int,

            @SalesOrderDetailID int,

            @OrderQty smallint,

            @ProductID int,

            @LineTotal numeric(38,6);

     

    -- Declare and init cursor

    DECLARE SalesOrderDetailCursor

      CURSOR

        LOCAL           -- LOCAL or GLOBAL makes no difference for performance

        FORWARD_ONLY    -- FORWARD_ONLY or SCROLL

        STATIC          -- STATIC, KEYSET, DYNAMIC, or FAST_FORWARD

        READ_ONLY       -- READ_ONLY, SCROLL_LOCKS, or OPTIMISTIC

        TYPE_WARNING    -- Inform me of implicit conversions

    FOR SELECT   SalesOrderID, SalesOrderDetailID,

                 OrderQty, ProductID, LineTotal

        FROM     Sales.SalesOrderDetail

        ORDER BY SalesOrderID, SalesOrderDetailID;

     

    OPEN SalesOrderDetailCursor;

     

    -- Fetch first row to start loop

    FETCH NEXT FROM SalesOrderDetailCursor

          INTO @SalesOrderID, @SalesOrderDetailID,

               @OrderQty, @ProductID, @LineTotal;

     

    -- Process all rows

    WHILE @@FETCH_STATUS = 0

    BEGIN;

     

      -- Change OrderQty of current order

      UPDATE Sales.SalesOrderDetail

      SET    OrderQty = @OrderQty + 1

      WHERE  SalesOrderID = @SalesOrderID

      AND    SalesOrderDetailID = @SalesOrderDetailID;

     

      -- Fetch next row

      FETCH NEXT FROM SalesOrderDetailCursor

            INTO @SalesOrderID, @SalesOrderDetailID,

                 @OrderQty, @ProductID, @LineTotal;

     

    END;

     

    -- Done processing; close and deallocate to free up resources

    CLOSE SalesOrderDetailCursor;

    DEALLOCATE SalesOrderDetailCursor;

     

    -- Display duration

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go

     

    -- Rollback changes for the next test

    ROLLBACK TRANSACTION;

    go

     

    So from using WHERE CURRENT OF and default options, at 16.6 seconds, I’ve gotten execution time down to 5.1 seconds by using the primary key for the update and specifying a STATIC cursor (including the LOCAL, FAST_FORWARD, and READ_ONLY options for documentation). Looks good, as long as I close my eyes to the 0.4 second execution time of the set-based version:

     

    -- Enclose in transaction so we can roll back changes for the next test

    BEGIN TRANSACTION;

    go

     

    -- Keep track of execution time

    DECLARE @start datetime;

    SET @start = CURRENT_TIMESTAMP;

     

    -- Change OrderQty of all orders

    UPDATE Sales.SalesOrderDetail

    SET    OrderQty = OrderQty + 1;

     

    -- Display duration

    SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);

    go

     

    -- Rollback changes for the next test

    ROLLBACK TRANSACTION;

    go

     

    Conclusion

     

    If you have to optimize a cursor for performance, keep the following considerations in mind:

     

    1. Always try to replace the cursor by a set-based equivalent first. If you fail to see how, do not hesitate to ask in one of the SQL Server newsgroups.
    2. If you are really stuck with a cursor, then do NOT rely on the default options. They will result in the slowest of all possible option combinations
    3. If you think that the FAST_FORWARD option results in the fastest possible performance, think again. I have not found one single test case where it was faster than, or even as fast as, a STATIC cursor.
    4. Do NOT use the WHERE CURRENT OF syntax of the UPDATE command. Using a regular WHERE clause with the primary key values will speed up your performance by a factor of two to three.
    5. Do not rely blindly on my performance results. Remember, the one thing that is always true when working with SQL Server is: “it depends”.
  • So-called "exact" numerics are not at all exact!

    Attempting to dispel myths tends to make me feel like Don Quixote, riding against hordes of windmills that won’t budge. In this case, even some of my fellow MVPs and Microsoft’s own Books Online are among the windmills…

     

    Books Online says that there are two categories of numeric data types: “approximate” (float and real), and “exact” (all others, but for this discussion mainly decimal and numeric). It also says that “floating point data is approximate; therefore, not all values in the data type range can be represented exactly”, thereby suggesting that other numeric data types are capable of representing all values in the data type range. The latter is of course not true, for there is no way that values such as 1/3, π, or √2 can ever be represented exactly in any of SQL Server’s data types.

     

    But Books Online is not the only one to blame – many respected MVPs carry part of the blame as well. For instance, Aaron Bertrand, the original author of the famous website www.aspfaq.com, write on a page about rounding errors when using floating point mathematics: “You should try to avoid the FLOAT datatype whenever possible, and opt for the more versatile, and precise, DECIMAL or NUMERIC datatypes instead”. And just today, I was reading this (otherwise impressive) book by Bob Beauchemin and Dan Sullivan, when I came across a passage that presented a code snippet to demonstrate rounding errors in the .Net equivalent of float; the authors did present size and speed as possible reasons to choose float over decimal, but failed to mention that decimal is not exact either.

     

    Since reading this paragraph was the final straw that caused me to blog on this, I’ll start with a SQL Server equivalent of the code presented by Bob and Dan:

     

    DECLARE @Float1 float, @Float2 float, @Float3 float, @Float4 float;

    SET @Float1 = 54;

    SET @Float2 = 0.03;

    SET @Float3 = 0 + @Float1 + @Float2;

    SELECT @Float3 - @Float1 - @Float2 AS "Should be 0";

     

    Should be 0

    ----------------------

    1.13797860024079E-15

     

    DECLARE @Fixed1 decimal(8,4), @Fixed2 decimal(8,4), @Fixed3 decimal(8,4);

    SET @Fixed1 = 54;

    SET @Fixed2 = 0.03;

    SET @Fixed3 = 0 + @Fixed1 + @Fixed2;

    SELECT @Fixed3 - @Fixed1 - @Fixed2 AS "Should be 0";

     

    Should be 0

    ---------------------------------------

    0.0000

     

    As you see, adding some numbers and then subtracting them again does indeed incur a rounding error. The result is 0.0000000000000011379786 instead of 0. But what happens if we do a similar test with multiplying and dividing? The code below should always return 1. It does for the floating point calculation, but not for the fixed point version – this one’s result is off by exactly 1E-15, approximately the same margin of error that float caused when adding and multiplying.

     

    DECLARE @Float1 float, @Float2 float, @Float3 float, @Float4 float;

    SET @Float1 = 54;

    SET @Float2 = 0.03;

    SET @Float3 = 1 * @Float1 / @Float2;

    SELECT @Float3 / @Float1 * @Float2 AS "Should be 1";

     

    Should be 1

    ----------------------

    1

     

    DECLARE @Fixed1 decimal(8,4), @Fixed2 decimal(8,4), @Fixed3 decimal(8,4);

    SET @Fixed1 = 54;

    SET @Fixed2 = 0.03;

    SET @Fixed3 = 1 * @Fixed1 / @Fixed2;

    SELECT @Fixed3 / @Fixed1 * @Fixed2 AS "Should be 1";

     

    Should be 1

    ---------------------------------------

    0.99999999999999900

     

    It even gets more interesting when you change the value of @Fixed2 from 0.03 to 0.003 – in that case, the floating point calculation still runs fine and without error, whereas the fixed point calculation bombs:

     

    Msg 8115, Level 16, State 8, Line 11

    Arithmetic overflow error converting numeric to data type numeric.

    Should be 1

    ---------------------------------------

    NULL

     

    Now I’m sure that many of you will already have experimented and found that they could “fix” this by increasing the scale and precision of the fixed point numbers. But they can never exceed 38, and it’s not hard at all to come up with examples of rounding errors in fixed point calculations for any setting off scale and precision.

     

    Mind you, I am not saying that float is “better” than decimal. It is not – but it’s not worse either. Both “exact” and “approximate” numeric data types have their place. A grand choice for “exact” numeric data, is when dealing with numbers that have a fixed number of decimal places and represent an exact amount, such as monetary units. There’s no way that I would ever use floating point data in such an application!

     

    But if you are dealing with scientific data, that is usually derived from some measurement and hence by definition an approximation of reality (since there’s no way to measure with unlimited precision), floating point data is an excellent choice. Not because it’s approximate nature mimics the act of trying to get a measure as close as possible to reality, but also (or maybe I should say: mainly) because it can easily represent both very large and very small numbers with a large number of significant figures – try for instance to do something like this with “exact” numeric data types, if you don’t believe me!

     

    DECLARE @Float1 float, @Float2 float, @Float3 float, @Float4 float;

    SET @Float1 = 987654321.0 * 123456789.0;

    SET @Float2 = 0.123456789 / 998877665544332211.0;

    SET @Float3 = 1 * @Float1 / @Float2;

    SELECT @Float3 / @Float1 * @Float2 AS "Should be 1";

     

    Should be 1

    ----------------------

    1

  • How NOT to pass a lot of parameters

    Did you know that SQL Server allows stored procedures to have up to 2100 parameters? And more important: do you care? Well, some people do care, and Joe Celko seems to be one of them.

     

    If you are a regular reader of SQL Server newsgroups, you probably know Joe Celko from his always unfriendly and often incorrect replies. Here is a typical example, one that I have seen several times recently, in a paraphrased form:

    Question: I want to send a list of values to my stored procedure, but WHERE ColumnName IN (@ValueList) does not work – how to solve this?

    Answer: SQL Server can handle over 1000 parameters. You should use WHERE ColumnName IN (@Parm1, @Parm2, …, @ParmN).

     

    Joe Celko is the only one I have ever seen giving this advise. Many people will then jump into the discussion, challenging Joe’s advise. To which Joe will always reply that he has received a smart stored procedure that will solve a Sudoku puzzle, and that takes 81 parameters (one for each cell in the puzzle) as its input – unfortunately, Joe has so far refused to actually publish his code to let other people verify his claims.

     

    The test setup

     

    I still wanted to see for myself how passing 81 parameters into a stored procedure compares to other methods of passing in the same input, so I wrote three simple test procedures. Each of these procedures takes a Sudoku puzzle as input, but in three different forms. Each of the three then uses the input to populate a temporary table (#Problem) with the puzzle, and then performs a pretty standard pivot query to output the puzzle in the usual form.

     

    After verifying that all of the procedures worked as expected, I uncommented the pivot query to reduce the output for my performance tests. I then set the tests. I selected two real Sudoku puzzles (an easy one, with 34 cells given, and a hard one with only 27 cells given) and added two nonsensical ones of my own (one with only 5 cells given, and one with 72 cells). For each combination of a puzzle and a procedure, I coded a loop that calls the procedure a thousand times and records the elapsed time in a table. These twelve loops were than enclosed in an endless loop. Once satisfied with the code, I hit the execute button, pushed my laptop out of the way and went on to other stuff.

     

    Some 24-odd hours later, I interrupted the test script. Each of the twelve “thousand calls” tests had been executed 400 times. I dusted of a query I originally wrote for testing the performance of code for the Spatial Data chapter in “Expert SQL Server 2005 Development” to calculate the average duration per single call, disregarding the fastest and slowest 10% of the measurements to exclude the influence of semi-random other activities on my laptop.

     

    (Note that all code to create the stored procedures and run the tests is in the attachment to this post, so you can always repeat these tests on your machine.)

     

    The contenders

     

    The first contender is of course the procedure with 81 parameters that Joe Celko is such an avid fan of. Creating this procedure involved a lot of copying and pasting, a lot of editing numbers in the copied and pasted code, and a lot of tedious debugging until I had finally found and corrected all locations where I had goofed up the copy and paste or where I had failed to edit a number after pasting. The resulting code is very long, tedious to read and maintain, and screams “Hey! You forgot to normalize these repeating groups into their own table” all the way. Manually typing the EXEC statements to call this procedure with test data was also very cumbersome and error-prone. In a real situation, the procedure would probably be called from some user-friendly front end code. I’m far from an expert in front end code, but I expect this code to be very long as well, since it has to check and optionally fill and pass 81 parameters.

     

    The second contender uses a pretty standard CSV string as input, with the additional requirement that each value in the CSV is three characters: row@, column#, value. The procedure uses a variation of one of the scripts found on Erland Sommarskog’s site to parse the CSV list into a tabular format. This code is lots shorter, and as a result easier on the eyes and easier to maintain. Typing in the EXEC statements for testing is still pretty cumbersome (though I found a way to cheat – simply copy the parameter list for the Celko version, do some global search and replace to remove various characters, and the end was exactly the string I needed to call this procedure). The front end code will probably be lots shorter, since it can use a simple loop to process the input and build the CSV parameter.

     

    The third and last contender takes a CHAR(81) parameter as input. The first 9 characters of this parameter describe the top row, using a space to depict an empty cell; the second set of 9 characters is for the second row, and so forth. Parsing this parameter turned out to be even easier than parsing the CSV parameter. Another observation I made is that is was much easier to manually enter the parameter for the tests – just read the puzzle left to right and bottom to top and type either a number or a space for each cell. This was absolutely painless, and I didn’t make a single mistake. Of course, this is irrelevant for the expected real situation where the parameter is built by the front end – the code to do this will probably be about as complex as that for the CSV parameter.

     

    Performance results

     

    If you’re as eager to see the test results, you’ll probably have skipped the previous section. No problem, just promise to go back and read it later, m’kay?

     

    Test version

    Joe Celko’s 81 parameters

    Single CSV parameter

    Single CHAR(81) parameter

    Almost empty (5 cells)

    1.08 ms

    1.40 ms

    1.05 ms

    Hard puzzle (27 cells)

    1.80 ms

    1.78 ms

    1.35 ms

    Easy puzzle (34 cells)

    2.04 ms

    1.90 ms

    1.45 ms

    Almost full (72 cells)

    3.34 ms

    2.56 ms

    1.99 ms

     

    As you see, using lots of parameters is faster than using a single CSV parameter only if you don’t actually pass values in these parameters. As soon as you use the parameters, performance of a procedure with lots of parameters deteriorates quickly.

     

    You can also see that the CHAR(81) parameter wins in all cases.

     

    Network bandwidth

     

    My testing was all carried out on my laptop. The results will for the most part me a result of the time needed to process the input, not on network capacity. However, it is easy to see by just looking at the EXEC statements that the CHAR(81) version uses the least network resources, Celko’s version with 81 parameters uses the most, and the CSV versions sits nicely in between.

     

    Final thoughts

     

    You may have noted that I have not included a version with an XML input parameter in my tests. I probably should have done that, but I have to admit that I still have so much to learn on how to handle XML in a SQL Server database that I didn’t feel comfortable enough to sit down and write one myself. But your submissions are welcomed – if you feel that you can write an efficient version of this procedure that accepts its input in XML format, do not hesitate to write me. As soon as I can spare the time to set up the laptop for another all-nighter of performance testing, I’ll rerun the script with your XML solution included and post the results back to this site,

     

    While writing this post, I found a newsgroup posting by Joe Celko where he reveals a snippet of “his” Sudoku solver. And guess what? I was wrong when I thought that I could guess how his procedure looks. It turns out that he does not use defaults for his parameter; you always have to supply them all, using 0 for an empty cell. I didn’t want to repeat all the tests at this time. I expect that this will reduce performance even more, though not by much – but it will also cause a huge increase in network usage!

    I also saw that the parameters in Joe Celko’s version were declared as integer, so that each parameter will use 4 bytes instead of just 1. This will definitely affect both the performance of the procedure and the network pressure.

     

    Conclusion

     

    If you have to pass a long list of parameters to a stored procedure or function, you should not use a long collection of parameters. It makes the code harder to write and maintain, prone to subtle errors, longer (which will affect parse and compile time, though I did not include this in my test), uses far more network resources than any of the other alternatives (except, maybe, XML), and gets terribly slow as more parameters are actually used.

     

    Joe celko will probably find that he too can shorten the amount of code in his Sudoku solver *and* increase performance by using a different strategy to pass the puzzle. Of course, in the case of solving a Sudoku, those two 2 milliseconds extra execution time won’t really matter, not the few hundred extra bytes travelling over the network. But if you ever encounter a similar multi-parameter problem in a procedure that will be called from a web page that will get hundreds of hits per second, those 2 milliseconds and those extra bytes in the network can suddenly become a huge bottleneck!

     

    SQL Server may support up to 2100 parameters – but that does not imply that it is a good idea to actually use them!

  • What if null if null is null null null is null?

    In this fourth and final part in my series about NULL, I’ll discuss some well-known and some less well-known functions and keywords that are specifically created to deal with NULL values. And I will, of course, explain why null if null is null null null is null. In case you have not yet read them, you can click these links to read the first, second, and third part.

     

    IS NULL is not = NULL

     

    I have already explained why tests for equality with NULL will always return Unknown instead of True or False. This holds true in all circumstances – even when both sides of the equation are NULL. That’s why the query below will not help you find the people for which no birthday is on file:

     

    SELECT FirstName, LastName

    FROM   dbo.Persons

    WHERE  Birthday = NULL;

     

    Even though no birthday is on file for Hugo Kornelis, my name will not be returned – because when evaluating “my” row, SQL Server will see a comparison between a missing value (my birthday) and a missing value (the NULL) – and when asked whether two unspecified values are equal, the only guaranteed correct answer is “I ain’t the faintest, dude” (for the British readers, that would be “I haven’t got the foggiest idea, old chap”).

     

    Since there are numerous occasions where you want to find rows where values are missing, a special operator was introduced for this: IS NULL. Rewriting the query above as follows will return my name, because my Birthday is indeed NULL in this database.

     

    SELECT FirstName, LastName

    FROM   dbo.Persons

    WHERE  Birthday IS NULL;

     

    Note that the operator IS NULL will return True if the operand (Birthday in this case) is NULL, or False if it is any other value – an IS NULL test can therefore never result in Unknown. So to find values that are not missing, the expression can simply be negated to

    WHERE  NOT(Birthday IS NULL);

    but there is also a shorthand form available that is much more common – in fact so much more that I don’t think I’ve ever seen the above version used!

    WHERE  Birthday IS NOT NULL;

     

    Do not forget that the NULL keyword is an integral part of the IS [NOT] NULL operator. You can’t use IS or IS NOT as a replacement for = and <> and expect the same special treatment of NULL values that IS [NOT] NULL provides. That would require the use of the IS [NOT] DISTINCT FROM operator that is specified in SQL:1999 but not yet implemented in SQL Server (vote here if you’d like to see this changed!)

     

    SET ANSI_NULLS { ON | OFF }

     

    Some of you may have seen existing code where a predicate such as Column = NULL is used, and does actually return rows. This is probably legacy code that has not been maintained in a long time (and if it isn’t, it has to be written by a legacy developer who has not maintained his knowledge in a long time).

     

    Very early versions of SQL Server were released before agreement was reached in the ANSI committee on all features. As a result, some features were implemented in a different way than what the ANSI committee turned out to describe, facing Microsoft developers with the challenge to upgrade to ANSI compliant behaviour without breaking existing code.

     

    Behaviour of NULL comparisons in predicates is one such example where the original version of SQL Server “got it wrong”. So when SQL Server was changed to return Unknown on a comparison with NULL, the SET ANSI_NULL OFF option was introduced to force the “old” behaviour that would return False on NULL vs. non-NULL comparison, and True on NULL vs. NULL comparison.

     

    The SET ANSI_NULL option has been marked as deprecated in SQL Server 2005, so if you find code that still relies on this setting, you’d better change it ASAP. This would of course be very easy if you only had to look for “= NULL” and “<> NULL”, and change them to “IS NULL” and “IS NOT NULL” – but unfortunately, expressions of the form “= expression” where expression can be NULL are also affected by the ANSI_NULL setting; these are much harder to identify or fix!

     

    ISNULL is not COALESCE

     

    Despite the name similarity, and despite the fact that in Access, ISNULL(xxx) is equivalent to xxx IS NULL, the T-SQL ISNULL function is completely different from the IS NULL predicate. ISNULL is used to return the first non-NULL from its two inputs. However, ISNULL is also a leftover from the days before the ANSI standard was finalised – the ANSI standard function that should be used to replace ISNULL is called COALESCE. Despite the similarities, there are also a couple of differences that you should be aware of before replacing all your occurrences of ISNULL to COALESCE.

     

    One reason to prefer COALESCE of ISNULL, in addition to its adherence to standards, is that COALESCE can take any number of arguments whereas ISNULL only takes two. Both will return the first non-NULL argument, or NULL if all arguments are NULL. The unlimited number of arguments makes COALESCE a far better option when tasked to find the first non-NULL value from more than two inputs – for example, to find the first non-NULL of Arg1, Arg2, Arg3, and Arg4, I’d rather use

    COALESCE(Arg1,Arg2,Arg3,Arg4)

    than

    ISNULL(Arg1,ISNULL(Arg2,ISNULL(Arg3,Arg4)))

     

    Another major problem with ISNULL is how it deals with implicit conversions. COALESCE will, like any other T-SQL operation, use the rules of data type precedence to find the data type of its result – but ISNULL will always return a value of the data type of its first argument, making this the only T-SQL keyword that does not respect the precedence rules.

     

    Unfortunately, there is one situation where you can not simply rip out ISNULL and replace it with COALESCE – and that is in the case of computed columns. If you attempt to create the tables below, you will see that Test_ISNULL is created without problems, whereas the attempt to add the index to Test_COALESCE fails.

     

    CREATE TABLE Test_ISNULL

         (Col1 int NOT NULL,

          Col2 int NULL,

          Col3 AS ISNULL(Col2, Col1) PRIMARY KEY);

    go

    CREATE TABLE Test_COALESCE

         (Col1 int NOT NULL,

          Col2 int NULL,

          Col3 AS COALESCE(Col2, Col1) PRIMARY KEY);

    go

     

    I consider this to be a bug. Clearly, a computed column using COALESCE should inherit its nullability from the last argument, just as is the case with ISNULL. Please vote on Connect if you agree with me that this bug should be fixed, as a first step towards deprecating and removing the superfluous and confusing ISNULL function.

     

    NULLIF, the forgotten one

     

    The last NULL related keyword that I will cover here is also the least well-known (although some strange coincidence cause Jeff Smith to blog about it earlier this week). That may be due to the fact that it is only useful in a limited number of cases, and also not as easy to understand as the other ones.

     

    The NULLIF function takes two arguments. It returns NULL if the first argument is equal to the second one – in all other cases, the first argument is returned unchanged. According to the SQL standards, NULLIF is actually a shorthand form for a CASE expression:

    NULLIF(Arg1, Arg2)

    is defined as equivalent to

    CASE WHEN Arg1 = Arg2 THEN NULL ELSE Arg1 END

     

    The most common use of NULLIF is to prevent runtime errors such as division by zero. For instance, the query below will fail if there are any rows with (Col1 + Col2) equal to zero:

    SELECT IdCol, Col1 / (Col1 + Col2) AS Col1Ratio

    FROM   dbo.SampleTable;

     

    One way to attempt to prevent this error is to exclude these rows:

    SELECT IdCol, Col1 / (Col1 + Col2) AS Col1Ratio

    FROM   dbo.SampleTable

    WHERE  Col1 + Col2 <> 0;

    However, this might fail because SQL Server might use an execution plan that calculates Col1Ratio before applying the WHERE clause – that is not a bug, the SQL standards explicitly allow the vendors such freedoms in the implementation of their products. And if other columns are displayed as well, completely omitting the rows is not a good way to tackle this problem.

     

    This is where NULLIF comes to the rescue. In the query below, the divisor will be changed to NULL if it was 0, setting the result of the entire calculation to NULL instead of causing a runtime error.

    SELECT IdCol, Col1 / NULLIF(Col1 + Col2, 0) AS Col1Ratio

    FROM   dbo.SampleTable;

    If so desired, a COALESCE function can be added to change this NULL result back to a chosen numeric value – or it can be left as NULL to make it instantly visible that for this row, a value of Col1Ratio could not be computed.

     

    Another problem area where NULLIF can be used is to check if values are distinct. I have already explained the difference between unequal and distinct in the third part of this series, and I have also presented this method for testing for distinctness:

    WHERE OldValue <> NewValue

    OR   (OldValue IS NULL AND NewValue IS NOT NULL)

    OR   (OldValue IS NOT NULL AND NewValue IS NULL);

     

    This is of course rather clumsy; many developers prefer to replace this with a shorter expression. That is very easy to do if there is some magic value that OldValue and NewValue can never be equal to – for instance, if both are always above 0, you could replace the three lines above with this single line:

    WHERE COALESCE(OldValue, -1) <> COALESCE(NewValue, -1)

     

    But what if any value can occur in the domain? In that case, you can use NULLIF to shorten the expression a bit:

    WHERE NULLIF(OldValue, NewValue) IS NOT NULL

    OR    NULLIF(NewValue, OldValue) IS NOT NULL

    Remember that NULLIF returns NULL if both arguments are equal, but also if the first argument is NULL. So for the expression above, both NULLIF expressions will be NULL if either NewValue is equal to OldValue, or both are NULL. If one of the two is NULL and the other is not, only one of the NULLIF expressions will be NULL. If neither is NULL and they are not equal, both NULLIF expression will be NOT NULL.

     

    Do remember that neither of the three distinction tests presented above will allow the use of an index seek strategy. Until Microsoft implement IS [NOT] DISTINCT FROM, we’ll have to live with not being able to use an index seek for this.

     

    What if null if null is null null null is null?

     

    Now, I can finally answer this question – I only have to remove some spaces and add some commas and parentheses to change it to valid SQL Server syntax:

    IF NULLIF(NULL, ISNULL(NULL, NULL)) IS NULL

     

    Note that this is not a valid statement, as the IF statement misses its statement block. Don’t add it yet – first try to predict the results without running the code.

     

    The result of ISNULL(NULL, NULL) should of course be NULL, since both arguments are NULL. That simplifies the IF statement to:

    IF NULLIF(NULL, NULL) IS NULL

     

    The result of NULLIF(NULL, NULL) should be NULL. Not because the two operands are equal (they are not – remember, NULL = NULL evaluates to Unknown!), but because in case of unequal operands, the first argument is returned. So now we have:

    IF NULL IS NULL

    which will of course evaluate to True.

     

    Now complete the IF statement in the original query and run it to check our prediction:

    IF NULLIF(NULL, ISNULL(NULL, NULL)) IS NULL

      PRINT 'Prediction was correct';

    ELSE

      PRINT 'I goofed...';

     

    Msg 8133, Level 16, State 1, Line 1

    None of the result expressions in a CASE specification can be NULL.

     

    So I didn’t predict the result correctly, nor did I goof … instead, I discovered an interesting caveat with NULLIF – or rather, with the CASE expression that this NULLIF expression expands to:

     

    CASE WHEN NULL = NULL THEN NULL ELSE NULL END

     

    The problem here, is that both the THEN (or rather, each THEN, as a CASE expression allows as many WHEN … THEN clauses as you wish) and the ELSE clause return the constant expression NULL. That this is the problem is easily verified:

     

    SELECT CASE WHEN 1 = 2 THEN NULL ELSE NULL END

    Msg 8133, Level 16, State 1, Line 1

    None of the result expressions in a CASE specification can be NULL.

     

    This error message is misleading. Result expressions of a CASE specification are allowed to be NULL; I’ve used that many times. The real reason that this particular query errors is related to how SQL Server determines the data type for a constant in the query. A numeric constant, like 42 or 3.14159265, is considered to be integer or numeric(9,8); a constant enclosed in quotes like ‘this’ is considered to be varchar(4). But since any data type allows the NULL “value”, the constant NULL can be of any data type and SQL Server has no way of guessing what you mean. This is usually solved by checking what data types are used around it. So in the expression

    CASE WHEN 3 = NULL THEN NULL ELSE '3' END

    the first NULL is considered to be of data type integer (as it is compared to the constant 3, which is integer as well), and the second NULL is considered to be varchar(1) (as both this NULL and the constant ‘3’ are possible results of the CASE expression).

     

    Back to the error query – if all possible result expressions of the CASE expression are the constant NULL, SQL Server has no way of working out the data type of the result expression, and that is indeed an error – so SQL Server did the right thing in returning an error, it just chose a bad message! Interestingly, even though COALESCE is also defined as shorthand for a CASE expression, there is a specific error message for COALESCE(NULL,NULL) in SQL Server 2005 (maybe as a result of this connect entry?) – but instead of fixing the real problem, Microsoft chose to fix only the specific COALESCE case. I have of course filed a bug report on Connect for this misleading error message.

     

    Anyway, now that I know the cause of the error message, I can also fix it, and run the code below to finally check if null if null is null null null is null:

     

    -- Please, NEVER use this as a variable name in real code....

    DECLARE @NULL int;

    -- This superfluous assignment makes the code self-documenting.

    SET @NULL = NULL;

    -- Now check if null if null is null null null is null:

    IF NULLIF(@NULL, ISNULL(@NULL, @NULL)) IS NULL

      PRINT 'Prediction was correct';

    ELSE

      PRINT 'I goofed...';

     

    Prediction was correct

  • Dr. Unknown, or how I learned to stop worrying and love the NULL

    Two months ago, I posted the first two parts of a series about NULL. After that, I went quiet. Much to do and little time was one excuse. But to be honest, I also lost interest. However, I felt I owe my readers to conclude the series, so I have now forced myself to write and publish the last parts before moving on to more interesting subjects J.

     

    Before reading on, I suggest that you first read (or reread) the first and second part of this series, so that we’re all on the same level again.

     

    Finished reading? Okay, good to see you back. J

     

    How to turn logical into illogical

     

    In his comment to the second part of this series, David Portas pointed out the danger of equating the behaviour of Unknown in SQL’s three-valued logic to the behaviour of humans when confronted with unknown (or rather: missing) data. The problem is that the behaviour of NULL and Unknown is defined such that it will mostly produce the logical result – but that defined behaviour is then in true computer-fashion applied to all situations. A computer lacks the brain power to assess when the rules stop to make sense.

     

    Here are two examples where the logical treatment of NULL and Unknown, when applied to a specific situation, renders completely illogical results:

     

    DECLARE @Var int;

    SET @Var = NULL;

     

    IF @Var = 8 OR @Var <> 8

      PRINT 'It either is or isn''t 8';

    ELSE PRINT 'Huh? It''s not 8, but it''s not not 8 either!';

     

    IF @Var = @Var

      PRINT 'Duh, of course it''s equal to itself';

    ELSE PRINT 'What the hey, it''s not even equal to itself!!';

     

    Both conditions in the code above are immediately recognisable as tautologies for us humans. But SQL Server doesn’t attempt to identify tautologies, it just does as instructed. So in the case of the first IF statement, it first evaluated each of the individual parts: “@Var = 8” is Unknown, since the value for @Var is missing and might or might not be 8. Likewise, the evaluation of “@Var <> 8” results in Unknown as well. And then the logic table that I explained in my last post is consulted to get a result of Unknown for “Unknown OR Unknown” – and hence the ELSE is executed (and the PRINTed output should actually have read “I’m not really sure whether it’s either equal or unequal to 8!” – which of course makes no sense either). Note also that changing “@Var <> 8” to “NOT(@Var <> 8)” will not change the outcome.

     

    The second example takes the oddity yet a step further. Surely, anything is always to itself? Well, yeah, sure it is – but the specifications of the ANSI standard disallow SQL Server from knowing that. What SQL Server does here, is that first the variable reference is replaced by its value, so that the equation to evaluate becomes “NULL = NULL” – and there is of course no way to know if two missing values are equal to each other or not.

     

    A different, slightly better hidden form of tautology is presented in this example:

     

    CREATE TABLE Test

       (Col1 int NOT NULL PRIMARY KEY,

        Col2 int NULL,

        CHECK(Col2 < 10));

     

    INSERT INTO Test (Col1, Col2)

    SELECT 1, 3

    UNION ALL

    SELECT 2, NULL;

     

    SELECT Col1, Col2

    FROM   Test

    WHERE  Col2 < 10;

    go

     

    For the row with Col1 = 2, the value of Col2 is missing – but because of the CHECK constraint, we know that it can never be 10 or more. And yet, SQL Server had to reject this row, based on the handling of NULL that I have been labelling as “logical” in my previous posts!

     

    An idea that pops up in newsgroups from time to time, is to have the RDBMS recognise tautologies and react accordingly, regardless of whether the input is NULL or not. I must say that I am glad that this has, so far (touches wood), not been incorporated in either the ANSI standard or SQL Server. Sure, I do appreciate that the current situation leads to weird effects, but it is at least 100% logical. But extending the database with a list of recognised tautologies would introduce another weirdness, for I consider it to be highly unlikely that a database will ever be able to recognise all possible tautologies. Would you instantly recognise the WHERE clause below as a tautology?

     

    WHERE Foo BETWEEN 0 AND Bar

    OR    SQUARE(Foo) > Foo * Bar

    OR    Bar <= 10;

     

    And remember – even if you could, and if you could make a query parser do it too, I can always come up with a more complex way of saying “True”.

     

    Unequal, but not distinct

     

    Database developers often have to deal with testing if values are unequal – but they also have to deal with testing if values are distinct. Now you may think that “distinct” is the same as “unequal”, and you may even find that Wikipedia agrees with you – but in database land, “distinct” is in fact both distinct from and unequal to “unequal”.

     

    The difference between equality and distinction, in database terms, is related to NULL values. For an equality comparison, the result is Unknown if either or both the arguments is NULL, as explained previously. For a distinction comparison, though, a NULL value is considered to be distinct from any non-NULL value, but the same as another NULL. It is easy to see this in practice – just run a query with a GROUP BY or a DISTINCT clause, and you will see that all NULL values are lumped together in a single group. (Unfortunately, Microsoft has also seen fit to use distinction rather than equality in its implementation of UNIQUE constraints, a clear violation of the ANSI standard that often forces us to implement awkward workarounds – vote here if you agree with me that this should be changed).

     

    There are various occasion where a database developer will want to know if two values are distinct. A typical example would be a trigger to track changes and store them in a separate auditing table – if only two columns in a 40-column table are actually changed, you prefer not to waste log space on the other 38 columns. But a straight comparison between the old and the new value would only catch changes from a non-NULL to a different non-NULL value; changes from NULL to non-NULL and vice versa would not be audited since this comparison would result in Unknown. As would a non-change of a NULL value.

     

    Unfortunately, Microsoft has not yet implemented the IS [NOT] DISTINCT FROM operator that was added in the SQL:1999 version of the ANSI standard. That means that instead of being able to write

    WHERE OldValue IS DISTINCT FROM NewValue;

    we are stuck with writing this more clumsy (and less efficient) variant:

    WHERE OldValue <> NewValue

    OR   (OldValue IS NULL AND NewValue IS NOT NULL)

    OR   (OldValue IS NOT NULL AND NewValue IS NULL);

     

    Implementing the IS [NOT] DISTINCT FROM clause has been suggested on Microsoft Connect two months ago by Steve Kass. There are already 49 votes in favour of this suggestion and Microsoft is considering it for SQL Server 2008 – you too can increase the chance that this will happen by adding your vote here.

     

    WHERE WeAre [NOT] IN (trouble)

     

    Another common cause of NULL-related woes is related to the use of [NOT] IN with a subquery. To understand the root cause of this troubles, it’s useful to first have a look at the exact meaning of [NOT] IN with a hardcoded value list.

     

    In the ANSI standard, the IN clause is defined such that

    WHERE Something IN (3, 5, 7);

    should be considered as shorthand for

    WHERE Something = 3

       OR Something = 5

       OR Something = 7;

    and should therefore return the exact same result. Likewise,

    WHERE Something NOT IN (3, 5, 7);

    should return the same result as

    WHERE NOT (Something = 3

            OR Something = 5

            OR Something = 7);

    or

    WHERE Something <> 3

      AND Something <> 5

      AND Something <> 7;

     

    This definition is to be taken very literally. Even when a NULL is inserted in the value list, the definition still holds. So

    WHERE Something IN (3, NULL, 7);

    is considered as shorthand for

    WHERE Something = 3

       OR Something = NULL

       OR Something = 7;

    and returns the same results. If Something = 3, the result will be True (True OR Unknown OR False) and the row is included. If Something = 2, the result is Unknown (False OR Unknown OR False), and the row is not included. No surprises so far. And if Something is NULL, the result ends up as Unknown (Unknown OR Unknown OR Unknown) as well, so this row is excluded as well – this may be slightly surprising, but not really a big problem for most.

     

    The nasty surprise sits in the NOT IN variant:

    WHERE Something NOT IN (3, NULL, 7);

    is shorthand for

    WHERE Something <> 3

      AND Something <> NULL

      AND Something <> 7;

    and this will never return any row at all! See, if Something is 3, the result is of course False (False AND Unknown AND True) and the row is omitted, as expected. For a NULL value of Unknown, the result is Unknown (Unknown AND Unknown AND Unknown), so this row is omitted as well. But the kicker is what happens if Something = 2 … in that case, the result is not True (as most would expect), but Unknown (True AND Unknown AND True) and this row is also omitted from the result. After all, the database can’t be sure that the missing value is accidentally equal to 2, and doesn’t want to risk returning rows that don’t qualify!

     

    Mind you, I have never ever seen anyone actually use a NULL in the hardcoded value list of a NOT IN clause, so what’s the deal anyway? Well, it’s simple – the defined behaviour of a [NOT] IN (subquery) clause is to evaluate the subquery, replace it with the list of values returned from the subquery, and then evaluate the [NOT] IN according to the rules discussed above. (Note that the query optimizer is free to, and will often, choose another method of actually executing the query as long as it doesn’t change the results). So if Foo.Bar can be NULL, the result of

    WHERE    Something NOT IN

     (SELECT Bar

      FROM   Foo

      WHERE  Yadda yadda);

    will be just as empty as the result of

    WHERE Something NOT IN (3, NULL, 7);

     

    The easy workaround is of course to explicitly exclude the NULLs from the subquery:

    WHERE    Something NOT IN

     (SELECT Bar

      FROM   Foo

      WHERE  Yadda yadda

      AND    Bar IS NOT NULL);

     

    A different solution is to rewrite the NOT IN to a subquery with NOT EXISTS:

    WHERE NOT EXISTS

     (SELECT *

      FROM   Foo

      WHERE  Yadda yadda

      AND    Bar = Something);

     

    This also provides a nice workaround for the current non-existing support for row value constructors in SQL Server (also on Connect, so vote here if you deem this important). You can not write

    WHERE (Something, Splunge) NOT IN

     (SELECT Bar, Splat

      FROM   Foo

      WHERE  Yadda yadda);

    but you can use this:

    WHERE NOT EXISTS

     (SELECT *

      FROM   Foo

      WHERE  Yadda yadda

      AND    Bar   = Something

      AND    Splat = Splunge);

     

    My advice is to strictly avoid [NOT] IN (subquery) and always use [NOT] EXISTS (subquery) instead. Not only because the latter avoids the NULL issues, but also because [NOT] EXISTS can be used in cases where [NOT] IN isn’t possible. Plus, I have never seen a case where [NOT] IS performs better than [NOT] EXISTS (but I have seen the reverse).

     

    There is of course nothing wrong with using [NOT] IN (value, value, …)

     

    What if null if null is null null null is null?

     

    I promised to answer this question – and now I see that this is already my longest blog post ever, and I’m not even near the answer yet. So I’ll just go ahead and tell you that, if null if null is null null null is null, everything is okay and the product is working as it should be, but for one caveat. Why, and what caveat, that’ll have to wait until I find the time to wrap up and post the final part of this series on NULL. (And no, I won’t keep you waiting for another two months – promise!)

  • Upcoming speaking events

    I know I’ve been neglecting my blog lately, but I at least have a valid excuse this time. Four weeks ago, I received two emails on a single day, both inviting me to speak at two different Dutch events that will be held in the same week.

     

    I replied “yes” to both mails, and since have been preparing my sessions. For those readers who would like to catch me speaking, I’ll give the details below.

     

    For the “Software Developer Conference 2007”, to be held in Arnhem, the Netherlands on the 17th and 18th of September, I’ll deliver two sessions:

     

    And for the meeting of the Dutch PASS chapter, to be held in Amsterdam in the evening of the 20th of September, I’ll deliver a single session:

     

    Note that all sessions will be held in Dutch, so if you don’t speak Dutch, you’d probably better go to Denver instead.

  • A Better Developer in 6 months? Deal!

    It's all the fault of Denis Gobo. He tagged me, and now I feel obliged to publish my plans for the last half of 2007 as well. I am not quite as ambitious as Denis, though.

     

    My first commitment

    Like Denis, I will read more technical books. I have a huge stack of unread books on my desk, that I will try to finish. Here are the books, in planned reading order:

    I am currently near page 20 of T-SQL Programming.

     

    My second commitment

    I'll write more blog posts. At least two per month. Non-technical blog posts, like these, don't count.

     

    In case anyone believes that blogging won’t make me a better developer, go ahead and read some of my previous blog posts. Then try to guess how much time I have spent studying a subject before I even start writing about it.

     

    My third commitment

    I will set up a virtual machine and actually install and test at least two CTP versions of SQL Server 2008 (instead of just downloading them and forgetting which directory I put them in, as I currently do).

     

    My fourth commitment

    I’ll try to increase my number of posts in SQL Server related newsgroups to at least 200 posts per month. (As a comparison, I had 55 posts in June, and 45 posts in the first half of July).

     

    Like blogging, replying to questions in newsgroups is a great way to learn, as it confronts me with problems I don’t have in my normal work and that require me to think out of the box. Plus, I automatically see other replies when working in a newsgroup, which is also a great source of inspiration!

     

     

    My fifth commitment

    I am enrolled in a number of online courses offered by Microsoft Learning. Some are completed, some partially completed, and some not even started. I’ll finish them all before the end of the year. Here is the list:

    I have currently finished two of the eight modules of collection 2781 completely, and a third one for 81 %. I also noted that a third collection I was working on has already expired, so I guess I won’t be able to finish that one.

     

    Progress

    I know myself. If I just make plans and keep them for myself, I’ll forget about them. Posting my plans here and labeling them as “commitments” will generate some extra pressure for me. To add yet more pressure, I will also post progress reports, as comments to this post. Don’t expect daily reports, but feel free to prod me if you see no progress reports for three weeks or more.

     

    Tagging

    Denis says he knows it’s cheesy, but did it anyway. I agree with his assessment; tagging is VERY cheesy indeed. So I won’t. (Do I hear some sighs of relief?)

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement