THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

  • Let's deprecate UPDATE FROM!

    I guess that many people using UPDATE … FROM on a daily basis do so without being aware that they are violating all SQL standards.

     

    All versions of the ANSI SQL standard that I checked agree that an UPDATE statement has three clauses – the UPDATE clause, naming the table to be updated; the SET clause, specifying the columns to change and their new values; and the optional WHERE clause to filter the rows to be updated. No FROM or JOIN – if you need data from a different table, use a subquery in the SET clause. The optional FROM and JOIN clauses were added by Microsoft, as an extension to the standard syntax (and just to make out lives more interesting, they invented different variations of the syntax for SQL Server and for Access). So when you are in the habit of using them, be prepared to review all your UPDATE statements when moving to Oracle, DB2, Sybase, MySQL, or even a different Microsoft database!

     

    Standards? Bah, who cares?

     

    Well, some do. Me for instance – I will never use proprietary syntax if I know a standard alternative, expect if using the latter has severe negative consequences. And maybe you will, one day, when your boss comes back from the golf course with the great news that he managed to convince a colleague (who just happens to work in an Oracle shop) to buy a copy of your company’s application instead of some off-the-shelf product. Or when there’s a great job opportunity for someone with cross platform skills. Or when you are asked to help out this new colleague with 10+ years of DB2 experience. One of the lesser known side effects of Murphy’s Law is that those who least expect having to move their database to another platform, will.

     

    But even if you really don’t care about portability, there are other reasons to be wary of using UPDATE FROM. In fact, the most important reason why I dislike UPDATE FROM is not that it’s non-standard, but that it is just too easy to make mistakes with.

     

    Correctness? Bah, who cares?

     

    Well, most do. That’s why we test.

     

    If I mess up the join criteria in a SELECT query so that too many rows from the second table match, I’ll see it as soon as I test, because I get more rows back then expected. If I mess up the subquery criteria in an ANSI standard UPDATE query in a similar way, I see it even sooner, because SQL Server will return an error if the subquery returns more than a single value. But with the proprietary UPDATE FROM syntax, I can mess up the join and never notice – SQL Server will happily update the same row over and over again if it matches more than one row in the joined table, with only the result of the last of those updates sticking. And there is no way of knowing which row that will be, since that depends in the query execution plan that happens to be chosen. A worst case scenario would be one where the execution plan just happens to result in the expected outcome during all tests on the single-processor development server – and then, after deployment to the four-way dual-core production server, our precious data suddenly hits the fan…

     

    That’s all?

     

    Well, almost. There’s one more thing. Probably not something you’ll run into on a daily base, but good to know nonetheless. If the target of the update happens to be a view instead of a base table, and there is an INSTEAD OF UPDATE trigger defined for the view, the UPDATE will fail with this error message:

     

    Msg 414, Level 16, State 1, Line 1

    UPDATE is not allowed because the statement updates view "v1" which participates in a join and has an INSTEAD OF UPDATE trigger.

     

    Of course, most people will never run into this. But I did have the misfortune of doing so once – unfortunately, I discovered this limitation after rewriting several hundred ANSI standard UPDATE statements to the equivalent UPDATE FROM, and having to convert them all back after as much as a single test…

     

    And that’s why you want to deprecate UPDATE FROM?

     

    Well, no. The view with INSTEAD OF UPDATE trigger won’t affect many people. And the possibility of error can be somewhat thwarted by making sure (and double-checking) to always include all columns of the primary key (or a unique constraint) of the source table. So we’re back to the more principle point of avoiding proprietary syntax if there is an ANSI standard alternative with no or limited negative consequences. And in the case of UPDATE FROM, there are some cases where the standard syntax just doesn’t cut it.

     

    One such scenario is when a file is read in periodically with updated information that has to be pushed into the main table. The code below sets up a simplified example of this – a table Customers, with SSN as its primary key, that stores address and lots of other information, and a table Moved, which is the staging table containing the contents of a file received from a third party listing new address for people who recently moved. I have also included the code to preload the tables with some mocked up data – the Customers table has 10,000 rows, and the Moved table has 3,000 rows, 1,000 of which match an existing row in the Customers table. The others don’t – those people are apparently not our customers.

     

    CREATE TABLE Customers

          (SSN char(9) NOT NULL,

           Street varchar(40) NOT NULL,

           HouseNo int NOT NULL,

           City varchar(40) NOT NULL,

           LotsOfOtherInfo char(250) NOT NULL DEFAULT (''),

           PRIMARY KEY (SSN),

           CHECK (SSN NOT LIKE '%[^0-9]%')

          );

    CREATE TABLE Moved

          (SSN char(9) NOT NULL,

           Street varchar(40) NOT NULL,

           HouseNo int NOT NULL,

           City varchar(40) NOT NULL,

           PRIMARY KEY (SSN),

           CHECK (SSN NOT LIKE '%[^0-9]%')

          );

    go

    INSERT INTO Customers(SSN, Street, HouseNo, City)

    SELECT RIGHT(Number+1000000000,9), 'Street ' + CAST(Number AS varchar(10)),

           Number, 'City ' + CAST(Number AS varchar(10))

    FROM   dbo.Numbers

    WHERE  Number BETWEEN 1 AND 30000

    AND    Number % 3 = 0;

    INSERT INTO Moved(SSN, Street, HouseNo, City)

    SELECT RIGHT(Number+1000000000,9), 'New street ' + CAST(Number AS varchar(10)),

           Number * 2, 'New city ' + CAST(Number AS varchar(10))

    FROM   dbo.Numbers

    WHERE  Number BETWEEN 1 AND 30000

    AND    Number % 10 = 0;

    go 

     

    Since ANSI-standard SQL does not allow a join to be used in the UPDATE statement, we’ll have to use subqueries to find the new information, and to find the rows that need to be updated, resulting in this query:

     

    UPDATE Customers

    SET    Street  = (SELECT Street

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN),

           HouseNo = (SELECT HouseNo

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN),

           City    = (SELECT City

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN)

    WHERE EXISTS     (SELECT *

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN);

     

    There’s a lot of duplicated code in here. And if we were getting data from a complicated subquery instead of the table Moved, it would be even worse (though we can at least put all the duplicated code in a CTE since SQL Server 2005). Of course, writing the code is done quickly enough once you master the use of copy and paste, but the code has to be maintained as well.

     

    Maybe even worse is that the performance of this query just sucks – if you run this (enclosed in a BEGIN TRAN / ROLLBACK TRAN, so you can run the variations below without having to rebuild the original data) and check out the execution plan, you’ll see that the optimizer needs no less than five table scans (one for Customers, and four for Moved) and four merge join operators. And that, too, would be much worse if the source of the data had been a complex subquery (and no, using a CTE will not help the optimizer find a better plan – it just doesn’t understand that the four subqueries are similar enough that they can be collapsed.

     

    Now, if Microsoft had chosen to implement row-value constructors (as defined in the ANSI standard), we could have simplified this to

     

    UPDATE Customers

    SET   (Street, HouseNo, City)

                   = (SELECT Street, HouseNo, City

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN)

    WHERE EXISTS     (SELECT *

                      FROM   Moved AS m

                      WHERE  m.SSN = Customers.SSN);

     

    But this is invalid syntax in any version of SQL Server (including the latest CTP for SQL Server 2008), and I know of no plans to change that before SQL Server 2008 RTMs.

     

    But with using the proprietary UPDATE FROM syntax, we can simplify this, and get a much better performance to boot. Here’s how the same update is written in non-portable code:

     

    UPDATE     c

    SET        Street     = m.Street,

               HouseNo    = m.HouseNo,

               City       = m.City

    FROM       Customers AS c

    INNER JOIN Moved     AS m

          ON   m.SSN      = c.SSN;

     

    And now, the optimizer will produce a plan that scans each table only once and has only a single merge join operator. Some quick tests (with much more rows in the tables) show that it executes two to three times quicker than the ANSI standard version. For that performance gain, I will gladly choose the proprietary syntax over the standard!

     

    What’s with the title of this post then? Why deprecate a fine feature?

     

    Patience, we’re getting there. Bear with me.

     

    All the above is true for versions of SQL Server up to SQL Server 2005. But SQL Server 2008 will change the playing field. It introduces a new statement, MERGE, that is specifically designed for situations where rows from a table source either have to be inserted into a destination table, or have to be used to update existing rows in the destination table. However, there is no law that prescribes that any MERGE should always actually include both an insert and an update clause – so with this new statement, we can now rewrite the above code as follows:

     

    MERGE INTO Customers AS c

    USING      Moved     AS m

          ON   m.SSN      = c.SSN

    WHEN MATCHED

    THEN UPDATE

    SET        Street     = m.Street,

               HouseNo    = m.HouseNo,

               City       = m.City;

     

    As you can see, the source table and the join criteria are included only once, just as in the proprietary UPDATE FROM. The execution plan (tested on the February CTP, also known as CTP6) is also quite similar, including just a few extra operators that are specific to the new MERGE statement. What really surprised me, was that the plan for the MERGE statement was estimated to be about 65% cheaper (faster) than the corresponding UPDATE FROM statement. However, I think SQL Server is lying here – a quick test with more data shows only an extremely marginal advantage of MERGE over UPDATE FROM. This test was too limited to draw serious conclusions, but I am quite sure that there will not be a 65% saving by using MERGE over UPDATE FROM. (I do expect such a saving form either MERGE or UPDATE FROM over the ANSI-compliant UPDATE statement for this case).

     

    The good news is that:

    1)      The MERGE statement is described in SQL:2003 and can thus be considered ANSI standard. (In fact, SQL Server implements a superset of the ANSI standard MERGE syntax: everything described in the syntax is implemented, but there are some non-standard extensions that make the command even more useful as well. However, the example above uses only the standard features and should hence run on each DBMS that conforms to the SQL:2003 version of MERGE).

    2)      The MERGE statement will return an error message if I mess up my join criteria so that more than a single row from the source is matched:

    Msg 8672, Level 16, State 1, Line 1

    The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

    3)      The MERGE statement will gladly accept a view with an INSTEAD OF UPDATE trigger as the target of the update.

     

    So as you see, MERGE allows me to achieve what I previously could achieve only with an ANSI standard UPDATE statement with lots of duplicated code and lousy performance, or with a UPDATE FROM statement that hinders portability, introduces a higher than normal risk of errors going unnoticed through QA right into the production database, and has some odd limitation on views with INSTEAD OF UPDATE triggers. None of these downsides and limitations apply to MERGE. And if there are any other problems with MERGE, I have yet to find them.

     

    With this alternative available, I fail to see any reason why the proprietary UPDATE FROM syntax should be maintained. In my opinion, it can safely be marked as deprecated in SQL Server 2008. It should of course still work, as “normal” supported syntax in both SQL Server 2008 and the next version, and in at least one version more if the database is set to a lower compatibility – but it should be marked as deprecated, and it should eventually be removed from the product. Why waste resources on maintaining that functionality, when there is an alternative that is better in every conceivable way? I’d much rather see the SQL Server team spend their time and energy on more important stuff, such as full support for row-value constructors and full support for the OVER() clause. Or maybe even on releasing Service Pack 3 for SQL Server 2005!


  • Want a Service Pack? Ask for it!

    Service pack 2 for SQL Server 2005 is already 11 months old. And there is still no sign of service pack 3 on the horizon. Why is that? Has Microsoft managed to release a perfect, completely bug-free product? No, of course not – with the size and complexity of a product such as SQL Server is, that will simply never happen.

     

    There have, in fact, enormous numbers of bugs been uncovered and fixed since SP2 was released. And roughly once every two months, a so-called “cumulative update package” gets released. The last one is officially called “Cumulative update package 5 for SQL Server 2005 Service Pack 2”, or simply CU5 for friends. Quite a mouthful. But if you think that name is long, check out the list of bugs that CU5 fixes!

     

    I think that it’s great that Microsoft now releases these cumulative update packages at regular intervals. I see them as a good addition that nicely fits in between hot-fixes for quick fixes with limited testing, only for those needing it, on one side, and fully tested service packs that are released once or at most twice per year on the other side.

     

    Given the long list of bugs fixed in CU5, should everyone be recommended to install it, just as if it were a service pack? Well, no. Microsoft themselves advise against this. In fact, you can’t even just download and install the package; you have to “submit a request to Microsoft Online Customer Services to obtain the cumulative update package”. This quote comes directly from the Knowledge Base article for CU5, as come these further disclaimers:

    it is intended to correct only the problems that are described in this article. Apply it only to systems that are experiencing these specific problems

    if you are not severely affected by any of these problems, we recommend that you wait for the next SQL Server 2005 service pack that contains the hotfixes in this cumulative update package

     

    These quotes, which have been in the same or a similar form in all cumulative updates, make it pretty clear that I should wait for the next service pack. So we waited. And waited. And waited some more. And then, some MVP's got impatient and suggested (in the MVP newsgroup) to release SP3 as soon as possible. The answer surprised me – apparently, Microsoft has no plans yet to release a new service pack, because not enough customers have asked for it. (MS: “Good, but honestly, at least our management says, we're not getting feedback requesting SP3 from enough customers to require it” – Me: “Is that reason under NDA? Because if it's not, I can post a blog entry supporting people to write MS management asking for SP3” – MS: “As far as I know that's a public response”). So, the KB article says to wait, all customers do as asked and then Microsoft concludes that nobody wants a service pack because nobody asks for it? And I misunderstood when I thought that “any hotfix that is provided in a SQL Server service pack is included in the next SQL Server service pack” implies that there will actually be a next service pack. Apparently, my graps of the English language is not as good as I’d like to believe…

     

    Anyway, I now understand that Microsoft will only release a new service pack if enough people ask for it. So I’ve decided to make sure that they get the message. I’ve gone to Connect and filed a suggestion to release Service Pack for SQL Server 2005, including all changes up to and including CU5.

     

    If you read this, and you agree with me that Service Pack 3 for SQL Server 2005 is overdue, you now know what to do – log in to Connect and vote 5 for my suggestion. And if you think that I’m losing my marbles and that there should be no Service Pack 3, then you should log in to Connect and vote 1. In short: make yourself heard!

     

    Microsoft will not release service pack 3 because insufficient customer are asking for it? Well, I’m asking – can we have service pack 3, please? Pretty please? Pretty pretty pretty pretty pretty please????

     

    With sugar on top…
  • Bin packing part 3: Need for speed

    In the first post of this series, I explained the bin-packing problem and established a baseline solution. The second post investigated ways to increase the packing efficiency. In none of these posts did I pay particular attention to performance – and frankly, it shows. Performance of all solutions presented thus far sucks eggs. Time to see what can be done about that.

     

    If you look in detail at the most efficient solution so far (dbo.OrderDesc, as described in the second post of the series), you’ll see that all 40,000 registrations are processed one by one, with a cursor. No surprise here, as I’ve already promised to postpone the set-based solution to a later moment. For each of these 40,000 registrations, the following actions are executed:

     

    ·        The dbo.Sessions table is queried to find a session for the correct year and quarter that still has enough space left for the current registration.

    ·        When needed, a new session is added to the dbo.Sessions table.

    ·        The registration is updated in the dbo.Registrations table. Well, updating all 40,000 registrations is pretty hard to avoid as the goal is to assign each registration to a session, and as long as the processing is cursor-based, the updates will inevitably come in one by one.

    ·        The chosen session is updated in the dbo.Sessions table to reduce the space left by the size of registration that was just assigned to the session.

     

    Improve indexes

     

    Indexing is one of the most common answers to performance problems. In this case, based on the above summary of actions taken in the loop, it’s clear that adding or changing an index on dbo.Registrations won’t do much. The update of this table is currently based on the clustered index key, which is the fastest possible way to perform an update. The cursor requires a table scan (to read all rows) and a sort (to order them by year, quarter, and descending number of candidates); the sort can be avoided if the clustered index is on these three columns, but at the price of 40,000 bookmark lookups for the updates – I don’t need to run a test to see that this is a bad idea!

     

    The dbo.Sessions table is less clear-cut. The clustered index is on (Year, Quarter, SessionNo), so searching for a session with enough room for a registration currently seeks the clustered index to process only the sessions of a single quarter, but it still has to scan through sessions in the quarter until it finds one with enough space. A nonclustered index on (Year, Quarter, SpaceLeft) will speed up this process, especially since it is covering for the query (the query uses the SessionNo column as well, but that column is included in the nonclustered index as part of the reference to the clustered index key). The downside to this index is that it has to be updated each time the space left in a session changes and each time a session is added. So, the question to answer is whether gained when searching for a session to add a registration to outweighs the performance lost during updates. To find out, I added this index before repeating the tests of dbo.OrderDesc:

     

    CREATE NONCLUSTERED INDEX ix_Sessions

    ON dbo.Sessions (Year, Quarter, SpaceLeft);

     

    The packing efficiency didn’t change. The execution time did change, though – with the index in place, the average execution time of five test runs of dbo.OrderDesc was down to 68,300 ms, a 12.4% improvement over the execution time without the index. Clearly, the extra overhead incurred on the UPDATE statements is less than the savings on the SELECT statements.

     

    Note that I create the index before starting the test. I assume that the index can be added permanently to the database – if the database frequently updates the dbo.Sessions table at other moments, when the bin packing procedure is not executing, it might make more sense to create it at the start of this procedure and remove it when done – and in that case, the time taken for creating and dropping the index (less than 100 ms) should be added in.

     

    For completeness, I also tested the dbo.Order50FirstB version, that proved to be a good trade-off between performance and efficiency in the previous post. This version should of course see a similar performance benefit from the additional index, and indeed it does – the average execution time for dbo.Order50FirstB was down to 68,144 ms after adding the index, a saving of 9.8%.

     

    If you’re duplicating the tests on your own server, then don’t forget to remove the index – we won’t need it in the following step, and the overhead of maintaining it would just waste precious time.

     

    DROP INDEX dbo.Sessions.ix_Sessions;

     

    Do less work …

     

    Saving 12.4% execution time is great – but (of course) not enough to satisfy me! So it’s time to take another approach: let’s see if I can’t find any way to reduce the number of times the dbo.Sessions table is searched and updated. How, you might ask? Well, let’s again check how a human would operate. If I have to stow packages in boxes, I’d probably keep adding packages to the same box until I get a package that won’t fit. Only when I get a package that doesn’t fit the box anymore would I put the box away and open an new box. I have coded the T-SQL equivalent of this method, and called it dbo.FillThenNext (see FillThenNext.sql in the attached ZIP file).

     

    The average execution time for this simple algorithm turned out to be just 39,110 ms, so this saves 42.6% over dbo.OrderDesc with index, or 47.0% over the baseline. But the packing efficiency turns out to be really down the drain with this one:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       6670        75.364617        16431.800000

    2       3548        78.769447        7532.600000

    3       8066        75.025787        20144.200000

    4       5346        78.283015        11609.900000

    ALL     23630       76.420440        13929.625000

     

     

    A total of 23,630 sessions exceeds the number of sessions required by the baseline by 21.4%, and that of dbo.OrderDesc by 24.9%. What a high price to pay for a speed advantage!

     

    … but do it smart

     

    The reason for this enormous efficiency loss is that I became very wasteful. Suppose that the first registration processed is for 20 persons, and the second for 85. The second can not be combined with the first in a single session, so the algorithm opens a new session – and never again looks at the first session, even though there still are 80 seats left! That is of course a HUGE waste of resources. So I modified the algorithm. I still have a “current” session that I keep adding registrations to as long as I can, but if I find a registration that won’t fit I now first search the previous sessions for one with enough empty seats before closing the current session and opening a new one. The code for this version (dbo.FillThenSearch (is in FillThenSearch.sql in the attached ZIP file. And the test results are really awesome! The average execution time is now 45,506 ms (with the nonclustered index back in place – without it, execution time is 50,874 ms), which is of course slightly slower than my previous attempt but still 33.4% faster than dbo.OrderDesc with index, and 38.3% faster than the baseline. But the packing is much better:

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5399        93.106501        3721.800000

    2       2863        97.615787        682.600000

    3       6945        87.135781        8934.200000

    4       4542        92.140246        3569.900000

    ALL     19749       91.438300        4227.125000

     

    This version still doesn’t pack as efficient as dbo.OrderDesc (4.4% more sessions) or the baseline (1.5% more sessions) – but saving 33.4% execution time at the price of only 4.4% more sessions sounds like a proposition that deserves serious consideration, unlike the previous attempt!

     

    Revive an old trick

     

    If you have checked the source code, you may have noticed that I have once more removed the extra ordering that I added in the previous installment. I did that on purpose, because this extra ordering

    a)      decreased performance – I want to increase performance in this part, and

    b)      is not guaranteed to have the same effects in a different packing algorithm.

     

    But I did not want to end this post without at least testing the effect of adding back in the ordering that proved most efficient in the previous episode, so I re-introduced the sorting by descending number of candidates in dbo.FillThenSearchDesc (FillThenSearchDesc.sql in the ZIP file), and I discovered that this might be the best tradeoff so far – only 2 sessions more than dbo.OrderDesc, at only 48,626 ms (28.6% less than dbo.OrderDesc – with the nonclustered index still in place, of course).

     

    Quarter NumSessions AvgSessionSize   AvgEmptySeats

    ------- ----------- ---------------- ----------------

    1       5087        98.816984        601.800000

    2       2799        99.847802        42.600000

    3       6771        89.374981        7194.200000

    4       4268        98.055529        829.900000

    ALL     18925       95.419550        2167.125000

     

    Best options so far

     

    After having investigated so many different options, it’s all too easy to lose track. The table below lists the versions most worth remembering – except for the baseline, I did not include any version for which there was another version that produces less sessions in less time. The remaining sessions are the one you’ll have to choose from, making your own trade-off between saving sessions or saving execution time.

     

    Version

    Execution time (ms)

    Number of sessions

    Baseline

    73,759

    19,457