THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Alexander Kuznetsov

  • On filtered indexes and defensive coding

    When one and the same constant is copied and pasted in more than one place, there is always a chance that we can change it in one place and fail to change in another, resulting in a discrepancy.

    For example, recently I read a very interesting post by Michelle Ufford, Filtered Indexes Work-Around.

    I will not repeat the whole post here, I encourage you to read it in full, but here are the relevant parts. There is a filtered index:

    CREATE NONCLUSTERED INDEX MyLatestData
                
    ON dbo.myTable(myDate)
                
    Include (myData)
        
    WHERE myDate >= '20100127'

    The following stored procedure does not use it, because the optimizer does not know if the parameter is before the cutoff date for the filtered index:

    CREATE PROCEDURE SelectLatestData
        
    @DateFrom DATETIME
                
    AS
    SELECT DISTINCT 
    myData
        
    FROM dbo.filteredIndexTest
        
    WHERE myDate >= @DateFrom

    This is where Michelle analyzed several approaches. The following is my suggestion, which has problems, yet Michelle was kind enough to incorporate in her post:

    ALTER PROCEDURE SelectLatestData
        
    @DateFrom DATETIME
                
    AS
    SELECT DISTINCT 
    myData
        
    FROM dbo.filteredIndexTest
        
    WHERE myDate >= @DateFrom
        
    -- this ensures that the index is used
            
    AND MyDate >= '20100127' 

    By repeating the cutoff date from the definition of the filter index in the WHERE clause of this procedure, we explicitly tell the optimizer that that filtered index can always satisfy it regardless of the value of the parameter. However, there is a problem with my suggestion: what happens when later we change the filter of the index, as follows:

    CREATE NONCLUSTERED INDEX MyLatestData
                
    ON dbo.myTable(myDate)
                
    Include (myData)
        
    WHERE myDate >= '20100220'

    Our stored procedure will still work, but it will not use that new index any more.Suppose that we still want to use the stored procedure to get the latest data, and to use the lean filtered index for that purpose.

    Clearly we should change the stored procedure at the same time when we replace the index. How can we manage that? If the system is large, it is impossible to remember all these details.

    The way I deal with such problems is this:

    I have macros, and I apply C++ preprocessor against my sql source code when I build. So, in my definitions.h file I will have this line:

    #define LATEST_DATA_CUTOFF '20100127' 

    In my sql code, I will include definitions.h, just like I would do in C++ code. Both the index definition and the stored procedure will use the macro LATEST_DATA_CUTOFF. When I change the macro definition, the preprocessor will take care of all the changes in all the places the macro is used. SQL Compare will take care of the deployment.

    How do you solve such problems?

     

     


  • Multi-statement TVFs are essentially slowish nested loops.

    Whenever we are using multi-statement TVFs, we are essentially forcing nested loops logic on the database engine. Although multi-statement TVFs are smart enough and do not always execute once per row, when they do so, they may be much slower than nested loops.

    As usual, inline UDFs shine as compared to multi-statement ones, at least in all the benchmarks in this post - let us run some tests and see for ourselves.

     

     Setting up data

     

    First of all, here is a table, rather large, 512K rows, and rather wide, less than 20 rows per page, so that reading from it via nested loops is a natural choice in many cases:


     

    CREATE TABLE dbo.Parent
        
    (
          
    ID INT NOT NULL
                 
    PRIMARY KEY ,
          
    ParentNumber INT NOT NULL,
          
    SpaceFiller CHAR(500) NOT NULL
        ) ;
    GO
    DECLARE @adder INT ;
    SET @adder ;
    INSERT  INTO dbo.Parent
            
    IDParentNumberSpaceFiller )
    VALUES  11'#' ) ;

    WHILE @adder 500000 
        
    BEGIN ;
            
    INSERT  INTO dbo.Parent
                    
    ID ,
                      
    ParentNumber ,
                      
    SpaceFiller
                    
    )
                    
    SELECT  ID @adder ,
                            
    ParentNumber ,
                            
    SpaceFiller
                    
    FROM    dbo.Parent ;
            
    SET @adder @adder ;
        
    END ;
    GO
      

    The Child table is much smaller and narrow:

    CREATE TABLE dbo.Child
        
    (
          
    ID INT NOT NULL
                 
    PRIMARY KEY ,
          
    ParentID INT NOT NULL ,
          
    ChildNumber INT NOT NULL UNIQUE
        
    ) ;   
    GO

    INSERT INTO dbo.Child
            
    IDParentIDChildNumber )
    SELECT ID-- ID - int
              
    ID-- ParentID - int
              
    ID  -- ChildNumber - int
    FROM dbo.Parent WHERE ID BETWEEN AND 1000 

     

     Comparing nested loops vs. multi-statement UDF calls

     

    Here are the two UDFs which we are  going to benchmark:

     CREATE FUNCTION dbo.GetParentNumber_Multiline @ParentID INT )
    RETURNS @parentInfo TABLE ParentNumber INT )
    AS 
        BEGIN 
    ;
            
    INSERT  @parentInfo
                    
    ParentNumber
                    
    )
                    
    SELECT  ParentNumber
                    
    FROM    dbo.Parent
                    
    WHERE   ID @ParentID ;
            
    RETURN ;                
        
    END ;
    GO

    CREATE FUNCTION GetParentNumber_Inline @ParentID INT )
    RETURNS TABLE
    AS
    RETURN
        
    SELECT    ParentNumber
          
    FROM      dbo.Parent
          
    WHERE     ID @ParentID
        
    ) ;
     

    Let us fire up the Profiler, make sure it records individual statements, and run the following script:

    SELECT IDParentIDp.ParentNumber
    INTO #t1
    FROM dbo.Child
    CROSS APPLY dbo.GetParentNumber_Multiline ParentID AS ;

    DROP TABLE #t;

    SELECT IDParentIDp.ParentNumber
    INTO #t1
    FROM dbo.Child
    CROSS APPLY dbo.GetParentNumber_Inline ParentID AS p

    DROP TABLE #t

    In the Profiler, we shall see that the multi-statement UDF has been invoked 1000 times, which is slow:

    Multiline stats:

       CPU time = 312 ms,  elapsed time = 696 ms.

    Inline stats:

       CPU time = 0 ms,  elapsed time = 7 ms.
     

    Clearly the difference in performance is dramatic, while the plan is essentially the same - nested loops. We can highlight the query, compare the execution plans, and see for ourselves.

     

    Multi-statement UDFs do not always run once per row

     

    While we are at it, let us bust a myth: multi-statement UDFs do not have to run once per row. Let us have all child rows refer to only one parent:

    UPDATE dbo.Child SET ParentID 

    Let us rerun our benchmarks. In the profiler we shall see that the multi-statement UDF has been invoked only once. Execution costs are also quite comparable this time for both scripts:

       CPU time = 15 ms,  elapsed time = 21 ms.

    As we have seen, the database engine is smart enough and may sometimes execute multi-statement UDFs once per distinct set of parameters, not once per row.

     

    Multi-statement UDFs essentially force nested loops on the engine

     

    Let us set up test data differently, so that the second benchmark (the one using inline UDF) executes as a hash join:

     

    UPDATE dbo.Child SET ParentID ID 

    TRUNCATE TABLE dbo.Parent ;
    GO
    INSERT INTO dbo.Parent
            
    IDParentNumberSpaceFiller )
    SELECT IDID'*'        
    FROM dbo.Child ;
      

    The first benchmark still executes as a nested loop, and is much slower

    Multi-statement UDF, nested loops:
       CPU time = 266 ms,  elapsed time = 674 ms.
       
    Inline UDF, hash join:
       CPU time = 0 ms,  elapsed time = 5 ms.

     

    Conclusion

     

    As we have seen, if we want to force nested loops, we can use multi-statement UDFs.Also we have seen that those multi-statement UDFs can be very slow.

    However, let us be cautious: when we want to benchmark real life solutions, we do not want to do it with the Profiler running, like I did in this post. Also, as Dave Ballantine has correctly pointed out, in some cases STATISTICS TIME may skew the results just as well.

     


  • Don't swap horses in midstream.

    We can begin a transaction under snapshot isolation, but we cannot switch to it in the middle of an outstanding transaction. For example, the following procedure looks good and passes a smoke test:

     

     CREATE PROCEDURE dbo.SelectCountry 
       
    @CountrySymbol CHAR(2)
    AS 
        BEGIN 
    ;
            
    SET TRANSACTION ISOLATION LEVEL SNAPSHOT ;
            
    SELECT  CountrySymbol ,
                    
    Description
            
    FROM    data.Countries
            
    WHERE   CountrySymbol @CountrySymbol ;
        
    END ;
    GO

    -- Smoke test: this call completes
    EXECUTE dbo.SelectCountry @CountrySymbol 'US' ;
     

     

    However, the procedure fails if it is invoked in the middle of an outstanding transaction that did not begin under snapshot isolation level:

     

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
    BEGIN TRANSACTION ;
    EXECUTE dbo.SelectCountry @CountrySymbol 'US' ;
    COMMIT ;
      

     Msg 3951, Level 16, State 1, Procedure SelectCountry, Line 6
    Transaction failed in database 'Test' because the statement was run under snapshot isolation but the transaction did not start in snapshot isolation. You cannot change the isolation level of the transaction to snapshot after the transaction has started unless the transaction was originally started under snapshot isolation level.

    Apparently the author of this stored procedure overlooked the possibility that it can be invoked in an outstanding transaction that did not begin under snapshot isolation level. Is is very easy to improve the robustness of this stored procedure:

    ALTER PROCEDURE dbo.SelectCountry @CountrySymbol CHAR(2)
    AS 
        BEGIN 
    ;
            
    IF @@TRANCOUNT 
                
    BEGIN ;
                    
    SET TRANSACTION ISOLATION LEVEL SNAPSHOT ;
                
    END ;
            
    SELECT  CountrySymbol ,
                    
    Description
            
    FROM    data.Countries
            
    WHERE   CountrySymbol @CountrySymbol ;
        
    END ;

     

    As usual in defensive programming, let us not stop at this - we also need to review all other modules and proactively apply the same fix wherever appropriate.

     


  • Correlated subqueries do not execute once per row.

    Correlated subqueries do not have to execute once per row - on the contrary, they are equivalent to outer joins, and they may have the same execution plans and the same real execution costs (if we retrieve only one column via a correlated subquery, of course)

    Let me provide a script which compares the performance of a query with one correlated subquery vs. an equivalent out join. First, we need two tables for our example:

    CREATE TABLE dbo.Parent
        
    (
          
    ID INT NOT NULL
                 
    PRIMARY KEY ,
          
    ParentNumber INT NOT NULL
        ) ;
    GO
    CREATE TABLE dbo.Child
        
    (
          
    ID INT NOT NULL
                 
    PRIMARY KEY ,
          
    ParentID INT NOT NULL ,
          
    ChildNumber INT NOT NULL
        ) ;
      

    Next, let us populate them with 512K rows each:

    DECLARE @adder INT ;
    SET @adder ;
    INSERT  INTO dbo.Parent
            
    IDParentNumber )
    VALUES  1) ;

    WHILE @adder 500000 
        
    BEGIN ;
            
    INSERT  INTO dbo.Parent
                    
    ID ,
                      
    ParentNumber
                    
    )
                    
    SELECT  ID @adder ,
                            
    ParentNumber
                    
    FROM    dbo.Parent ;
            
    SET @adder @adder ;
        
    END ;
    GO

    INSERT INTO dbo.Child
            
    IDParentIDChildNumber )
    SELECT ID-- ID - int
              
    ID-- ParentID - int
              
    0  -- ChildNumber - int
    FROM dbo.Parent 

    We want to compare the performance of two queries without the overhead of retrieveing and or materializing result sets, so I will make sure the result sets are empty. The correlated subquery approach is as follows:

    SET STATISTICS IO ON ;
    SET STATISTICS TIME ON ;
    GO
                   
    SELECT  ChildNumber ,
            
    ParentNumber
    INTO    #t1
    FROM    SELECT    ChildNumber ,
                        ( 
    SELECT    ParentNumber
                          
    FROM      dbo.Parent AS p
                          
    WHERE     p.ID c.ParentID
                        
    AS ParentNumber
              
    FROM      dbo.Child AS c
            
    AS t
    -- no rows meet this criteria
    WHERE   ChildNumber ParentNumber -123 

    The execution costs are:

     Table 'Child'. Scan count 9, logical reads 2502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Parent'. Scan count 9, logical reads 2215, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    The equivalent outer join looks like this (run it in another tab):

    SET STATISTICS IO ON ;
    SET STATISTICS TIME ON ;
    GO

    SELECT  ChildNumber ,
            
    ParentNumber
    INTO    #t2
    FROM    SELECT    ChildNumber ,
                        
    ParentNumber
              
    FROM      dbo.Child AS c
                        
    LEFT OUTER JOIN dbo.Parent AS ON p.ID c.ParentID
            
    AS t
    -- no rows meet this criteria
    WHERE   ChildNumber ParentNumber -123 ;

    Its real execution costs are the same.

    More to the point, when I hit Ctrl+L in both tabs, I see the same execution plan.

    Conclusion 

    As we have seen, correlated subqueries do not have to execute once per each row. In fact, they are just macros - the optimizer can flatten them out and rewrite the queries involving them as outer joins.

     

     

     

     

     

     


  • When acquiring locks in the same order is not possible or not feasible.

    To avoid deadlocks, one of the most common recommendations is "to acquire locks in the same order" or "access objects in the same order". Clearly this makes perfect sense, but is it always feasible? Is it always possible? I keep encountering cases when I cannot follow this advice.

    If I store an object in one parent table and one or more child ones, I cannot follow this advice at all. When inserting, I need to insert my parent row first. When deleting, I have to do it in the opposite order.

    If I use commands that touch multiple tables or multiple rows in one table, then usually I have no control in which order locks are acquired, (assuming that I am not using hints).

    So, in many cases trying to acquire locks in the same order does not prevent all deadlocks. So, we need some kind of handling deadlocks anyway - we cannot assume that we can eliminate them all. Unless, of course, we serialize all access using Service Broker or sp_getapplock.

     

     


  • T-SQL Tuesday #002: patterns that do not work as expected.

    Neither

    UPDATE … IF (@@ROWCOUNT = 0) INSERT

    nor

    IF EXISTS(...) UPDATE ELSE INSERT

    patterns work as expected under high concurrency. Both may fail. Both may fail very frequently. MERGE is the king - it holds up much better.Let us do some stress testing and see for ourselves.

    Here is the table we shall be using:

    CREATE TABLE dbo.TwoINTs
        
    (
          
    ID INT NOT NULL PRIMARY KEY,
          
    i1 INT NOT NULL ,
          
    i2 INT NOT NULL ,
          
    version ROWVERSION
        
    ) ;
    GO

    INSERT  INTO dbo.TwoINTs
            
    IDi1i2 )
    VALUES  10) ;    

     

     

    IF EXISTS(…) THEN pattern frequently fails under high concurrency.

     

    Let us insert or update rows in a loop using the following simple logic: if a row with given ID exists, update it, and otherwise insert a new one. The following loop implements this logic. Cut and paste it into two tabs, switch into text mode in both tabs, and run them simultaneously.

    -- hit Ctrl+T to execute in text mode

    SET NOCOUNT ON ;

    DECLARE @ID INT ;

    SET @ID ;
    WHILE @ID > -100000
        
    BEGIN ;
            
    SET @ID SELECT  MIN(ID)
                        
    FROM    dbo.TwoINTs
                      
    ) - ;
            BEGIN TRY ;

                
    BEGIN TRANSACTION ;
                
    IF EXISTS ( SELECT  *
                            
    FROM    dbo.TwoINTs
                            
    WHERE   ID @ID )
                    
    BEGIN ;
                        
    UPDATE  dbo.TwoINTs
                        
    SET     i1 1
                        
    WHERE   ID @ID ;
                    
    END ;
                
    ELSE
                    BEGIN 
    ;
                        
    INSERT  INTO dbo.TwoINTs
                                
    IDi1i2 )
                        
    VALUES  @ID0) ;
                    
    END ;
                
    COMMIT 
            
    END TRY
            
    BEGIN CATCH ;
                
    ROLLBACK 
                
    SELECT  error_message() ;
           
    END CATCH ;
        
    END 

    When we run this script simultaneously in two tabs, we shall immediately get a huge amount of primary key violations in both tabs. This demonstrates how unreliable the IF EXISTS pattern is when it executes under high concurrency.

    Note: this example also demonstrates that it is not safe to use SELECT MAX(ID)+1 or SELECT MIN(ID)-1  as the next available unique value if we do it under concurrency.

     

    UPDATE … IF (@@ROWCOUNT = 0) BEGIN pattern is also unreliable.

     

    Another common approach is to update a row first, and if no row was updated, insert it. It is also unreliable. Before demonstrating that, let us delete the rows inserted by the previous example:

    DELETE  FROM dbo.TwoINTs WHERE   ID < 1 ;

    Let us modify the loop which we ran in the previous example, as follows:

    -- hit Ctrl+T to execute in text mode

    SET NOCOUNT ON ;
    DECLARE @ID INT ;
    SET @ID ;
     
    WHILE @ID > -100000
        
    BEGIN ;
            
    SET @ID SELECT  MIN(ID)
                        
    FROM    dbo.TwoINTs
                      
    ) - ;
            BEGIN TRY ;
                
                
    BEGIN TRANSACTION ;
          
                
    UPDATE  dbo.TwoINTs
                
    SET     i1 1
                
    WHERE   ID @ID ;
                
    IF @@ROWCOUNT )
                    
    BEGIN ;
                        
    INSERT  INTO dbo.TwoINTs
                                
    IDi1i2 )
                        
    VALUES  @ID0) ;
                    
    END ;
                
    COMMIT 
            
    END TRY
            
    BEGIN CATCH ;
                
    SELECT  error_message() ;  
                
    ROLLBACK ;     
            
    END CATCH ;

        END ;   

     

    When we run this script simultaneously from two tabs, we are getting lots of primary key violations, just like when we ran our previous example.

    As we have seen, the UPDATE … IF (@@ROWCOUNT = 0) pattern is unreliable under high concurrency too.

     

    MERGE holds up perfectly well

     

     Again, let us delete the rows inserted by the previous example:

    DELETE  FROM dbo.TwoINTs WHERE   ID < 1 ;

     Rerun the loop using MERGE:

    -- hit Ctrl+T to execute in text mode
    SET NOCOUNT ON ;
    DECLARE @ID INT ;
    SET @ID ;
     
    WHILE @ID > -100000
        
    BEGIN ;
            
    SET @ID SELECT  MIN(ID)
                        
    FROM    dbo.TwoINTs
                      
    ) - ;
            
    BEGIN TRY ;
           
                
    MERGE dbo.TwoINTs AS target
                    USING
                        
    SELECT    @ID ,
                                    
    ,
                                    
    0
                        
    AS source IDi1i2 )
                    
    ON target.ID source.ID )
                    
    WHEN MATCHED
                        
    THEN
            UPDATE           SET
                    
    i1 1
                    
    WHEN NOT MATCHED
                        
    THEN 
              INSERT    
    IDi1i2 )
                             
    VALUES
                        
    @ID ,
                          
    ,
                          
    0
                        
    ) ;
     
            
    END TRY
            
    BEGIN CATCH ;
                
    SELECT  error_message() ;       
            
    END CATCH ;
        
    END ;
      

    In the context of our loop, MERGE always completes without a single error!

    As we have seen, in this particular case MERGE holds up under high concurrency perfectly well. Of course, this does not mean that we can always use this new command without stress testing. Yet we should at least consider using it whenever we insert or update under high concurrency, with usual precautions and after thorough testing.

    It is fun to be blogging again. I kind of neglected blogging for some time, because I am writing a book on defensive programming, and that requires a huge amount of time. Anyway, it feels good to be back. The book should be finished soon, and I'll start contributing here more.

     

     This post participates in the second SQL Tuesday


  • Retrying after deadlocks leads to lost updates

    When we retry after deadlocks, we are very likely to overwrite other processes' changes. We need to be aware that very likely someone else modified the data we intended to modify. Especially if all the readers run under snapshot isolation, then readers cannot be involved in deadlocks, which means that all the parties involved in a deadlock are writers, modified or attempted to modify the same data. If we just catch the exception and automatically retry, we can overwrite someone else's changes.

    This is called lost updates, and this is usually wrong. Typically the right thing to do after a deadlock is to retry on a much higher level - re-select the data and decide whether to save in the same way the original decision to save was made.

    For example, if a user pushed a Save button and the saving transaction was chosen as a deadlock victim, it might be a good idea to re-display the data on the screen as of after the deadlock.

     


  • To design or not to design?

    Clearly Linux is one of the most successful products ever. Let me quote a little bit from Linus Torvalds and other brilliant people involved in it:

    "A strong vision and a sure hand sound like good things on paper. It's just
    that I have never _ever_ met a technical person (including me) whom I
    would trust to know what is really the right thing to do in the long run.

    Too strong a strong vision can kill you - you'll walk right over the edge,
    firm in the knowledge of the path in front of you."

    here

    "The most successful software have born from fixing/patching an
    initial/simple implementation while the greatest software failures
    have born from deep planning and design
    . "

    here

    More specifically, the following article is a great read

    Evolutionary Database Design

     

     

     

     

     

     

     


  • T-SQL Tuesday #001: Yesterday it worked, today it's not working...

    But did it actually work yesterday?

    If the day is Friday, Noverber 13th, or the 13th day of any other month, and your query blows up for the first time, one place to search for is character strings converted to datetime values. The following script illustrates the problem:

     -- yesterday it worked, or did it?
    SET LANGUAGE US_English;
    SELECT CAST('11/12/2009' AS DATETIME);


    SET LANGUAGE Norwegian;
    SELECT CAST('11/12/2009' AS DATETIME);

    Changed language setting to us_english.

    -----------------------
    2009-11-12 00:00:00.000

    (1 row(s) affected)

    Changed language setting to Norsk.

    -----------------------
    2009-12-11 00:00:00.000

    (1 row(s) affected)

     

    -- today it's not working
    SET LANGUAGE US_English;
    SELECT CAST('11/13/2009' AS DATETIME);

    SET LANGUAGE Norwegian;
    SELECT CAST('11/13/2009' AS DATETIME);
                
     

     Changed language setting to us_english.

    -----------------------
    2009-11-13 00:00:00.000

    (1 row(s) affected)

    Changed language setting to Norsk.

    -----------------------
    Msg 242, Level 16, State 3, Line 11
    The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.

    Note that the query did not quite work yesterday either - although it did not raise errors, it was selecting the wrong data.


  • PIVOting dense data may speed up your queires

    Of course, PIVOting uses up some CPU. However, if the data is dense (all the cells in the pivoted result set are not NULL), then the size of the pivoted result set may be significantly less. As a result,  the overall time to retrieve a pivoted result set and transmit it over the network may be less. Here are my benchmarks.

    Let us create a helper table with 1M numbers:

    CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY);
    GO
                
    TRUNCATE TABLE dbo.Numbers;
    INSERT INTO dbo.Numbers
        
    )
            
    VALUES  1
        
    );
    GO
    DECLARE @i INT;
        
    SET @i=0;
    WHILE @i<21 
        
    BEGIN
        INSERT INTO 
    dbo.Numbers
            
    )
            
    SELECT POWER(2@i)
            
    FROM dbo.Numbers;
        
    SET @i @i 1;
        
    END;    

    Let us create and populate an typical EAV table:

    CREATE TABLE dbo.ObjectAttributes(ObjectID INT NOT NULL, AttributeID INT NOT NULL, IntValue INT NOT NULL,
                    
    CONSTRAINT PK_ObjectAttributes PRIMARY KEY(ObjectIDAttributeID));
    GO
                
    TRUNCATE TABLE dbo.ObjectAttributes;
    INSERT INTO dbo.ObjectAttributes(ObjectIDAttributeIDIntValue)
        
    SELECT n1.nn2.nn1.n
        
    FROM dbo.Numbers AS n1
                
    CROSS JOIN
                
    dbo.Numbers AS n2
        
    WHERE n1.n <100001 
            
    AND n2.n<11;

    Let us select all the data from it as is a couple of times:

    SET STATISTICS TIME ON;
    SELECT 
        
    FROM dbo.ObjectAttributes;

    SELECT 
        
    FROM dbo.ObjectAttributes;
     

     SQL Server Execution Times:
       CPU time = 157 ms,  elapsed time = 8369 ms.
       
     SQL Server Execution Times:
       CPU time = 266 ms,  elapsed time = 5935 ms.

    Let us select pivoted data:

    SELECT ObjectID[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
        
    FROM dbo.ObjectAttributes
                PIVOT
    MAX(IntValueFOR AttributeID IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])) AS pvt
        
    ORDER BY pvt.ObjectID;

    Although the pivoted query used more CPU, the overall time was substantially less:

    SQL Server Execution Times:
       CPU time = 1219 ms,  elapsed time = 1794 ms.
     

    I was inspired by the following Aaron's post:

    What is so bad about EAV, anyway?

     

     

     

     

     

     


  • Speaking in Chicago on Sep 10

    I'll be speaking about Defensive Database Programming in Chicago on Sep 10 at 5:30PM.

    http://chicago.sqlpass.org/

    The seating is limited to 50, You must RSVP to attend

     


  • When you add an index and your query blows up...

    You cannot assume that the conditions in your WHERE clause will evaluate in the left-to-write order - making such assumptions leads to unsafe queries. For example, the following query is not safe:

    SELECT FROM dbo.Messages
      WHERE ISDATE(VarcharColumn)=
        
    AND CAST(
    VarcharColumn) AS DATETIME)='20090707'
     

    It can blow up at any time, and the reason is simple: the conditions in your WHERE clause can evaluate in any order, and the order can change the next time your query executes. If an invalid value casts before the check if it is valid, the query will blow up. Of course, I am not the first to point this out. Right from the top of my head I recall that you can find similar recommendations in Itzik Ben Gan's book on T-SQl Fundamentals, and in a recent Plamen Ratchev's post "Predicates in SQL". However, I would like to demonstrate how little it may take to break such code, how brittle it may be. I will provide a repro script in which a query originally succeeds, but fails after I have added an index.

     

    Do not make assumptions which you cannot guarantee

     

    Whenever you write such code as the previous sample, you clearly assume that the first condition, the call to ISDATE, will evaluate before the CAST. However, this assumption cannot be guaranteed, and when this assumption is not true, the query blows up. Even if this query always succeeds in your test runs, you cannot assume that it will always succeed in the future. The safe way to ensure that invalid values are never cast to datetime is this:

     

    SELECT FROM dbo.Messages
      
    WHERE CASE WHEN ISDATE(
    VarcharColumn)=1
        
    THEN CAST(
    VarcharColumn AS DATETIMEEND ='20090707'

     

    It is safe becasue it is documented that CASE expression evaluates the WHEN clause before evaluating the THEN clause.

     

    When order in which conditions are evaluated changes - a repro script.

     

    I am going to provide a repro script demostrating that provide that the order in which conditions are evaluated can change. However, I have to begin with a big disclaimer: there is no guarantee that this repro script will work as expected on your server. My repro script does work for me, and most likely will work for you too, but I cannot promise that.

     

    Setting up test data

     

    The following script adds 1M messages, all of them except one have invalid dates:

     

     

    -- helper table
    CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
    GO
    DECLARE @i INT;
    SET @i 1;
    INSERT INTO dbo.Numbers(nSELECT 1;
    WHILE @i<1024000 BEGIN
      INSERT INTO 
    dbo.Numbers(n)
        
    SELECT @i FROM dbo.Numbers;
      
    SET @i @i 2;
    END;
    GO
    CREATE TABLE dbo.Messages(MessageID INT NOT NULL PRIMARY KEY,
      
    SenderID INT NOT NULL,
      
    ReceiverID INT NOT NULL,
      
    MessageDateAs
    VarcharColumn VARCHAR(30) NULL,
      
    SomeMoreData CHAR(200) NULL);
    GO
    INSERT INTO dbo.Messages(MessageID,
      
    SenderID,
      
    ReceiverID,
     
    MessageDateAsVarcharColumn,
      
    SomeMoreData)
    SELECT nn%1000n/1000'Wrong Date''SomeMoreData'
      
    FROM dbo.Numbers;
    GO
    -- only one message has a valid date
    UPDATE dbo.Messages SET
    MessageDateAsVarcharColumn='20090707' 
      
    WHERE SenderID 123 AND ReceiverID 456;
     


     

    Originally an unsafe query against test data succeeds


    The following query succeeds on my server, selecting the only one row which has a valid date:

     

    SELECT FROM dbo.Messages
      
    WHERE SenderID 123 
        
    AND ReceiverID 456
        
    AND CAST(
    MessageDateAsVarcharColumn AS DATETIME)='20090707'
     

     

    Because the query succeeds, clearly for all the rows the first two conditions are evaluated before MessageDate is cast to DATETIME, because only the row for which the first two conditions are turn has a valid datetime value. Let me repeat my disclaimer: there is no guarantee that this query will succeed on your server. Even if you reshuffle the conditions in your WHERE clause, the query still succeeds, so clearly the cast is attempted only after both other conditions are evaluated:

     

     

    SELECT FROM dbo.Messages
      
    WHERE
    CAST(MessageDateAsVarcharColumn AS DATETIME)='20090707'
        
    AND ReceiverID 456
        
    AND
    SenderID 123 
     

     

    As you have seen, conditions in WHERE clause are not evaluated in left-to-right order.  You can also play with parenthesis trying to enforce one particular order, but that will not make any difference either.

     

    After adding an index, the same query blows up

     

    Here is the index:

     

    CREATE INDEX Messages_SenderID_MessageDate
      
    ON dbo.Messages(SenderID,
    MessageDateAsVarcharColumn);

     

    After it is created, the query blows up:

    Msg 241, Level 16, State 1, Line 1
    Conversion failed when converting datetime from character string.

    My explanation is simple: the optimizer chooses to use the new non-clustered index, and it chooses to cast the MessageDate value and evaluate the third condition CAST(MessageDateAsVarcharColumn AS DATETIME)='20090707') before doing an expensive bookmark lookup. Let me repeat my disclaimer again: there is no guarantee that this query will blow upon your server.

    Also I played a little bit with parenthesis, trying to use them to enforce a particular order of evaluation - of course that did not make any difference.

    As you have seen, it is unsafe to make assumptions about the order in which conditions in your WHERE clause are evaluated. Whenever you see a query developed under such an assumption, rewrite it eliminating the unsafe assumption altogether.

     

    If you are still reading this, can you do me a favor and post if this repro script worked on your server the way it did on mine. Please also post the output of SELECT @@VERSION.  I will be on vacation for a couple of weeks. This means that I will respond to questions, but not immediately.

     

    This post continues my series on defensive database programming. Here are my previous posts  from the series: 

    Defensive database programming: fun with triggers

    Defensive database programming: fun with ROWCOUNT

    Summarizing previous posts about defensive database programming

    Defensive database programming: SET vs. SELECT.

    Stress testing UPSERTs

    Defensive database programming: fun with UPDATE.

    Defensive database programming: eliminating IF statements.

    Defensive database programming: fun with changing column widths.

    Avoid mixing old and new styles of error handling.

    Defensive database programming: adding ESCAPE clauses.

    Defensive database programming: qualifying column names.

    Defensive database programming: rewriting queries with NOT IN().


  • Saving the whole team of players or nothing

    Suppose that you need to store teams of players, and you must enforce the following business rule: each team must consist of exactly two players. I will demonstrate how you can use constraints to implement this rule. I haven't used this approach in production yet, but I wanted to share an interesting idea. Of course we typically use triggers or stored procedures to implement such rules, but you can do it with constraints too, and it is worth mentioning at least as a brainteaser. Detailed comparison of pros and contras of different approaches is beyond the scope of this post.

     

    Setting up the tables

     

    CREATE TABLE dbo.Teams(TeamID INT NOT NULL PRIMARY KEY);
    GO
    INSERT INTO dbo.Teams(TeamIDSELECT UNION ALL SELECT 2;
    GO
      

    CREATE TABLE dbo.Players(PlayerID INT NOT NULL PRIMARY KEY,
      
    TeamID INT NOT NULL FOREIGN KEY REFERENCES dbo.Teams(TeamID),
      
    NumberInTeam INT NOT NULL CHECK(NumberInTeam IN (1,2)),
      
    TeamMateID INT NOT NULL,
      
    TeamMatesNumberInTeam INT NOT NULL,
    -- if NumberInTeam=1 then TeamMatesNumberInTeam must be 2
    -- and vise versa
      
    CHECK(NumberInTeam+TeamMatesNumberInTeam 3), 
      
    UNIQUE(TeamIDNumberInTeam),
      
    UNIQUE(PlayerIDTeamIDNumberInTeam),
      
    FOREIGN KEY(TeamMateIDTeamIDTeamMatesNumberInTeam)
        
    REFERENCES dbo.Players(PlayerIDTeamIDNumberInTeam)
    );

    You can only insert one or more whole teams:

    INSERT INTO dbo.Players(PlayerIDTeamIDNumberInTeamTeamMateIDTeamMatesNumberInTeam)
    SELECT 1,1,1,2,UNION ALL
    SELECT 2,1,2,1,1;
      

     You cannot insert a single player.

     

    Nor you can DELETE a single player, or UPDATE a player transferring to another team. Once the whole team is inserted, it can only be deleted as a whole team, or both players can at once transfer to other team(s). You try out and see for yourself.

     

    More complex cases.

     

    You can easily extend this approach to have teams of different sizes. For example, you can add columns TeamSizeFrom and TeamSizeTo to your Teams table, and make sure that all teams have the required sizes (different for different teams), all only with constraints. If you are running 2008, MERGE significantly simplifies modifications. Prior to 2008 in many cases all you can do is delete the whole team and reinsert the changed one.

    P.S. I am leaving for a vacation soon, so I will not be responding to questions for a few weeks or so.

     

     


  • Optimizing yet another query that involves highly correlated columns

    In some cases some of the columns involved in a query are highly correlated. If you manage to communicate to the optimizer that valuable information, it may come up with a more efficient plan. For example, consider the following table (the script that populates it is at the end of this post):

     

    CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
      
    EventTime DATETIME NOT NULL,
      
    SomeMoreData CHAR(10)
    ); 

     

    Suppose that your system inserts events one by one, that EventID keeps increasing as your system keeps inserting, and that EventTime also keeps increasing most of the time, with possible minor fluctuations caused by delays between the client and the server. Clearly EventID and EventDate are highly correlated in this scenario. Consider the following queries:

     

    DECLARE @dfrom DATETIME@dto DATETIME;
    SELECT @dfrom '20050102 12:34:56'@dto '20050103 12:34:56';

    SELECT COUNT(*) FROM(
    SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
    WHERE EventTime BETWEEN @dfrom AND @dto 
    AND SomeMoreData '1'
    )AS t;


    SELECT COUNT(*) FROM(
    SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
    WHERE EventTime BETWEEN 
    '20050102 12:34:56' AND '20050103 12:34:56'
    AND SomeMoreData '1'
    )AS t;
     

     

    Table 'Events'. Scan count 9, logical reads 4088, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
       CPU time = 107 ms,  elapsed time = 21 ms.
     

     In this case more than 8% or rows meet the criteria (the script that populates test data is at the end of this post, you can run it and see for yourself), so both queries are executed as clustered index scans. However, because of the typical way the rows are inserted into the table, the rows that actually meet the criteria are very much clustered together - most pages do not contain any rows meeting the criteria, while on some pages most or all rows are selected. Communicating this information to the optimizer is easy and the query runs much faster, because it scans just a range of the clustered index:

     

    DECLARE @dfrom DATETIME@dto DATETIME;
    SELECT @dfrom '20050102 12:34:56'@dto '20050103 12:34:56';

    SELECT COUNT(*) FROM(
    SELECT EventIDEventTimeSomeMoreData FROM dbo.Events 
    WHERE EventTime BETWEEN @dfrom AND @dto
    AND EventID BETWEEN (SELECT MIN(EventIDFROM dbo.Events AS e1 WHERE e1.EventTime @dfrom)
      AND (
    SELECT MAX(EventIDFROM dbo.Events AS e1 WHERE e1.EventTime @dto)
    AND 
    SomeMoreData '1'
    )AS t;
      

     

     Table 'Events'. Scan count 3, logical reads 848, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
       CPU time = 31 ms,  elapsed time = 36 ms.

     

    Of course, if the assumption that EventID and EventDate are highly correlated is no longer true, this query will no longer be a better choice.

     

     Setting up test data

     

    CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
    GO
    DECLARE @i INT;
    SET @i 1;
    INSERT INTO dbo.Numbers(nSELECT 1;
    WHILE @i<1024000 BEGIN
      INSERT INTO 
    dbo.Numbers(n)
        
    SELECT @i FROM dbo.Numbers;
      
    SET @i @i 2;
    END;
    GO 

    DROP TABLE dbo.Events
    GO
    CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
      
    EventTime DATETIME NOT NULL,
      
    SomeMoreData CHAR(10)
    );
    GO

    INSERT INTO dbo.Events(EventIDEventTimeSomeMoreData)
    SELECT nDATEADD(secondn'20050101'), 'some data'
    FROM dbo.Numbers;
    GO
    -- reshuffle EventTime just a little bit
    UPDATE dbo.Events SET EventTime DATEADD(secondEventID%20EventTime);
    GO
    CREATE INDEX Events_EventTime ON dbo.Events(EventTime);
      

     


  • Using CROSS APPLY to optimize joins on BETWEEN conditions

    Recently I encountered a case when I knew much more about the data than the optimizer. Originally the performance was horrible, this is why I had to have a look at the query in the first place. When I was able to share my knowledge with the optimizer, it produced a better plan, and the query ran dramatically faster.

     

    The slow query

     

    The following tables store one-munite commercials for every minut for one year, and customer calls, one call per minute, for the same year. The scripts that populate tables with test data are provided at the end of this post. Here are the tables:

     

    CREATE TABLE dbo.Commercials(
      
    StartedAt DATETIME NOT NULL 
       
    CONSTRAINT PK_Commercials PRIMARY KEY,
      
    EndedAt DATETIME NOT NULL,
      
    CommercialName VARCHAR(30) NOT NULL);
    GO
    CREATE TABLE dbo.Calls(CallID INT 
      CONSTRAINT 
    PK_Calls NOT NULL PRIMARY KEY,
      
    AirTime DATETIME NOT NULL,
      
    SomeInfo CHAR(300));
    GO
    CREATE UNIQUE INDEX Calls_AirTime
      
    ON dbo.Calls(AirTimeINCLUDE(SomeInfo);
    GO

    Every commercial in my table lasts for at most one minute, and they do not overlap. I can easily enforce both conditions with constraints (

    Storing intervals of time with no overlaps"

    ), which are omitted in this post just to keep it simple.

    The following query retrieves only 181 rows, and it runs very slowly:

     SELECT s.StartedAts.EndedAtc.AirTime
    FROM dbo.Commercials s JOIN dbo.Calls c 
      
    ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
    WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
      

     Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
       CPU time = 71704 ms,  elapsed time = 36316 ms.

    Why is it so slow? I haven't mastered the fine art of adding images to my posts yet, so I have to explain verbally. For every call the DB engine scans all the commercials which begin before the time of the call, which is expensive. The reason is simple: the optimizer does not know that the commercials are short, and that the commercials do not overlap, so it must scan all the potential matches, which are all the commercials which begin before the time of the call.

     

    Using CROSS APPLY  to tell the optimizer that commercials do not overlap.

     

     Because commercials do not overlap, we need at most one match. Translating this information into plain SQL is easy, and the query runs dramatically faster:

     

    SELECT s.StartedAts.EndedAtc.AirTime
    FROM dbo.Calls c CROSS APPLY(
      
    SELECT TOP 1 s.StartedAts.EndedAt FROM dbo.Commercials s 
      
    WHERE c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
      
    ORDER BY s.StartedAt DESCAS s
    WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'

     

    Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
       CPU time = 31 ms,  elapsed time = 31 ms.

     

    Note: if you needed only one column from Commercials table, you could easily use just a subquery. Because more than one column is needed, CROSS APPLY is a better, a more performant choice choice without redundant code.

    Note: If you are using the assumption that the commercials do not overlap, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

    Also let me put it differently: If you are using the assumption that the commercials do not overlap, use a unit test do document it, so that that you don't forget that your query relies on that assumption. Also you have to enforce that business rule in the database.

     

    Using another range condition to tell the optimizer that commercials are short.

     

     Because commercials are short, there is no need to scan the commercials that start more than maximum commercial's length before the call. Again, translating this information into plain SQL is quite easy too, and again the query runs much faster, even faster than the previous one:

     

    SELECT s.StartedAts.EndedAtc.AirTime
    FROM dbo.Commercials s JOIN dbo.Calls c 
      
    ON c.AirTime >= s.StartedAt AND c.AirTime s.EndedAt
    WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
    AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
     

    Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
       CPU time = 31 ms,  elapsed time = 24 ms.
     

    Note: If you are using the assumption that the commercials are short, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.

    Also let me put it differently: If you are using the assumption that the commercials are short, use a unit test do document that. Also you have to enforce that business rule in the database. 

     

     

    Setting up tables and test data

     

    CREATE TABLE dbo.Numbers(INT NOT NULL PRIMARY KEY)
    GO
    DECLARE @i INT;
    SET @i 1;
    INSERT INTO dbo.Numbers(nSELECT 1;
    WHILE @i<1024000 BEGIN
      INSERT INTO 
    dbo.Numbers(n)
        
    SELECT @i FROM dbo.Numbers;
      
    SET @i @i 2;
    END;
    GO
    INSERT INTO dbo.Commercials(StartedAtEndedAtCommercialName)
    SELECT DATEADD(minute1'20080101')
       ,
    DATEADD(minuten'20080101')
       ,
    'Show #'+CAST(AS VARCHAR(6))
      
    FROM dbo.Numbers
      
    WHERE n<=24*365*60;
    GO
    INSERT INTO dbo.Calls(CallID,
      
    AirTime,
      
    SomeInfo)
    SELECT 
       
    ,DATEADD(minute1'20080101')
       ,
    'Call during Commercial #'+CAST(AS VARCHAR(6))
      
    FROM dbo.Numbers
      
    WHERE n<=24*365*60;
    GO
      

     


More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement