|
|
|
|
-
When one and the same constant is copied and pasted in more than one place, there is always a chance that we can change it in one place and fail to change in another, resulting in a discrepancy.
For example, recently I read a very interesting post by Michelle Ufford, Filtered Indexes Work-Around.
I will not repeat the whole post here, I encourage you to read it in full, but here are the relevant parts. There is a filtered index: CREATE NONCLUSTERED INDEX MyLatestData
ON dbo.myTable(myDate)
Include (myData)
WHERE myDate >= '20100127';
The following stored procedure does not use it, because the optimizer does not know if the parameter is before the cutoff date for the filtered index: CREATE PROCEDURE SelectLatestData
@DateFrom DATETIME
AS
SELECT DISTINCT myData
FROM dbo.filteredIndexTest
WHERE myDate >= @DateFrom
;
This is where Michelle analyzed several approaches. The following is my suggestion, which has problems, yet Michelle was kind enough to incorporate in her post: ALTER PROCEDURE SelectLatestData
@DateFrom DATETIME
AS
SELECT DISTINCT myData
FROM dbo.filteredIndexTest
WHERE myDate >= @DateFrom
-- this ensures that the index is used
AND MyDate >= '20100127' ;
By repeating the cutoff date from the definition of the filter index in the WHERE clause of this procedure, we explicitly tell the optimizer that that filtered index can always satisfy it regardless of the value of the parameter. However, there is a problem with my suggestion: what happens when later we change the filter of the index, as follows: CREATE NONCLUSTERED INDEX MyLatestData
ON dbo.myTable(myDate)
Include (myData)
WHERE myDate >= '20100220';
Our stored procedure will still work, but it will not use that new index any more.Suppose that we still want to use the stored procedure to get the latest data, and to use the lean filtered index for that purpose.
Clearly we should change the stored procedure at the same time when we replace the index. How can we manage that? If the system is large, it is impossible to remember all these details. The way I deal with such problems is this: I have macros, and I apply C++ preprocessor against my sql source code when I build. So, in my definitions.h file I will have this line: #define LATEST_DATA_CUTOFF '20100127'
In my sql code, I will include definitions.h, just like I would do in C++ code. Both the index definition and the stored procedure will use the macro LATEST_DATA_CUTOFF. When I change the macro definition, the preprocessor will take care of all the changes in all the places the macro is used. SQL Compare will take care of the deployment. How do you solve such problems?
|
-
Whenever we are using multi-statement TVFs, we are essentially forcing nested loops logic on the database engine. Although multi-statement TVFs are smart enough and do not always execute once per row, when they do so, they may be much slower than nested loops. As usual, inline UDFs shine as compared to multi-statement ones, at least in all the benchmarks in this post - let us run some tests and see for ourselves. Setting up data First of all, here is a table, rather large, 512K rows, and rather wide, less than 20 rows per page, so that reading from it via nested loops is a natural choice in many cases: CREATE TABLE dbo.Parent
(
ID INT NOT NULL
PRIMARY KEY ,
ParentNumber INT NOT NULL,
SpaceFiller CHAR(500) NOT NULL
) ;
GO
DECLARE @adder INT ;
SET @adder = 1 ;
INSERT INTO dbo.Parent
( ID, ParentNumber, SpaceFiller )
VALUES ( 1, 1, '#' ) ;
WHILE @adder < 500000
BEGIN ;
INSERT INTO dbo.Parent
( ID ,
ParentNumber ,
SpaceFiller
)
SELECT ID + @adder ,
ParentNumber ,
SpaceFiller
FROM dbo.Parent ;
SET @adder = @adder * 2 ;
END ;
GO
The Child table is much smaller and narrow:
CREATE TABLE dbo.Child
(
ID INT NOT NULL
PRIMARY KEY ,
ParentID INT NOT NULL ,
ChildNumber INT NOT NULL UNIQUE
) ;
GO
INSERT INTO dbo.Child
( ID, ParentID, ChildNumber )
SELECT ID, -- ID - int
ID, -- ParentID - int
ID -- ChildNumber - int
FROM dbo.Parent WHERE ID BETWEEN 1 AND 1000 ;
Comparing nested loops vs. multi-statement UDF calls Here are the two UDFs which we are going to benchmark: CREATE FUNCTION dbo.GetParentNumber_Multiline ( @ParentID INT )
RETURNS @parentInfo TABLE ( ParentNumber INT )
AS
BEGIN ;
INSERT @parentInfo
( ParentNumber
)
SELECT ParentNumber
FROM dbo.Parent
WHERE ID = @ParentID ;
RETURN ;
END ;
GO
CREATE FUNCTION GetParentNumber_Inline ( @ParentID INT )
RETURNS TABLE
AS
RETURN
( SELECT ParentNumber
FROM dbo.Parent
WHERE ID = @ParentID
) ;
Let us fire up the Profiler, make sure it records individual statements, and run the following script: SELECT ID, ParentID, p.ParentNumber
INTO #t1
FROM dbo.Child
CROSS APPLY dbo.GetParentNumber_Multiline ( ParentID ) AS p ;
DROP TABLE #t1 ;
SELECT ID, ParentID, p.ParentNumber
INTO #t1
FROM dbo.Child
CROSS APPLY dbo.GetParentNumber_Inline ( ParentID ) AS p
DROP TABLE #t1 ;
In the Profiler, we shall see that the multi-statement UDF has been invoked 1000 times, which is slow: Multiline stats: CPU time = 312 ms, elapsed time = 696 ms.
Inline stats: CPU time = 0 ms, elapsed time = 7 ms. Clearly the difference in performance is dramatic, while the plan is essentially the same - nested loops. We can highlight the query, compare the execution plans, and see for ourselves. Multi-statement UDFs do not always run once per row
While we are at it, let us bust a myth: multi-statement UDFs do not have to run once per row. Let us have all child rows refer to only one parent: UPDATE dbo.Child SET ParentID = 1 ;
Let us rerun our benchmarks. In the profiler we shall see that the multi-statement UDF has been invoked only once. Execution costs are also quite comparable this time for both scripts: CPU time = 15 ms, elapsed time = 21 ms.
As we have seen, the database engine is smart enough and may sometimes execute multi-statement UDFs once per distinct set of parameters, not once per row.
Multi-statement UDFs essentially force nested loops on the engine Let us set up test data differently, so that the second benchmark (the one using inline UDF) executes as a hash join: UPDATE dbo.Child SET ParentID = ID ;
TRUNCATE TABLE dbo.Parent ;
GO INSERT INTO dbo.Parent
( ID, ParentNumber, SpaceFiller )
SELECT ID, ID, '*'
FROM dbo.Child ;
The first benchmark still executes as a nested loop, and is much slower Multi-statement UDF, nested loops: CPU time = 266 ms, elapsed time = 674 ms. Inline UDF, hash join: CPU time = 0 ms, elapsed time = 5 ms.
Conclusion As we have seen, if we want to force nested loops, we can use multi-statement UDFs.Also we have seen that those multi-statement UDFs can be very slow.
However, let us be cautious: when we want to benchmark real life solutions, we do not want to do it with the Profiler running, like I did in this post. Also, as Dave Ballantine has correctly pointed out, in some cases STATISTICS TIME may skew the results just as well.
|
-
We can begin a transaction under snapshot isolation, but we cannot switch to it in the middle of an outstanding transaction. For example, the following procedure looks good and passes a smoke test: CREATE PROCEDURE dbo.SelectCountry
@CountrySymbol CHAR(2)
AS
BEGIN ;
SET TRANSACTION ISOLATION LEVEL SNAPSHOT ;
SELECT CountrySymbol ,
Description
FROM data.Countries
WHERE CountrySymbol = @CountrySymbol ;
END ;
GO
-- Smoke test: this call completes
EXECUTE dbo.SelectCountry @CountrySymbol = 'US' ;
However, the procedure fails if it is invoked in the middle of an outstanding transaction that did not begin under snapshot isolation level: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN TRANSACTION ;
EXECUTE dbo.SelectCountry @CountrySymbol = 'US' ;
COMMIT ;
Msg 3951, Level 16, State 1, Procedure SelectCountry, Line 6 Transaction failed in database 'Test' because the statement was run under snapshot isolation but the transaction did not start in snapshot isolation. You cannot change the isolation level of the transaction to snapshot after the transaction has started unless the transaction was originally started under snapshot isolation level.
Apparently the author of this stored procedure overlooked the possibility that it can be invoked in an outstanding transaction that did not begin under snapshot isolation level. Is is very easy to improve the robustness of this stored procedure:
ALTER PROCEDURE dbo.SelectCountry @CountrySymbol CHAR(2)
AS
BEGIN ;
IF @@TRANCOUNT = 0
BEGIN ;
SET TRANSACTION ISOLATION LEVEL SNAPSHOT ;
END ;
SELECT CountrySymbol ,
Description
FROM data.Countries
WHERE CountrySymbol = @CountrySymbol ;
END ;
As usual in defensive programming, let us not stop at this - we also need to review all other modules and proactively apply the same fix wherever appropriate.
|
-
Correlated subqueries do not have to execute once per row - on the contrary, they are equivalent to outer joins, and they may have the same execution plans and the same real execution costs (if we retrieve only one column via a correlated subquery, of course)
Let me provide a script which compares the performance of a query with one correlated subquery vs. an equivalent out join. First, we need two tables for our example: CREATE TABLE dbo.Parent
(
ID INT NOT NULL
PRIMARY KEY ,
ParentNumber INT NOT NULL
) ;
GO
CREATE TABLE dbo.Child
(
ID INT NOT NULL
PRIMARY KEY ,
ParentID INT NOT NULL ,
ChildNumber INT NOT NULL
) ;
Next, let us populate them with 512K rows each: DECLARE @adder INT ;
SET @adder = 1 ;
INSERT INTO dbo.Parent
( ID, ParentNumber )
VALUES ( 1, 1 ) ;
WHILE @adder < 500000
BEGIN ;
INSERT INTO dbo.Parent
( ID ,
ParentNumber
)
SELECT ID + @adder ,
ParentNumber
FROM dbo.Parent ;
SET @adder = @adder * 2 ;
END ;
GO
INSERT INTO dbo.Child
( ID, ParentID, ChildNumber )
SELECT ID, -- ID - int
ID, -- ParentID - int
0 -- ChildNumber - int
FROM dbo.Parent ;
We want to compare the performance of two queries without the overhead of retrieveing and or materializing result sets, so I will make sure the result sets are empty. The correlated subquery approach is as follows: SET STATISTICS IO ON ;
SET STATISTICS TIME ON ;
GO
SELECT ChildNumber ,
ParentNumber
INTO #t1
FROM ( SELECT ChildNumber ,
( SELECT ParentNumber
FROM dbo.Parent AS p
WHERE p.ID = c.ParentID
) AS ParentNumber
FROM dbo.Child AS c
) AS t
-- no rows meet this criteria
WHERE ChildNumber + ParentNumber = -123 ;
The execution costs are: Table 'Child'. Scan count 9, logical reads 2502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Parent'. Scan count 9, logical reads 2215, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The equivalent outer join looks like this (run it in another tab): SET STATISTICS IO ON ;
SET STATISTICS TIME ON ;
GO
SELECT ChildNumber ,
ParentNumber
INTO #t2
FROM ( SELECT ChildNumber ,
ParentNumber
FROM dbo.Child AS c
LEFT OUTER JOIN dbo.Parent AS p ON p.ID = c.ParentID
) AS t
-- no rows meet this criteria
WHERE ChildNumber + ParentNumber = -123 ;
Its real execution costs are the same. More to the point, when I hit Ctrl+L in both tabs, I see the same execution plan. Conclusion As we have seen, correlated subqueries do not have to execute once per each row. In fact, they are just macros - the optimizer can flatten them out and rewrite the queries involving them as outer joins.
|
-
To avoid deadlocks, one of the most common recommendations is "to acquire locks in the same order" or "access objects in the same order". Clearly this makes perfect sense, but is it always feasible? Is it always possible? I keep encountering cases when I cannot follow this advice. If I store an object in one parent table and one or more child ones, I cannot follow this advice at all. When inserting, I need to insert my parent row first. When deleting, I have to do it in the opposite order. If I use commands that touch multiple tables or multiple rows in one table, then usually I have no control in which order locks are acquired, (assuming that I am not using hints). So, in many cases trying to acquire locks in the same order does not prevent all deadlocks. So, we need some kind of handling deadlocks anyway - we cannot assume that we can eliminate them all. Unless, of course, we serialize all access using Service Broker or sp_getapplock.
|
-
Neither
UPDATE … IF (@@ROWCOUNT = 0) INSERT
nor
IF EXISTS(...) UPDATE ELSE INSERT
patterns work as expected under high concurrency. Both may fail. Both may fail very frequently. MERGE is the king - it holds up much better.Let us do some stress testing and see for ourselves. Here is the table we shall be using: CREATE TABLE dbo.TwoINTs
(
ID INT NOT NULL PRIMARY KEY,
i1 INT NOT NULL ,
i2 INT NOT NULL ,
version ROWVERSION
) ;
GO
INSERT INTO dbo.TwoINTs
( ID, i1, i2 )
VALUES ( 1, 0, 0 ) ;
IF EXISTS(…) THEN pattern frequently fails under high concurrency.
Let us insert or update
rows in a loop using the following simple logic: if a row with given
ID exists, update it, and otherwise insert a new one. The following
loop implements this logic. Cut and paste it into two tabs, switch into
text mode in both tabs, and run them simultaneously.
-- hit Ctrl+T to execute in text mode
SET NOCOUNT ON ;
DECLARE @ID INT ;
SET @ID = 0 ;
WHILE @ID > -100000
BEGIN ;
SET @ID = ( SELECT MIN(ID)
FROM dbo.TwoINTs
) - 1 ;
BEGIN TRY ;
BEGIN TRANSACTION ;
IF EXISTS ( SELECT *
FROM dbo.TwoINTs
WHERE ID = @ID )
BEGIN ;
UPDATE dbo.TwoINTs
SET i1 = 1
WHERE ID = @ID ;
END ;
ELSE
BEGIN ;
INSERT INTO dbo.TwoINTs
( ID, i1, i2 )
VALUES ( @ID, 0, 0 ) ;
END ;
COMMIT ;
END TRY
BEGIN CATCH ;
ROLLBACK ;
SELECT error_message() ;
END CATCH ;
END ;
When we run this script
simultaneously in two tabs, we shall immediately get a huge amount of
primary key violations in both tabs. This demonstrates how unreliable
the IF EXISTS pattern is when it executes under high concurrency.
Note: this
example also demonstrates that it is not safe to use SELECT
MAX(ID)+1 or SELECT
MIN(ID)-1 as the next
available unique value if we do it under concurrency.
UPDATE … IF (@@ROWCOUNT = 0) BEGIN pattern is also unreliable.
Another common approach
is to update a row first, and if no row was updated, insert it. It is
also unreliable. Before demonstrating that, let us delete the rows inserted
by the previous example:
DELETE FROM dbo.TwoINTs WHERE ID < 1 ;
Let us modify the
loop which we ran in the previous example, as follows:
-- hit Ctrl+T to execute in text mode
SET NOCOUNT ON ;
DECLARE @ID INT ;
SET @ID = 0 ;
WHILE @ID > -100000
BEGIN ;
SET @ID = ( SELECT MIN(ID)
FROM dbo.TwoINTs
) - 1 ;
BEGIN TRY ;
BEGIN TRANSACTION ;
UPDATE dbo.TwoINTs
SET i1 = 1
WHERE ID = @ID ;
IF ( @@ROWCOUNT = 0 )
BEGIN ;
INSERT INTO dbo.TwoINTs
( ID, i1, i2 )
VALUES ( @ID, 0, 0 ) ;
END ;
COMMIT ;
END TRY
BEGIN CATCH ;
SELECT error_message() ;
ROLLBACK ;
END CATCH ;
END ;
When we run this script simultaneously from two tabs, we are getting lots of primary key
violations, just like when we ran our previous example.
As we have seen, the UPDATE
… IF (@@ROWCOUNT = 0) pattern is
unreliable under high concurrency too.
MERGE holds up perfectly well
Again, let us delete the rows inserted
by the previous example:
DELETE FROM dbo.TwoINTs WHERE ID < 1 ;
Rerun the loop using MERGE:
-- hit Ctrl+T to execute in text mode
SET NOCOUNT ON ;
DECLARE @ID INT ;
SET @ID = 0 ;
WHILE @ID > -100000
BEGIN ;
SET @ID = ( SELECT MIN(ID)
FROM dbo.TwoINTs
) - 1 ;
BEGIN TRY ;
MERGE dbo.TwoINTs AS target
USING
( SELECT @ID ,
0 ,
0
) AS source ( ID, i1, i2 )
ON ( target.ID = source.ID )
WHEN MATCHED
THEN
UPDATE SET
i1 = 1
WHEN NOT MATCHED
THEN
INSERT ( ID, i1, i2 )
VALUES
( @ID ,
0 ,
0
) ;
END TRY
BEGIN CATCH ;
SELECT error_message() ;
END CATCH ;
END ;
In the
context of our loop, MERGE always completes without a single error!
As we have seen, in
this particular case MERGE holds up under high concurrency perfectly
well. Of course, this does not mean that we can always use this new
command without stress testing. Yet we should at least consider using
it whenever we insert or update under high concurrency, with usual precautions
and after thorough testing. It is fun to be blogging again. I kind of neglected blogging for some time, because I am writing a book on defensive programming, and that requires a huge amount of time. Anyway, it feels good to be back. The book should be finished soon, and I'll start contributing here more.
This post participates in the second SQL Tuesday
|
-
When we retry after deadlocks, we are very likely to overwrite other processes' changes. We need to be aware that very likely someone else modified the data we intended to modify. Especially if all the readers run under snapshot isolation, then readers cannot be involved in deadlocks, which means that all the parties involved in a deadlock are writers, modified or attempted to modify the same data. If we just catch the exception and automatically retry, we can overwrite someone else's changes. This is called lost updates, and this is usually wrong. Typically the right thing to do after a deadlock is to retry on a much higher level - re-select the data and decide whether to save in the same way the original decision to save was made.
For example, if a user pushed a Save button and the saving transaction was chosen as a deadlock victim, it might be a good idea to re-display the data on the screen as of after the deadlock.
|
-
Clearly Linux is one of the most successful products ever. Let me quote a little bit from Linus Torvalds and other brilliant people involved in it:
"A strong vision and a sure hand sound like good things on paper. It's just
that I have never _ever_ met a technical person (including me) whom I
would trust to know what is really the right thing to do in the long run.
Too strong a strong vision can kill you - you'll walk right over the edge,
firm in the knowledge of the path in front of you."
here
"The most successful software have born from fixing/patching an
initial/simple implementation while the greatest software failures
have born from deep planning and design.
"
here
More specifically, the following article is a great read
Evolutionary Database Design
|
-
But did it actually work yesterday? If the day is Friday, Noverber 13th, or the 13th day of any other month, and your query blows up for the first time, one place to search for is character strings converted to datetime values. The following script illustrates the problem:
-- yesterday it worked, or did it?
SET LANGUAGE US_English;
SELECT CAST('11/12/2009' AS DATETIME); SET LANGUAGE Norwegian; SELECT CAST('11/12/2009' AS DATETIME);
Changed language setting to us_english.
----------------------- 2009-11-12 00:00:00.000
(1 row(s) affected)
Changed language setting to Norsk.
----------------------- 2009-12-11 00:00:00.000
(1 row(s) affected)
-- today it's not working SET LANGUAGE US_English; SELECT CAST('11/13/2009' AS DATETIME);
SET LANGUAGE Norwegian;
SELECT CAST('11/13/2009' AS DATETIME);
Changed language setting to us_english.
----------------------- 2009-11-13 00:00:00.000
(1 row(s) affected)
Changed language setting to Norsk.
----------------------- Msg 242, Level 16, State 3, Line 11 The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.
Note that the query did not quite work yesterday either - although it did not raise errors, it was selecting the wrong data.
|
-
Of course, PIVOting uses up some CPU. However, if the data is dense (all the cells in the pivoted result set are not NULL), then the size of the pivoted result set may be significantly less. As a result, the overall time to retrieve a pivoted result set and transmit it over the network may be less. Here are my benchmarks. Let us create a helper table with 1M numbers: CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY);
GO
TRUNCATE TABLE dbo.Numbers;
INSERT INTO dbo.Numbers
( n )
VALUES ( 1
);
GO
DECLARE @i INT;
SET @i=0;
WHILE @i<21
BEGIN
INSERT INTO dbo.Numbers
( n )
SELECT n + POWER(2, @i)
FROM dbo.Numbers;
SET @i = @i + 1;
END;
Let us create and populate an typical EAV table: CREATE TABLE dbo.ObjectAttributes(ObjectID INT NOT NULL, AttributeID INT NOT NULL, IntValue INT NOT NULL,
CONSTRAINT PK_ObjectAttributes PRIMARY KEY(ObjectID, AttributeID));
GO
TRUNCATE TABLE dbo.ObjectAttributes;
INSERT INTO dbo.ObjectAttributes(ObjectID, AttributeID, IntValue)
SELECT n1.n, n2.n, n1.n
FROM dbo.Numbers AS n1
CROSS JOIN
dbo.Numbers AS n2
WHERE n1.n <100001
AND n2.n<11;
Let us select all the data from it as is a couple of times: SET STATISTICS TIME ON;
SELECT *
FROM dbo.ObjectAttributes;
SELECT *
FROM dbo.ObjectAttributes;
SQL Server Execution Times: CPU time = 157 ms, elapsed time = 8369 ms. SQL Server Execution Times: CPU time = 266 ms, elapsed time = 5935 ms. Let us select pivoted data: SELECT ObjectID, [1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
FROM dbo.ObjectAttributes
PIVOT( MAX(IntValue) FOR AttributeID IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])) AS pvt
ORDER BY pvt.ObjectID;
Although the pivoted query used more CPU, the overall time was substantially less:
SQL Server Execution Times: CPU time = 1219 ms, elapsed time = 1794 ms. I was inspired by the following Aaron's post:
|
-
I'll be speaking about Defensive Database Programming in Chicago on Sep 10 at 5:30PM.
http://chicago.sqlpass.org/
The seating is limited to 50, You must RSVP to attend
|
-
You
cannot assume that the conditions in your WHERE clause will evaluate in
the left-to-write order - making such assumptions leads to unsafe
queries. For example, the following query is not safe:
SELECT * FROM dbo.Messages
WHERE ISDATE(VarcharColumn)=1
AND CAST(VarcharColumn) AS DATETIME)='20090707'
It can blow up at any time, and the reason is simple: the conditions
in your WHERE clause can evaluate in any order, and the order can
change the next time your query executes. If an invalid value casts
before the check if it is valid, the query will blow up. Of course, I
am not the first to point this out. Right from the top of my head I
recall that you can find similar recommendations in Itzik Ben Gan's
book on T-SQl Fundamentals, and in a recent Plamen Ratchev's post "Predicates in SQL".
However, I would like to demonstrate how little it may take to break
such code, how brittle it may be. I will provide a repro script in
which a query originally succeeds, but fails after I have added an
index.
Do not make assumptions which you cannot guarantee
Whenever you write such code as the previous sample, you clearly
assume that the first condition, the call to ISDATE, will evaluate
before the CAST. However, this assumption cannot be guaranteed, and
when this assumption is not true, the query blows up. Even if this
query always succeeds in your test runs, you cannot assume that it will
always succeed in the future. The safe way to ensure that invalid
values are never cast to datetime is this:
SELECT * FROM dbo.Messages
WHERE CASE WHEN ISDATE(VarcharColumn)=1
THEN CAST(VarcharColumn AS DATETIME) END ='20090707'
It is safe becasue it is documented that CASE expression evaluates the WHEN clause before evaluating the THEN clause.
When order in which conditions are evaluated changes - a repro script.
I am going to provide a repro script demostrating that provide that
the order in which conditions are evaluated can change. However, I have
to begin with a big disclaimer: there is no guarantee that this repro script will work as expected on your server. My repro script does work for me, and most likely will work for you too, but I cannot promise that.
Setting up test data
The following script adds 1M messages, all of them except one have invalid dates:
-- helper table
CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
INSERT INTO dbo.Numbers(n)
SELECT n + @i FROM dbo.Numbers;
SET @i = @i * 2;
END;
GO
CREATE TABLE dbo.Messages(MessageID INT NOT NULL PRIMARY KEY,
SenderID INT NOT NULL,
ReceiverID INT NOT NULL,
MessageDateAsVarcharColumn VARCHAR(30) NULL,
SomeMoreData CHAR(200) NULL);
GO
INSERT INTO dbo.Messages(MessageID,
SenderID,
ReceiverID,
MessageDateAsVarcharColumn,
SomeMoreData)
SELECT n, n%1000, n/1000, 'Wrong Date', 'SomeMoreData'
FROM dbo.Numbers;
GO
-- only one message has a valid date
UPDATE dbo.Messages SET MessageDateAsVarcharColumn='20090707'
WHERE SenderID = 123 AND ReceiverID = 456;
Originally an unsafe query against test data succeeds
The following query succeeds on my server, selecting the only one row which has a valid date:
SELECT * FROM dbo.Messages
WHERE SenderID = 123
AND ReceiverID = 456
AND CAST(MessageDateAsVarcharColumn AS DATETIME)='20090707'
Because the query succeeds, clearly for all the rows the first two
conditions are evaluated before MessageDate is cast to DATETIME,
because only the row for which the first two conditions are turn has a
valid datetime value. Let me repeat my disclaimer: there is no guarantee that this query will succeed on your server. Even
if you reshuffle the conditions in your WHERE clause, the query still
succeeds, so clearly the cast is attempted only after both other
conditions are evaluated: SELECT * FROM dbo.Messages
WHERE CAST(MessageDateAsVarcharColumn AS DATETIME)='20090707' AND ReceiverID = 456
AND SenderID = 123
As you have seen,
conditions in WHERE clause are not evaluated in left-to-right order.
You can also play with parenthesis trying to enforce one particular
order, but that will not make any difference either.
After adding an index, the same query blows up
Here is the index:
CREATE INDEX Messages_SenderID_MessageDate
ON dbo.Messages(SenderID, MessageDateAsVarcharColumn);
After it is created, the query blows up:
Msg 241, Level 16, State 1, Line 1 Conversion failed when converting datetime from character string.
My explanation is simple: the optimizer chooses to use the new
non-clustered index, and it chooses to cast the MessageDate value and
evaluate the third condition CAST(MessageDateAsVarcharColumn AS DATETIME)='20090707') before doing an expensive bookmark lookup. Let me repeat my disclaimer again: there is no guarantee that this query will blow upon your server.
Also I played a little bit with parenthesis, trying to use them to
enforce a particular order of evaluation - of course that did not make
any difference.
As you have seen, it is unsafe to make assumptions about the order
in which conditions in your WHERE clause are evaluated. Whenever you
see a query developed under such an assumption, rewrite it eliminating
the unsafe assumption altogether.
If you are still reading this, can you do me a favor and
post if this repro script worked on your server the way it did on mine.
Please also post the output of SELECT @@VERSION. I will be on vacation
for a couple of weeks. This means that I will respond to questions, but not
immediately.
This post continues my series on defensive database programming. Here are
my previous posts from the series:
Defensive database programming: fun with triggers
Summarizing previous posts about defensive database programming
Defensive database programming: SET vs. SELECT.
Stress
testing UPSERTs
Defensive
database programming: fun with UPDATE.
Defensive
database programming: eliminating IF statements.
Defensive
database programming: fun with changing column widths.
Avoid
mixing old and new styles of error handling.
Defensive
database programming: adding ESCAPE clauses.
Defensive
database programming: qualifying column names.
Defensive
database programming: rewriting queries with NOT IN().
|
-
Suppose that you need to store teams of players, and you must enforce the following business rule: each team must consist of exactly two players. I will demonstrate how you can use constraints to implement this rule. I haven't used this approach in production yet, but I wanted to share an interesting idea. Of course we typically use triggers or stored procedures to implement such rules, but you can do it with constraints too, and it is worth mentioning at least as a brainteaser. Detailed comparison of pros and contras of different approaches is beyond the scope of this post.
Setting up the tables
CREATE TABLE dbo.Teams(TeamID INT NOT NULL PRIMARY KEY);
GO
INSERT INTO dbo.Teams(TeamID) SELECT 1 UNION ALL SELECT 2;
GO
CREATE TABLE dbo.Players(PlayerID INT NOT NULL PRIMARY KEY,
TeamID INT NOT NULL FOREIGN KEY REFERENCES dbo.Teams(TeamID),
NumberInTeam INT NOT NULL CHECK(NumberInTeam IN (1,2)),
TeamMateID INT NOT NULL,
TeamMatesNumberInTeam INT NOT NULL,
-- if NumberInTeam=1 then TeamMatesNumberInTeam must be 2
-- and vise versa
CHECK(NumberInTeam+TeamMatesNumberInTeam = 3),
UNIQUE(TeamID, NumberInTeam),
UNIQUE(PlayerID, TeamID, NumberInTeam),
FOREIGN KEY(TeamMateID, TeamID, TeamMatesNumberInTeam)
REFERENCES dbo.Players(PlayerID, TeamID, NumberInTeam)
);
You can only insert one or more whole teams: INSERT INTO dbo.Players(PlayerID, TeamID, NumberInTeam, TeamMateID, TeamMatesNumberInTeam)
SELECT 1,1,1,2,2 UNION ALL
SELECT 2,1,2,1,1;
You cannot insert a single player. Nor you can DELETE a single player, or UPDATE a player transferring to another team. Once the whole team is inserted, it can only be deleted as a whole team, or both players can at once transfer to other team(s). You try out and see for yourself. More complex cases. You can easily extend this approach to have teams of different sizes. For example, you can add columns TeamSizeFrom and TeamSizeTo to your Teams table, and make sure that all teams have the required sizes (different for different teams), all only with constraints. If you are running 2008, MERGE significantly simplifies modifications. Prior to 2008 in many cases all you can do is delete the whole team and reinsert the changed one. P.S. I am leaving for a vacation soon, so I will not be responding to questions for a few weeks or so.
|
-
In some cases some of the columns involved in a query are highly correlated. If you manage to communicate to the optimizer that valuable information, it may come up with a more efficient plan. For example, consider the following table (the script that populates it is at the end of this post): CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
EventTime DATETIME NOT NULL,
SomeMoreData CHAR(10)
);
Suppose that your system inserts events one by one, that EventID keeps increasing as your system keeps inserting, and that EventTime also keeps increasing most of the time, with possible minor fluctuations caused by delays between the client and the server. Clearly EventID and EventDate are highly correlated in this scenario. Consider the following queries: DECLARE @dfrom DATETIME, @dto DATETIME;
SELECT @dfrom = '20050102 12:34:56', @dto = '20050103 12:34:56';
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN @dfrom AND @dto
AND SomeMoreData > '1'
)AS t;
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN '20050102 12:34:56' AND '20050103 12:34:56'
AND SomeMoreData > '1'
)AS t;
Table 'Events'. Scan count 9, logical reads 4088, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 107 ms, elapsed time = 21 ms.
In this case more than 8% or rows meet the criteria (the script that populates test data is at the end of this post, you can run it and see for yourself), so both queries are executed as clustered index scans. However, because of the typical way the rows are inserted into the table, the rows that actually meet the criteria are very much clustered together - most pages do not contain any rows meeting the criteria, while on some pages most or all rows are selected. Communicating this information to the optimizer is easy and the query runs much faster, because it scans just a range of the clustered index: DECLARE @dfrom DATETIME, @dto DATETIME;
SELECT @dfrom = '20050102 12:34:56', @dto = '20050103 12:34:56';
SELECT COUNT(*) FROM(
SELECT EventID, EventTime, SomeMoreData FROM dbo.Events
WHERE EventTime BETWEEN @dfrom AND @dto
AND EventID BETWEEN (SELECT MIN(EventID) FROM dbo.Events AS e1 WHERE e1.EventTime = @dfrom)
AND (SELECT MAX(EventID) FROM dbo.Events AS e1 WHERE e1.EventTime = @dto)
AND SomeMoreData > '1'
)AS t;
Table 'Events'. Scan count 3, logical reads 848, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 36 ms.
Of course, if the assumption that EventID and EventDate are highly correlated is no longer true, this query will no longer be a better choice. Setting up test data CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
INSERT INTO dbo.Numbers(n)
SELECT n + @i FROM dbo.Numbers;
SET @i = @i * 2;
END;
GO
DROP TABLE dbo.Events
GO
CREATE TABLE dbo.Events(EventID INT NOT NULL PRIMARY KEY,
EventTime DATETIME NOT NULL,
SomeMoreData CHAR(10)
);
GO
INSERT INTO dbo.Events(EventID, EventTime, SomeMoreData)
SELECT n, DATEADD(second, n, '20050101'), 'some data'
FROM dbo.Numbers;
GO
-- reshuffle EventTime just a little bit
UPDATE dbo.Events SET EventTime = DATEADD(second, EventID%20, EventTime);
GO
CREATE INDEX Events_EventTime ON dbo.Events(EventTime);
|
-
Recently I encountered a case when I knew much more about the data
than the optimizer. Originally the performance was horrible, this is
why I had to have a look at the query in the first place. When I was
able to share my knowledge with the optimizer, it produced a better
plan, and the query ran dramatically faster.
The slow query
The following tables store one-munite commercials for every minut
for one year, and customer calls, one call per minute, for the same
year. The scripts that populate tables with test data are provided at
the end of this post. Here are the tables:
CREATE TABLE dbo.Commercials(
StartedAt DATETIME NOT NULL
CONSTRAINT PK_Commercials PRIMARY KEY,
EndedAt DATETIME NOT NULL,
CommercialName VARCHAR(30) NOT NULL);
GO
CREATE TABLE dbo.Calls(CallID INT
CONSTRAINT PK_Calls NOT NULL PRIMARY KEY,
AirTime DATETIME NOT NULL,
SomeInfo CHAR(300));
GO
CREATE UNIQUE INDEX Calls_AirTime
ON dbo.Calls(AirTime) INCLUDE(SomeInfo);
GO
Every commercial in my table lasts for at most one minute, and they
do not overlap. I can easily enforce both conditions with constraints (
Storing intervals of time with no overlaps"
), which are omitted in this post just to keep it simple.
The following query retrieves only 181 rows, and it runs very slowly:
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c
ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0,
read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob
read-ahead reads 0. Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table
'Commercials'. Scan count 2, logical reads 7166, physical reads 0,
read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob
read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads
0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob
physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 71704 ms, elapsed time = 36316 ms.
Why is it so slow? I haven't mastered the fine art of adding images
to my posts yet, so I have to explain verbally. For every call the DB
engine scans all the commercials which begin before the time of the
call, which is expensive. The reason is simple: the optimizer does not
know that the commercials are short, and that the commercials do not
overlap, so it must scan all the potential matches, which are all the
commercials which begin before the time of the call.
Using CROSS APPLY to tell the optimizer that commercials do not overlap.
Because commercials do not overlap, we need at most one match.
Translating this information into plain SQL is easy, and the query runs
dramatically faster:
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Calls c CROSS APPLY(
SELECT TOP 1 s.StartedAt, s.EndedAt FROM dbo.Commercials s
WHERE c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
ORDER BY s.StartedAt DESC) AS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table
'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead
reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead
reads 0.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 31 ms.
Note: if you needed only one column from Commercials
table, you could easily use just a subquery. Because more than one
column is needed, CROSS APPLY is a better, a more performant choice choice without redundant code.
Note: If you are using the assumption that the commercials do not overlap, you have to enforce that business rule in the database. Also to make sure that you don't forget that your query relies on that assumption, use a unit test do document it.
Also let me put it differently: If you are using the assumption that the commercials do not overlap, use a unit test do document it, so that that you don't forget that your query relies on that assumption. Also you have to enforce that business rule in the database.
Using another range condition to tell the optimizer that commercials are short.
Because commercials are short, there is no need to scan the
commercials that start more than maximum commercial's length before the
call. Again, translating this information into plain SQL is quite easy
too, and again the query runs much faster, even faster than the
previous one:
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c
ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table
'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead
reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead
reads 0. Table 'Commercials'. Scan count 1, logical reads 4,
physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical
reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 24 ms.
Note: If you are using the assumption that the commercials are short, you have to enforce that business rule in the database.
Also to make sure that you don't forget that your query relies on that
assumption, use a unit test do document it. Also let me put it differently: If you are using the assumption that the commercials are short, use a unit test do document that.
Also you have to enforce that business rule in the database.
Setting up tables and test data
CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
INSERT INTO dbo.Numbers(n)
SELECT n + @i FROM dbo.Numbers;
SET @i = @i * 2;
END;
GO
INSERT INTO dbo.Commercials(StartedAt, EndedAt, CommercialName)
SELECT DATEADD(minute, n - 1, '20080101')
,DATEADD(minute, n, '20080101')
,'Show #'+CAST(n AS VARCHAR(6))
FROM dbo.Numbers
WHERE n<=24*365*60;
GO
INSERT INTO dbo.Calls(CallID,
AirTime,
SomeInfo)
SELECT n
,DATEADD(minute, n - 1, '20080101')
,'Call during Commercial #'+CAST(n AS VARCHAR(6))
FROM dbo.Numbers
WHERE n<=24*365*60;
GO
|
|
|
|
|
|