THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Nondeterministic UPDATE

Under the description for UPDATE in SQL Server 2000 and 2005 Books Online, you can find the following statement (thanks to SQL Server MVP Steve Kass for pointing me to this passage):

The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.

BOL goes on to give a simple example of such an UPDATE. However, the example doesn't give you a real feel for the nondeterministic behavior. Although the result is supposed to be undefined and indeed you don't know what value UPDATE will end up assigning, the result does come out consistently the same no matter how many times you execute it. At least during my limited number of test runs, UPDATE seems to always pick the first value. But of course you can't count on that because a different execution plan may pick a different value.

Still, it would be nice to see a nondeterministic UPDATE statement actually picks a different value to drive it home that you should not use it in your app, unless of course your app logic wants to take advantage of this particular non-determinism (which is difficult to imagine).

Recently, however, I happened to run into an interesting case where the nondeterministic behavior of UPDATE can be easily reproduced on a multi-core server when query parallelism is used. Here's the script to see the behavior.

First, run the following script to create two tables and a stored procedure in any user database on a SQL Server 2005 instance:

CREATE TABLE t1 (
	c1 int         NULL,
	c2 varchar(50) NULL,
	rate int       NULL
)
go
CREATE INDEX c1_ind ON t1 (c1)
go

CREATE TABLE t2 (
	c1 int           NULL,
	c2 varchar(50)   NULL,
	rate int         NOT NULL
)
go

drop proc testProc
go
create proc testProc
as
set nocount on
begin tran
  UPDATE t1
     SET rate = t2.rate
    FROM t1 With (Index (c1_ind)), t2
   WHERE t1.c1 = t2.c1
     and t1.c2 = t2.c2

  SELECT 'count_0' = count(*) FROM t1 WHERE rate = 0;
  SELECT 'count_1' = count(*) FROM t1 WHERE rate = 1;
  SELECT 'rate_sum' = sum(rate) FROM t1
rollback tran
go

Second, populate the two tables with data using the following script:

WITH tmp (c1, c2,c3) as (
    SELECT 1,1,1
    UNION ALL
    SELECT cast(rand(c3)*1000000 as int)%200, 
           cast(rand(c3)*1000000 as int)%200, 
           c3 + 1
      FROM tmp
     WHERE c3 < 100000
)
INSERT t1
SELECT c1, CAST(c2 as varchar(50)), NULL
  FROM tmp
OPTION (maxrecursion 0);

WITH tmp (c1, c2, c3) as (
    SELECT 0,0,0
    UNION ALL
    SELECT c1 + 1, c2 + 1, c3 + 1
      FROM tmp
     WHERE c1 < 200
)
INSERT t2
SELECT cast(c1 as char(6)), 
       cast(c2 as varchar(50)), 
       0                         -- c3 value = 0
  FROM tmp
OPTION (maxrecursion 0);

-- give each c1 value a second c3 value = 1
INSERT t2
SELECT c1, c2, 1
  FROM t2;
go

Finally, run the stored procedure multiple times to see that different results are produced:

EXEC testProc;

Apparently, nondeterminism in query processing gives rise to the different results. A few observations are noteworthy:

  • The non-determinstic behavior is not a result of different query execution plan. At least, the reported execution plan remains the same for all my test runs.
  • If you disable query parallelism by setting max degree of parallelism to 1, the stored procedure will produce the same results.
  • On a SQL Server 2000 instance, you probably won't get a parallel plan, and thus won't see different results.
  • The size of the table t1 is significant. On my test server, populating table t1 with 70,000 still leads to a parallel query plan, thus different results in multiple executions of the stored procedure. But if I only populate the table with 50,000 rows, I don't get an query parallelism, and I get consistent results. So SQL Server query optimizer is sensitive to the table size in deciding whether to use a parallel plan.

 

Published Tuesday, November 13, 2007 3:15 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

Hi Linchi

This is another script written by David Portas.  

CREATE TABLE Countries

(countryname VARCHAR(20) NOT NULL PRIMARY KEY,

capitalcity VARCHAR(20));

CREATE TABLE Cities

(cityname VARCHAR(20) NOT NULL,

countryname VARCHAR(20) NOT NULL

     REFERENCES Countries (countryname),

CONSTRAINT PK_Cities

PRIMARY KEY (cityname, countryname));

INSERT INTO Countries (countryname, capitalcity) VALUES ('USA', NULL);

INSERT INTO Countries (countryname, capitalcity) VALUES ('UK', NULL);

INSERT INTO Cities VALUES ('Washington', 'USA');

INSERT INTO Cities VALUES ('London', 'UK');

INSERT INTO Cities VALUES ('Manchester', 'UK');

The MS-syntax makes it all too easy for the developer to slip-up by

writing ambiguous UPDATE...FROM statements where the JOIN criteria is

not unique on the right side of the join.

Try these two identical UPDATE statements with a small change to the

primary key in between.

UPDATE Countries

SET capitalcity = cityname

 FROM Countries JOIN Cities /* evil UPDATE... FROM syntax */

  ON Countries.countryname = Cities.countryname;

SELECT * FROM Countries;

ALTER TABLE Cities DROP CONSTRAINT PK_Cities;

ALTER TABLE Cities ADD CONSTRAINT PK_Cities PRIMARY KEY (countryname,

cityname);

UPDATE Countries

SET capitalcity = cityname

 FROM Countries JOIN Cities /* don't do this! */

  ON Countries.countryname = Cities.countryname;

SELECT * FROM Countries;

You get this from the first SELECT statement:

countryname          capitalcity

-------------------- --------------------

UK                   London

USA                  Washington

and this from the second:

countryname          capitalcity

-------------------- --------------------

UK                   Manchester

USA                  Washington

(though these results aren't guaranteed - that's part of the problem).

Why did the result change? The physical implementation has affected the

meaning of the code, with serious, potentially disastrous consequences.

How can you even test your code if its results are subject to change

due to the vagaries of storage, indexing and cacheing?

With the ANSI syntax there is no ambiguity. The UPDATE statement

compels the programmer to design an unambiguous assignment subquery

that returns no more than a single value.

UPDATE Countries

SET capitalcity =

 (SELECT MIN(cityname)

  FROM Cities

   WHERE Countries.countryname = Cities.countryname);

At the very least this forces the developer to reconsider whether the

UPDATE statement makes logical sense.  You might want to make an effort

to learn Standard SQL instead of a dialect that can change at any time,

which will not port, cannot be understood by other programmers, etc.

November 13, 2007 11:36 PM
 

Linchi Shea said:

Thanks Uri for posting David's script. It's a good example. One notable difference is that in David's example, the query plan may have changed with the PK change, whereas in my script, you don't even need to make any change to get a different result!

November 14, 2007 12:43 AM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement