THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

Undocumented Query Plans: Equality Comparisons

The diagram below shows two data sets, with differences highlighted:

image

To find changed rows using TSQL, we might write a query like this:

image

The logic is clear: join rows from the two sets together on the primary key column, and return rows where a change has occurred in one or more data columns.  Unfortunately, this query only finds one of the expected four rows:

image

The problem, of course, is that our query does not correctly handle NULLs.  The ‘not equal to’ operators <> and != do not evaluate to TRUE if one or both of the expressions concerned is NULL.  In this example, that statement is always true because we are comparing column references.  In other circumstances, the behaviour might depend on the ANSI_NULLS setting.  I am not going to go into that here, partly because new code should be written to assume ANSI_NULLS ON:

image

One obvious way to handle the NULLs in our working example is to write all the conditions out in full:

image

That produces the correct result (only the row with PK = 4 is identical in both sets):

image

Even with just three columns to check, the query is already getting quite long.  It is also quite easy to miss one of the combinations or to misplace a bracket.  In an attempt to reduce the size of the query, we might be tempted to use ISNULL or COALESCE:

image

The idea there is to replace any NULLs in the data with a particular value that allows the comparison to work as we expect.  This produces the correct result (with the test data) but there are a number of reasons to dislike this approach.  For one thing, we have to be careful to choose a special replacement value that can never appear in the real data, now or at any time in the future.  Taking this approach is no different, in principle, from choosing not to store NULLs in the first place, and using the chosen special value as a default instead.

Leaving the ‘to NULL or not to NULL’ debate to one side, the other issue is that COALESCE returns a value typed according to the input expression that has the highest data type precedence.  Many times, this will not matter, but it is possible to introduce subtle bugs this way.  Using ISNULL instead avoids this issue by ensuring that the data type of the first expression is used, but the problem of choosing appropriate special values remains.  The final objection I have to this method is a bit more subjective: although the query looks simpler than before, COALESCE is just shorthand for a CASE expression.  Let’s compare the query plans for the two queries:

image

That’s the plan for the query with all the conditions written out in full.  The Filter contains this mess:

SNAGHTML7ce46f7

The COALESCE form query plan looks like this:

image

Now, the complexity is split between two operators.  The Compute Scalar contains the definitions shown below on the left, and the Clustered Index Seek contains the residual predicate shown on the right (click the image to enlarge it):

SNAGHTML7d383a9

Using ISNULL instead makes some difference – the graphical plan is visually identical to that obtained by using COALESCE, but the defined values and residual predicate are somewhat more concise:

SNAGHTML7d89056

An Alternative Syntax

We are looking for rows that join based on the value of the PK column, but which contain a difference in at least one column.  Another way to state that is to say: for each row that joins, the intersection of the two rows should be empty.  If the two rows are identical the intersection of the two rows will be a single row; conversely if the two rows are different, the intersection will be empty.  Writing that logic in TSQL results in this query form:

image

The query accounts for NULLs correctly, and produces the correct result.  The query plan looks like this:

image

There are no surprises in the Clustered Index Scan, Clustered Index Seek, or Inner Join.  In particular, none of these operators define any expressions or apply any residual predicates.  The seek is the simple one we would expect: it seeks to the row with the matching PK column value.  Looking at the Constant Scan reveals nothing, literally.  This operator produces a single row with no columns, using an in-memory query processor construct.  There really is nothing to see there, so we will move on to the last remaining operator: the Left Anti Semi Join.  If you were expecting to see complex logic similar to the CASE expressions seen earlier, prepare for a disappointment.  The anti semi join contains the following:

SNAGHTML7f1af99

Aside from a (redundant) check that the two PK columns match, this predicate just checks that all three data column values match, using a simple equality (=) comparison.  (As a side note, we can avoid the redundant check on PK values by specifying just the data columns in the INTERSECT sub-query, rather than using the star syntax).

To see how this works, consider that a left anti semi join passes rows through where no row is found on the inner input.  In this case, a row (with no columns) is always provided by the Constant Scan, but the predicate shown above is applied to it before the anti semi join decides whether to pass the source row on or not.  If all the conditions evaluate to TRUE, the no-column row from the Constant Scan is retained, the anti semi join finds that row on its inner input, and the source row does not pass on to the query output.

The net effect is that if the two rows match in all columns, no row appears on the output.  In all other cases (where at least one difference exists) the current row is returned by the query plan.  This is the correct semantic, so the query returns the correct result. 

NULL handling trickery

At this point, you might be wondering how this query plan manages to handle NULLs correctly.  Consider the rows in the source tables with PK = 4.  Both rows are identical, but only if the NULLs present in the ival column compare equal.  The relevant part of the predicate shown above is:

image

In the case where both columns contain NULL, we would expect this simple equality comparison to return UNKNOWN, not the TRUE needed to ensure that the anti semi join does not pass the source row to the output.  In other words, unless this equality comparison is doing something unusual, we would expect the query to incorrectly return a row for PK = 4 because the NULLs in the ival column should not compare equal.  The reason it works lies in the way INTERSECT handles NULLs.  According to the documentation:

image

That explains why the INTERSECT query form produces correct results, but it does not say how the query plan achieves this.  Before we see the details of that, let’s look at what happens if we try to write the query using the same logic as the INTERSECT query plan:

image

That query produces the exact same graphical query plan as the INTERSECT form:

image

Even the predicate on the Anti Semi Join is identical:

SNAGHTML811b17d

But, despite the identical plan, this new query produces the wrong results!  It includes the row with PK = 4 in the output, due to the problem comparing the NULLs in those rows:

image

The Answer

Although the graphical query plan (and even the extra detail available in the Properties window) shows no difference between the INTERSECT and NOT EXISTS query forms, there is a difference – one that implements the different comparison semantics involved.  In the INTERSECT form, the equality comparison must compare two NULLs as equal.  In the NOT EXISTS form, we are using a regular = comparison, one that should return UNKNOWN when two NULLs are compared.

To see this difference, we have to look deeper into the query plan than the graphical form or properties window can take us.  Inspecting the XML behind the graphical plan, we see the following logic in both cases for the test on the PK column values:

image

Notice the compare operation is EQ – a test for equality between the two column references.  The EQ test does not return true for NULLs.  In the NOT EXISTS form of the query, the other columns are compared in exactly the same way, using EQ comparisons.  For example, this is the test on the ival column:

image

Now look at the XML for the ival column comparison in the INTERSECT query:

image

Now the compare operation is shown as IS instead of EQ.  This is the reason that NULLs compare equal in the INTERSECT test – it is using the comparison semantic familiar to TSQL users from expressions like WHERE x IS NULL.  This is the SQL language IS DISTINCT FROM feature – implemented by the query processor, but not yet available in the TSQL language.  If you agree that IS DISTINCT FROM would be a useful addition to TSQL, you can vote for Steve Kass’ Connect item here.

For my part, I would like to see the internal language of the query processor exposed as an alternative to TSQL.  The query processor’s query specification language is much richer and more expressive than TSQL, and would not be bound by some of the bizarre behaviours maintained in TSQL for backward compatibility.  I live in hope, but I’m not holding my breath…

© Paul White 2011
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

Test script:

DECLARE @Set1 TABLE
(
    pk        BIGINT PRIMARY KEY,
    ival    INTEGER NULL,
    cval    CHAR(1) NULL,
    mval        MONEY NULL
);
 
DECLARE @Set2 TABLE
(
    pk        BIGINT PRIMARY KEY,
    ival    INTEGER NULL,
    cval    CHAR(1) NULL,
    mval        MONEY NULL
);
 
INSERT @Set1
    (pk, ival, cval, mval)
VALUES 
    (1, 1000, 'a', $1),
    (2, NULL, 'b', $2),
    (3, 3000, 'c', NULL),
    (4, NULL, 'd', $4),
    (5, 5000, 'e', $5);
    
INSERT @Set2
    (pk, ival, cval, mval)
VALUES
    (1, 1000, 'a', NULL),
    (2, 2000, 'b', $2),
    (3, NULL, 'c', $3),
    (4, NULL, 'd', $4),
    (5, 5999, 'z', $5);
 
-- Incorrect results, doesn't account for NULLs
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    s.ival <> t.ival
    OR s.cval <> t.cval
    OR s.mval <> t.mval;
 
-- Correct, but verbose and error-prone    
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    s.ival <> t.ival
    OR (s.ival IS NULL AND t.ival IS NOT NULL)
    OR (s.ival IS NOT NULL AND t.ival IS NULL)
    OR s.cval <> t.cval
    OR (s.cval IS NULL AND t.cval IS NOT NULL)
    OR (s.cval IS NOT NULL AND t.cval IS NULL)
    OR s.mval <> t.mval
    OR (s.mval IS NULL AND t.mval IS NOT NULL)
    OR (s.mval IS NOT NULL AND t.mval IS NULL);
 
-- COALESCE: Correct results, but problematic
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    COALESCE(s.ival, -2147483648) <> COALESCE(t.ival, -2147483648)
    OR COALESCE(s.cval, '¥') <> COALESCE(t.cval, '¥')
    OR COALESCE(s.mval, $-922337203685477.5808 ) <> 
        COALESCE(t.mval, $-922337203685477.5808)
 
-- ISNULL: Correct results, but problematic
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    ISNULL(s.ival, -2147483648) <> ISNULL(t.ival, -2147483648)
    OR ISNULL(s.cval, '¥') <> ISNULL(t.cval, '¥')
    OR ISNULL(s.mval, $-922337203685477.5808 ) <> 
        ISNULL(t.mval, $-922337203685477.5808)
 
-- INTERSECT:
-- Correct results in a compact form
SELECT
    *
FROM @Set1 AS t
JOIN @Set2 AS s ON
    s.pk = t.pk
WHERE
    NOT EXISTS (SELECT s.* INTERSECT SELECT t.*)
 
-- NOT EXISTS:
-- Same query plan, but different results!
SELECT 
    *
FROM @Set2 AS s
JOIN @Set1 AS t ON 
    t.pk = s.pk
WHERE 
    NOT EXISTS 
    (
        SELECT 1
        WHERE 
            t.pk = s.pk
            AND t.ival = s.ival
            AND t.cval = s.cval
            AND t.mval = s.mval
    )
Published Wednesday, June 22, 2011 5:58 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Justin Kong said:

Awesome post Paul. You write some of the top post out there!

June 21, 2011 12:55 PM
 

Chris Wood said:

Paul,

I am looking forward to your PASS session in October.

Chris

June 21, 2011 1:52 PM
 

calin oprea said:

brilliant, as usual

June 21, 2011 2:00 PM
 

Arthur said:

Paul, thank you for a ton of useful information, especially on that the new version (implies "Denali" I guess) of the SQL Server will have the ANSI_NULLs turned ON by default, this is significant!

June 21, 2011 2:52 PM
 

Paul White said:

Hello Arthur,

The documentation just says that a 'future' version of SQL Server will not support ANSI_NULLS OFF.  I don't know whether that is planned to happen in Denali or not, but it does make sense to follow the advice and not use it for any current development work!

Thanks for the comments everyone :)

Paul

June 21, 2011 3:33 PM
 

Brad Schulz said:

Incredible post, as always... Thanks, Paul!

June 21, 2011 7:19 PM
 

Chuck Rummel said:

Excellent post, hadn't used INTERSECT much before, but this is a great addition to one's bag of tricks.

June 21, 2011 8:25 PM
 

Martin said:

On a kind of related note I've noticed before for an index seek with

the pattern "WHERE C=@C  OR (@C IS NULL AND C IS NULL)" that this just appears in the execution plan as a straight forward equality seek and the plan is indistinguishable from the plan "WHERE C=@C" except by looking at the query text.

I guess to be consistent the first one should really show

<SeekKeys><Prefix ScanType="IS">

Instead of both of them just showing

<SeekKeys><Prefix ScanType="EQ">

June 22, 2011 2:53 AM
 

Paul White said:

Hi Martin,

Yes, index seeks are 'special' and always use EQ.  This might be related to the fact that indexes originally could not seek to a NULL, but were later extended (I think it was SQL 2000?)

Your example query without a seek (no index present or FORCESCAN specified) produces a predicate on C = @C with comparison type IS (assuming ANSI_NULLS ON).  The transformation to a seek, e.g. by the rule SelIdxToRng changes that to an EQ seek.

It would be nice if show plan output showed seeks using IS rather than EQ when appropriate, or if whatever hidden extra attribute is present was surfaced.

Very nice point, thanks for commenting!

Paul

June 22, 2011 8:51 AM
 

Meher said:

Great post Paul. Thank you.

Meher

June 22, 2011 2:52 PM
 

Alejandro Mesa said:

Great post, Paul!

I hope Microsoft implements IS [NOT] DISTINCT FROM. It comes very handy when comparing nullable columns to other columns, variables, etc.

--

AMB

June 22, 2011 9:08 PM
 

Alan said:

You can also use the checksum with <> which will result in a NE in the query plan.

June 28, 2011 7:57 PM
 

Paul White said:

Hi Alan,

There are many ways to detect changes, but CHECKSUM isn't one of the better ones.  It's great for building hash indexes where you are looking for potential equality matches, but hash collisions mean that it is not reliable for detecting changes.  It's pretty easy to generate a collision, for example:

SELECT CHECKSUM('7D3EF1CB-6900-4065-B798-5E74D4BE3455' COLLATE LATIN1_GENERAL_CI_AI)

SELECT CHECKSUM('DDA51F25-B430-4E10-813A-9D88FC5A8CBC' COLLATE LATIN1_GENERAL_CI_AI)

...both produce the value 92232112, so that change would go undetected.  A function like HashBytes is generally preferred, though it can be difficult to handle empty strings and NULLs correctly.

Paul

June 28, 2011 10:45 PM
 

Alan said:

Didn't know that before. Luckily I have never tried it in ptoduction.

Thank you!

Alan

June 29, 2011 10:33 PM
 

Alan said:

Had another read on the BOL about checksum and I think it is neither good for checking change nor unchange. Looks like the best use of it is to generate the hashed index.

Also, I am a bit confused about the use of collate in your example above. If I remove the collate part, the statements return different checksum. Sure you don't need to specify the collation when you running the statement on the same query window and both guid will be in the same data type?

Thanks,

Alan

June 29, 2011 11:49 PM
 

Paul White said:

The semantic of CHECKSUM (for strings) is that it produces the same hash value for strings that compare equal.  Notice that the values given are string representations of GUIDs, not actually GUIDs.

Since 'compare equal' depends on collation, I included it in my example to be sure the checksum values match whatever the default collation happens to be in your database.  For example:

SELECT CHECKSUM('7D3EF1CB-6900-4065-B798-5E74D4BE3455' COLLATE SQL_Latin1_General_CP1_CI_AI)

SELECT CHECKSUM('DDA51F25-B430-4E10-813A-9D88FC5A8CBC' COLLATE SQL_Latin1_General_CP1_CI_AI)

...produces different values.  There will still be collisions whatever the collation, of course, just for different inputs :)

Paul

June 30, 2011 1:52 AM
 

Tom Thomson said:

An extrely useful tip - as always, Paul.

July 4, 2011 11:51 PM
 

Naomi said:

Very useful, thanks a lot.

August 24, 2011 11:27 AM
 

gbn said:

Hi

Changing the WHERE to "EXISTS (SELECT s.* EXCEPT SELECT t.*)" gives the same result and same execution plan

Do you foresee any problems with this?

Cheers

gbn

January 16, 2012 5:46 AM
 

Paul White said:

Hi gbn,

In my experience the INTERSECT form tends to give produce 'better' plans, because the optimizer more readily simplifies the logic into a simple anti-semi-join.

Using the test script, I do get a slightly different plan for the EXCEPT form (an extra anti-semi-join, constant scan, and a start-up filter).  This may be version-dependent (I used R2 build 2789).

Finally, I personally find the INTERSECT form to be more intuitive, but I can quite see that others might prefer to think of the problem the other way around.

Paul

January 16, 2012 6:20 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement