THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

How Orphans Affect Semi Joins and Anti Semi Joins

When writing queries, we often can think of more than one way to solve a problem. There is danger in assuming that two different queries are logically equivalent when they return the same results. For example, just because a database has referential integrity doesn't mean that it was always present or always enforced over the life of the database. Sometimes a query implicitly assumes that all data is good - that no constraints have ever been violated.

The sample code creates three tables that logically have referential integrity. The child table is a child of the parent table and the parent table is a child of the grandparent table. To help you navigate from table to table, data values show the data lineage as described below.

  • parentId values 101 and 201 have a grandparentId of 1
  • childId values 10203 and 10203 have a parentId of 203 and a grandparentId of 3

Logically, although not actually enforced with foreign key constraints in the sample code, these foreign key relationships exist:

  • parent.grandparentId = grandparent.grandparentId
  • parent.parentId = child.parentId

Run the following three statements to create the tables:

create table grandparent (grandparentId int);
create table parent (parentId int, grandparentId int);
create table child (childId int, parentId int);

As the comments in the code indicate, you must choose the correct syntax for the inserts appropriate for your version of SQL Server. Of the 20 rows of child table rows inserted, 6 rows ultimately relate back to a grandparentId of 2.

/* Do only one set of inserts using the syntax appropriate for your server. */

/* SQL Server 2008 syntax */
insert into grandparent values (1),(2),(3);
insert into parent values (101,1),(201,1);
insert into parent values (102,2),(202,2),(302,2);
insert into parent values (103,3),(203,3),(303,3);
insert into child values (10101,101),(20101,101),(30101,101);
insert into child values (10201,201),(20201,201),(30201,201);
insert into child values (10102,102),(20102,102),(30102,102);
insert into child values (10302,302),(20302,302),(30302,302);
insert into child values (10103,103),(20103,103);
insert into child values (10203,203),(20203,203),(30203,203);
insert into child values (10303,303),(20303,303),(30303,303);
insert into child values (10301,301);
insert into child values (10401,401);

/* pre-SQL Server 2008 syntax */
--insert into grandparent values (1);
--insert into grandparent values (2);
--insert into grandparent values (3);
--insert into parent values (101,1);
--insert into parent values (201,1);
--insert into parent values (102,2);
--insert into parent values (202,2);
--insert into parent values (302,2);
--insert into parent values (103,3);
--insert into parent values (203,3);
--insert into parent values (303,3);
--insert into child values (10101,101);
--insert into child values (20101,101);
--insert into child values (30101,101);
--insert into child values (10201,201);
--insert into child values (20201,201);
--insert into child values (30201,201);
--insert into child values (10102,102);
--insert into child values (20102,102);
--insert into child values (30102,102);
--insert into child values (10302,302);
--insert into child values (20302,302);
--insert into child values (30302,302);
--insert into child values (10103,103);
--insert into child values (20103,103);
--insert into child values (10203,203);
--insert into child values (20203,203);
--insert into child values (30203,203);
--insert into child values (10303,303);
--insert into child values (20303,303);
--insert into child values (30303,303);

A Semi Join query is executed to find all of the child table rows that do not have a grandparentId of 2.

select * from child
where exists
(
    select *
    from parent
    inner join grandparent
    on parent.grandparentId = grandparent.grandparentId
    where grandparent.grandparentId <> 2
    and parent.parentId = child.parentId
);

An Anti Semi Join is used to return the exact same 14 row result set:

select * from child
where not exists
(
    select *
    from parent
    inner join grandparent
    on parent.grandparentId = grandparent.grandparentId
    where grandparent.grandparentId = 2
    and parent.parentId = child.parentId
);

If both queries return the same results and the results are correct (which they are), are both queries correct?

Foreign key constraints can be created and enabled as defined previously because there are no orphans. Since no foreign key constraints have been defined, we are free to insert orphans. Run these two statements to create two orphaned child rows:

insert into child values (10301,301);
insert into child values (10401,401);

Rerunning the Semi Join query returns the same original 14 row result set. Rerunning the Anti Semi Join query returns a 16 row result set that includes the two child rows. As a consultant, sometimes I encounter databases that have referential integrity and always intended to have referential integrity and thus shouldn't have any orphans, but they do. Failing to consider orphaned data can lead to unanticipated and even undesirable results. Certainly a query that yields undesirable results is incorrect.

As you've seen, when orphans are present, superficially equivalent queries can produce different results. Which result is correct is determined by your requirements. Now consider which query pattern is correct if the requirement is to delete all data that doesn't have a grandparentId of 2. The Anti Semi Join pattern is the correct choice for that requirement because it deletes the orphans, which the Semi Join does not. Here is the code to do this:

delete from child
where parentId not in
(
    select parentId
    from parent
    inner join grandparent
    on parent.grandparentId = grandparent.grandparentId
    where grandparent.grandparentId = 2
    and parent.parentId = child.parentId
);

The differences between the Semi Join and the Anti Semi Join can be taken advantage of to find orphans. By combining the queries with EXCEPT, the orphans are easily identified. Notice that the Anti Semi Join is first and EXCEPT is used to subtract the Semi Join's result set from the Anti Semi Join's result set. 

select * from child
where not exists
(
    select *
    from parent
    inner join grandparent
    on parent.grandparentId = grandparent.grandparentId
    where grandparent.grandparentId = 2
    and parent.parentId = child.parentId
)
EXCEPT
select * from child
where exists
(
    select *
    from parent
    inner join grandparent
    on parent.grandparentId = grandparent.grandparentId
    where grandparent.grandparentId <> 2
    and parent.parentId = child.parentId
);

When searching for orphans using this technique, I have a habit of doing the EXCEPT twice. I do this to check my work. I know which query to subtract from which, but sometimes I'll make a copy/paste mistake. By running EXCEPT both ways, I'll catch the careless error.

In case you're wondering about my style choice of Anti Semi Join instead of something like antisemijoin, it came from Microsoft's style choice used in the graphical query plan in SSMS.

Published Saturday, April 18, 2009 6:54 PM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
(required) 
Submit

About John Paul Cook

John Paul Cook is a Technology Solutions Professional for Microsoft's data platform and works out of Microsoft's Houston office. Prior to joining Microsoft, he was a Microsoft SQL Server MVP. He is experienced in Microsoft SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. John is also a Registered Nurse who graduated from Vanderbilt University with a Master of Science in Nursing Informatics and is an active member of the Sigma Theta Tau nursing honor society. He volunteers as a nurse at safety net clinics. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2. Opinions expressed in John's blog are strictly his own and do not represent Microsoft in any way.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement