THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Kalen Delaney

Geek City: Join With Me!

I remember one of the most surprising changes in SQL Server 2000 2005 was how the graphical plans showed the use of a nonclustered index to seek, and the plan included something that looked like a JOIN to find rows in the base table.  Here’s an example. Although I used SQL Server 2008R2, the graphical plan will be pretty similar to what it looked like back in SQL Server 2000 2005. My code will make a copy of a table in the AdventureWorks2008 database, and then build an index on one of the columns.

USE AdventureWorks2008;
GO
IF object_id('dbo.Sales') IS NOT NULL
    DROP TABLE dbo.Sales;
GO
SELECT * INTO dbo.Sales FROM Sales.SalesOrderHeader;
GO
CREATE INDEX Sales_SalesPersonID_index on dbo.Sales(SalesPersonID);
GO

Now look at the estimated graphical execution plan for this query, that searches for a particular value for the SalesPersonID column:

SELECT * FROM dbo.Sales
WHERE SalesPersonID = 280;
GO

You should see something like this:

image

It shows a JOIN even though there is only one table involved! What’s with that? It took me a few minutes to figure out what I was seeing, and I must admit that over the years since then I have come to appreciate this graphical representation. You’ll have a problem if you think of a JOIN as only finding matching rows between two tables, but can be solved if you realize that a JOIN can be used when finding matches between any two ROWSETS, i.e. any two sets of rows. There is a set of rows in the nonclustered index on SalesPersonID, and a set of rows in the table. The index seek finds all the index rows with a SalesPersonID value of 280, but those index rows only contain an index key value and a row pointer (RowID or RID). Because the query is requesting all the columns in the table, not just the SalesPersonID value,  SQL Server must find the rows in the dbo.Sales table that have a matching value for the RowID. Internally, it is doing a join, even though there is only one table. It is joining index rows with table rows.

If the table had a clustered index, the plan would look almost exactly the same, with one difference. In SQL Server, if a table has a clustered index, nonclustered index rows do not contain RowIDs, they contain the clustered key value to be used a pointer to the row. SQL Server can take this clustered key from the nonclustered index row and look up in the clustered index to find the desired row. So I create a clustered index on SalesOrderNumber:

CREATE UNIQUE CLUSTERED INDEX Sales_ident on dbo.Sales(SalesOrderNumber);
GO

Now when I run the same query as above, I get the following plan:

image

The only difference in the two plans above is the icon for the lookup into the base table.  One is looking up into a heap using a RowID, which is an address containing the File ID, the page number and the slot or row number on the page. The other icon represents taking a key value from the nonclustered index row and looking it up by seeking through the clustered index. If you look again at these two icons, which one looks like a picture of directly addressing a row, and which looks like a picture of following another index into the table itself?

image

The icon on the left is used for a RID lookup, and the icon on the right is used for a clustered index key lookup. I know that nobody asked me, but I think they got these backwards.

But wait, there’s more…

Prior to SQL Server 7, there was basically a rule of thumb with query plans that SQL Server would use one index (at most) per table per query. It could never (well, hardly ever) use two indexes on the same table in a single query. That changed in SQL Server 7 when a new algorithm called Index Intersection was introduced.

I’m going to build another index on SalesOrderDate:

CREATE INDEX Sales_OrderDate_index on dbo.Sales(OrderDate);
GO

This index will be used if I look for all orders place on July 1, 2002. The query plan for the following query is shown:


SELECT * FROM dbo.Sales
WHERE OrderDate = '2002-07-01';

image

We saw that the index on SalesPersonID is useful when looking for the value 280, and the index on SalesOrderDate is useful when looking for '2002-07-01'. But what if one query looks for rows that satisfy both conditions?

SELECT * FROM dbo.Sales
WHERE OrderDate = '2002-07-01'
AND SalesPersonID = 280

Here’s the plan:

image

We see two JOINs ! First, SQL Server finds the sets of index rows meeting each of the WHERE filter conditions and those two index row sets are joined together, using the clustered key value as the join column. Then only after SQL Server finds rows that meets both conditions, it takes the common clustered index key and uses it to find the row in the underlying table.

But wait, there’s more!

Both of these queries use a table scan, because the filter conditions are not selective enough. There are way more rows with a SalesPersonID value of 289 than there were for a value of 280, and there are more rows with an OrderDate of March 1, 2004 than there are with an OrderDate of July 1, 2002.

SELECT * FROM dbo.Sales
WHERE OrderDate = '2004-03-01' ;
GO

SELECT * FROM dbo.Sales
WHERE SalesPersonID = 289;
GO

Both of the above queries show the same plan. Because the table now has a clustered index, the plan shows a clustered index scan, but it is exactly the same work as a table scan:

image

However, if you look at the plan for a query that uses both filter conditions, we don’t get a clustered index scan.

SELECT * FROM dbo.Sales
WHERE OrderDate = '2004-03-01'
AND SalesPersonID = 289;

image

Even though each filter alone is not very restrictive, they can both be used for index intersection. The reason SQL Server usually doesn’t choose index seeks on non-restrictive filters is because the cost of then accessing the base table for all the qualifying rows can be very high. But even though each individual index returns a lot of rows, the optimizer’s estimate of the number of rows that meet BOTH conditions is very small, and therefore the number of lookups into the base table is quite small and affordable.

So, what can you learn? Just because you think an column might not be selective enough to build a nonclustered index on, don’t discount it.  If that column will frequently be used in queries with multiple conditions, it just might turn out to be more useful than you thought!

So make sure you test many possible choices of indexes and combinations of indexes before you know what the best choices are.

Have fun!

~Kalen

Published Thursday, February 14, 2013 3:16 PM by Kalen Delaney

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rob Farley said:

Hi Kalen,

Nice post. Reminded me a little of one of mine from December.

http://sqlblog.com/blogs/rob_farley/archive/2012/12/11/joins-in-single-table-queries.aspx

Great to know that we get our interest piqued by similar things. :)

Rob

February 14, 2013 7:18 PM
 

Adam Machanic said:

Hi Kalen,

Just to set the record straight, the graphical plan change came in 2005. In 2000 it still looked like this:

http://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-components-postattachments/00-00-65-26-24/bookmark_5F00_lookup_5F00_2000.PNG

(Image from Craig Freedman's blog post in which he described the 2005 change -- which is one of the best plan changes ever made, in my humble opinion: http://blogs.msdn.com/b/craigfr/archive/2006/06/30/652639.aspx )

February 15, 2013 9:53 AM
 

Kalen Delaney said:

Aack! I thought about this long and hard, but I admit, I didn't try looking it up. I thought the change happened before XML plans were introduced, which was in 2005. And I think it's an awesome change too... when I said in the post that I have come to appreciate it, I meant that I REALLY love it!  

Thanks Adam....

February 15, 2013 12:09 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Favorite Non-technical Sites or Blogs

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement