THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Tibor Karaszi

How selective do we need to be for an index to be used?

You know the answer already: It depends. But I often see some percentage value quoted and the point of this post is to show that there is no such percentage value.

To get the most out of this blog post, you should understand the basic structure for an index, i.e. how the b+ tree look like. You should also understand the difference between a clustered and a non-clustered index. In essence, you should be able to visualize these structures and searches through them as you read the text. If you find that difficult, draw a few versions on a piece of paper and "navigate" through them by tracing through them with a pen or your finger. After a while, you will do this in your mind. For instance, check out the sections under this.

I'm not saying that we shouldn't consider selectivity when designing indexes - of course we should! I'm not saying that one shouldn't have some vague "feeling" about how much data to be return when making such decisions. What I will prove is that there is in reality no set percentage that the optimizer uses. The comment we usually see is something like:

"If we return more than 5% of the rows, then an index will not be used."

Where did that 5% number came from? I can assure you that this is not some hard-wired number in the optimizer (except for an edge-case, see below). The optimizer aims at running the query with as low cost as possible. Considering the data access part (think WHERE clause and the condition), this is pretty much about reading as few pages as possible (few page-accesses).

Just to cut down a bit on the thread that might follow these types of blogs ("Hey, when I do this, your observations doesn't match, your blog post is incorrect!"), let us first consider some special cases:

Clustered index
The clustered index *is* the data. If the search condition (SARG) is SEEKable, then SQL Server will obviously seek through a clustered index instead of scan it. Anything else would be stupid. There can be *other* search conditions that are more efficient, but we are considering one search condition at-a-time.

Non-clustered index that covers the query
This is pretty much the same argument as for above. Since all data is in the index ("covers the query"), not seeking it would be stupid. Again, there can be cheaper alternatives for any of the other search conditions, but we are considering one condition at-a-time.

The value is not known to the optimizer
This is what happens when you have a TSQL variable you compare against. Something like "colname = @v". The optimizer has no knowledge of the contents of this variable. Either it uses density (where applicable, like "="), as stored in the statistics information of the index. Where not applicable (like ">", "<", "BETWEEN" etc), then the optimizer actually do use some hard-wired percentage value. This value can change between versions so give it a spin of you want to know what value you have for your version/build number. Note that a variable is not the same thing as a parameter. SQL Server sniffs parameters (parameter sniffing). Read this for elaboration: http://msdn.microsoft.com/en-us/library/ee343986.aspx.

The search expression is not seek-able
I hope you know this already, but just to point it out. In most cases, having some calculation at the column side will void the ability to seek through the index. This should ideally be known to all T-SQL developers: Never do calculations at the column side! So, things to avoid are like "colname * 23 > 45" or "SOMEFUNCTION(colname) = 44".

Hopefully by now we all understand that there are always special cases and exceptions. The more of Kalen's books you have read, the more you understand this. What we are discussing here is the typical situation. OK? Fine. So, "Why is there no percentage value that the optimizer uses?", you ask. Because the value will differ. In short, SQL Server wants to read as few pages as possible. In the most simple example, the alternative to an index seek is a table scan. So we will use this as basis for your discussion. There can be other alternatives to the table scan (using some other index for some other condition), but that doesn't change the principal "it depends" concept.

In essence, it is all about the alternative. As I said, our example wil use a table scan as alternative. A table scan (or clustered index scan if it is a clustered table) means that SQL Server will look at every page and see what rows satisfies the search condition on each page.

My example has two different tables, both with 100,000 rows. These tables both have an integer column with consecutive increasing unique values, which also has a non-clustered index. I will see how selective I need to be when searching on this column in order for an index search to be done, compared to a table scan. I.e, find this percentage cut-off value.

The fewrows table only fit one row per data page. This means 100,000 data pages. My tests show that the cutoff for fewrows is about 31,000 rows. I.e., we need to be more selective than 31% for the index to be used.

The manyrows table fit 384 rows per page. This means 260 pages. My tests show that the cutoff for fewrows is about 155 rows. I.e., we need to be more selective than 0.16% for the index to be used.

You might end up with different exact numbers, depending on what you have in the statistics, the build number of your SQL Server etc. But what you will see that a similar pattern. A huge difference between the two.

It is all about the alternative
If I look at my indexes using sys.dm_db_index_physical_stats, I will see that the non-clustered index on the int column for the two tables are exactly the same (same number of pages in the index, etc). So, two indexes with the same characteristics have very different cut-off values. How can that be? It is because the alternative differs. The alternative for this example is a table scan. For the bigrows table, the alternative means reading 100,000 pages. But for the smallrows table, the alternative means reading only 260 pages. There can of course be other alternatives, like using some other index for some other search condition. This is, in the end, why we don't have a set percentage value: it is all about the alternative!

Conclusion
The typical cases will of course fall somewhere between my more extreme examples. But my examples show that there is no set percentage value used by the optimizer. I showed that for my test, the percentage value can be as low as 0.15% or as high as 31%. What matter is the alternative!

T-SQL

USE tempdb
GO

IF OBJECT_ID('manyrows'IS NOT NULL DROP TABLE manyrows
IF OBJECT_ID('fewrows'IS NOT NULL DROP TABLE fewrows
GO

CREATE TABLE manyrows(c1 INT IDENTITY PRIMARY KEYc2 INTc3 INT)
CREATE TABLE fewrows(c1 INT IDENTITY PRIMARY KEYc2 INTc3 CHAR(4500))
GO

INSERT INTO manyrows
SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY a.OBJECT_IDAS c2AS c3
FROM sys.columns AS asys.columns AS b

INSERT INTO fewrows
SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY a.OBJECT_IDAS c2'hi' AS c3
FROM sys.columns AS asys.columns AS b

CREATE INDEX ON manyrows (c2)
CREATE INDEX ON fewrows (c2)

--Number of pages etc:
EXEC sp_indexinfo 'manyrows'
-- Data: 265 pages (2.07 MB)
-- Index x: 187 pages (1.46 MB)

EXEC sp_indexinfo 'fewrows'
-- Data: 100385 pages (784 MB)
-- Index x: 187 pages (1.46 MB)

SELECT OBJECT_NAME(OBJECT_ID), *, OBJECT_NAME(OBJECT_ID)
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'DETAILED')

--Run below with showplan:
SELECT FROM manyrows 
WHERE c2 BETWEEN AND 155
--155 = ix search, 156 = ts
--Cut-off is 0.16%

SELECT FROM fewrows 
WHERE c2 BETWEEN AND 31000
--31000 = ix search, 32000 = ts
--Cut-off is 31%




Published Thursday, April 01, 2010 10:39 AM by TiborKaraszi
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

rudy said:

you forgot to 'splain the difference between seek, scan, and just plain have-a-look

April 1, 2010 8:10 AM
 

TiborKaraszi said:

Rudy,

No, I didn't forget that. I assumed that knowledge. All blog posts need to assume a certain level of knowledge, otherwise it wouldn't be possible to blog in the first place. For instamce, I used below for search in my search engine and came out with plenty of hits:

index seek table scan

April 1, 2010 8:39 AM
 

rudy said:

you know what happens when you assume...  ;o)

it's called begging the question, also known as "petitio principii"

if someone is already familiar with b-trees, seeks, and scans, how useful is a blog post on clustered and covering indexes going to be?

April 1, 2010 10:48 AM
 

Adam Machanic said:

Rudy,

No one is forcing you to read Tibor's material. If you feel it's too advanced for you, move along and read something else. Neither Tibor nor any other blogger can be expected to explain every basic point in every single post. Go take a course or something if you want to get up to speed.

April 1, 2010 11:11 PM
 

Klaus said:

Does he also need to explain what % means?

April 2, 2010 11:15 PM
 

AaronBertrand said:

Wow, what a ridiculous complaint about a thorough and insightful post.  From a Canadian, no less; shameful.

April 3, 2010 11:56 AM
 

David Walker said:

I just did some testing on an indexed bit field, with the distribution of 1's in the column at about half a percent.  It's a large table with half a million rows, and 8 GB of data in this table.

I would *not* have expected that an Index Seek is done regardless of whether I search for 0 or 1 in the table where the 1's make up half a percent of the values -- but that's what I saw (SQL 2008 SP1 (not R2)).

At least, the actual execution plan claims that it's doing an index seek on the bit column, regardless of whether I am searching for 1's or 0's.  Is this expected?

December 10, 2010 2:21 PM
 

David Walker said:

Followup: I meant to say 5 percent, not half a percent.  Arithmetic is complicated...

December 10, 2010 5:11 PM
 

TiborKaraszi said:

David, So we have two issues here.

1: The fact the index seek is selected when you return as much as 5%. Did you check execution plan and estimated rows vs actual rows? Also, did you compare number of pages (or some other cost measurement) when you compare to a table scan?

2: The fact that the index is used also when you search for the other value, of which you have 95%. This could either be explained by execution plan re-use or the index cover the query. Anything else would be weird.

Had to say anything more without actually playing with it. Of course, we also have things such as statistics, possible miscalculation etc.

December 11, 2010 5:18 AM
 

David Walker said:

I was looking at the actual execution plan, not the estimated erxecution plan.  From what I could tell, based on running tests with and without the index on the bitmap column, SQL was actually doing an index seek.  Very few actual reads were performed when the bitmap column was indexed.  This just seems different than all of the guidance available on the Web for index selectivity.

December 13, 2010 12:15 PM
 

TiborKaraszi said:

Looking at the actual plan is good, but you don't give us any hard numnbers to go on. Actual number of rows vs. estimated number of rows? Does the index cover the query? Etc. Basically, if you have few page accesses, and return many rows then either the index covers the query, or the index is a clustered index. If not, then you will have at least one page access per row.

December 13, 2010 12:50 PM
 

David Walker said:

I'll post some numbers here.  

December 13, 2010 4:00 PM
 

Kognjen said:

Can someone tell me why border (line) is exactly between 155-156 rows? Can this be calculate as ratio between number of pages, rows per page and number of rows which satisfy search condition...?

Thanks.

January 16, 2012 7:27 AM
 

TiborKaraszi said:

Kognjen,

Consider the execution plan, using the index. SQL Server will navigate the index, and for each row found, it will fetch that row. Each "fetch that row" in this case cost two pages (tables has a clutered index, this has depth 2, for each row navigate the clustered index = 2 pages). 155 rows means 155*2 pages = 310 pages. Plus traversing the index cost, which is about 20 pages. Sum is 330 pages. Remember that the root page for the clustered index will likely be in cache. So this will not be a physical I/O. So should we ignore the root page cost and instead of 330 pages calculate 175 pages? Remember that these are random I/O if physical I/O.

Now, scanning the data means looking at every data page which in this case is 262 pages. These are requential I/O when physical, meaning more efficient than random I/O.

As you probably understand, there is no one exact formula to determine which is more effective, and based on that where the cut-off should be. It has to be *somewhere* and this is where it is in current version. This has changed before and might change again. The principal is the important part here, "visualize the execution plan...". :-)

January 16, 2012 11:36 AM
 

Kognjen said:

To TiborKaraszi - thank you very much.

January 18, 2012 9:30 AM
 

rule30 said:

This was my take on this very subject:

http://rule30.wordpress.com/2014/03/30/sql-server-are-your-non-clustered-indexes-useless/

Thanks for a fantastic blog.

Best Regards

March 30, 2014 4:48 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement