THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Elisabeth Redei

Lies, Damned Lies and Statistics – Part III (SQL Server 2008)

In previous posts (http://sqlblog.com/blogs/elisabeth_redei/archive/2009/08/10/lies-damned-lies-and-statistics-part-i.aspx  and http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/lies-damned-lies-and-statistics-part-ii.aspx)I have talked about the performance problems that may surface because of low statistics sample rate on large tables.

One of the things I wrote about are issues with naturally skewed data in an indexed column. In this context, skewed data means that some values are represented more than others. One example is a "Country" column in an Address table where an international company is more likely to have more customers or employees in the U.K then let’s say Finland. A “Lastname" column in a Customers table would typically have skewed data as well – there are a lot more Smiths around than Rédei for example . Your SalesOrder table is bound to have more entries for some customers than for others as well.

The solutions I listed -  that have been listed by many before me - were never really appealing to me and consequently I was never good at persuading anyone to use them and this has always annoyed me.

SQL Server 2008 offers a great solution to this problem - filtered statistics. As you probably know, SQL Server 2008 offers you the ability to create filtered indexes, i.e. the index is created on a horizontal partition of the column. But did you know you can create filtered statistics on an unfiltered index?

This solution is much more appealing simply because you implement it at the index level rather than individual query level. In addition, your data will change and this solution can easily adopt to those changes. Another attractive perk is that it would be quite difficult to mess things up with this approach – unless you make a serious effort to!

Jason Massie has touched on the subject on SQLServerpedia, http://sqlserverpedia.com/blog/sql-server-bloggers/filtered-stats-to-counter-data-skew-issues/, but I wanted to see how it worked if you filtered on ranges of data covering all values in the index.

Again, I have a table, test1, with with values in the c2 column that ranges from 1 to 500.I have created a non-clustered index Idx1 on column c2.

In each bucket, I have about 2 400 rows, but c2 = 45 returns 0 rows:

SELECT c2, COUNT(*) as NumRows
FROM test1
group by c2
order by c2

Fig1_NumRows2400

DBCC SHOW_STATISTICS ('test1','idx1') WITH HISTOGRAM reveals that the value 45 has fallen between the chairs, even after an UPDATE STATISTICS test1 WITH FULLSCAN:

DBCC SHOW_STATISTICS ('test1','idx1') WITH HISTOGRAM:

Fig3_ShowStatistics2400

DBCC SHOW_STATISTICS ('test1','idx1') WITH STAT_HEADER shows that all 1 190 402 rows are represented by 200 “buckets” or ranges (the Steps column):

Fig3a_ShowStatisticsStatHeader2400 

Consequently, the following query gives me a table scan because the statistics that the optimizer have at hand, implies that 2 400 rows will be returned:

SELECT c1, c2 FROM test1 WHERE c2 = 45

Fig4_ExecPlan2400 


So the problem is not the sample rate; SQL Server has looked at all values in the index when it was building the statistics – it’s just that the histogram in this case is to coarse for the data it represents.

How can I solve this problem?

The trick is to increase the number of ranges that represent the data so rather than having 200 to represent 1 200 000 rows you can have 600 or a 1000.

To start with, I need to know my domain of values:

SELECT MIN(c2) AS minVal, AVG(c2) avgVal, MAX(c2) maxVal FROM test1

Fig4_Ranges

I have only some 1 200 000 rows in this table, so I am going to go ahead and try with 3 ranges (ideally you should have an idea how many rows are in each range):

CREATE STATISTICS testStatLow
    ON test1 (c2)
WHERE c2 < 200
GO

CREATE STATISTICS testStatMed
    ON test1 (c2)
WHERE c2 > 200 AND c2 < 400
GO

CREATE STATISTICS testStatHi
    ON test1 (c2)
WHERE c2 > 400
GO

I am then going to update the statistics with some sample rate to see if I - in addition to getting better query plans – I can save time on my maintenance job that updates the statistics for all tables:

UPDATE STATISTICS test1 WITH SAMPLE 20 percent

And then look at the statistics:


dbcc show_statistics ('test1','testStatLow') with STAT_HEADER

Fig7_Stat_header2400

Rows sampled vs. Rows shows that a little bit more than half of the rows were scanned (remember that SQL Server can decide to increase the sample rate if it decides that it is necessary to get a good representation of the data). The data in the column has been divided into yet another 194 buckets/ranges, drastically reducing the risk for problems with skewed data.

Let’s run the query again to see if my new filtered statistics makes a difference:

SELECT c1, c2 FROM test1 WHERE c2 = 45

Fig6_ExecPlan2400

There we go, the optimizer chose to do an index seek to find my non-existing row. Much better!

What about maintenance for the statistics?

My idea was that perhaps I can get away with a lower sample rate now that the index has been logically divided into several histograms. And as you just saw, I can for this particular scenario. However, after an update, I ran:

UPDATE STATISTICS test1 WITH SAMPLE 20 PERCENT

… which took about 6 seconds whereas

UPDATE STATISTICS test1 WITH FULLSCAN

.. took about 1 second.

So in my somewhat constructed scenario, my gain is with the performance of the query but I am not saving on the precious time I have available for doing index maintenance.

Hm… I can sense an upcoming blog on the subject of the Long and Winding Shortcut!

Published Thursday, December 17, 2009 3:14 PM by Elisabeth Redei

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

John Cougar said:

Thanks for the informative article! Good stuff. Question: determining the bucket ratio emperically is one thing, but in a real world scenario, how do we go about determining which columns in a living dataset will require such attention? My guess it we wait until we start to see table scans, letting us know that we're out-of-tune, but can we do something more pro-active than that?

July 4, 2012 11:10 PM
 

Understanding SQL Server Statistics | sqlbelle's musings said:

November 8, 2014 11:02 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement