THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Statistics Blunders

One of the major blunders in SQL Server 2000 is in the calculation of data distribution statistics. In theory, it should be possible to take just a small true random sample to get a reasonable estimate of the true distribution. Because of the page organization of storage in SQL Server (and other relational database engines), it is much easier sample all rows in a set of randomly selected pages, than to randomly samples rows which encompasses a much larger set of pages. A simple test shows that SQL Server 2000 in fact does sample entire pages for column statistics. So if there is any correlation between a column value and storage location, a very common occurence in tables with a clustered index, then the distribution statitistics is grossly distorted.

When SQL Server 2005 came out, it did seem that this very serious error was corrected. But just recently I was looking a SQL Server 2005 (build 3186) database. Looking into unexpected behavior, the index statistics with the default sampling were significantly off. Statistics at higher sampling percentages were progressively more accurate. Now index and column statistics should different strategies.

For columns statistics, supposed one sampled 1% of the rows, find M occurences of N distinct values. Should the interpretation be that in the full distribution, there are 100xM occurences of the N distinct values or M occurences of 100xN distinct values or something in between? 

For index statistics, there is an alternative strategy. Start a number random points including the first and last. At each point, go backward and forward to find the number of occurences of the value at the starting point. The number of occurrences is exactly the same for the full distribution.,So the reasonable assumption is that there are proportionaly more distinct values with the average sampled occurence.

Still one should always check for key indexes that the estimated row count from the initial index access with an equality SARG, specifically for one of the Range Hi Keys, is reasonably accurate (within 2X or so).

Published Friday, November 14, 2008 5:21 PM by jchang
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS


No Comments

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement