Joe Chang : Statisticshttp://sqlblog.com/blogs/joe_chang/archive/tags/Statistics/default.aspxTags: StatisticsenCommunityServer 2.1 SP2 (Build: 61129.1)Decoding STATS_STREAMhttp://sqlblog.com/blogs/joe_chang/archive/2012/05/05/decoding-stats-stream.aspxSat, 05 May 2012 10:16:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:43206jchang6http://sqlblog.com/blogs/joe_chang/comments/43206.aspxhttp://sqlblog.com/blogs/joe_chang/commentrss.aspx?PostID=43206Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field...(<a href="http://sqlblog.com/blogs/joe_chang/archive/2012/05/05/decoding-stats-stream.aspx">read more</a>)<img src="http://sqlblog.com/aggbug.aspx?PostID=43206" width="1" height="1">StatisticsStatistics Blundershttp://sqlblog.com/blogs/joe_chang/archive/2008/11/14/statistics-blunders.aspxFri, 14 Nov 2008 21:21:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:9831jchang0http://sqlblog.com/blogs/joe_chang/comments/9831.aspxhttp://sqlblog.com/blogs/joe_chang/commentrss.aspx?PostID=9831<P>One of the major blunders in SQL Server 2000 is in the calculation of data distribution statistics. In theory, it should be possible to take just a small true random sample to get a reasonable estimate of the true distribution. Because of the page organization of storage in SQL Server (and other relational database engines), it is much easier sample all rows in a set of randomly selected pages, than to randomly samples rows which encompasses a much larger set of pages. A simple test shows that SQL Server 2000 in fact does sample entire pages for column statistics. So if there is any correlation between a column value and storage location, a very common occurence in tables with a clustered index, then the distribution statitistics is grossly distorted.</P>
<P>When SQL Server 2005 came out, it did seem that this very serious error was corrected. But just recently I was looking a SQL Server 2005 (build 3186) database. Looking into unexpected behavior, the index statistics with the default sampling were significantly off. Statistics at higher sampling percentages were progressively more accurate. Now index and column statistics should different strategies. </P>
<P>For columns statistics, supposed one sampled 1% of the rows, find M occurences of N distinct values. Should the interpretation be that in the full distribution, there are 100xM occurences of the N distinct values or M occurences of 100xN distinct values or something in between? </P>
<P>For index statistics, there is an alternative strategy. Start a number random points including the first and last. At each point, go backward and forward to find the number of occurences of the value at the starting point. The number of occurrences is exactly the same for the full distribution.,So the reasonable assumption is that there are proportionaly more distinct values with the average sampled occurence. </P>
<P>Still one should always check for key indexes that the estimated row count from the initial index access with an equality SARG, specifically for one of the Range Hi Keys, is reasonably accurate (within 2X or so).</P><img src="http://sqlblog.com/aggbug.aspx?PostID=9831" width="1" height="1">Statistics