THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: tempdb and update statistics

In SQL Server, one of the most significant architectural legacies from Sybase is the use of a single globally shared tempdb database. Any time you have something globally shared in a highly concurrent system, the shared resource can become a huge impediment to the scalability of the whole system. Tempdb is no exception.

 

Consequently, understanding tmepdb usage is essential to maintaining a highly scalable SQL Server environment. Unfortunately, except for the obvious cases, how tempdb is used by SQL Server is not all documented or well known. The fact is that SQL Server uses tempdb internally more than you may know. It may be justifiable that precisely because the usage is internal, it’s not documented.

 

And it can be a totally unproductive exercise to probe how such an internal usage may behave.

 

However, some of these internals may bubble up to the surface, and force them on a SQL Server user to make an effort to understand the behavior. Such is the case when a colleague of mine told me that he saw heavy usage of tempdb when the statistics of a table was updated. To be honest, I had not seen UPDATE STATISTICS making use of tempb, and was a bit surprised initially. But I was able to confirm the observation later.

 

That prompted me to probe this a bit more because I didn’t think—and still don’t think—tempdb usage by UPDATE STATISTICS is a generally known fact in the SQL Server community. Even less is known about exactly how UPDATE STATISTICS may use tempdb.

 

Seeing is believing

 

Here is how you can run a quick test in your own environment to see tempdb being used by UPDATE STATISTICS.

 

For this test, you need to open two SQL Server connections for this exercise.

 

First, run the following script—Script I—in any user database to create a test table, populate the table with ~1.6GB of data, and create a statistics on column c2.

 

-- Script I

-- on connection 1

drop table test

go

create table test(c1 int, c2 varchar(500))

go

 

set nocount on

go

declare  @i int

set @i = 1

begin tran

while @i <= 5000000

begin

      insert test(c1, c2)

      select @i,

             cast(@i as varchar(30)) + replicate('a', 300)        

            

      if @i % 200000 = 0

      begin

         commit tran

         begin tran

      end

      set @i = @i + 1

end

commit tran

go

create statistics c2_stats on test(c2)

go

 

 

Note that the statistics is created with the default sampling rate. Now, run the following script—Script II—to record the tempdb usage for the subsequent update stats activities.

 

-- Script II

-- On connection 2

use tempdb

go

 

drop table #tmp

drop table #stats

go

 

create table #tmp(Fileid int, FileGroup int, TotalExtents int, UsedExtents int, Name sysname, FileName varchar(500))

create table #stats(loggedTime datetime, Allocated int, Used int)

go

 

set nocount on

go

truncate table #stats

while (1=1)

begin

    truncate table #tmp

    insert #tmp

    EXECUTE ('dbcc showfilestats WITH NO_INFOMSGS')

 

    insert #stats

    select getdate(), sum(TotalExtents*64/1024) as Allocated, sum(UsedExtents*64/1024) as Used

      from #tmp

    waitfor delay '00:00:05'

end

 

Note that the above script must be run in the tempdb database. While Script II is running, run the following script—Script III—to update the statistics on column test.c2 with fullscan:

 

-- Script III

-- on connection 1

UPDATE STATISTICS test WITH FULLSCAN

 

Note the fullscan option in the UPDATE STATISTICS statement. When Script III is done executing, cancel the execution of Script II and run the following script—Script IV—to see how the tempdb data file(s) is used:

 

-- Script IV

-- on connection 2

select * from #stats

 

Here is what I observed running the above scripts.

 

On an x64 SQL Server 2008 instance with 2GB allocated to the buffer pool (max/min server memory both set to 2GB), Script IV produced the following result:

 

LoggedTime

tempdb Data Allocated (MB)

tempdb Data Used (MB)

12:42:16 AM

13500

1

12:42:21 AM

13500

1

12:42:26 AM

13500

1

12:42:31 AM

13500

229

12:42:36 AM

13500

1083

12:42:41 AM

13500

1561

12:42:46 AM

13500

1413

12:42:51 AM

13500

1170

12:42:56 AM

13500

732

12:43:01 AM

13500

120

12:43:06 AM

13500

1

12:43:11 AM

13500

1

12:43:16 AM

13500

1

12:43:21 AM

13500

1

 

Clearly, during the time when the statistics was being updated, tempdb data usage went up from 1MB to as high as 1561MB, and when updating the statistics was completed, the tempdb data usage went back to 1MB. Repeated runs of the test confirmed that the tempdb data was used during the update of the statistics in Script III.

 

Factors influencing tempdb usage

 

What are the factors that do or do not influence the use of tempdb by UPDATE STATISTICS with fullscan?

 

Buffer pool size

On a different x64 SQL Serve 2008 instance with 23GB allocated to the buffer pool, Script IV reported the following result:

 

LoggedTime

tempdb Data Allocated (MB)

tempdb Data Used (MB)

12:31:46 AM

4500

1

12:31:51 AM

4500

1

12:31:56 AM

4500

1

12:32:01 AM

4500

1

12:32:06 AM

4500

1118

12:32:11 AM

4500

1559

12:32:16 AM

4500

1211

12:32:21 AM

4500

784

12:32:26 AM

4500

331

12:32:31 AM

4500

1

12:32:36 AM

4500

1

12:32:41 AM

4500

1

12:32:46 AM

4500

1

 

On both instances, there were no other activities while the test was running. Apparently, the decision to use tempdb during a statistics update in the case of a 2GB buffer pool was not made because there was insufficient free buffer pool space. 

 

The size of the buffer pool did not seem to be a major factor.

 

Sampling rate

I modified the UPDATE STATISTICS statement to not use the fullscan option, which means that UPDATE STATISTICS would use the default sampling rate. In this case, tempdb usage was not observed.

 

Key column width

It should also be noted that the key width plays a significant role. If I created the test table as follows:

 

  create table test(c1 int, c2 varchar(30), c3 char(300))

 

And the table was populated with the same amount of data with Script I except that the INSERT was modified as follows:

 

  insert test(c1, c2, c3)

  select @i,

         cast(@i as varchar(30)),

         replicate('a', 300)

 

I then created the statistics on column c2. In this case, the key width of c2 was rather small. Almost no tempdb usage was observed while UPDATE STATISTICS WITH FULLSCAN was being executed.

 

The data type of the key column

Initially, when my colleague mentioned that he observed significant tempdb usage, the first thing I noticed was that the table whose statistics were being updated had a text column and an image column. I thought that perhaps the usage of tempdb could be a result of the text/image data column. However, that assumption turned out not to be true. You can substitute the data type of the c2 column in Script I with text, and the change would have little impact on the tempdb usage.

 

Data selectivity

Does the selectivity of the column matter? In Script I, the c2 column is given very high selectivity. In fact, all values are unique. Is the tempdb usage the result of this high selectivity? The answer is no. When I tested with the c2 column having only two distinct values, the same amount of tempdb usage was observed. Changing the data selectivity to a few other values did not significantly change the tempdb usage.

 

Table size

Does the table size influence the tempdb usage by UPDATE STATISTICS? Intuitively, the size of the table should matter. Tests show that indeed to be the case. Here is a representative test result:

 

LoggedTime

tempdb Data Allocated (MB)

tempdb Data Used (MB)

8:13:56 PM

13500

1

8:14:01 PM

13500

1

8:14:06 PM

13500

231

8:14:11 PM

13500

932

8:14:16 PM

13500

1632

8:14:21 PM

13500

2331

8:14:26 PM

13500

2563

8:14:31 PM

13500

3124

8:14:36 PM

13500

3065

8:14:41 PM

13500

2945

8:14:46 PM

13500

2565

8:14:51 PM

13500

2057

8:14:56 PM

13500

1731

8:15:01 PM

13500

1308

8:15:06 PM

13500

940

8:15:11 PM

13500

585

8:15:16 PM

13500

1

8:15:21 PM

13500

1

 

The behavior needs to be documented

 

Since I’m not privy to the implementation decisions inside SQL Server, I don’t want to guess what good reasons may be there to justify the observed tempdb usage behavior during UPDATE STATISTICS with fullscan. But given the importance of tempdb performance in many SQL Server applications, at least, the fact that UPDATE STATISTICS may use tempdb should be documented in SQL Server Books Online.

 

I’ve filed a Connect item for this: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=483247

 

Key takeaway

 

The central point of this post is that UPDATE STATISYICS with FULLSCAN may incur a significant usage of your tempdb.

Published Sunday, August 16, 2009 10:10 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Sankar Reddy said:

Linchi,

One reason why UPDATE STATISTICS uses tempdb is because of a SELECT TOP 100 ... ORDER BY within a subquery.

Something like below.

SELECT StatMan([SC0], [SB0000]) FROM

(SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] FROM

(SELECT [cEndDateTime] AS [SC0] FROM [dbo].[Tablename]

   WITH (READUNCOMMITTED,SAMPLE 5.949706e+001 PERCENT) ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SB0000] )

    AS _MS_UPDSTATS_TBL OPTION (MAXDOP 1)

I created a connect item on this and as per MSFT team, this will be fixed sometime after SQL Server 2008 R2.

Here is the connect and my blog post.

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=457024

http://sankarreddy.spaces.live.com/blog/cns!1F1B61765691B5CD!367.entry

August 17, 2009 2:35 AM
 

Linchi Shea said:

Thanks Sankar for the info! That's interesting. Do you know what are the conditions in which UPDATE STATISTICS would lead to the referenced SELECT TOP query? I did a quick check and did not see the query, but will check again when I have time in the evening.

Also the OPTION(MAXDOP 1) clause is worrisome.

August 17, 2009 8:33 AM
 

Joe Chang said:

the StatMan shows up in Trace Stored Procedures SP:StmtCompleted

unfortunately, neither the step_direction nor the StatMan functions can be executed by the user(?)

Also, in S2K, the statistics only kept header, vector and histogram information, of which the histogram was limited to 200 rows, containing the range_hi_key, eq_rows, range_rows, and distinct_range_rows, plus some housekeeping info, meaning each histogram row in in your above example consumes 300+ bytes, but for int and big int type, the statblob was tpyically 8K bytes

in S2K5 and on, additional info is keeping for substring distribution, and I have seen the Stats_Stream  upward of 190K

try DBCC SHOW_STATISTICS (test,c2) WITH STATS_STREAM

and see how long the field is

August 17, 2009 9:35 AM
 

Linchi Shea said:

> the StatMan shows up in Trace Stored Procedures SP:StmtCompleted

Yes, of course, Joe! I was thrown off by the fact that UPDATE STATISTICS showed up in SQL:batchStarting, but not in RPC:BatchStarting, and never bothered to check SP:StmtStarting. Anyway, for UPDATE STATISTICS WITH FULLSCAN, the traced call is as follows:

SELECT StatMan([SC0]) FROM (SELECT TOP 100 PERCENT [c2] AS [SC0] FROM [dbo].[test] WITH (READUNCOMMITTED)  ORDER BY [SC0] ) AS _MS_UPDSTATS_TBL

This explains the use of tempdb. And it confirms that the FULLSCAN option is not limited to a single thread, i.e. OPTION(MAXDOP 1) is not present.

And if I run UPDATE STATISTICS without the FULLSCAN option, the SP statement call is as follows:

SELECT StatMan([SC0], [SB0000]) FROM (SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000]  FROM (SELECT [c2] AS [SC0] FROM [dbo].[test] TABLESAMPLE SYSTEM (1.307525e+000 PERCENT) WITH (READUNCOMMITTED) ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SB0000] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 1)

So the default sampling rate does limit the operation to a single thread.

I take the trace result to mean that UPDATE STATISTICS actually runs a SQL statement through the 'normal' query processing mechanisms. I had thought that UPDATE STATISTICS might internally bypass some of the regular query processing, and therefore could have less an impact on the shared resources.

August 17, 2009 9:58 AM
 

AaronBertrand said:

I just don't comprehend why the TOP / ORDER BY in the subquery in the first place.  It seems like a sort for nothing.  Granted that I do not know the code behind StatMan(), but still, how could this code behave differently than when just saying:

SELECT StatMan([SC0]) FROM (SELECT [c2] AS [SC0] FROM [dbo].[test] WITH (READUNCOMMITTED)) AS _MS_UPDSTATS_TBL

?

August 17, 2009 10:31 AM
 

Linchi Shea said:

Aaron;

Maybe, SQL Server has internal mechanisms to treat these queries differently than your regular user queries. Just a WAG on my part. But then, for statistics, SQL Server does need to sort the values, doesn't it?

August 17, 2009 10:46 AM
 

Linchi Shea said:

In addition, even UPDATE STATISTICS test WITH SAMPLE 99 PERCENT results in OPTION(MAXDOP 1), but FULLSCAN does not.

August 17, 2009 11:18 AM
 

gvphubli said:

for stats calculation and data evaluation it has to use some space somewhere, what is the other best place than the Tempdb ?

August 24, 2009 12:09 AM
 

Linchi Shea said:

gvphubli;

I didn't mean to suggest it's necessarily wrong to use tempdb for updating stats. But this needs to be clearly documented because tempdb is a shared resource, and is often a bottleneck, espeically when many SQL Server apps are tempdb heavy. Ideally, for some operations, it would be nice for each database to have its own scratch space.

August 24, 2009 12:34 PM
 

Allen Zhang said:

Sybase can have multiple tempdb a few years back. It will be good to see SQL Server can support multiple tempdb.

July 17, 2013 10:42 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement