In SQL Server, one of the most significant architectural legacies from Sybase is the use of a single globally shared tempdb database. Any time you have something globally shared in a highly concurrent system, the shared resource can become a huge impediment to the scalability of the whole system. Tempdb is no exception.
Consequently, understanding tmepdb usage is essential to maintaining a highly scalable SQL Server environment. Unfortunately, except for the obvious cases, how tempdb is used by SQL Server is not all documented or well known. The fact is that SQL Server uses tempdb internally more than you may know. It may be justifiable that precisely because the usage is internal, it’s not documented.
And it can be a totally unproductive exercise to probe how such an internal usage may behave.
However, some of these internals may bubble up to the surface, and force them on a SQL Server user to make an effort to understand the behavior. Such is the case when a colleague of mine told me that he saw heavy usage of tempdb when the statistics of a table was updated. To be honest, I had not seen UPDATE STATISTICS making use of tempb, and was a bit surprised initially. But I was able to confirm the observation later.
That prompted me to probe this a bit more because I didn’t think—and still don’t think—tempdb usage by UPDATE STATISTICS is a generally known fact in the SQL Server community. Even less is known about exactly how UPDATE STATISTICS may use tempdb.
Seeing is believing
Here is how you can run a quick test in your own environment to see tempdb being used by UPDATE STATISTICS.
For this test, you need to open two SQL Server connections for this exercise.
First, run the following script—Script I—in any user database to create a test table, populate the table with ~1.6GB of data, and create a statistics on column c2.
-- Script I
-- on connection 1
drop table test
go
create table test(c1 int, c2 varchar(500))
go
set nocount on
go
declare @i int
set @i = 1
begin tran
while @i <= 5000000
begin
insert test(c1, c2)
select @i,
cast(@i as varchar(30)) + replicate('a', 300)
if @i % 200000 = 0
begin
commit tran
begin tran
end
set @i = @i + 1
end
commit tran
go
create statistics c2_stats on test(c2)
go
Note that the statistics is created with the default sampling rate. Now, run the following script—Script II—to record the tempdb usage for the subsequent update stats activities.
-- Script II
-- On connection 2
use tempdb
go
drop table #tmp
drop table #stats
go
create table #tmp(Fileid int, FileGroup int, TotalExtents int, UsedExtents int, Name sysname, FileName varchar(500))
create table #stats(loggedTime datetime, Allocated int, Used int)
go
set nocount on
go
truncate table #stats
while (1=1)
begin
truncate table #tmp
insert #tmp
EXECUTE ('dbcc showfilestats WITH NO_INFOMSGS')
insert #stats
select getdate(), sum(TotalExtents*64/1024) as Allocated, sum(UsedExtents*64/1024) as Used
from #tmp
waitfor delay '00:00:05'
end
Note that the above script must be run in the tempdb database. While Script II is running, run the following script—Script III—to update the statistics on column test.c2 with fullscan:
-- Script III
-- on connection 1
UPDATE STATISTICS test WITH FULLSCAN
Note the fullscan option in the UPDATE STATISTICS statement. When Script III is done executing, cancel the execution of Script II and run the following script—Script IV—to see how the tempdb data file(s) is used:
-- Script IV
-- on connection 2
select * from #stats
Here is what I observed running the above scripts.
On an x64 SQL Server 2008 instance with 2GB allocated to the buffer pool (max/min server memory both set to 2GB), Script IV produced the following result:
|
LoggedTime |
tempdb Data Allocated (MB) |
tempdb Data Used (MB) |
|
12:42:16 AM |
13500 |
1 |
|
12:42:21 AM |
13500 |
1 |
|
12:42:26 AM |
13500 |
1 |
|
12:42:31 AM |
13500 |
229 |
|
12:42:36 AM |
13500 |
1083 |
|
12:42:41 AM |
13500 |
1561 |
|
12:42:46 AM |
13500 |
1413 |
|
12:42:51 AM |
13500 |
1170 |
|
12:42:56 AM |
13500 |
732 |
|
12:43:01 AM |
13500 |
120 |
|
12:43:06 AM |
13500 |
1 |
|
12:43:11 AM |
13500 |
1 |
|
12:43:16 AM |
13500 |
1 |
|
12:43:21 AM |
13500 |
1 |
Clearly, during the time when the statistics was being updated, tempdb data usage went up from 1MB to as high as 1561MB, and when updating the statistics was completed, the tempdb data usage went back to 1MB. Repeated runs of the test confirmed that the tempdb data was used during the update of the statistics in Script III.
Factors influencing tempdb usage
What are the factors that do or do not influence the use of tempdb by UPDATE STATISTICS with fullscan?
Buffer pool size
On a different x64 SQL Serve 2008 instance with 23GB allocated to the buffer pool, Script IV reported the following result:
|
LoggedTime |
tempdb Data Allocated (MB) |
tempdb Data Used (MB) |
|
12:31:46 AM |
4500 |
1 |
|
12:31:51 AM |
4500 |
1 |
|
12:31:56 AM |
4500 |
1 |
|
12:32:01 AM |
4500 |
1 |
|
12:32:06 AM |
4500 |
1118 |
|
12:32:11 AM |
4500 |
1559 |
|
12:32:16 AM |
4500 |
1211 |
|
12:32:21 AM |
4500 |
784 |
|
12:32:26 AM |
4500 |
331 |
|
12:32:31 AM |
4500 |
1 |
|
12:32:36 AM |
4500 |
1 |
|
12:32:41 AM |
4500 |
1 |
|
12:32:46 AM |
4500 |
1 |
On both instances, there were no other activities while the test was running. Apparently, the decision to use tempdb during a statistics update in the case of a 2GB buffer pool was not made because there was insufficient free buffer pool space.
The size of the buffer pool did not seem to be a major factor.
Sampling rate
I modified the UPDATE STATISTICS statement to not use the fullscan option, which means that UPDATE STATISTICS would use the default sampling rate. In this case, tempdb usage was not observed.
Key column width
It should also be noted that the key width plays a significant role. If I created the test table as follows:
create table test(c1 int, c2 varchar(30), c3 char(300))
And the table was populated with the same amount of data with Script I except that the INSERT was modified as follows:
insert test(c1, c2, c3)
select @i,
cast(@i as varchar(30)),
replicate('a', 300)
I then created the statistics on column c2. In this case, the key width of c2 was rather small. Almost no tempdb usage was observed while UPDATE STATISTICS WITH FULLSCAN was being executed.
The data type of the key column
Initially, when my colleague mentioned that he observed significant tempdb usage, the first thing I noticed was that the table whose statistics were being updated had a text column and an image column. I thought that perhaps the usage of tempdb could be a result of the text/image data column. However, that assumption turned out not to be true. You can substitute the data type of the c2 column in Script I with text, and the change would have little impact on the tempdb usage.
Data selectivity
Does the selectivity of the column matter? In Script I, the c2 column is given very high selectivity. In fact, all values are unique. Is the tempdb usage the result of this high selectivity? The answer is no. When I tested with the c2 column having only two distinct values, the same amount of tempdb usage was observed. Changing the data selectivity to a few other values did not significantly change the tempdb usage.
Table size
Does the table size influence the tempdb usage by UPDATE STATISTICS? Intuitively, the size of the table should matter. Tests show that indeed to be the case. Here is a representative test result:
|
LoggedTime |
tempdb Data Allocated (MB) |
tempdb Data Used (MB) |
|
8:13:56 PM |
13500 |
1 |
|
8:14:01 PM |
13500 |
1 |
|
8:14:06 PM |
13500 |
231 |
|
8:14:11 PM |
13500 |
932 |
|
8:14:16 PM |
13500 |
1632 |
|
8:14:21 PM |
13500 |
2331 |
|
8:14:26 PM |
13500 |
2563 |
|
8:14:31 PM |
13500 |
3124 |
|
8:14:36 PM |
13500 |
3065 |
|
8:14:41 PM |
13500 |
2945 |
|
8:14:46 PM |
13500 |
2565 |
|
8:14:51 PM |
13500 |
2057 |
|
8:14:56 PM |
13500 |
1731 |
|
8:15:01 PM |
13500 |
1308 |
|
8:15:06 PM |
13500 |
940 |
|
8:15:11 PM |
13500 |
585 |
|
8:15:16 PM |
13500 |
1 |
|
8:15:21 PM |
13500 |
1 |
The behavior needs to be documented
Since I’m not privy to the implementation decisions inside SQL Server, I don’t want to guess what good reasons may be there to justify the observed tempdb usage behavior during UPDATE STATISTICS with fullscan. But given the importance of tempdb performance in many SQL Server applications, at least, the fact that UPDATE STATISTICS may use tempdb should be documented in SQL Server Books Online.
I’ve filed a Connect item for this: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=483247
Key takeaway
The central point of this post is that UPDATE STATISYICS with FULLSCAN may incur a significant usage of your tempdb.