THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Best and Worst Checkpoint Performance

The best documentation on the I/O behavior of SQL Server checkpoints is found in SQL Server 2000 I/O Basics by Bob Dorr. In particular, you should read the following carefully:

SQL Server uses the following steps to set up another page for flushing and repeats for up to 16 total pages inclusive of the first page.
  1. Do a hash lookup for the next contiguous page. For example, if the page to be flushed were page 100, SQL Server searches for page 101 in the buffer hash array.
  2. If the page is not located, then the end of contiguous I/O block is established and the I/O is posted.
  3. If the page is located, acquire latch to prevent further modifications if the page may be dirty.
  4. Check to ensure that the page is dirty and needs to be written. If not, release the latch and consider the end of contiguous I/O block as established and submit the asynchronous I/O request.
  5. If dirty, follow the steps preceding these that tell you how to flush a single page.

After the set of pages to be flushed is determined, the WriteFileGather function is invoked to post (Async / OVERLAPPED) the I/O request with the associated callback function to complete the I/O operation.

In SQL Server I/O Basics, Chapter 2, Bob Dorr further states that, "SQL Server 7.0 and 2000 could issue a WriteMultiple operation for up to 16 pages (128 KB). SQL Server 2005 extends the capability of WriteMultiple up to 32 pages (256 KB)."

For most disk I/O subsystems, large I/Os--such as 256K writes--are much more efficient than small I/Os such as 8K writes. It thus follows that if SQL Server checkpoints are forced to issue single-page I/Os, checkpoint performance would seriously suffer.

To see this for myself, I conducted a little test on SQL Server 2005. First, I created a heap table that is one page per row, and populated it with 1,000,000 rows, resulting in ~8GB of used space. The table schema is as follows:

    i      int NOT NULL, 
    filler char(8000) NOT NULL

And the table was populated with the following script:


DECLARE @batch_size int
SET @batch_size = 100000  

DECLARE @i int
SET @i = 1
WHILE @i <= 1000000
    INSERT tbTest(i, filler) VALUES (@i, ' ')

    IF (@i % @batch_size = 0)
    SET @i = @i + 1
if @@trancount > 0
   commit tran


Since the test server had 12GB physical memory allocated to the SQL Server buffer pool, all the pages of this table could be cached in memory. I then disabled automatic checkpoints with trace flag 3502 so that I could control when I wanted to checkpoint the database, and study the checkpoint performance.

I examined two extreme SQL Server 2005 database checkpoint scenarios:

  • The Best Scenario where all the dirty pages were contiguous and all the checkpoint I/Os should be 256K writes, and
  • The Worst Scenario where no two dirty pages were contiguous and all the checkpoint I/Os should therefore be 8K writes

The best scenario was accomplished with the following code:

update tbTest
   set filler = 'xyz'
 where i < 500000

declare @dt datetime
set @dt = getdate()
select datediff(second, @dt, getdate())

And the worst scenario was produced with this code:

update tbTest
   set filler = '123'
 where i % 2 = 0

declare @dt datetime
set @dt = getdate()
select datediff(second, @dt, getdate())

There was no difference in execution duration between the two UPDATE statements. After all, they both performed a full table scan and updated the same number of rows (i.e. 500,000). However, note that the first UPDATE statement modified the first 500,000 consecutive rows, thus 500,000 contiguous pages, whereas the second UPDATE statement modified every other row, leaving no two dirty pages contiguous.

The following chart shows a significant contrast between the checkpoint durations of these two rather extreme scenarios:

In this test, it took about four times as long to checkpoint the same number of dirty pages in the worst scenario (duration =~ 80 seconds) as it did in the best scenario (duration =~ 20 seconds). The performance difference was staggering.

How did I know that SQL Server checkpoints were doing large or small I/Os? That was through observing the disk counters while the checkpoints were in progress. In the worst scenario, the Avg. Disk Bytes/Write counter remained almost a constant of 8192 bytes = 8KB, while in the best scenario, the same disk counter hovered around 260,000 bytes =~ 254KB. [Thanks, Andrew Kelly, for pointing out a mistake here--the two underlined words were incorrectly transposed. Updated 2007/11/01.]

It is also interesting to note that in the best scenario, the Current Disk Queue Length counter was about 20 for most of the time, whereas in the worst scenario the Current Disk Queue Length spiked to 100 or even higher. Apparently, SQL Server was trying to push the I/Os harder with more outstanding async I/O write requests when it detected that a checkpoint was issuing small block sized I/Os. This didn't quite do the trick as that pushed up the I/O latency--as measured by Avg Disk sec/Write--to greater than 48ms per write in the worst scenario as compared to ~20ms per write in the best scenario, even when the former was doing smaller I/Os.

So besides confirming what Bob Dorr described with a concrete example, are there any practical implications or can we apply the discussions to anything practical? Of course, but I'll leave that as an exercise for you to ponder.

Published Thursday, November 1, 2007 1:05 AM by Linchi Shea

Attachment(s): Best_checkpoints.gif



Andrew Kelly said:


You had this statement:


In the best scenario, the Avg. Disk Bytes/Write counter remained almost a constant of 8192 bytes = 8KB, while in the worst scenario, the same disk counter hovered around 260,000 bytes =~ 254KB.


Shouldn’t that be the other way around?  In the best case scenario it was able to do contiguous writes with large blocks so I would expect the avg bytes per write to higher than the worst case.

November 1, 2007 12:19 PM

Linchi Shea said:

Good catch, Andy! I didn't expect anybody to read it so carefully :-)

I'll correct it. Thanks!

November 1, 2007 1:04 PM

Greg Linwood said:

Excellent information Linchi, thx for posting

November 1, 2007 2:14 PM

Steve said:


sorry to be dumb - but does this mean checkpoint processing is better on a table with low fragmentation ? ie contiguous on disk - or do you mean checkpoint processing is affected by the order data is loaded into cache ?  

November 14, 2007 10:14 AM

Rizzo said:

Trace flag 3502 and 3504 is only for output checkpoint information to the errorlog.

Automatic checkpoint is disabled by trace flag 3505.

June 16, 2011 10:33 AM
New Comments to this post are disabled

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog


Privacy Statement