THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Argenis Fernandez

DBCC CHECKDB on VVLDB and latches (Or: My Pain is Your Gain)

 

Does your CHECKDB hurt, Argenis?

There is a classic blog series by Paul Randal [blog|twitter] called “CHECKDB From Every Angle” which is pretty much mandatory reading for anybody who’s even remotely considering going for the MCM certification, or its replacement (the Microsoft Certified Solutions Master: Data Platform – makes my fingers hurt just from typing it). Of particular interest is the post “Consistency Options for a VLDB” – on it, Paul provides solid, timeless advice (I use the word “timeless” because it was written in 2007, and it all applies today!) on how to perform checks on very large databases.

Well, here I was trying to figure out how to make CHECKDB run faster on a restored copy of one of our databases, which happens to exceed 7TB in size. The whole thing was taking several days on multiple systems, regardless of the storage used – SAS, SATA or even SSD…and I actually didn’t pay much attention to how long it was taking, or even bothered to look at the reasons why - as long as it was finishing okay and found no consistency errors.

Yes – I know. That was a huge mistake, as corruption found in a database several days after taking place could only allow for further spread of the corruption – and potentially large data loss.

In the last two weeks I increased my attention towards this problem, as we noticed that CHECKDB was taking EVEN LONGER on brand new all-flash storage in the SAN! I couldn’t really explain it, and were almost ready to blame the storage vendor. The vendor told us that they could initially see the server driving decent I/O – around 450Mb/sec, and then it would settle at a very slow rate of 10Mb/sec or so. “Hum”, I thought – “CHECKDB is just not pushing the I/O subsystem hard enough”. Perfmon confirmed the vendor’s observations.

blob

Dreaded @BlobEater

What was CHECKDB doing all the time while doing so little I/O? Eating Blobs.

It turns out that CHECKDB was taking an extremely long time on one of our frankentables, which happens to be have 35 billion rows (yup, with a b) and sucks up several terabytes of space in the database. We do have a project ongoing to purge/split/partition this table, so it’s just a matter of time before we deal with it.

But the reality today is that CHECKDB is coming to a screeching halt in performance when dealing with this particular table.

Checking sys.dm_os_waiting_tasks and sys.dm_os_latch_stats showed that LATCH_EX (DBCC_OBJECT_METADATA) was by far the top wait type. I remembered hearing recently about that wait from another post that Paul Randal made, but that was related to computed-column indexes, and in fact, Paul himself reminded me of his article via twitter. But alas, our pathologic table had no non-clustered indexes on computed columns.

I knew that latches are used by the database engine to do internal synchronization – but how could I help speed this up? After all, this is stuff that doesn’t have a lot of knobs to tweak.

(There’s a fantastic level 500 talk by Bob Ward from Microsoft CSS [blog|twitter] called “Inside SQL Server Latches” given at PASS 2010 – and you can check it out here. DISCLAIMER: I assume no responsibility for any brain melting that might ensue from watching Bob’s talk!)

Failed Hypotheses

Earlier on this week I flew down to Palo Alto, CA, to visit our Headquarters – and after having a great time with my Monkey peers, I was relaxing on the plane back to Seattle watching a great talk by SQL Server MVP and fellow MCM Maciej Pilecki [twitter] called “Masterclass: A Day in the Life of a Database Transaction” where he discusses many different topics related to transaction management inside SQL Server. Very good stuff, and when I got home it was a little late – that slow DBCC CHECKDB that I had been dealing with was way in the back of my head.

As I was looking at the problem at hand earlier on this week, I thought “How about I set the database to read-only?” I remembered one of the things Maciej had (jokingly) said in his talk: “if you don’t want locking and blocking, set the database to read-only” (or something to that effect, pardon my loose memory). I immediately killed the CHECKDB which had been running painfully for days, and set the database to read-only mode. Then I ran DBCC CHECKDB against it. It started going really fast (even a bit faster than before), and then throttled down again to around 10Mb/sec. All sorts of expletives went through my head at the time. Sure enough, the same latching scenario was present. Oh well.

I even spent some time trying to figure out if NUMA was hurting performance. Folks on Twitter made suggestions in this regard (thanks, Lonny! [twitter])

…Eureka?

This past Friday I was still scratching my head about the whole thing; I was ready to start profiling with XPERF to see if I could figure out which part of the engine was to blame and then get Microsoft to look at the evidence.

After getting a bunch of good news I’ll blog about separately, I sat down for a figurative smack down with CHECKDB before the weekend. And then the light bulb went on.

A sparse column. I thought that I couldn’t possibly be experiencing the same scenario that Paul blogged about back in March showing extreme latching with non-clustered indexes on computed columns. Did I even have a non-clustered index on my sparse column?

As it turns out, I did. I had one filtered non-clustered index – with the sparse column as the index key (and only column).

To prove that this was the problem, I went and setup a test.

Yup, that'll do it

The repro is very simple for this issue: I tested it on the latest public builds of SQL Server 2008 R2 SP2 (CU6) and SQL Server 2012 SP1 (CU4).

First, create a test database and a test table, which only needs to contain a sparse column:

CREATE DATABASE SparseColTest;
GO

USE SparseColTest;
GO

CREATE TABLE testTable (testCol smalldatetime SPARSE NULL);
GO

INSERT INTO testTable (testCol) 
    VALUES (NULL);
GO 1000000

That’s 1 million rows, and even though you’re inserting NULLs, that’s going to take a while. In my laptop, it took 3 minutes and 31 seconds.

Next, we run DBCC CHECKDB against the database:

DBCC CHECKDB('SparseColTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;

This runs extremely fast, as least on my test rig – 198 milliseconds.

Now let’s create a filtered non-clustered index on the sparse column:

CREATE NONCLUSTERED INDEX [badBadIndex]
    ON testTable (testCol)
    WHERE testCol IS NOT NULL;

With the index in place now, let’s run DBCC CHECKDB one more time:

DBCC CHECKDB('SparseColTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;

In my test system this statement completed in 11433 milliseconds. 11.43 full seconds. Quite the jump from 198 milliseconds.

I went ahead and dropped the filtered non-clustered indexes on the restored copy of our production database, and ran CHECKDB against that. We went down from 7+ days to 19 hours and 20 minutes.

Cue the “Argenis is not impressed” meme, please, Mr. LaRock.

My pain is your gain, folks. Go check to see if you have any of such indexes – they’re likely causing your consistency checks to run very, very slow.

Happy CHECKDBing,

-Argenis

ps: I plan to file a Connect item for this issue – I consider it a pretty serious bug in the engine. After all, filtered indexes were invented BECAUSE of the sparse column feature – and it makes a lot of sense to use them together. Watch this space and my twitter timeline for a link.

Published Sunday, June 16, 2013 11:32 PM by Argenis

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dave said:

Your GO 1000000 is 7 digits so it is making 1 million rows, not 10 million

June 17, 2013 11:53 AM
 

Argenis said:

@Dave - you're absolutely correct. 10 was a typo - I only inserted 1 million rows.

June 17, 2013 12:10 PM
 

Joseph said:

Nice write-up...Thanks!

So, Argenis, what do you recommend?  Dropping this NCI on the offline database, before running the check?

Is the worse-case scenario (on the prod copy), if corruption does turn up one day, is to drop and recreate it?

-Joseph

June 17, 2013 3:20 PM
 

Argenis said:

@Joseph - I would probably take a snapshot of the database, remove the NCI, run DBCC CHECKDB on that, and then revert to the snap and run another DBCC CHECKDB on the whole thing, and not care how long that takes.

June 17, 2013 3:24 PM
 

Joseph said:

Very nice, thanks!

-J

June 17, 2013 3:54 PM
 

Argenis said:

@Joseph indeed, if corruption is found, the easiest way to fix it would be to drop the index and recreate it. But this might not be a good strategy for your business. Relying on automatic page repair on mirroring or availability groups is a better idea IMO.

June 17, 2013 4:48 PM
 

Chris Adkin said:

My understanding is that filtered indexes and sparse columns were 'Invented' to a large degree to support Sharepoint. Specifically, Sharepoint uses the entity attribute model in part which despite its evils works well with filtered indexes. Microsoft can be picky from a support / de-support perspective if you play around with system objects in Sharepoint databases, I don't know if dropping a filtered system index, would trigger de-support. However, this is something you may wish to bring up in your connect item as it ** might ** help your cause and that of the wider SQL community.

June 20, 2013 9:07 AM
 

Argenis said:

@Chris thanks, that's a great point.

June 20, 2013 11:17 AM
 

James said:

Nice post. How did you work out which table CHECKDB was stalling on? Was it just obvious in this case due to the table's size?

October 13, 2013 3:47 PM
 

Argenis said:

@James - indeed it was pretty obvious which table was the problem given its size. I had confirmed earlier that this was the case by dropping that table on a test environment and comparing DBCC CHECKDB execution times.

October 21, 2013 4:34 PM
 

BradC said:

Here is a query (just a slightly modified version of Paul's) that shows all indexes containing a calculated or sparse field:

SELECT s.name as SchemaName,

   o.name as TableName,

   i.name as IndexName,

   c.name as ColumnName,

   ic.*

FROM sys.columns c

JOIN sys.index_columns ic

   ON ic.object_id = c.object_id

   AND ic.column_id = c.column_id

JOIN sys.indexes i

   ON i.object_id = ic.object_id

   AND i.index_id = ic.index_id

JOIN sys.objects o

   ON i.object_id = o.object_id

JOIN sys.schemas s

   ON o.schema_id = s.schema_id

WHERE c.is_computed = 1

  OR c.is_sparse = 1;

January 30, 2014 5:57 PM
 

Manuj Bahl said:

Sparse Column does not seem to play a role here, its the filtered index that is causing the problem on my server.

USE [master]

SET NOCOUNT ON

DROP DATABASE [Playground]

GO

CREATE DATABASE [Playground];

GO

USE [Playground];

GO

CREATE TABLE [dbo].[FilteredIndexTest]

   (

     IsProcessed BIT DEFAULT ( 0 )

   )

GO

INSERT  INTO [dbo].[FilteredIndexTest]

       ( IsProcessed )

VALUES

       ( 0 );

GO 10000

INSERT  INTO [dbo].[FilteredIndexTest]

       (

         IsProcessed

       )

SELECT

   [FilteredIndexTest].[IsProcessed]

FROM

   [dbo].[FilteredIndexTest]

GO 7

DECLARE @StartDateTime DATETIME

DECLARE @EndDateTime DATETIME

SELECT

   @StartDateTime = GETDATE()

DBCC CHECKDB('Playground') WITH ALL_ERRORMSGS,NO_INFOMSGS;

SELECT

   @EndDateTime = GETDATE()

SELECT

   DATEDIFF(ms, @StartDateTime, @EndDateTime) TimeTaken

CREATE NONCLUSTERED INDEX [IXF_FilteredIndexTest_IsProcessed]

ON [dbo].[FilteredIndexTest] (IsProcessed)

WHERE IsProcessed = 1;

SELECT

   @StartDateTime = GETDATE()

DBCC CHECKDB('Playground') WITH ALL_ERRORMSGS,NO_INFOMSGS;

SELECT

   @EndDateTime = GETDATE()

SELECT

   DATEDIFF(ms, @StartDateTime, @EndDateTime) TimeTakenWithFilteredIndex

DROP INDEX [FilteredIndexTest].[IXF_FilteredIndexTest_IsProcessed]

CREATE NONCLUSTERED INDEX [IX_FilteredIndexTest_IsProcessed]

ON [dbo].[FilteredIndexTest] (IsProcessed)

SELECT

   @StartDateTime = GETDATE()

DBCC CHECKDB('Playground') WITH ALL_ERRORMSGS,NO_INFOMSGS;

SELECT

   @EndDateTime = GETDATE()

SELECT

   DATEDIFF(ms, @StartDateTime, @EndDateTime) TimeTakenWithNCIndex

May 29, 2014 8:50 AM
 

Google Play Gift Card Generator No Survey said:

Argenis Fernandez : DBCC CHECKDB on VVLDB and latches (Or: My Pain is Your Gain)

September 18, 2014 6:58 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement