If you follow me on Twitter (@SQLSarg) you probably saw tweets about a bug I encountered in SQL Server 2008 recently. What was hoped to be a normal Monday morning started off with alerts from a new production server that DBCC CHECKDB had failed for our new Sharepoint 2010 primary content database. Thankfully, the Sharepoint 2010 portal was still in pre-production configuration and validation, so while I had a problem it wasn’t necessarily a major one. When I started looking at the issue, one item that struck me as odd was, the CHECKDB job didn’t output error information when it was running against this database, it just failed, and terminated the connection, so in order to determine what exactly was wrong I had to rerun DBCC CHECKDB WITH ALL_ERRORMSGS, NO_INFOMSGS on the database again. Since we moved the content database from our Dev/Test server where it was initially configured to the production server on Friday, I decided to run CHECKDB against the Dev/Test server while CHECKDB ran again on the production server so I could help determine if the problem pre-existed the migration, or was new to the production server.
Of course it being a Monday, both of the CHECKDB runs failed exactly the same way, without providing information as to where the corruption in the database existed. So on the test server, I placed the database into SINGLE_USER mode and reran CHECKDB. The output was similar to the following:
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:691341; actual 0:0). It occurred during a read of page (1:691341) in database ID 25 at offset 0x0000015191a000 in file 'D:\SQLData\MSSQL10.MSSQLSERVER\MSSQL\DATA\Test.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
The same week that we moved this content database from our development to or anticipated production environment, we had a disk failure in our RAID 4 array, so I immediately suspected that we might have a problem with one of our other disks, so I ran disk diagnostics on the development server but found nothing wrong. I was about to toss in the towel and admit failure as a DBA when I tried rebuilding indexes as a last hope to see if possible the corruption was index related and I ran into a Numbers table in the Sharepoint content Database. Knowing that this isn't supposed to be there, my memory kicked in and I realized I had created this in model as a part of my previous blog posts on Tempdb, Does tempdb Get Recreated From model at Startup and Does the tempdb Log file get Zero Initialized at Startup? As a followup to those posts I decided to get really stupid and create a massive Numbers table in model to see if I could significantly impact instance startup by doing something stupid, only I never got around to playing further since it took a while to create the Numbers table seeded with the maximum value for an interger column.
So I dropped it from the content database and immediately it became corruption free. So I ran CHECKDB on model and it was fine, ran a SELECT * FROM Numbers and it returned from model. Try it in another Sharepoint database and I got the exact same error. I ran a reindex of the other Sharepoint database on the development server and it failed once again on the Numbers table, and once again, dropping this table from the database cleared the issue, so it is related to this oversized Numbers table that I created in model.
At this point I decided to create a connect feedback for the issue and solicit other members of the community for verification that this is in fact a problem inside of SQL Server 2008. The details are available on the following link:
I have been able to repro this on Windows Server 2008 R2 x64 and Windows Server 2008 x64with SQL Server 2008 x64 SP1 and SP1 + various CU's installed, as well as on Windows Server 2008 x86 with SQL Server 2008 SP1 x86 installed. However, I couldn't repro it SQL Server 2005 on any platform.
Here is the entire repro script from the Connect Feedback (Be forewarned, this will break your SQL Instance until you drop the Numbers table from Model):
ALTER DATABASE model SET RECOVERY FULL
-- Suppress data loading messages
SET NOCOUNT ON
-- Create Sample Data using a Table Varable
SELECT TOP 2147483647 IDENTITY(INT,0,1) AS N
FROM sysobjects a, sysobjects b, sysobjects c, sysobjects d, sysobjects e
-- Create a Primary Key and Clustered Index to control data order
ALTER TABLE dbo.Numbers ADD CONSTRAINT
PK_Numbers PRIMARY KEY CLUSTERED (N)
ALTER DATABASE [model] SET RECOVERY SIMPLE
DBCC SHRINKFILE (N'modellog' , 0, TRUNCATEONLY)
DBCC SHRINKFILE (N'modeldev' , 6144)
CREATE DATABASE Test
SELECT * FROM Numbers
I also sent this to a good friend of mine at Microsoft, Bob Ward, who graciously took the time to run the repro script and validate that this is in fact a form of bug in SQL Server 2008. He is still looking into it, but I would suspect that since it requires a 4GB+ table to be created in the model database, and the corruption problem has a valid workaround of dropping the table from model and any user database created based on the bloated model database, that this doesn’t necessarily warrant a Critical fix priority. What is really interesting with this bug is that smaller tables, 1-2GB inside of model don’t trigger the same error. I seems to be size based, with really large objects in model actually being the root cause of the problem.
I have been keeping in touch with Bob since he first replied to my email regarding this, and he is still looking into the reported bug. One certainly has to question what kind of person would actually create a 4GB table in the model database of SQL Server, knowing that this object would be copied to every database created on the server after it was created (wait a minute, I actually did this). As a note, the tempdb database on the server also reported corruption issues when it was restarted with this large object in the model database, and there was an impact to startup time by restarting the instance with this object in the model database (potentially a topic for another blog post as originally intended).