THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Deletes that Split Pages and Forwarded Ghosts

Can DELETE operations cause pages to split?  Yes.  It sounds counter-intuitive on the face of it; deleting rows frees up space on a page, and page splitting occurs when a page needs additional space.  Nevertheless, there are circumstances when deleting rows causes them to expand before they can be deleted.  The mechanism at work here is row versioning (extract from Books Online below):

Row versioning space usage

Isolation Levels

The relationship between row-versioning isolation levels (the first bullet point) and page splits is reasonably clear.  Any data that existed before either of the isolation levels was enabled will need to have the 14 bytes added by future data modifications, perhaps causing pages to split.  In this scenario, tables will likely contain a mix of records to start with, but over time (particularly as index maintenance is performed) the database will end up with row versions on most records, reducing the chances of a page split for that particular reason.

There is nevertheless a window of opportunity where adding the 14 bytes to an existing record could cause a page split.  No doubt there’s a recommendation out there somewhere to rebuild all tables and indexes when enabling or disabling a row-versioning isolation level on a database.  This is not all that interesting though, so let’s look at the second bullet point instead:

Triggers

The documentation says that versioning information is added if the table has a trigger.  What it doesn’t say is:

  • The extra bytes for row versioning can be added even where both READ_COMMITTED_SNAPSHOT and SNAPSHOT isolation are OFF.
  • This only applies to AFTER triggers, not INSTEAD OF triggers
  • The AFTER trigger also needs to be enabled to generate row versions, the mere existence of a trigger is not enough.
  • There is a very important exception to all the above…

SQL Server can still avoid adding the row-versioning information even where an enabled AFTER TRIGGER exists (the remainder of this post assumes that both row-versioning isolation levels are OFF, by the way).

Avoiding Row Versioning with Triggers

To explore this behaviour in a bit of detail, we’ll need a test rig:

USE tempdb;
GO
IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
    DROP TABLE dbo.Test;
GO
-- Test table
CREATE TABLE dbo.Test
(
    ID          integer IDENTITY PRIMARY KEY,
    Column01    nvarchar(20) NULL,
    Column02    nvarchar(4000) NULL,
);
GO
-- Add some rows
INSERT dbo.Test WITH (TABLOCKX)
    (Column01)
SELECT TOP (100000)
    CONVERT(nvarchar(20), N'X')
FROM sys.columns AS c
CROSS JOIN sys.columns AS c2
CROSS JOIN sys.columns AS c3
OPTION (MAXDOP 1);
GO
-- A trigger that does nothing
CREATE TRIGGER trg
ON dbo.Test
AFTER DELETE
AS RETURN;
GO
-- Write any dirty pages to disk
CHECKPOINT;
GO
-- Physical storage before any changes
SELECT
    ddips.index_type_desc,
    ddips.alloc_unit_type_desc,
    ddips.page_count,
    ddips.record_count,
    ddips.max_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(
    DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
WHERE
    ddips.index_level = 0;
GO
-- Buffer pool pages before any changes
SELECT 
    dobd.[file_id],
    dobd.page_id,
    dobd.page_type,
    dobd.row_count,
    dobd.free_space_in_bytes,
    dobd.is_modified
FROM sys.partitions AS p
JOIN sys.allocation_units AS au 
    ON au.container_id = p.hobt_id
JOIN sys.dm_os_buffer_descriptors AS dobd ON 
    dobd.allocation_unit_id = au.allocation_unit_id
WHERE
    p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
ORDER BY
    dobd.page_id;
GO
SET STATISTICS IO ON;
 
-- Delete 1 in 10 rows
DELETE dbo.Test
WHERE ID % 10 = 0;
 
SET STATISTICS IO OFF;
GO
SELECT
    [Page Splits] = COUNT_BIG(*)
FROM sys.fn_dblog(NULL,NULL) AS fd 
WHERE 
    fd.[Transaction Name] = N'SplitPage';
GO
-- Ensure ghosted records are processed so
--  we see accurate per-page row counts
DBCC FORCEGHOSTCLEANUP;
GO
-- Physical storage after the delete operation
SELECT
    ddips.index_type_desc,
    ddips.alloc_unit_type_desc,
    ddips.page_count,
    ddips.record_count,
    ddips.max_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(
    DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
WHERE
    ddips.index_level = 0;
GO
-- Buffer pool pages after the delete operation
SELECT 
    dobd.[file_id],
    dobd.page_id,
    dobd.page_type,
    dobd.row_count,
    dobd.free_space_in_bytes,
    dobd.is_modified
FROM sys.partitions AS p
JOIN sys.allocation_units AS au 
    ON au.container_id = p.hobt_id
JOIN sys.dm_os_buffer_descriptors AS dobd ON 
    dobd.allocation_unit_id = au.allocation_unit_id
WHERE
    p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
ORDER BY
    dobd.page_id;
GO
 
--DBCC TRACEON (3604);
--DBCC PAGE (tempdb, 1, 4720, 3);

The script performs the following actions:

  1. Creates a clustered table with an ID and two data columns
  2. Adds 100,000 rows each with a single ‘X’ character in the first column, and NULL in the second
  3. Creates an AFTER DELETE trigger that does nothing at all except exist and be enabled
  4. Displays physical storage and buffer pool information using DMVs
  5. Deletes every tenth row in the table
  6. Shows the number of leaf-level page splits that occurred
  7. Displays the physical storage and buffer pool information again

Test One – Clustered Table

Typical output:

Clustered Table Delete Test 1

There are 235 data pages with a maximum physical record size of 17 bytes before and after the delete.  Before the delete, each data page contains 426 rows with 2 bytes of free space.  After the DELETE:

  • A total of 10,000 records have been deleted
  • The data page count remains at 235
  • The maximum record size is still 17 bytes
  • Each data page has lost 42 or 43 rows
  • The free space on each page has risen to 800 or 819 bytes
  • All data pages are marked as being modified in memory
  • A total of 237 logical reads are reported

No surprises there.

Test Two – Clustered Table

Now, run the script again with the only change being that Column01 is defined as nvarchar(22) instead of nvarchar(20).  The before picture is the same as before, but the situation after the DELETE is very different:

Clustered Table Delete Test 2

There have been 234 page splits, increasing the data page count from 235 pages to 469 pages, and halving the number of rows on each data page.  The number of reads reported has also blown out from 237 previously to 2342 logical reads in this run (a factor of ten worse).

Explanation

The cause of the page splitting is that the deleted records must be versioned.  SQL Server 2005 and later uses the version store to build the inserted and deleted pseudo-tables used by AFTER triggers.  Where the data has no pre-existing versioning data, adding the 14 bytes will result in a clustered index page split if the page contains insufficient free space to accommodate this expansion.

Temporarily turning off ghost record clean-up using global trace flag 661 and examining an affected data page using DBCC PAGE shows the following (remember to turn the trace flag off afterward if you try this):

Row versioning ghost data record

Slots 8 and 10 on this page hold records that were unaffected by the DELETE; the physical row length is 17 bytes as displayed previously.  The record that was in slot 9 has been deleted.  It is a ghost record with versioning information added.  The record size is 17 + 14 = 31 bytes, and this expansion with only 2 bytes of free space on the page caused it to split.

This explains why the nvarchar(22) DELETE test caused page splitting, but why didn’t the original nvarchar(20) script behave the same?

There is a performance optimization that can avoid adding row versioning information, but only if the table cannot generate ROW_OVERFLOW or LOB allocation units.  This means that the definition of the table must not allow for LOBs or for the possibility of variable length columns moving off row.  The actual size of the data stored is immaterial – it is the potential size that matters.

In our test, the nvarchar(22) column definition caused the maximum possible INROW size to just exceed the 8060 byte limit.  (The exact INROW limit also depends on the table definition; marking one of the data columns SPARSE would reduce the limit to 8015-8019 bytes – whatever the right number is).

Heaps of Forwarded Ghosts

“Page splitting only occurs in index structures, are heap structured tables affected by this issue too?”

It is true that heap pages do not split, but when a heap row needs to expand, the engine will move the row to another page if insufficient free space exists on the current page.  When the storage engine does this, it leaves a forward pointer behind to avoid updating all non-clustered indexes to reflect the new physical row locator.

For a heap table with an active AFTER trigger, and a LOB column (or the possibility of row-overflow) the row has to be versioned and ghosted.  If the page contains insufficient free space to accommodate the versioning,  the row moves to another page leaving a forwarding stub behind.  This results in a forwarded ghost record.  Ghost clean-up will normally remove this record pretty quickly, so we will need to disable that process temporarily.  The following script creates the very special circumstances necessary to produce a forwarded ghost record this way (note this script is for test systems only should not be run in tempdb):

USE Sandpit;
GO
IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
    DROP TABLE dbo.Test;
GO
-- Heap test
CREATE TABLE dbo.Test
(
    ID          integer IDENTITY PRIMARY KEY NONCLUSTERED,
    Column01    nvarchar(16) NULL,
    Column02    nvarchar(4000) NULL
);
GO
-- Add some all-NULL rows
INSERT dbo.Test WITH (TABLOCKX)
    (Column01)
SELECT TOP (100000)
    NULL
FROM sys.columns AS c
CROSS JOIN sys.columns AS c2
CROSS JOIN sys.columns AS c3
OPTION (MAXDOP 1);
GO
-- Ensure rows are tightly packed
ALTER TABLE dbo.Test REBUILD;
GO
-- A trigger that does nothing
CREATE TRIGGER trg
ON dbo.Test
AFTER DELETE
AS RETURN;
GO
-- Write any dirty pages to disk
CHECKPOINT;
GO
-- Physical storage before any changes
SELECT
    ddips.index_type_desc,
    ddips.page_count,
    ddips.record_count,
    ddips.max_record_size_in_bytes,
    ddips.ghost_record_count,
    ddips.forwarded_record_count
FROM sys.dm_db_index_physical_stats(
    DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
WHERE
    ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
    AND ddips.index_level = 0;
GO
-- Buffer pool pages before any changes
SELECT 
    dobd.[file_id],
    dobd.page_id,
    dobd.page_type,
    dobd.row_count,
    dobd.free_space_in_bytes,
    dobd.is_modified
FROM sys.partitions AS p
JOIN sys.allocation_units AS au 
    ON au.container_id = p.hobt_id
JOIN sys.dm_os_buffer_descriptors AS dobd ON 
    dobd.allocation_unit_id = au.allocation_unit_id
WHERE
    p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    AND dobd.page_type = N'DATA_PAGE'
ORDER BY
    dobd.page_id;
GO
-- Disable ghost clean-up
DBCC TRACEON (661, -1);
GO
SET STATISTICS IO ON;
 
-- Delete three records on the same page
DELETE dbo.Test
WHERE ID BETWEEN 1 AND 3;
 
SET STATISTICS IO OFF;
GO
-- Physical storage after the delete operation
SELECT
    ddips.index_type_desc,
    ddips.page_count,
    ddips.record_count,
    ddips.max_record_size_in_bytes,
    ddips.ghost_record_count,
    ddips.forwarded_record_count
FROM sys.dm_db_index_physical_stats(
    DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
WHERE
    ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
    AND ddips.index_level = 0;
GO
-- Buffer pool pages after the delete operation
SELECT 
    dobd.[file_id],
    dobd.page_id,
    dobd.page_type,
    dobd.row_count,
    dobd.free_space_in_bytes,
    dobd.is_modified
FROM sys.partitions AS p
JOIN sys.allocation_units AS au 
    ON au.container_id = p.hobt_id
JOIN sys.dm_os_buffer_descriptors AS dobd ON 
    dobd.allocation_unit_id = au.allocation_unit_id
WHERE
    p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    AND dobd.page_type = N'DATA_PAGE'
    AND dobd.is_modified = 1
ORDER BY
    dobd.page_id;
GO
-- View the appropriate page
--DBCC TRACEON (3604);
--DBCC PAGE (0, 1, 339408, 3);
 
-- Enable ghost clean-up
DBCC TRACEOFF (661, -1);

Example output, showing page 453648 receiving a versioned forwarded ghost record (click to enlarge in a new tab):

Versioned Forwarded Ghost Record

Partial DBCC PAGE output for the highlighted page:

DBCC PAGE forward ghost record

Summary

If you ever wonder why your deletes are so slow, it is worth checking to see if you are suffering from page splitting due to an enabled trigger and a table definition that allows for LOB allocations or ROW_OVERFLOW.  Any table with a LOB column (including the max data types) qualifies, as does one with even a surprisingly small number of variable-length columns, as shown in the examples in this post.  This is a great reason to avoid using ‘max’ or old-style LOB data types unnecessarily, and to be careful about the maximum length of ‘normal’ variable-length data types too.  Remember, it is the potential maximum row size that is important, not the actual row size.

On a related note, remember that deletes on a heap can only deallocate empty pages if a table lock is acquired?  A table definition that allows for LOB or ROW_OVERFLOW prevents that optimization too.  So, if your heaps are growing despite DELETE WITH (TABLOCKX), check the maximum possible row length!  You could also convert them to clustered tables as well, of course, but that’s a quite different debate.

I would like to acknowledge and thank SQL Server MVP Dmitri V. Korotkevitch who first brought the basic issue to my attention with UPDATE queries.  I would strongly encourage you to also read his blog entry showing how this behaviour also affects UPDATE queries, resulting in slow performance and excessive fragmentation.

Thanks for reading.

image45

© 2012 Paul White
Twitter: @SQL_Kiwi
Email: SQLkiwi@gmail.com

Published Friday, August 31, 2012 6:09 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

AlexK said:

Paul,

Have you checked if the presence of an OUTPUT clause in a DML command has the same effect as a trigger?

That's a really interesting read, I enjoyed it, thanks!

August 30, 2012 10:26 PM
 

Paul White said:

Hi Alex,

No the OUTPUT clause doesn't have the same effect.  Columns from the inserted and deleted tables used with the OUTPUT clause are provided by the execution plan, it doesn't use the version store.  Great question, though!

Paul

August 31, 2012 12:44 AM
 

Niko Neugebauer said:

Great info, Paul!

As always, there are so many cases when some "insignificant" changes of the internal structure affect the DB performance in such a big way.

August 31, 2012 12:14 PM
 

Patrick Ge said:

Thanks Paul. As always, great stuff. Very detailed explanation and excellent demostration!

Just in case you or other people are not aware of, Paul Randal also touched on this topic a little bit in his this post:

http://www.sqlskills.com/BLOGS/PAUL/post/Ghost-cleanup-redux.aspx

September 13, 2012 7:15 PM
 

Mark Cleary said:

Minor typo - I think you meant to write the sparse row limit is 8016, not 8616. I think the actual value is 8018 though BOL also mentions 8019.

September 14, 2012 6:55 PM
 

Paul White said:

Thanks Mark - 8616 was indeed a fat-finger issue.  Updated the post with the vague range of numbers (still can't be bothered to check it properly!)

Paul

September 14, 2012 7:10 PM
 

Mark Cleary said:

I just tested this with nvarchar(max) and the results seem to be worse. I'd assumed that the algorithm would take the max in-row size for column02 as 4000 and the results would be the same.

Instead when using the 20 character column01, I finished up with 469 pages and the free space was mostly 4428 to 4467 bytes.

This is on SQL Server 2012 (32 bit).

We have a table where we added an audit column where we record changes made by data cleaning processes. This uses varchar(max) and would appear to be a bad bad thing.

September 14, 2012 10:44 PM
 

Paul White said:

Hi Mark,

The maximum in-row size for nvarchar(max) is 8000 bytes not 4000 (Unicode characters use two bytes each).  So your results are pretty much as expected.

If your table with the audit column has a trigger (and I assume it does - for auditing!) then yes you will be experiencing this phenomenon.

Using a suitable FILLFACTOR on your table will help avoid splitting and fragmentation.  To see this, add the following statement to the test rig just after the INSERT that loads the data:

ALTER INDEX ALL ON dbo.Test REBUILD WITH (FILLFACTOR = 90);

Paul

September 15, 2012 2:45 AM
 

Mark Cleary said:

I don't think you understood what I said.

The maximum in row size for nvarchar(max) is 4000 characters which is the same as your sample using where column02 is nvarchar(4000). Both reserve 8000 bytes. I assumed nvarchar(max) would produce similar results where column01 being nvachar(20) doesn't page split but nvarchar(22) does.

I've been playing with fill factor and it does eliminate most page splits. We aren't using triggers.

September 15, 2012 2:53 AM
 

Paul White said:

Hi Mark,

Sorry if I misuderstood you.  Anyway, nvarchar(max) is a LOB type.  As I say in the main text:

"There is a performance optimization that can avoid adding row versioning information, but only if the table cannot generate ROW_OVERFLOW or LOB allocation units.  This means that the definition of the table must not allow for LOBs or for the possibility of variable length columns moving off row."

So the presence of a type that might generate LOB allocation units ensures row versioning information is added, resulting in splits (regardless of potential row size issues).

Paul

September 15, 2012 8:44 AM
 

Mark Cleary said:

Ah,yes I did read the bit about the optimisation and promptly forgot it.

We've moved to a design where the audit columns are stored in a separate table so we avoid page splitting, all inserts are sequential and we effectively force off row storage which means we don't retrieve the audit data unless we actually need it. Oh dear, sounds like a column store.

September 16, 2012 7:30 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement