From time to time, I encounter a system design that always issues an UPDATE against the database after a user has finished working with a record – without checking to see if any of the data was in fact altered. The prevailing wisdom seems to be that “the database will sort it out”. This raises an interesting question: how smart is SQL Server in these circumstances? In this post, I’ll look at a generalisation of this problem: what is the impact of updating a column to the value it already contains?
The specific questions I want to answer are:
- Does this kind of UPDATE generate any log activity?
- Do any data pages get marked as dirty (and so eventually get written out to disk)?
- Does SQL Server bother doing the update at all?
To answer each of these questions, we will need a test database, and way to measure each of the potential effects.
The Test Database
The following code creates a table with a single row of data, in a brand new database:
CREATE DATABASE [5B4F379C-7B88-45C8-9CF2-662BC4CCCDE7];
ALTER DATABASE [5B4F379C-7B88-45C8-9CF2-662BC4CCCDE7] SET ALLOW_SNAPSHOT_ISOLATION OFF;
ALTER DATABASE [5B4F379C-7B88-45C8-9CF2-662BC4CCCDE7] SET READ_COMMITTED_SNAPSHOT OFF;
CREATE TABLE dbo.SomeTable
some_column INTEGER NOT NULL,
some_data INTEGER NOT NULL,
VALUES (1, 999);
Measuring log activity
To assess the impact on the transaction log, we can use the undocumented (but widely known and used) system table-valued function sys.fn_dblog – a handy function that provides a way to read the active part of the transaction log. To make it easy to see just the log records generated by our test UPDATE statement, we’ll run a CHECKPOINT immediately before the test itself. Since the database we created is operating in log auto-truncate mode, this will ‘clear out’ the log for us.
SELECT L.[Current LSN],
FROM sys.fn_dblog (NULL, NULL) L;
As you can see, the transaction log just contains records concerning the CHECKPOINT operation itself. We can now run test UPDATE queries and examine the log again afterwards to see the changes. By the way, if you were wondering if the new database’s recovery model makes a difference here, the answer is ‘no’. All newly-created databases run in log auto-truncate mode. If the recovery model is full or bulk-logged, this continues until the first full backup is taken. A database running the simple recovery model always operates in log auto-truncate mode.
Measuring Dirty Pages
The dynamic management view sys.dm_os_buffer_descriptors shows the pages held in SQL Server’s buffer pool. One of the columns returned from this view (is_modified) shows if a page has been changed since it was read from disk. We can write a simple query to show the state of all pages associated with our test table:
FROM sys.dm_os_buffer_descriptors BUF
JOIN sys.allocation_units AU
ON AU.allocation_unit_id = BUF.allocation_unit_id
JOIN sys.partitions PAR
ON AU.container_id =
WHEN AU.type = 2 THEN PAR.partition_id
WHERE BUF.database_id = DB_ID()
AND PAR.object_id = OBJECT_ID(N'dbo.SomeTable', N'U');
The CHECKPOINT we need to run to clear the log will also ensure that all dirty pages are written to disk. After running a test UPDATE, we can run the above script to see if any pages were dirtied (and so will need to be flushed to disk at some point).
Measuring Update Activity
Finally, we need a way to determine if SQL Server performed the update, or skipped it entirely. We can use the sys.dm_db_index_operational_stats dynamic management view for this. Despite the name, it works with heaps as well as clustered tables:
This will show the number of index (or heap) operations performed at the leaf level.
A Simple Update
There are two basic ways to write an UPDATE that does not change the column values. First, we can explicitly set a column to the value we know is stored there. Alternatively, we can write an statement that simply sets a column equal to itself:
UPDATE dbo.SomeTable SET some_data = 999;
UPDATE dbo.SomeTable SET some_data = some_data;
In this test, both syntaxes produce the same results:
The logging test results show that two log records were written, to log the start and end of the implicit transaction wrapping the UPDATE statement. There are no data modification log records. The buffer pool test shows that neither of the two pages associated with the table were marked dirty by the UPDATE. No buffer pages will need to be flushed to disk as a result of the update. The index operations test shows one insert (caused by the initial data load) and one update operation against index zero (the heap). This confirms that the update was physically performed (rather than skipped altogether) when using either UPDATE syntax. The query plans for the two syntax variations are not quite identical, though they do look the same at first glance:
When we specify a constant value (999), the Table Scan iterator does not pass any column references up the plan, it just passes a bookmark so the Table Update knows which row to modify. When we use the column name syntax, the Table Scan passes the same bookmark plus the value currently stored in the some_data column.
If we perform the same tests, but with a clustered index on some_column, we get the same results – except the index_id in the index operations test is one instead of zero. The logging and buffer pool results are identical. The query plan is slightly different of course:
In addition to the different icons, the Clustered Index Scan passes a reference to some_column to serve as the bookmark. Essentially, though, it’s the same plan.
Updating Both Columns
Things get more interesting if we update both the some_column and some_data columns:
SET some_column = 1,
some_data = 999;
SET some_column = some_column,
some_data = some_data;
When our test table is configured as a heap table, we get the same results as before: just the implicit transaction is logged, the buffer pool pages are not marked dirty, and the index operations test shows that a physical update was performed. This remains the case even if we create a non-clustered primary key on some_column.
Performing the same tests on a clustered table produces a rather different output:
The logging test shows that SQL Server now logs marking the existing row as a ghost and inserting a new row, along with adjustments to the Page Free Space (PFS) record (to record the fact that the page now contains at least one ghost record). The buffer pool test shows that the data page associated with the clustered table is now marked as dirty. The 8KB page will be written to disk shortly (by the checkpoint process, the lazy writer, or an about-to-suspend worker thread). The index operations test results show one row-ghosting operation and an insert, as indicated in the transaction log records.
The important observation is that a clustered table will always produce full logging and dirty buffer pages if (any part of) the clustering key is updated.
Effects of a LOB Column
For our third test scenario, we’ll add an NVARCHAR(MAX) column to the table:
CREATE TABLE dbo.SomeTable
some_column INTEGER NOT NULL,
some_data INTEGER NOT NULL,
big_data NVARCHAR(MAX) NOT NULL,
The presence of the LOB column does not affect the previous results (whether we ‘update’ the LOB column or not) with one important exception: If the LOB column contains more than 8000 bytes of data the effects seen depend on the syntax we use. If the SET column_name = column_name syntax is used, we can still see minimal transaction logging and clean buffer pool pages. Let’s modify the test rig to use the explicit-value syntax instead:
(some_column, some_data, big_data)
VALUES (1, 999, REPLICATE(CONVERT(NVARCHAR(MAX), N'X'), 4001));
SET big_data = REPLICATE(CONVERT(NVARCHAR(MAX), N'X'), 4001);
That produces the fully-logged, dirty-page-creating behaviour, regardless of whether the table is a heap or clustered, but only if the LOB data exceeds 8000 bytes (4000 Unicode characters in the above example). The only visible difference in the query plan is an extra Compute Scalar iterator, used to add in the result of the REPLICATE expression.
If either of the database options ALLOW_SNAPSHOT_ISOLATION or READ_COMMITTED_SNAPSHOT are ON even if neither are being used we always get the fully-logged, page-dirtying behaviour, regardless of anything that has been said so far. This is because the engine starts generating and storing row versions for all data modification operations. Adding or updating the row versioning information on a page inevitably results in page changes which must be logged and flushed to disk at some stage.
SQL Server contains a number of optimisations to avoid unnecessary logging or page flushing when processing an UPDATE operation that will not result in any change to the persistent database.
Non-updating updates to a clustered table generally avoid extra logging and page flushing, unless a column that forms (part of) the cluster key is affected by the update operation.
If any part of the cluster key is ‘updated’ to the same value, the operation is logged as if data had changed, and the affected pages are marked as dirty in the buffer pool. This is a consequence of the conversion of the UPDATE to a delete-then-insert operation.
Heap tables behave the same as clustered tables, except they do not have a cluster key to cause any extra logging or page flushing. This remains the case even where a non-clustered primary key exists on the heap. Non-updating updates to a heap therefore generally avoid the extra logging and flushing (but see below).
Both heaps and clustered tables will suffer the extra logging and flushing for any row where a LOB column containing more than 8000 bytes of data is updated to the same value using any syntax other than ‘SET column_name = column_name’.
Simply enabling either type of row versioning isolation level on a database always causes the extra logging and flushing. This occurs regardless of the isolation level in effect for the update transaction.
SQL Server 2005 and later include a further important optimisation that can eliminate the overhead of updating non-clustered indexes where column values have not changed. See this blog entry by the SQL Server Query Optimiser Team.
More information on the sys.dm_db_index_operational_stats DMV by Louis Davidson (twitter | blog).
More information on the system function fn_dblog by Kevin Kline (twitter | blog)
My thanks to Josef Richberg (twitter | blog) and Mladen Prajdic (twitter | blog) for their contributions to the #sqlhelp twitter discussion that prompted this blog entry.