THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

LOB Pointer Indexing Proposal

My observations are that IO to lob pages (and row overflow pages as well?) is restricted to synchronous IO, which can result in serious problems when these reside on disk drive storage. Even if the storage system is comprised of hundreds of HDDs, the realizable IO performance to lob pages is that of a single disk, with some improvement in parallel execution plans.

The reason for this appears to be that each thread must work its way through a page to find the lob pointer information, and then generates a synchronous IO if the page is not already in memory, continuing to the next row only when the IO is complete. I believe this issue could be addressed if we could build a non-clustered index where the index row header contains a pointer to the lob page instead of back to the cluster key (or heap page). Presumably this structure would be able to use the asynchronous IO capability of the SQL Server engine used for normal in-row pages. Per Mark Rasmussen’s PASS 2013 session, the index should perhaps point to all lob pages for very large structures requiring multi-level lob pointers.

Another thought is for the nonclustered index included columns could have the lob page pointer, so that we can jump directly to it instead of first going back to the table.

In the last two years, I have worked on databases that made extensive use of large text fields, prominently kCura (see Relativity), the legal document discovery application. Typically the fields are declared as varchar(max) but it could be any of the data types that are stored outside of the row, i.e., in either the lob_data pages or row_overflow_data as opposed to the normal in_row_data pages.

It is quickly apparent that accesses to the lob fields are horribly slow when the data is not in the buffer cache. This is on a storage system comprised of over one hundred 10K HDDs distributed over four RAID controllers.

SQL Server has an outstanding storage engine for driving disk IO involving normal in-row data (of course, I had previously complained on the lack of direct IO queue depth controls.) In the key lookup and loop join operations (that generate pseudo random IO), both high volume and single large query, SQL Server can issue IO asynchronously and at high queue depth. For a database distributed over many disk volumes, many IO channels, and very many HDDs, SQL Server can simultaneously drive IO to all disks, leveraging the full IO capability of the entire storage system, in this case 25-25K IOPS at low queue depth.

Apparently, this is not true for lob (and row-overflow) pages. A quick investigation shows that IO for lob pages is issued synchronously at queue depth one per thread (or DOP). This means each thread issues one IO. When the IO completes, it does the work on the data, then proceeds to the next IO. Given the data was distributed over limited portion of the disk, the actual access times was in the 2.5-3ms range corresponding to 300-400 IOPS. This is slightly less than 10K HDD theoretical random access comprised of 3ms rotational latency plus 3.5ms average seek for data distributed over the entire disk for 150 IOPS.

SET STATISTICS IO shows “lob physical reads 14083, lob read-ahead reads 39509”, indicating read-ahead reads, but this might be for the addition pages of a field larger than 8KB? There is some scaling with degree of parallelism in the IO access to LOB pages, as it appears that the synchronous IO is per thread. On this particular system, the decision was made to set MAXDOP at 8 out of 20 physical cores, 40 logical to better support multiple concurrent search queries. A query specifying LOB data can generate on order of 1500-2000 IOPS at DOP 8, perhaps 4000-6000 at DOP 20.

It was noticed that the time to generate a Full-Text index on a table with 20M rows and 500GB of LOB data took several hours. This would be consistent with a single thread running at 400 IOPS (not all rows have LOB pages). I do not recall any single CPU being pegged high during this period, but a proper observation is necessary. It could be that Full Text indexes creation will be much faster with better IO to the LOB pages (and parallelism?).

It so happens that kCura Relativity seems to run with most normal in-row pages in memory, generating little disk activity. In this case, one possible solution is to use the TEXTIMAGE_ON specification to place LOB in a file group on an SSD array. In principle the in-row data is more important, but these will mostly like be in-memory, hence not need the fast disk IO. The LOB pages that cannot benefit from proper asynchronous IO at high queue depth operation is placed on the low latency SSD. This is the reverse of putting important data on the more expensive storage media, but it suits the actual mode operation?

Even better is for Microsoft to fix this problem. My suggestion (and I know how the SQL Server team loves to hear my advice) is to use the index Included Columns feature to support pointers to LOB pages, instead of the column value itself. Naturally it does not make sense to include the LOB column because it would be stored out of page, as in the original table. The reason disk IO to lob pages is synchronous is that each thread must work its way into the row to find the file-page pointer? Nowhere is there list of pointers similar to the nonclustered index header. The advantages would: 1) having the list of pointers available in a compact structure that can be scan quickly, and 2) support asynchronous IO. The index row header would point to the lob pages, instead of back to the table? In order of the cluster key? Or some other column? Naturally it is necessary to understand how to SQL Server engine handle LOB IO to come up with the right solution.

At PASS 2013, there were two sessions on row and page organization: DBA-501 Understanding Data Files at the Byte Level by Mark S. Rasmussen, of iPaper A/S ( and DBA-502 Data Internals Deep Dive by Bradley Ball, . The Rasmussen session also details LOB organization.

Published Tuesday, October 29, 2013 4:29 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



tobi said:

My idea was to remove the idea of having lob pages at all. Instead, for each partition with lob columns, a 2nd b-tree would be generated of the form (<clusteringKey>, columnId INT, offset BIGINT, dataChunk varbinary(8000)). The LOB contents would be split over multiple rows with in-row binary data. This could even surface in the query plan as a join when LOB columns are requested. An internal aggregate could reconstruct the full LOB value in memory when requested by the query.

October 30, 2013 12:08 PM

Stephen O'Leary said:

You mention "quick investigation shows that IO for lob pages is issued synchronously at queue depth one per thread (or DOP)" can you please give details of your investigation and how you found this out, I'd like to be able to demonstrate this to some of our developers who store LOB data extensively (excessively in my opinion) and it would be great to be able to demonstrate the impact this has on performance.

November 14, 2013 12:22 PM

jchang said:

My tests were done on an active production system. There was other activity, and only a limited number of times I could issue DBCC DROPCLEANBUFFERS, which would have sent the regular IT staff into panic, so I did not tell them.

To do a proper test, create a table with one or more varchar max fields. I suggest several larges, then populate each row with the lob fields at 40KB (also try 4MB, corresponding to direct and 1 level indirection).

The reason for having several lob fields is so that the last lob field is large, hence in accessing the first lob field in each subsequent rows is beyond the storage system read-ahead (64K-4M???)

Then run queries looking for a substring at varying DOP.

On a clean system, this should demonstrate that queue depth corresponds to (just under) DOP.

November 14, 2013 12:45 PM

Paul Randal said:

We considered that during 2005 development, plus having a back-pointer to the owning data record to allow maintenance operations to be faster. It was going to be very complicated and so wasn't done. I'd be surprised to see this change any time soon.

November 15, 2013 12:40 PM

jchang said:

thanks Paul. I have been out of touch, so was slow to acknowledge. I am willing to accept that there was a heavy burden on the SQL Server 2005 development, as there were so many critical items, first being just getting a 64-bit X86 port. Technically a traditional OLTP or even DW should not have lob fields, so it may have been regarded as non-critical. But now SQL Server is being used far outside of traditional DB applications. And so many auxiliary features have been added to SQL Server. There should be enough people of the SQL Server team? to handle this, even if it is difficult?

November 22, 2013 11:27 AM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement