My observations are that IO to lob pages (and row overflow pages as well?) is restricted to synchronous IO, which can result in serious problems when these reside on disk drive storage. Even if the storage system is comprised of hundreds of HDDs, the realizable IO performance to lob pages is that of a single disk, with some improvement in parallel execution plans.
The reason for this appears to be that each thread must work its way through a page to find the lob pointer information, and then generates a synchronous IO if the page is not already in memory, continuing to the next row only when the IO is complete.
I believe this issue could be addressed if we could build a non-clustered index where the index row header contains a pointer to the lob page instead of back to the cluster key (or heap page). Presumably this structure would be able to use the asynchronous IO capability of the SQL Server engine used for normal in-row pages. Per Mark Rasmussen’s PASS 2013 session, the index should perhaps point to all lob pages for very large structures requiring multi-level lob pointers.
Another thought is for the nonclustered index included columns could have the lob page pointer, so that we can jump directly to it instead of first going back to the table.
In the last two years, I have worked on databases that made extensive use of large text fields, prominently kCura (see
Relativity), the legal document discovery application. Typically the fields are declared as varchar(max) but it could be any of the data types that are stored outside of the row, i.e., in either the lob_data pages or row_overflow_data as opposed to the normal in_row_data pages.
It is quickly apparent that accesses to the lob fields are horribly slow when the data is not in the buffer cache. This is on a storage system comprised of over one hundred 10K HDDs distributed over four RAID controllers.
SQL Server has an outstanding storage engine for driving disk IO involving normal in-row data (of course, I had previously complained on the lack of direct IO queue depth controls.) In the key lookup and loop join operations (that generate pseudo random IO), both high volume and single large query, SQL Server can issue IO asynchronously and at high queue depth. For a database distributed over many disk volumes, many IO channels, and very many HDDs, SQL Server can simultaneously drive IO to all disks, leveraging the full IO capability of the entire storage system, in this case 25-25K IOPS at low queue depth.
Apparently, this is not true for lob (and row-overflow) pages. A quick investigation shows that IO for lob pages is issued synchronously at queue depth one per thread (or DOP). This means each thread issues one IO. When the IO completes, it does the work on the data, then proceeds to the next IO. Given the data was distributed over limited portion of the disk, the actual access times was in the 2.5-3ms range corresponding to 300-400 IOPS. This is slightly less than 10K HDD theoretical random access comprised of 3ms rotational latency plus 3.5ms average seek for data distributed over the entire disk for 150 IOPS.
SET STATISTICS IO shows “lob physical reads 14083, lob read-ahead reads 39509”, indicating read-ahead reads, but this might be for the addition pages of a field larger than 8KB? There is some scaling with degree of parallelism in the IO access to LOB pages, as it appears that the synchronous IO is per thread. On this particular system, the decision was made to set MAXDOP at 8 out of 20 physical cores, 40 logical to better support multiple concurrent search queries. A query specifying LOB data can generate on order of 1500-2000 IOPS at DOP 8, perhaps 4000-6000 at DOP 20.
It was noticed that the time to generate a Full-Text index on a table with 20M rows and 500GB of LOB data took several hours. This would be consistent with a single thread running at 400 IOPS (not all rows have LOB pages). I do not recall any single CPU being pegged high during this period, but a proper observation is necessary. It could be that Full Text indexes creation will be much faster with better IO to the LOB pages (and parallelism?).
It so happens that kCura Relativity seems to run with most normal in-row pages in memory, generating little disk activity. In this case, one possible solution is to use the TEXTIMAGE_ON specification to place LOB in a file group on an SSD array. In principle the in-row data is more important, but these will mostly like be in-memory, hence not need the fast disk IO. The LOB pages that cannot benefit from proper asynchronous IO at high queue depth operation is placed on the low latency SSD. This is the reverse of putting important data on the more expensive storage media, but it suits the actual mode operation?
Even better is for Microsoft to fix this problem. My suggestion (and I know how the SQL Server team loves to hear my advice) is to use the index Included Columns feature to support pointers to LOB pages, instead of the column value itself. Naturally it does not make sense to include the LOB column because it would be stored out of page, as in the original table.
The reason disk IO to lob pages is synchronous is that each thread must work its way into the row to find the file-page pointer? Nowhere is there list of pointers similar to the nonclustered index header. The advantages would: 1) having the list of pointers available in a compact structure that can be scan quickly, and 2) support asynchronous IO. The index row header would point to the lob pages, instead of back to the table? In order of the cluster key? Or some other column? Naturally it is necessary to understand how to SQL Server engine handle LOB IO to come up with the right solution.
At PASS 2013, there were two sessions on row and page organization: DBA-501 Understanding Data Files at the Byte Level by Mark S. Rasmussen, of iPaper A/S (improve.dk)
and DBA-502 Data Internals Deep Dive by Bradley Ball, .
The Rasmussen session also details LOB organization.