THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

Why does SQL Server not compress data on LOB pages?

Enabling compression on your database can save you a lot of space – but when you have a lot of varchar(max) or nvarchar(max) data, you may find the savings to be limited. This is because only data stored on the data and index pages is compressed, and data for the (max) data types is generally stored on other, special-purpose pages – either text/image pages, or row overflow data pages. (See Understanding Pages and Extents in Books Online). This is from the SQL Server 2008R2 Books Online, but it is still valid in SQL Server 2012 – but apparently, this page has been removed from newer Books Online editions).

So why does SQL Server not compress the data that, perhaps, would benefit most from compression? Here’s the answer.

SQL Server currently supports two compression methods for data in the database (backup compression is out of scope for this post).

* Row compression: This is a simple algorithm to save storage space for individual rows. It has two elements. The first is a more efficient way to store the per-row metadata, saving a few bytes per row regardless of layout and content. The second element is storing almost all data types, even those that have a fixed length, as variable length. This mainly has benefits for the larger numerical types (e.g a bigint with a value of 1,000 is stored in two bytes instead of eight – only values that actually need all eight bytes do not gain from this, and will instead take up more space because the actual length has to be stored somewhere) and for fixed-length string types with lots of trailing spaces. For Unicode data, the SCSU algorithm is used, which saves 15% to 50% depending on the actual content of the column. (According to Wikipedia, the SCSU standard has gained very little adoption because it is not as effective as other compression schemes).

See Row Compression Implementation and Unicode Compression Implementation in Books Online.

* Page compression: When enabled, page compression is done *after* row compression. As the name implies, it's done on a per-page basis. It consists of two steps:

1. Prefix compression. Within each column, the longest common prefix is used to build the "anchor record". All columns than only indicate how many characters of the anchor value they use as prefix. So for example, if we have a first name column with the values Roger / Hugo / Hugh, the anchor value could be Hugh, and the data values would be stored as {0}Roger / {3}o / {4}. (Here, {3} is stored as a single byte, and {3}o means: first three characters of Hugh, followed by an o).

2. Dictionary compression. Accross the entire page, columns that are now stored with the same bit pattern are replaced with a single value that points to the dictionary entry. Let's assume that the same page I use above also has a Lastname column, with values Plowman / Kornelis / Ploo. Here, Plowman would be the anchor value, and the data after prefix compression would be {7} / {0}Kornelis / {3}o. The dictionary encoding would then see that there is a {3}o in the population of the Firstname columnm and a {3}o in the population of the Lastname column. It would place {3}o as the first entry in the dictionary and replace both {3}o values with the reference [1].

See Page Compression Implementation in Books Online.

All elements of page compression save space by eliminating repeated data between different column values, so they will only work when multiple values are stored on a page. For all LOB pages, the reverse is the case: a single value spans multiple pages. So by definition, page compression can never yield any benefits.

For row compression, the more efficient storage of per-row metadata naturally only affects pages that have per-row metadata stored – data and index pages, but not LOB pages. And the conversion of fixed length to variable length data types also doesn’t affect LOB pages, since these can only be used for varying length data.

Based on the above, it is obvious why SQL Server does not compress varchar and varbinary data stored on LOB pages – there would be zero benefit from any of the implemented compression methods. But how about Unicode compression for nvarchar(max) and overflowing nvarchar(nnn) data? Wouldn’t that save some space?

To answer that, I now have to go into speculation mode. And I see two possible theories:

1. Because the SCSU standard saves less spacing than other algorithms, the SQL Server team deliberately made this choice in order to encourage people to compress these large values in the client before sending them to the server, thereby reducing not only storage space (by more than SCSU would have yielded), but also network traffic. The down side of this is that cool features such as Full-Text Search and Semantic Search don’t understand data that was compressed at the client – at least not without a lot of extra effort.

2. Since all compression algorithms work on a per-page basis, they had a choice between either first breaking the LOB data into pages and then compressing (which makes no sense, as the rest of the page would remain empty and the amount of space actually used remains the same) or creating a separate algorithm for LOB data to first compress it and then split it over multiple pages. That would of course have cost a lot of extra engineering hours, and if my understanding of SCSU is correct, it would also have a big adverse side effect on operations that affect only a part of an nvarchar(max) value (like SUBSTRING or the .WRITE method of the UPDATE statement). That is because SCSU works by traversing the entire string from left to right and can’t handle operating on only a subset of the string.

Bottom line: When you have to store large values and you want to save on storage size, your best course of action is probably to compress and decompress the values on the client side. But do beware the consequences this has for Full Text Search and Semantic Search!

Final note: I didn’t spend as much time on this blog post as I normally do. That’s because this actually started as a reply to a question on an internet forum, but when I was busy I realized that the reply was long enough to be promoted to a blog post.

Published Thursday, January 31, 2013 11:08 AM by Hugo Kornelis
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Roger Plowman said:

Thanks!

January 31, 2013 8:53 AM
 

Adam Machanic said:

I think a much simpler answer is: "because the SQL Server team didn't implement the feature due to it not meeting some internal priority bar." It absolutely *could* be done -- there's no major technical barrier -- and it probably would by very useful for a lot of people. But they just didn't do it.

January 31, 2013 10:31 AM
 

Hugo Kornelis said:

@Adam: Of course it can be done. But can it be done without having a big adverse effect on performance?

All the algorithms I know for effective compression of a single stream of bytes work on the basis of either a dictionary that is explcitly stored as part of the compressed data (e.g. Lempel–Ziv–WelchHuffman), or that is kept in memory only and can be reconstructed from the compressed data (e.g. Lempel–Ziv–Welch). Both categories have problems with SUBSTRING and chunked updates (the .WRITE method of the UPDATE statement).

The first category always needs to read the dictionary in addition to the pages where the required data for a SUBSTRING lies, so more IO is needed for a SUBSTRING. It might even be necessary to read and decompress everything to the left of the substring, because I don't think it's possible to predict where exactly the 80,000th character of a compressed varchar(max) starts. For .WRITE, the same problem exists. And after too many .WRITE updates, the compression ratio will drop unless the dictionary is recalculated and the entire string is de- and then recompressed. Plus, what happens if I use .WRITE to replace a few character, but the replacement data doesn't compress as well as the original data?

In the second category, it is even completely impossible to decompress a substring without first processing all the preceding bytes. And changing a chunk changes the dictionary, so any change would force the engine to de- and recompress everything that follows the changed part. Using .WRITE to replace a few characters near the start of a 750MB string can become very costly that way!

January 31, 2013 12:26 PM
 

Peter said:

Rather than compressing a 750MB string as a single stream, it would probably better to compress it in separate 1MB chunks which would then mean that SUBSTRING & .WRITE would only have to decompress/recompress a smaller section of the data.

February 1, 2013 4:32 AM
 

Andrew said:

@Peter excellent idea. Combined with fast algorithm, e.g. LZ4 (http://code.google.com/p/lz4/), can yield substantial I/O savings.

@Hugo I think compression is not for everybody. If you have an app that often updates 750MB row of data, then of course it's best to avoid compression. But SQL should support compression in scenarios where .WRITEs are used rarely or not used at all.

So, we have cloud backup in SP1 CU2, may I expect LOB compression in SP1 CU3? :-)

February 2, 2013 1:12 PM
 

Adam Machanic said:

Agreed with Andrew. The .write method is not especially common; I'm not sure I've seen it used more than one or two times. Make it a limitation of the feature and 95+% of people storing LOB data will be perfectly happy.

February 3, 2013 12:09 PM
 

tobi said:

The fact is that some scenarios would be conveniently addressed by being able to turn on a much stronger compression scheme that also works for LOB data. The disk space, buffer pool efficiency and IO savings can be worth the (enormous) CPU cost.

SQL Server doesn't have this feature today but I hope they someday get around doing it. Maybe they are waiting until they have variable page size implemented so they can compress 1MB chunks at once.

February 3, 2013 4:15 PM
 

tobi said:

With a page size of >1MB one could use the same compression algorithm that is already used for backups. Backups would then not even need to compress again.

February 3, 2013 4:32 PM
 

TheSQLGuru said:

Small-chunk compression is the foundation of Column Store Indexing, which we know is very efficient and provides dramatic IO reduction and thus performance increases (even without Batch Mode Execution).  And like Adam I too have hardly EVER seen user apps doing LOB updating in the wild.

February 4, 2013 8:46 AM

Leave a Comment

(required) 
(required) 
Submit

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement