THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Wrinkly Entropy

So, found this incredible freeware program on Codeplex with a really simple UI to handle performance optimization for SQL Server:

UltimateSlider

I was gonna post the link here but I lost it, and now the site seems to have been taken down...

OK, I am obviously lying. Wouldn't that be sweet, though? I do have a hunch there are developers working on, or at least philosophically aiming for, that level of automation around index tuning and partitioning, etc, but it's a long way off.

It's worth considering this dichotomy, though, when doing any kind of physical database design: call it the "Pay me Now or Pay me Later" principle. Basic information theory has, among other implications, the notion that it takes more effort to put a system into a high degree of organization (low entropy) than a lower degree of organization. And, of course, there is the related principle that it's possible to trade computation (time) for storage, and vice versa. In very concrete terms, this means that when designing a database at the physical storage level, perhaps the most important theme we work with is, "How organized will the data be in its stored state, to provide optimal efficiency."

At one end of the slider above, we have a theoretical limit of the most highly organized, most read-optimal state, where we have traded maximum storage for minimum compute time. A database way at that end of the spectrum would have, hypothetically, a covering index for every possible query, all joins pre-resolved, and aggregates pre-built and stored for every possible aggregation. The time taken to query the data would essentially be only the time required to fetch the data and send it back, with zero computation. The cost, then, is that any time information is added to such a database, a huge amount of computation is required to "file" it into that very detailed model. This is the essence of "Pay me Now." Inserts are expensive, in order to store the data in a highly organized state, thereby reducing the computation necessary to read results out later.

At the other end, all the data we are attempting to store is in a uniform, undifferentiated "heap." Inserts are really fast and straightforward, rather like slinging laundry onto a pile, because no computation is required to add data to the store. This is "Pay me Later." Laundry goes onto the pile easily, but when it's time to find all the matching socks, you have some work ahead of you.

So, "Duh," you may be saying. Of course.

The place where things get sticky is that the in-between states in this dichotomy don't fall on a nice, smooth gradient. In fact, the specific methods used and the technologies employed in SQL Server - all the details of table organization, keys, indexes, normalization, partitioning - make that in-between space all wrinkly and convoluted. Some specific techniques help speed some types of queries but not other types; some have a large impact and some a small impact. Some choices (which clustered index?) are inherently limited, where you have to trade one thing for another. Some introduce data integrity or redundancy questions that require complicated code to resolve.

Because of that complexity, or "wrinkly landscape," we often get into conversations like Aaron Bertrand's recent, excellent discussion on EAV, or my rant about TPH related to ORM and polymorphism.

At a high level, if I can use the EAV example, I find that it's useful to apply the measure above, at least in an abstract way. If you have a performance problem of some kind with a database, it might be because you're in the wrong place on that slider. And when evaluating some design pattern for how to arrange data (Kimball star schema, or TPH, or conventional normalization, or EAV, partitioning, etc.), taking into consideration data integrity, the main question is, "will the design pattern get you to the right position on that slider?" What's wrong with EAV? Well, it depends. Part of the famous It Depends comes from the wrinkly landscape between Pay me Now and Pay me Later. It depends on the granular details of your application, and where EAV would put you on the pay-me-now or pay-me-later meter. It might be just the thing, or it could turn into a real performance or data integrity challenge, depending on how it applies to the problem at hand.

 

Published Friday, November 20, 2009 4:44 PM by merrillaldrich
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Brooks Howell said:

M-

This sounds nothing like Architecture. Congrats on the escape.

Brooks

November 23, 2009 9:11 AM
 

merrillaldrich said:

Ack! A ghost from my past!

November 23, 2009 12:45 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement