THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul Nielsen

www.SQLServerBible.com

Denormalize for Performance?

Conventional wisdom says to “normalize to 3rd normal form then denormalize for performance.” Poppycock! I say for an OLTP operational database (not a reporting or BI database), a well normalized design will outperform a denormalized design for three good reasons:

1)      If the denormalization duplicates data, then the DML operations have to write to multiple tables, this means extra code and extra work.
 
2)      If the denormalization duplicates data, these will typically be keys or at least candidate keys, so they’ll be indexed which contributes to the index bloat problem.
 
3)      If the denormalization repeats columns (item1, item2, item3, etc), then the data has to massaged before being inserted or updated, and when reading the data it typically needs to be unpivoted for set-based code, and these operations mean extra work.
Those who promote denormalization would say argue that denormalization is trading write performance for read performance, and this is true for reporting database, but in issue number 3 above, even read performance suffers.

Since denormalization tends to go with SQL code in the app layer (no database abstraction in the form of stored procedures), the denormalization also impacts extensibility, since it’s even more expensive now to modify the database schema.
 
In production databases I’ve built, I’ve tested a pure normalized design vs. the best denormalized design I could come up with. Having a clearly defined database abstraction layer meant that I could run one script to modify the table structure, correct the data, and alter the stored procedure, and the app could continue to make the same stored procedure calls without breaking. In every case, the normalized design outperformed the denormalized design by about 15% due the normalized design’s more efficient code.
Published Friday, October 03, 2008 7:00 AM by Paul Nielsen

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Denis Gobo said:

You know what they say: normalize till it hurts, then denormalize till it works  :-)

October 3, 2008 8:54 AM
 

xero said:

I'd be interesting in seeing real numbers and case studies with this. Conventional wisdom does say to denormalize for read performance, however, you must also tweak your indexing. Not every table in an OLTP DB has the same traffic, sometimes it helps to denormalize a few of them because they're weighted much more toward reads than writes. The best performance will come out of tweaking each area as needed, but I'd be interested to see the numbers of some stereotypical apps written fully normalized and fully denormalized. It would be interesting to see if the numbers would support erring one way or the other.

October 3, 2008 10:29 AM
 

noeldr said:

There is no black-or-white in this kind of problems. Some denormalization techniques out perform BY FAR normalized schemas. There are occasions in which taken too far denormalization gets in the way. So my approach is "use it sparingly".

By the way there is no such thing as "pure" OLTP in real life.  

October 3, 2008 10:42 AM
 

Mladen said:

I do agree. Also this is my take on this from the more developer oriented perspective:

http://weblogs.sqlteam.com/mladenp/archive/2008/09/17/Normalization-for-databases-is-like-Dependency-Injection-for-code.aspx

October 3, 2008 11:33 AM
 

Armando Prato said:

Funny, I just had this situation today!

In the past, I've denormalized heavily trafficked (SELECTed) OLTP tables with great results.  

October 3, 2008 12:10 PM
 

James Luetkehoelter said:

I agree with the above comments, and say poppycock to your poppycock :) Denis has it right - Normalize till it hurts then demormalize till it works. Yes, there can be databases that fit 3rd normal form just fine, but often practical aspects in implementation trump flexible design for expediency. Is this good from a database design perspective? Of course not. But it is reality.

I've seen databases normalized to the point of insanity, where even the most basic query to return data to a screen (not a report, just a single "record" for review or edit) involve 15-20 joins. The bottom line is that you often have to do what makes the most sense for your situation.

That being said, I'm a vehement advocate of starting normalized, at least at a logical design layer before even considering de-normalization.

October 3, 2008 12:24 PM
 

Paul Nielsen said:

Hi James,

>>I've seen databases normalized to the point of insanity

this is a good chance to clarify. If normalization created insanity, then it's a poorly design data model, and probably NOT normalized correctly.

more tables <> more normalization

There's another art that's required when data modeling - the art of generalization - deciding how broad or narrow a definition should apply to any entity. If the entity definition is too narrow then the data model explodes with table. It takes experience, a repitore of past database designs, and a fear complexity / love elegence & simplicity to find the sweet spot of normalization.

I've worked with many a system that was "purely normalized" but inanity. and when I re-deisnged the schema, I typically reduce the number of tables by 60-80%. The result is data that's more obvious, easier to query, faster, and more flexible. Part of the solution is looking for ways to make the database nimble using data-driven designs.

October 3, 2008 1:12 PM
 

John Paul Cook said:

With ORM and now the entity framework, sometimes performance problems are the result of abstracting away the database structure. The application developers may have no idea if the database is normalized or not. When abstraction becomes ignorance, performance will suffer regardless of the normalization.

October 3, 2008 1:16 PM
 

Paul Nielsen said:

Xero,

I've done it. There have been times I've been tempted to denormalize, so I built out the database both ways for the client. In *every* case, I saw that the extra code to handle the denormalization OLTP operations cost more than the tempting performance gains.

I'm not talking about reporting databases, or portions of a datbase that exist for reporting; for OLTP operations, denormalizing for performance is a myth.

October 3, 2008 1:16 PM
 

Kalen Delaney said:

When I used to teach basic SQL courses, I would always stress that there is a big different between UNnormalized and DEnormalized. So James' comment about starting normalized should be an absolute requirement. If you're going to denormalize, you have to know how you're moving away from the normalized design, one step at a time.

I like Denis' comment... and it could support Paul's ideas too. If it 'works' as soon as it is normalized, then there is no need to denormalize. Paul probably has such terrific data models that he never needs to denormalize. ;-)

~Kalen

October 3, 2008 1:21 PM
 

Adam Machanic said:

I think noeldr hits the nail on the head here with his comment that "there is no such thing as 'pure' OLTP in real life."  The applications I've worked with over the past ten years have been either 100% OLAP/BI or mixed-workload OLTP.  I've never seen anything approximating what I thought OLTP should look like when I was first studying databases.  So I agree with others who mention that there is no black and white here, and oftentimes some denormalization is necessary.  Luckily we have some great tools in SQL Server and don't need to actually denormalize anything most of the time; indexed views are a hugely underused tool for achieving limited denormalization, in my experience.

October 3, 2008 1:36 PM
 

jchang said:

in the cases I have seen, there were somewhat appropriately normalized databases that had bad performance, which not the same as normalization causes bad performance.

The bad performance was caused by one or more of the following: 1) excessive compiles, usually from dynamic SQL, 2) excessive recompiles, usually from temp table write, 3) poor indexing, 4) poorly written SQL.

Elaborating on 3) indexing, most people do not understand the cost of key lookups, particular how the SQL optimizer treats it, which is a little rigidly silly.

4) many people like the one SQL query for all, like:

WHERE @P1 IS NULL OR col1 = @P1

AND @P2 is NULL OR col2 = @P2

With properly written SQL, proper cluster key, noncluster indexes, I would say most performance problems could be achieved in a normalized database. By not doing a proper root cause, people identify a poorly performing query, denormalize, which alleviates compile/recompile costs because the query is simpler, or simplifying indexing, hence problem solved, and conclude normalization was the cause.

October 3, 2008 2:34 PM
 

jchang said:

More on 3) indexing above. Because the SQL Server cost based optimizer makes the fixed assumption of scan at 1350 pages/sec and key lookups at 320 IOPS (variations on version, memory, etc), there are many times the plan will show a scan + hash join (another discussion topic) instead of a loop join with index seek. All of this leads people to think SQL Server (and other RDBMs) cannot join too many tables without problems. Poppycock^3!

October 3, 2008 2:50 PM
 

mosha said:

Coming with zero experience in building OLTP systems and with some experience in BI/DW - I say "Always normalize". I don't see why denormalized schemas should have better performance, quite the opposite I think. Analysis Services internally stores everything in fully normalized (snowflake) form, and it is wicked fast - so there is no reason to associate slow performance with normalized schemas.

October 3, 2008 4:42 PM
 

Sergey Shelukhin said:

Denormalization makes a lot of sense for data that is (1) huge, (2) live for reads, (3) and old and/or rarely changed.

The most simple example is storing summaries in addition to main data for some transactions when we know we will very rarely get a transaction  earlier than past month, so updating in two places is not a big deal, storing concatenated string together with user ids to output "to:" list for a certain message system where the quantity of messages is considerable, hit rate on them is very high, and to: lines of messages will not change unless hell brakes loose... etc

October 4, 2008 1:53 AM
 

Richard Fryar said:

In my experience, whether or not a database is normalized is the least of my worries. Other aspects of the schema, and poorly written application code are much more likely to be the cause of poor performance.

Recent examples:

1) A table with so many wide columns that there are an average of 1 or 2 rows per page, yet only 4 narrow columns are regularly used! They should have split this into 2 or more tables;

2) Too many round trips to the server. A simple search screen on surname should return data in one hit, but this application retrieved just the ids, then a separate (ad-hoc) select for each row returned!!  (This is a big, well-known vendor).

3) Generic SQL.  I often see the exact scenario mentioned by Joe, and have usually managed to get the code changed.  Sometimes, though, it is generated dynamically by the application so this is not possible.

My point is that while we have horrendous problems like these to solve, the existence (or not) of a few denormalized columns will have negligible effect.

October 5, 2008 6:46 PM
 

Louis Davidson said:

>>1) A table with so many wide columns that there are an average of 1 or 2 rows per page, yet only 4 narrow columns are regularly used! They should have split this into 2 or more tables;<<

Um, well, this is very likely a normalization problem :)  A table with too many columsn is a good sign that you are not doing something exactly right.

>>In my experience, whether or not a database is normalized is the least of my worries<<

Then you are most likely the luckiest human in the world.

The key here is that a design should not (initially) be optimized for performance.  This is true of the database, where correctness and usefulness of the data is key, and the UI should be optimized for the humans using the code (or the machines, perhaps.)  Performance should be considered mostly after the data is safe and secure.

The problem we generally have is overthinking the problem. The relational language SQL likes data normalized for the most part, so normalization is almost always the best thing to do and in the past two (or three) versions of SQL Server, it has become much easier to get it right and normal with less and less consessions made (de-normalizations).

The fact is, of course, that performance IS a major concern.  If you have a 100 people doing data entry, and the normalized design's need to recalculate values over and over are costing people 2 seconds a minute in productivity, you are costing 16 minutes a day.  If you can find a good point where you can denormalize and store some calculated values...there is a win for the user, a win for the company, and probably only a small hit for the maintainability of the code.

Too often though, people denormalize BEFORE testing, generally out of habit. I have argued with quite a few people that wanted to denormalize some count or such just because they needed to do it in early versions of SQL, or even Foxpro.  My personal feeling is that the interface you provide should look the way the clients need it. In your inital coding, you try no denormalization, go through performance testing and see what happens.  If you can't tune it with indexes and code rewrites, denormalize.  You will probably do it a lot less than you would guess.

All denormalizations have to be maintained, so you are hampering future you's development by having to deal with the baggage, (as well as having to do quick coding during tuning!) but you will find that you do a lot less denormalizing...and that is a good thing

October 5, 2008 7:32 PM
 

Alex said:

I find I only ever need to denormalize in 1 situation:

- History/versioning

Regarding history/versioning. The only denormalization would be to store the current version of an object in a main table, and all versions in another table.

You could achieve this with a view, selecting the most current version of each object from the big table with all versions. But this means that you are always forcing the query engine to do a top 1 per group. This is not a cheap operation, even with proper indexing.

I would venture that the other 2 commonly labelled data normalization situations are not at all:

- Reporting

- Archiving

Reporting is not denormalization, because it's saved output. It's not part of the data. It may even take into account business processes from the application.

Archiving is not denormalization, because it may need to be accessed without the benefit of SQL engine or application to help decipher the data. It too, is not part of the data.

No?

August 27, 2009 9:28 AM

Leave a Comment

(required) 
(required) 
Submit

About Paul Nielsen

Paul Nielsen believes SQL is the romance language of data. As such he’s a hands-on database developer, Microsoft SQL Server MVP, trainer, and author of SQL Server Bible series (Wiley). As a data architect, he developed the concepts of Smart Database Design and Nordic – an open source O/R dbms for SQL Server. He lives in Colorado Springs.

This Blog

Syndication

News

news item test
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement