THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Brute Force Hardware versus Tuning

Every now and then, the question is asked “When will servers be powerful enough that performance tuning will not be necessary.” Sometimes I get the impression this was posed not on technical grounds, but rather that ISVs and CIOs don’t like performance specialists. Fortunately (for me) it does not appear that this will ever happen for two principal reasons: 1) hardware and tuning contribute in such completely different aspects that neither can be ignored, and 2) multi-core processors actually introduce new tuning requirements and strategies.

On the hardware side, performance improves over time with 1) processor compute capability, in terms of both single core and system aggregate performance, 2) memory capacity and 3) storage IO performance. Tuning largely involves the three categories: 1) improving the execution plans of top resource consuming queries, 2) balancing avoidable compiles and recompiles of high volume queries with having the right execution plan, 3) managing deadlocks, blocking and various resource bottlenecks.


Processors
In the past, processor performance increased roughly 40% per year at the single core level. Since the 2005 time frame, processor performance has increased mostly with increasing number of cores per socket than at the single core level. Over the last six years, single core performance has roughly doubled. However, the total aggregate performance of recent generation microprocessors is truly impressive. In the past, mainstay database servers had 2-4 processor cores. Today, processors with six or eight core are standard and the server system itself might have between 12 to 32 cores. The current AMD Opteron processors have 12 cores on 2 die in one socket, but the aggregate performance at the socket level seems to be comparable to the Intel Xeon with 6 cores.

Database performance will depend on whether an application can effectively use all of the available processor cores in a scalable manner. If so, then it might be possible to handle performance issues with brute force. More than likely, only a few elements of a complete application will naturally scale exceptionally well. Many elements may have moderate or marginal scaling. Key operations, particularly writes (Insert, Update and Delete) may be restricted to a single core. A complete database application with a rich functionality set cannot be completely handled with brute force compute power comprised of many processor cores.

Memory
Memory can dramatically reduce disk IO to manageable levels. This usually requires proper tuning to bring the active database into a limited set of pages that fit in the memory available for the buffer cache. Once upon a time memory was to extremely critical. When system memory was below 1GB, a major effort was required to bring disk IO to manageable levels by ruthlessly restricting queries to basic transactions only. A report run concurrently with transactions could seriously overload the disk system. By the year 2000 or so, system memory capacity with reasonably priced memory modules (usually half the maximum support memory module capacity) crossed the 3-4GB range. Around this time frame, the memory capacity was sufficient to support both transactions and some reporting activity.

This was followed by the tortured path to full 64-bit for unrestricted access to all system memory. Itanium, the first Intel microprocessor designed for 64-bit architecture, was repeatedly delayed and seriously disappointed on performance when it finally did arrive (2001?). The Intel Pentium Pro and later processors did have a band-aid capability with the Physical Address Extension (PAE) feature which let a 32-bit OS manage more than 4GB physical memory. Microsoft reluctantly (perhaps it was a cold day in hell) added Address Windowing Extensions so that applications like SQL Server could map memory outside the normal address range with functionality sufficient for the buffer cache.

The Intel Itanium II finally provided good performance on a full 64-bit operating system supporting a 64-bit version of SQL Server 2000. But acceptance of the Itanium platform was poor. A 64-bit OS did not become pervasive until after the AMD Opteron, and later Intel 64-bit Xeons (based on the Prescott core) in 2004. SQL Server with 64-bit on the more popular X86-64 architecture did not arrive until late 2005.

By this time system memory capacity was in the 32-64GB range, with 16-32GB being very affordable. This seemed to be sufficient to handle properly tuned databases even with rich functionality. Today, server systems can reach 1TB memory at a reasonable price. A 16GB memory module with ECC is around $800. The price for 64 modules (1TB) is just over $50,000. This is relatively low compared with the overall cost of an important database application. Few applications should actually require so much memory. The reality is that most environments have already reached the memory brute force realm.

IO Performance
Interestingly, IO performance, while weak at the single disk level, was the one component that could be scaled with massive numbers of disk drives. The key to implementing this strategy is low cost per disk amortized over the complete storage system. However, storage vendors were interested in selling extremely high value-add features, i.e., extraordinarily expensive disks, which directly conflicts with the performance strategy.

Now with solid-state storage technology sufficiently mature and competitive in price, IO performance has also reached the brute-force realm. Of course, this came only after system memory configuration was so massive that a well designed transaction processing database should not generate heavy IO anymore. Still, it is good to have capability to handle IO surges.

Tuning for Execution Plan Efficiency
On the tuning side, a combination of rewriting the SQL (or using plan hints when this is not possible) and building the right set of indexes allows the query optimizer to find efficient execution plans for the most important queries. This effort targets the top SQL statements, and it may be possible to achieve a very large performance gain on a handful of queries, with essentially no change on the majority of queries.

Note that the range of performance on the hardware side between contemporary mid-range (2 sockets, 12 cores total) and high-end systems (4-8 sockets, 32-64 cores total) is not more than 2-4X. In the old days, there were systems with 32-64 processor sockets, scaling was limited and could only be achieved with complete re-architecture of the application. The difference between a good execution plan and poor one might be one, two or even three orders of magnitude, which cannot be spanned with hardware. Only tuning can close this range, even if this is feasible for just a handful of SQL.

Tuning for Parallelism
As discussed earlier, server system today can have 12, 32 or even 64 processor cores. Achieving good scaling to all processor cores is not a simple matter. The Microsoft SQL CAT team has published strategies for reducing contention for key metadata structures and index root level access. Examples are creating multiple identical indexes on small tables that frequently updated to avoid contention for the index root level page, or partitioning to achieve the same effect.

Summary
However powerful modern server systems are in terms of compute, memory and IO, brute force capability is helpful, but does not replace good database architecture, and performance tuning. The impact characteristics from each are very different. Hardware alone cannot solve some performance issues, nor does endless performance tuning effort compensate for obsolete hardware.

ps
Its always difficult to tell when something is an unbiased assessment versus an self-serving opinion. If only life were easy!

Published Thursday, January 27, 2011 11:20 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mike Sexton said:

Could not agree more

January 27, 2011 10:58 AM
 

Stuki Moi said:

From a wider perspective, the boost in hardware performance is not exclsive to the database. Instead, the systems relying on, and making demands on, the database has gotten faster as well. I doubt that will ever change.

In fact, a case can be made that non relational database applications have gotten "faster", at a quicker pace than RDBMS'; hence all the hoopla over NoSql type backends, whose salient feature is more linear scaling across cores and nodes. Never mind the close to impossible task of ensuring ACID properties for any halfway complex and heterogenous application running on them, at least with anything short of Google like staffing.

January 27, 2011 5:39 PM
 

Johan Åhlén said:

Great summary Joe!

January 28, 2011 2:37 AM
 

RichB said:

Incidentally I have a pair of servers at 128GB Ram, 4x6cores and 6xFusionIO Duo SSD cards each.

This allows me to backup and restore at over 1.4GB/S locally, little slower over the 10GB net between them though.

Biggest performance problems I have found now is picking apart the way SQL seems to be handling stuff.

Say I want to do a big reorg of a 3TB database - shuffle data out of filegroups to new ones, I can really see differences between how well various techniques work.  

Multithreading becomes really obvious with that rate of data movement, the differences between insert select and select into become staggering - can be the difference between around 60MB/S and 1.4GB/s!

How we handle tasks really changes based on these.  

In the end, there is always going to be a need to 'tune' even if that is only figuring out the 'right' way to do something.

As data retention expands, consumer expectations rise, we are still going to struggle to keep up.  

For now.

January 28, 2011 5:54 AM
 

jchang said:

the fusion-io duo's should be able to read at 1.5GB/s each, and write 1GB/s, so I am thinking local backup w/o compression at 3GB/s should be possible (simultaneous 3GB/s read, 3GB/s write). With compression, say 3:1, I am thinking 6GB/sec should be possible (6GB/sec read, 2GB/s write). It might that SQL Server backup compression cannot do this. Quest LightSpeed should be able to do this assuming Quest did not tamper with the file IO and threading I set in during version 4.5 days. For max network transmission, backup locally with compression, restore over the network (not file copy, followed by local restore). I suggest using 2 10GbE pipes in parallel, and explicitly set the IP address (also try backup to 2-4 files, restore 1-2 files simultaneously over each network channel)

btw, thanks for pointing out the difference between INSERT/SELECT and SELECT INTO, it was big with HDD, but not that big, I had not looked into this for SSD yet.

January 28, 2011 10:06 AM
 

anonymous said:

Select into is faster?

Thomas Keyser said the opposite in SQLBITS videos...

January 28, 2011 10:50 PM
 

Adam Machanic said:

Anonymous: Which video? Yes, SELECT INTO is often much faster than INSERT/SELECT. This is due to the fact that in most cases INSERT/SELECT is fully logged and SELECT INTO is minimally logged. Not always in either case, however.

January 31, 2011 2:50 PM
 

Jesse said:

perf tuning will be necessary until we build AI that write code better than the best human.

February 1, 2011 10:30 AM
 

RichB said:

With the Fusions I think the biggest problem is leveraging the CPU.  They gobble CPU rapidly and it becomes a significant bottleneck - occasionally flooded.  Eg a big join select with parallelism seems to be able to push data back so quickly the system is unable to timeslice and do anything else with any cores for considerable amounts of time - ie effectively hangs until 10s later!

I only have the disks at the moment pushed in a R0 array, once I can break out the data a bit I will be able to attempt other options, which as you say should be able to massively increase the already huge throughput.

Incidentally we successfully mirrored a 3TB db between the 2 - which usually generates around 120MB of TLogs a minute...

February 1, 2011 12:26 PM
 

merrillaldrich said:

Excellent post, Joe. Right on the money.

February 11, 2011 12:20 PM
 

John J said:

I love your blog

One comment surprises me "The current AMD Opteron processors have 12 cores on 2 die in one socket, but the aggregate performance at the socket level seems to be comparable to the Intel Xeon with 6 cores. "

We are swapping out a 48 core DL785-G6 for a 64 core DL980 and we don't see that much performance improvement.(certainly nothing that matches that statement)  The 980 is better at higher volumes of small queries, but we have had to spend a bunch of time tuning to get large queries up to the performance of the DL785.(HT turned off etc.)

Also I should note, the Intel Enhanced Speed Step Technology implementation is flawed in my view.  To get decent performance you just have to force performance to high.

April 19, 2011 11:06 AM
 

jchang said:

In looking at published TPC-H reports, the HP DL580G7 4 x 8-core Xeon does better than the DL585G7 4 x 12-core Opteron by 13% overall, which in turn did better than the DL785G6 8 x 6-core Opteron by about 16% (some of which might be attributed to S2K8R2 over 2k8rtm). However, in the individual queries, the Opteron could be better than the Xeon. There is a wide range in variation from query to query.

I also suggest you look for my slidedeck at SQLBits 7. I looked at parallel execution plan scaling from DOP 1, 2, 4, 8, 16, 32. A few execution plans have perfect scaling, many have good scaling, and some have poor or even negative scaling. If you want me to look at your case, try to get the actual execution plan on both 785 and 980, along with execution statistics.

April 19, 2011 12:39 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement