THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Microsoft's-Thinking Oracle RAC Think Again Whitepaper

Now that I have actually kind of read this paper, I will comment on it. I say kind of because I cannot focus on non-technical matters, the same way I get really frustrated trying to explain something to a person that just cannot understand the difference between an argument substantiated by hard analysis and an argument that seems logical but is not the predominant underlying effect.

First, Microsoft competes with Oracle, and MS gets hit with this all the time so of course they need to collect in one place their best arguments. On a side note, many Microsoft whitepapers list the author(s). When an artist paints a masterpiece, he/she will put a signature on it because he takes pride in his work. Not all talented artists can paint what they want and must accept commercial work (having food to eat is not overrated?). In such cases, he does not want his peers to know that he had to stoop to do such work and prefers it be anonymous. There is no listed author for this paper.

 

The first argument in the paper is valid. Only a small percentage of customers have actually deployed RAC. Back in the OPS days, one prominent expert said that he had never seen for himself an OPS deployment that actually achieved positive scaling, but he did hear from someone he considered competent on this matter that one customer did. RAC is much better than OPS (to get something right, you do have to screw it up once or twice or trice). The Oracle people I talk (more hurling insults/ridicule back and forth than talking actually) with do say that the technical skills to deploy RAC are not common among Oracle DBAs.

 

But the fact that Oracle RAC is not really required by most people does not stop sales and marketing from making a big deal out of it. Funny how CIOs are influenced by the scaling argument (men are most susceptible the size thing) even though it is of no consequence to their specific environment.

 

Of course RAC is expensive. The alternative is buying big iron which is also expensive. If a solution is not really painfully expensive, is it any good? If one project manager turns what should have been a $1M project into a $10M project and another manager does his project for $1M, who will know it should have been a $1M project? Who will get the higher job ranking come evaluation time? That is: the next promotion and big raise?

 

When a project is so hideously expensive, the CIO must go to the CEO, who in turn may need to go the board for approval. Well if this project becomes really messed up, the big bosses are not inclined to declare it a failure because they will look bad as well, that is, they had endorsed the project. When a small project runs into difficulty, no matter who is at fault, it is not hard for the CEO/CIO to pin the blame on a lowly project manager, i.e., fire him.

 

If what you need is SQL Server cluster style fail-over redundancy, then the expense of RAC licensing does not make sense. So RAC only makes sense if scaling performance is needed. The paper talks about scale-out OLTP [by which I mean with distributed partitioned views]. Scale out on any DBMS is not simple and not just for the reasons described. If you understand in detail how the [SQL Server] cost base optimizer (CBO) works with respect to local and remote data, you will understand the horrible implications. [In the execution plan with remote data, the row count estimates are mostly 1,000 or 10,000 rows. When two sources of 1,000 rows each are joined, the output is 1M rows. When this is joined to a large local table, the plan is frequently a table scan.][Technically, RAC scales out on hardware, but does not have a DPV database design.]

 

If one wanted to point out an issue with Oracle RAC for OLTP, the most apparent is the near complete absence of published benchmark results. There is one RAC TPC-C (there are no Oracle TPC-E period) and it was done long ago (12/2003). The TPC-C RAC pub was 16 HP Integrity rx5670 each with 4 Itanium 2 1.5GHz for a total of 64 cores, a score of 1,184,893tpm-C at total cost $6.5M, versus the contemporary result for a HP Superdome, same 64 Itanium 2 processors, scored 1,008,144 tpm-C at $8.4M cost. The RAC had 17% better performance at 22% lower cost. Looking closely at the price detail, the cost of memory for the rx5670 was $1.4M versus $5M on the Superdome. It is a little difficult to compare pricing because the Superdome discount was about 45% versus 25% for the clustered rx5670. The major software licensing difference was $640K for RAC and $320K for partitioning. So a big chunk of the price advantage is because of a memory pricing anomaly.

Nodes

1

16

System

Superdome

rx5670

Database

Oracle 10g

Oracle 10g+RAC+Part

Report Date

11/04/2003

12/08/2003

Tpm-C

1,008,144

1,184,893

Total System Cost

$8,397,262

$6,541,770

Price/Performance

$8.33 per tpm-C

$5.52/tpm-C

Processors

64 Itanium 2 1.5GHz 6M

16x4 Itanium 2 1.5GHz 6M

Memory

1024GB

768GB (16x48)

Disks

2100+120

672+1344+224

HBA

28 FC 2Gb/s

64 x FC

Costs

 

 

Processors

$1,280,000

($40K ea)

$528,000

($8.25K ea)

Memory

$4,998,400

$39K for 8GB

$1,440,000

$7500 for 4GB

Server Subtotal

$7,085,433

 

Storage

$5,032,188

 

Server+Storage

 

$4,694,618

Oracle 10g

$1,280,000

$1,280,000

RAC+Partitioning

 

$960,000

Discounts

$7,000,000

$1,900,000

Since then, there have been no Oracle RAC TPC-C publications. This usually means there is no good news. If MS wants to criticize RAC for OLTP, I am ok with it. In other posts, I argued that going forward, the new Intel QPI and existing AMD HT interconnects should allow building big-iron (scale-up) systems with better scaling than RAC, on account of the higher bandwidth and lower latency that can be achieved versus RAC going over Infiniband. This is a theoretical argument that needs actual measurements to assess its validity.

 

On the Data Warehouse side, there are many Oracle RAC TPC-H publications. From the results, I think RAC has decent scalability, and I am really happy that I can bring a better balance of processor power, memory and storage than I can with big iron, or rather, a max'ed out server system (i.e., expensive big capacity DIMMs, and high priced storage). So apparently my arguments above on interconnect bandwidth and latency are not as important on DW. MS does mention they will soon have their own MPP solution, so that good, because I am too old to learn Oracle, and I am relatively happy (meaning I am bitching a lot) doing big SQL Server projects.

 

I will expand on what I mean by balance. Lets compare 2 recent TPC-H 1000GB results. A 64-node, BL460c with 2 QC Xeon 5450 3.0GHz, 32GB memory per node, (128 sockets, 512 cores, 2TB memory) scored 1,166,976 QphH at total cost $6.3M compared to a 32-socket, 64 core Superdome, 384GB  memory score of 123,323 QphH at cost $2.5M. Put aside for this discussion the fact that Itanium is still dual core on a 90nm process while Xeon is quad core on 45nm. For 8X the number of cores, the performance gain is 9.5X. Keeping on mind that the Xeon 5450 core is about 50% faster than the Itanium 1.6GHz based on SPEC CPU int 2006, this is about right. Also the RAC system used the Exadata storage which offloads some processing, but probably not too much.

 

The cost breakdown for the Itanium is $736K for processors, $800K for memory, $1.3M for storage (EVA SAN), $470K for Oracle (+54% discount on all). In the RAC-Xeon system, it is $160K for processors, $85K for memory, $200K for Infiniband, $540K for storage, $6M for Oracle (30% discount on Oracle only). OK, this is not what I was really getting at. If you max out a system, it means buying the 8GB DIMMs which cost 4X more than the 4GB DIMMs but does not contribute proportionately higher performance. The big gain in the RAC system is that its possible to configure enough memory to fit the database and make do with a less powerful storage system. This argument goes out the window if the RAC memory cannot encompass the entire DB plus working space. Still, it is good to have this avenue.

 

System

Superdome

BL460 Cluster

Database

Oracle 11g + Partitioning

Oracle 11gR2, RAC

QphH@1000GB

123,323

1,166,976

TPC-H Power

118,577

782,608

TPC-H Throughput

128,259

1,740,122

Total System Cost

$2,532,527

$6,320,001

Processors

32 Itanium 9140 1.6GHz

128 X5450 3GHz

Cores

64

512

Memory

384GB

2080GB

Disks

768

128+6x12

HBA

64 Dual Port FC 4Gb/s

64 x 2 Infiniband

Costs

 

 

Processors

$736,000

($23K ea)

$159,872

($1.2K ea)

Memory

$850,000

$35,611 per 16GB

$92,000

$339 per 8GB

HBA

$233,680 ($3.5K ea)

$199,000 (HCA+switch)

Server Subtotal

$2,397,614

$696,419

Storage

$1,390,861

$536,516

Software

$624,245

$6,425,600

List Price (ex maint.)

$4,412,720

$

Discounts

$2,210,000

$1,965,000

Maintenance (net)

$330,067

$626,054

Published Friday, June 05, 2009 10:36 AM by jchang
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Greg Rahn said:

<blockquote>

If you understand in detail how the cost base optimizer (CBO) works with respect to local a remote data, you will understand the horrible implications

</blockquote>

I understand with fair detail how the Oracle CBO works, but guess what, it knows nothing about remote (buffer cache) data, and since Oracle RAC uses shared storage, all disk data is the same distance from any given node.  Perhaps you could further explain exactly what the "horrible implications" are you mention.

June 10, 2009 11:53 PM
 

jchang said:

my fault for not being clear. On scale out, I was talking about the SQL Server distributed partitioned view design. In a SQL Server execution plan with remote data, the row count estimates are 1,000, 3,000 or 10,000 rows, which is the default estimate when no estimation is possible. I looked into this once for someone. The ODBC protocol (for remote data sources) has a provision for passing dta distribution statistics for this purpose. But when we examined the contents of the field it was null, so remote data sources kicks the SQL Server optimizer into default mode, which can be really really bad.

btw, I was on your site earlier,

so when will Oracle publish a proper TPC-H that actually has to use the Exadata storage? instead of running in memory? we already know RAC can scale TPC-H.

My thinking is the 64 node, 128 socket, 512 core cluster can consume close to 100GB/s, which might beyond what the Exadata can scale to.

An 8 node, 16 socket, 64 core should be a good match for the 14 Exadata full rack, but go for 16 node if you can scale Exadata to 2 full racks.

June 11, 2009 10:48 AM
 

Riyaj Shamsudeen said:

Joe

<quote>

 The big gain in the RAC system is that its possible to configure enough memory to fit the database and make do with a less powerful storage system.

</quote>

 First, RAC is mostly needed to scale out throughput, not necessarily performance. They are two different things. It is not uncommon to lose 5-10% performance if we migrate from non-RAC to RAC multiple active instance cluster and meanwhile, nearly double the throughput. I can give you many examples for this.

 Second, let me address the notion in your comment that somehow keeping the whole database in memory is an ideal way to tune the database. That is far from the truth. If that is indeed the case, how do one tune TB databases? It has been proven, time and again that, tuning execution plans and resolving bottlenecks are ideal way to tune an application/process. Just in-memory-database doesn't fly.

  Let us think through this with a real life example. I had a customer issue where short running transaction processor (with 30 threads) was not scaling. Server is one of the big iron servers and can't add much more processors. Only way to improve the throughput is to use 2 nodes of the clusters run transaction processors in both nodes. Average run time went from 15 seconds to 16 seconds, but throughput of those jobs nearly doubled. You see where RAC is much more powerful.

<quote>

If you understand in detail how the [SQL Server] cost base optimizer (CBO) works with respect to local and remote data, you will understand the horrible implications. [

</quote>

Precisely, why you can't compare SQL server cluster with RAC. SQL server cluster is share nothing architecture. One node accesses a table/partition/data exclusively, where as in RAC, every block is shared and accessible from any node. Losing one RAC node is not of a big concern since other nodes can continue the service. But, in SQL Server cluster, if one node is lost, application has lost access to partial data. Point is that RAC is on its own category and provides sophisticated cluster communication mechanisms.

The specific problem you talk about used to happen early Oracle 7 release with queries joining local and remote table. But, later versions will query remote dictionary before parsing. I guess, Oracle software is way ahead of the curve in this area.

It is obvious that your knowledge domain does not run parallel to Oracle Database knowledge domain. Why not stay out of it?

Cheers

Riyaj

June 12, 2009 2:38 PM
 

jchang said:

the first quote you cite specifically refers to the two HP/Oracle TPC-H benchmark reports. The TPC-H 1000GB scale factor DB is around 1.7TB, the Superdome was configure with 384GB, the RAC system was configured 2080GB

So why do you think Oracle publishes RAC in-memory versus single-node disk bound configurations?

If RAC is mostly needed to scale throughput, not performance (parallel execution plan), why is it Oracle has published several TPC-H results with RAC, but only one TPC-C on RAC, no TPC-E period?

On the second item you cite, you start on SQL Server Cluster, but then shift to Distributed Partitioned Views,

so if you people could be coherent, I would stay out of Oracle

have nice weekend

June 12, 2009 2:57 PM
 

jchang said:

Anyways, per the title, this blog was a commentary on the Microsoft whitepaper concerning RAC,

the paper makes arguments, but neglects to cite published performance reports on RAC. There are several TPC-H results for Oracle RAC, so the arguments made in the MS paper concerning DW on RAC is not substantiated. Concerning OLTP on RAC, there is only one TPC-C report, which shows 64P single node is roughly comparable to 16 node x 4P per node.

If the argument is for multi-node RAC instead of single node big-iron, then Oracle can claim their point, if the argument is that RAC can continue to scale beyond the biggest single node system, then we need a RAC result for greater than the 64 socket (dual core Itanium) with result 4M tpm-C (not including DB2 results). I am not expecting HP to do a 2-4 node Superdome, with 128-256 sockets, mostly because even HP does not have 14,000 disk drives sitting around (7000+ drives in the 64P Superdome 4M tpm-C result), not because it can't be done.

June 12, 2009 3:33 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement