THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Big-Iron Revival III

Revenge, Return of Big Iron.

In the old days, standard server systems did not have the power to run large enterprises, hence there were vendors that built really big servers. However it became apparent if not widely publicized that there were serious technical challenges scaling up on big iron systems. (Many of these difficulties have since been or will be resolved,) Furthermore, the press could not distinguish the difference between there being big systems on the market and actually scaling effectively on big systems. But IT departments did figure out that it was often better to buy the standard 4-way SMP server offered by almost all system vendors than a proprietary big NUMA system, even if this meant scaling back on features. More recently, microprocessors have become so powerful that the default system choice should now be a 2-way, meaning it is usually safe to pick this system without any technical sizing effort. Still there was a niche demand for truly immense compute power if such capability could be harnessed effectively in a production environment.

Years ago, Oracle recognized the limited realizable scaling and serious technical anomalies that occurred in big systems, and elected to pursue the distributed computing solution. The first iteration was Oracle Parallel Server, and then followed by Real Application Clusters (RAC). A full RAC system is very complicated, and it was impossible to provide good support because of all the variations in each specific customer implementation. Furthermore, Oracle probably got tired of hearing customer complaints that were repeatedly traced to really expensive storage systems with absolutely pathetic performance with regard to the special characteristics and requirements of database IO.

Two years ago, Oracle came out with the Oracle Database Machine (ODM), comprised of several pre-built RAC nodes (2-way quad-core Xeon 5400 systems) coupled with their Exadata storage system. Each Exadata storage unit was itself a 2-way quad-core Xeon system with 8GB (24GB in gen2)  memory and 12 SATA or SAS disks, and running a special version of the Oracle database capable of off-loading certain processing tasks from the main database engine.

The first generation ODM/Exadata was only targeted at data warehouse environments as the storage system had excellent sequential IO bandwidth, which SAN system are really bad at, but only marginal random IO. (In fact SAN systems often have special features meant to prevent one host from drawing too much load, to better simultaneously support many hosts). The second iteration last year broadened the scope to also support random IO from OLTP environments with 384GB Flash storage to supplement hard disks, and making extensive use of compression.

The first generation ODM in 2008 employed a cluster of 2-way systems with the Xeon 5400 processors (8 database server + 14 storage servers in a full rack). At the time this was the more correct choice at the time. The contemporary Opteron had better interconnect, but the Core 2 architecture had significantly greater compute capability. In 2009, Oracle updated ODM to 2-way systems with the Xeon 5500 processors on both the RAC nodes and the Exadata units. Again this was the best choice at the time among both Opteron versus Xeon, and 2-way versus 4-way, as the 4-way Xeon 7400 processor series was limited in memory bandwidth for a balanced system.

For the 2010 ODM refresh, the RAC base unit is now not a 2-way or 4-way, but an 8-way system with Xeon 7500 processor. There are good reasons for choosing the Xeon 7500 series. This includes 1) 4 duplex memory channels supporting 16 DIMMs per processor versus 9 for the Xeon 5600 and the 2) Machine Check Architecture for enhanced reliability. Also, the max memory for 2-way Xeon 5600 is 192GB (12x16GB DIMMs) versus 512GB (32x16GB) for each pair of Xeon 7500s.  If these were the reasons for including the Xeon 7560 as an option, then Oracle could have offered two-way or even 4-way Xeon 7560 nodes. Instead Oracle is offering the 8-way Xeon 7560 as a node option for which the only(?) reason is that there are certain applications not suitable for scale-out even with the fast and fat Infini-Band interconnect. The Exadata units remain 2-way, but moving from Xeon 5500 to 5600.

For years, Larry has ridiculed IBM and their POWERx system architecture, saying big iron belonged in a museum. The future was clustered small systems. Last year I commented that scale-out clustering with small systems (not a Microsoft feature until PDW is released) was a mistake. The propagation latency between systems over even an Infini-Band connection was far higher than the inter-node latency of most NUMA systems. The main deficiencies of the NUMA systems of years ago was limited interconnect bandwidth, which was still better than IB, and an integrated directory for cache coherence (I did not discuss this at the time). But all of this was on the public roadmaps of Intel. AMD already had the interconnect technology, and only needed HT-Assist (introduced with Istanbul), but AMD has decided to withdraw from >4-way, probably due to severe financial pressure.

So the Oracle decision to move ODM RAC nodes to not just Xeon 7500, but also to 8-way nodes is essentially a concession that scale-up first is the best strategy if a very high degree of locality cannot be achieved. The fact that ODM now has only 2-nodes versus previously 8 nodes with 2-way system may be hinting that the two nodes strategy is more for availability than scale-out. There is good evidence that Oracle RAC can effectively scale-out highly partitionable workloads, including with 2-way nodes. I am inclined to think that RAC can also scale-out partitionable workloads with 8-way nodes, but why not just stay with the finer grain increments of the 2-way nodes. A non-localizable workload will not scale-out as well as it will scale-up (on a good system architecture, which is now available). This leads me to think that a 2-node 8-way ODM is more for high-availability, but I will withdraw this assertion if there are performance reports to support the scale-out characteristics.

Side note on Exadata Storage Systems
In general, I think this is a good concept in:
1) putting the database engine on the storage system to off-load processing when practical,
2) having substantial compute power in the storage system to do this with 2 quad-core processors over relatively few disks,
3) using the compute power to also handle compression,
4) the choice of Infini-Band is based on best technology, not fear of leaving Ethernet.
The goal of 2GB/sec sequential per unit is reasonable, considering each Nehalem core can probably handle 250MB/sec compressed, then 8 cores matched to 2GB/sec works out.

I do disagree in having only 12 3.5in SAS disks to supply the 2GB/s means 167MB/sec per disk, which is possible, but requires pure sequential IO.
This is not always what happens in DW, hence I think 24 SFF (2.5in) disks is better.
potentially this could be 20 x 73GB 15K disks + 4 640GB 7200RPM SATA disks, never mind, + 4 x 600GB 10K SAS disks are better.

per Kevin below on ODM options
The X2-8 has two 8-way 8-core Xeon 7560 nodes (presumably the X7560) with 1TB mem each.
the X2-2 has either 2, 4, or 8 (quarter, half and full rack) 2-way nodes with the six-core Xeon 5670 with 96GB mem per node.
The Exadata Storage X2-2 server use the Xeon L5640 six-core 2.26GHz, but some confusion as to whether this is with the 4-core E5640 or one of the 6-core X model processors with 24GB mem per node, 4 x 96GB flash storage for cache, and 12 600GB 15K or 2TB 7200RPM disks. Bandwitdh to flash is 3.6GB/s, to 15K disk is 1.8GB/s and 1GB/s to 7200RPM disks.
Front-end network on the database servers is 2x10GbE + 2x1GbE on the X2-2 and 8x10Gb + 8x1GbE on the X2-8. My understanding is that MS PDW intends to have 1GbE, but 10GbE can be installed.

ps the HP PREMA whitepaper on their DL980 G7 system architecture say future Xeon systems up to 32-way(?) are possible. I am thinking this should be next year.

UPDATE 2010-10-12 The Exadata X2-2 has the Xeon L5640 6-core 2.26GHz 60W (the E5640 is a quad-core 2.66GHz 80W).

Published Friday, September 24, 2010 10:24 PM by jchang
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Kevin Closson said:

"The fact that ODM now has only 2-nodes with 8-way versus previously 8 nodes with 2-way sysem may be hinting that the two nodes strategy is more for availability than scale-out."

  One small clarification regarding options for the database grid of the Exadata Database Machine. The latest hardware refresh consists of either 2 8-socket Nehalem EX servers in a 2-node RAC cluster or quarter/half/full rack configurations based on 2-socket Westemere EP.

  The Nehalem EX option is called X2-8 and the Westmere EP option is X2-2. The X2-8 is only avaialable in a full-rack configuration. The X2-2 is available in quater/half/full rack configurations.

  The storage grid in all cases is based on 2-socket Westemere EP Exadata Storage Servers with either 12 15K SAS drives or 12 7200 RPM 2GB 2TB SAS drives (Seagate Constellation).

  All this and more is covered in:

http://kevinclosson.wordpress.com/2010/09/20/intel-xeon-7500-nehalem-ex-finds-its-way-into-exadata-database-machine-so-does-solaris/

(edit by jc, per kc correction below)

September 27, 2010 5:11 PM
 

jim jones said:

stay on yer own blog kevvie

unless you are here to study...

September 27, 2010 8:30 PM
 

jchang said:

civility please, Kevin provided valid information, and I made reference to Oracle on this site. The main Oracle information only showed the X2-8 with the 8-way Xeon 7500. After Kevin's reminder, I searched the X2-2 and found it on the Oracle press release. I do think maintaining the 2-way option makes sense. The EP line should be out 1 year ahead of the EX line on a consistent basis. I am suprised storage server is moving to 5600, and the upper-end 6-core at that. Of course, Oracle should have sufficient information to assess whether the addition compute capability is required. I would have guessed that there would have been some evidence that the quad-core was sufficient to support 1.5GB/sec and 75 KIOPS.

I am curious as to which 5600 Oracle elected for each, as the 6-core is the X-models only. The previous generation Xeon 5500 was the E-model, which was a good choice for thermal and density reasons.

Next, while RAC is not perfect by any means, as close as it is made out to be, it is useful. So we should admire that Oracle has the guts to trail blaze in the bleeding edge, while somehow not suffering the consequences of others that have done so.

Now the curious matter of benchmarks. Oracle has not published a TPC-C for ODM even thought there is every reason to believe it should do well. There is a result for SPARC RAC. There is an older result on TPC-H(?) on RAC. I am inclined to think ODM can do some of the TPC-H queries well, but not others. This might depend on the rules, and partitioning strategies. Oracle has not published any TPC-E results. There is every reason to believe Oracle does fine for TPC-E on a single system, perhaps better than SQL Server on big systems. Is the reason Oracle is not publishing TPC-E only because TPC-E does not scale well on RAC?

September 27, 2010 9:15 PM
 

Kevin Closson said:

..and shame on me for speed-typing. Of course I meant to say 2 TB 7,200 RPM SAS drives, not 2 GB.

 Regarding which Westmere EPs, the database hosts are fitted with 5670s and the storage servers have 5640s.

September 28, 2010 12:05 AM
 

jchang said:

thanks kevin, i think the X5670 is a good choice for the database engine, being the top frequency at 95W, with the X5680 at 130W. But the E5650 (80W) is a 2.66GHz 4-core part. The X5650 (95W), also 2.66GHz, is the lowest 6-core. The Oracle press release clearly said that the full rack of 14 storage server had 168 (=14 x 2 x 6) cores. I can see moving to the 5600, but staying at the lower 80W, and 4-cores as there should be good data on how much compute power is required to support compression (and SQL offload) at the IO load. I expect that hyper-threading helps too.

oh yeah, the point of the blog is not whether ODM/RAC/Exa is good or bad, or whether SQL Server needs to follow vis PSW. The point is the that another company with deep database expertise, started out promoting scale-out on small boxes, but then returned to big-iron (when suitable big-iron became available). I am thinking very sound technical analysis went into this decision (or Larry reads my blog).

Each of the major database products is very complex, so DBAs tend to specialize in just one, sure there are people who do 2, but experts tend to do 1, So while it is seriously impractical to be expert in 2, one should atleast pay attention to what is happening on the other side of the pond.

September 28, 2010 8:25 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement