Revenge, Return of Big Iron.
In the old days, standard server systems did not have the power to run large enterprises, hence there were vendors that built really big servers. However it became apparent if not widely publicized that there were serious technical challenges scaling up on big iron systems. (Many of these difficulties have since been or will be resolved,) Furthermore, the press could not distinguish the difference between there being big systems on the market and actually scaling effectively on big systems. But IT departments did figure out that it was often better to buy the standard 4-way SMP server offered by almost all system vendors than a proprietary big NUMA system, even if this meant scaling back on features. More recently, microprocessors have become so powerful that the default system choice should now be a 2-way, meaning it is usually safe to pick this system without any technical sizing effort. Still there was a niche demand for truly immense compute power if such capability could be harnessed effectively in a production environment.
Years ago, Oracle recognized the limited realizable scaling and serious technical anomalies that occurred in big systems, and elected to pursue the distributed computing solution. The first iteration was Oracle Parallel Server, and then followed by Real Application Clusters (RAC). A full RAC system is very complicated, and it was impossible to provide good support because of all the variations in each specific customer implementation. Furthermore, Oracle probably got tired of hearing customer complaints that were repeatedly traced to really expensive storage systems with absolutely pathetic performance with regard to the special characteristics and requirements of database IO.
Two years ago, Oracle came out with the Oracle Database Machine (ODM), comprised of several pre-built RAC nodes (2-way quad-core Xeon 5400 systems) coupled with their Exadata storage system. Each Exadata storage unit was itself a 2-way quad-core Xeon system with 8GB (24GB in gen2) memory and 12 SATA or SAS disks, and running a special version of the Oracle database capable of off-loading certain processing tasks from the main database engine.
The first generation ODM/Exadata was only targeted at data warehouse environments as the storage system had excellent sequential IO bandwidth, which SAN system are really bad at, but only marginal random IO. (In fact SAN systems often have special features meant to prevent one host from drawing too much load, to better simultaneously support many hosts). The second iteration last year broadened the scope to also support random IO from OLTP environments with 384GB Flash storage to supplement hard disks, and making extensive use of compression.
The first generation ODM in 2008 employed a cluster of 2-way systems with the Xeon 5400 processors (8 database server + 14 storage servers in a full rack). At the time this was the more correct choice at the time. The contemporary Opteron had better interconnect, but the Core 2 architecture had significantly greater compute capability. In 2009, Oracle updated ODM to 2-way systems with the Xeon 5500 processors on both the RAC nodes and the Exadata units. Again this was the best choice at the time among both Opteron versus Xeon, and 2-way versus 4-way, as the 4-way Xeon 7400 processor series was limited in memory bandwidth for a balanced system.
For the 2010 ODM refresh, the RAC base unit is now not a 2-way or 4-way, but an 8-way system with Xeon 7500 processor. There are good reasons for choosing the Xeon 7500 series. This includes 1) 4 duplex memory channels supporting 16 DIMMs per processor versus 9 for the Xeon 5600 and the 2) Machine Check Architecture for enhanced reliability. Also, the max memory for 2-way Xeon 5600 is 192GB (12x16GB DIMMs) versus 512GB (32x16GB) for each pair of Xeon 7500s. If these were the reasons for including the Xeon 7560 as an option, then Oracle could have offered two-way or even 4-way Xeon 7560 nodes. Instead Oracle is offering the 8-way Xeon 7560 as a node option for which the only(?) reason is that there are certain applications not suitable for scale-out even with the fast and fat Infini-Band interconnect. The Exadata units remain 2-way, but moving from Xeon 5500 to 5600.
For years, Larry has ridiculed IBM and their POWERx system architecture, saying big iron belonged in a museum. The future was clustered small systems. Last year I commented that scale-out clustering with small systems (not a Microsoft feature until PDW is released) was a mistake. The propagation latency between systems over even an Infini-Band connection was far higher than the inter-node latency of most NUMA systems. The main deficiencies of the NUMA systems of years ago was limited interconnect bandwidth, which was still better than IB, and an integrated directory for cache coherence (I did not discuss this at the time). But all of this was on the public roadmaps of Intel. AMD already had the interconnect technology, and only needed HT-Assist (introduced with Istanbul), but AMD has decided to withdraw from >4-way, probably due to severe financial pressure.
So the Oracle decision to move ODM RAC nodes to not just Xeon 7500, but also to 8-way nodes is essentially a concession that scale-up first is the best strategy if a very high degree of locality cannot be achieved.
The fact that ODM now has only 2-nodes versus previously 8 nodes with 2-way system may be hinting that the two nodes strategy is more for availability than scale-out. There is good evidence that Oracle RAC can effectively scale-out highly partitionable workloads, including with 2-way nodes. I am inclined to think that RAC can also scale-out partitionable workloads with 8-way nodes, but why not just stay with the finer grain increments of the 2-way nodes. A non-localizable workload will not scale-out as well as it will scale-up (on a good system architecture, which is now available). This leads me to think that a 2-node 8-way ODM is more for high-availability, but I will withdraw this assertion if there are performance reports to support the scale-out characteristics.
Side note on Exadata Storage Systems
In general, I think this is a good concept in:
1) putting the database engine on the storage system to off-load processing when practical,
2) having substantial compute power in the storage system to do this with 2 quad-core processors over relatively few disks,
3) using the compute power to also handle compression,
4) the choice of Infini-Band is based on best technology, not fear of leaving Ethernet.
The goal of 2GB/sec sequential per unit is reasonable, considering each Nehalem core can probably handle 250MB/sec compressed, then 8 cores matched to 2GB/sec works out.
I do disagree in having only 12 3.5in SAS disks to supply the 2GB/s means 167MB/sec per disk, which is possible, but requires pure sequential IO.
This is not always what happens in DW, hence I think 24 SFF (2.5in) disks is better.
potentially this could be 20 x 73GB 15K disks
+ 4 640GB 7200RPM SATA disks, never mind, + 4 x 600GB 10K SAS disks are better.
per Kevin below on ODM options
The X2-8 has two 8-way 8-core Xeon 7560 nodes (
presumably the X7560) with 1TB mem each.
the X2-2 has either 2, 4, or 8 (quarter, half and full rack) 2-way nodes with the six-core Xeon 5670 with 96GB mem per node.
The Exadata Storage X2-2 server use the Xeon L5640 six-core 2.26GHz,
but some confusion as to whether this is with the 4-core E5640 or one of the 6-core X model processors with 24GB mem per node, 4 x 96GB flash storage for cache, and 12 600GB 15K or 2TB 7200RPM disks. Bandwitdh to flash is 3.6GB/s, to 15K disk is 1.8GB/s and 1GB/s to 7200RPM disks.
Front-end network on the database servers is 2x10GbE + 2x1GbE on the X2-2 and 8x10Gb + 8x1GbE on the X2-8. My understanding is that MS PDW intends to have 1GbE, but 10GbE can be installed.
ps the HP PREMA whitepaper on their DL980 G7 system architecture say future Xeon systems up to 32-way(?) are possible. I am thinking this should be next year.
UPDATE 2010-10-12 The Exadata X2-2 has the Xeon L5640 6-core 2.26GHz 60W (the E5640 is a quad-core 2.66GHz 80W).