THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Is any one seriously looking at the HP ProLiant DL785G5?

Or the IBM x3950M2 8-way Quad-Core Xeon?

Consider the following specs:

8-way Quad-Core Opteron (32 cores total)

Max Memory: 512GB (64 DIMM sockets)

11 PCI-E slots: 3x16, 3x8, and 5x4 or option for 7 PCI-E and 2HTx

 

Compare this with the HP ProLiant DL585G5:

4-way Quad-Core Opteron (16 cores total)

Max Memory: 256GB (32 DIMM sockets)

7 PCI-E slots: 3x8, and 4x4

 

Aside from 8 Quad-Core processor sockets, the significant differences are 64 DIMM sockets, doubling the maximum memory of the 4-way, and 11 PCI-E slots (depending on the actual architecture, this could be 92 PCI-E lanes!)

 

Some pricing from the HP web site below.

ProLiant DL585G5

w/4x2.2GHz $10,389, 4x2.3GHz $12,189, 4x2.5GHz $16,189

64GB 32x2GB +$4,495, 128GB 32x4GB +$13,455, 256GB 32x8GB +$55,863

 

ProLiant DL785G5

w/4x2.2GHz $16,973, 8x2.3GHz $27,291, 8x2.5GHz $46,891

128GB 64x2GB +$7,612, 256GB 64x4GB +$26,492, 512GB 64x8GB +$75,316

 

8-way systems were somewhat popular in the Pentium III Xeon/ProFusion days, but Intel did not follow it with a respectable (read on when you stop laughing) 8-way chipset for the NetBurst based Xeon processors. HP did their own chipset for the DL740(?) which they considered moderately successful (in that it was profitable but did not warrant continuation for the next generation dual core processors).

 

The ProLiant DL785 posts a respectable TPC-H benchmark result of 52,860 QphH@300GB (SQL Server 2008), compared with 46,034 for the 8-way IBM x3950 with 8 Xeon 7350 (SQL Server 2005). The ProLiant DL585G5 also posted a top 4-way SQL Server TPC-C result of 471,883tpm-C on the AMD Opteron 8360 at 2.5GHz. I was really expecting that the AMD quad-core needed to be at 2.8GHz to reach this performance level.

 

Now getting an actual database (SQL Server or any other) to scale to 16 or 32 cores is not a simple matter. I do suggest conducting a proper quantitative scaling analysis, i.e., measuring maximum throughput with 1, 2, 4, 8, 16, and 32 cores.

 

The other reasons for going with the 8-way is the extra DIMM sockets. The 4-way QC 2.3GHz DL585 with 256GB memory (32x8GB DIMMs) is $68K, versus $54K for the 8-way DL785 with 64x4GB DIMMs (it is necessary to populate all 8 processor sockets to get 64 DIMM sockets, but one could restrict SQL Server to 4 sockets if per processor licensing is involved, see Andy's comment below). Even at 128GB, it is $26K for the 4-way DL585 with 32x4GB DIMMs versus $35K for the 8-way with 64x2GB. The extra 16 cores for a $9K delta is highly attractive in CAL situations.

 

One would also think that the 11 PCI-E sockets could support phenomenally high sequential disk transfer rates: 800MB/sec per first generation PCI-E SAS RAID controller, 1,100MB/sec+ for second generation (in x8 slot). But there is no published data on this for the AMD systems with PCI-E. The HP Itanium systems can do over 15GB/sec in SQL Server table scans.

 

Published Sunday, July 20, 2008 1:22 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

jchang said:

silly me, let me rephrase:

is any one seriously looking at the 8-way in addition to Linchi?

July 20, 2008 12:45 PM
 

Andrew Kelly said:

Joe,

This comment is incorrect:  

" but one could restrict SQL Server to 4 sockets if per processor licensing is involved".

This is not true with the current licensing. If the server has 8 procs regardless of if you set affinity or intend to use all of them you must pay a per proc license for each processor in the box.

July 20, 2008 3:33 PM
 

DonRWatters said:

I'm about to begin testing a DL785 as compared to a SuperDome, with the same amount of cores and RAM (it would be difficult to match the amount of IO or HBAs, so we're not considering that).  We have one app that is CPU bound, and we're trying to determine if the cost of the SuperDome that we already have will be substantiated when compared to a DL785.  My guess is probably not, given the performance that I've seen out of the Itanium CPUs in the SuperDome.  We'll see though.  So, the answer is yes, we're defintely considering it.

July 20, 2008 4:56 PM
 

jchang said:

Andy, wow, that sucks, too bad AMD does not sell a socket module with the processors disabled, leaving only the memory controller and HT ports functional.

Don: good of you to look into this. Even though the Itanium 1.6GHz and Opteron 2.3GHz are about the same on SPEC CPU 2006 integer (13-14 base), I suspect the Opteron will do better on most SQL code, because SQL code is heavy on logic, and cannot easily benefit from the wide Itanium architecture. It would be nice if some one from MS could show us the assembly output on some relevent SQL engine code. Its also too bad that Intel is so far behind on getting Itanium to the current generation process, 90nm does not cut the mustard today. The DL785 will also have an advantage at 8 sockets 32 cores versus 16 socket, 32 core (32 or 64 threads) Itanium because the AMD architecture is soft NUMA, while the Itantium will have hard NUMA overhead. The Itanium could potentially have an advantage in high call volume apps, because it is the one characteristic the benefits from HT and large cache.

BTW, do you have even CPU distribution, or unbalanced?

July 20, 2008 10:05 PM
 

DonRWatters said:

The cell layout has a balanced CPU distribution.  

Much of the workload that we do has to be limited using MAXDOP=8 to stay within the hard numa limits on the SuperDome...and we're hoping to not have to do this within the DL785.  Although, I've heard from some HP folks that the DL785 soft numa might not work as expected in some situations.  That's exactly what I'm hoping to test.

So far, I've seen a decent amount of descrepancy between a DL585 and a SuperDome regarding performance, in favor of the DL585, because of the AMD's ability to rip through the data faster than the Itaniums can.  Of course, the DL585 couldn't load up as many processors so, the SuperDome won out in terms of scale.  But now that we have a DL785 in the mix, we'll see what kind of watermarks we can reach, in terms of throughput and performance as compared to the big dog. :)

July 21, 2008 5:57 PM
 

Linchi Shea said:

Don;

We should compare notes on DL785. The competions in this space include IBM 3950M2, Sun x4600M2 (haven't got a chance to check if there is an update to Barcelona), and DL785.

July 22, 2008 1:34 PM
 

jchang said:

Don, where did the MAXDOP=8 come from? Are you on SQL 2000 or 2005? When I did parallel execution plan testing just prior to 2005 RTM, I found SQL 2000 scaling beyond the NUMA cell was highly problematic, while SQL 2005 showed excellent scaling to 16 procs, 4 NUMA cells (prior to dual core). Later, 2k5 SP2 made additional improvements.

Linchi: I am suprised HP didn't send their very first DL785 off the assembly line! Anymore takers?

July 22, 2008 5:03 PM
 

Linchi Shea said:

Joe;

I'm no longer where I used to be, and the priorities are different now (for better or for worse).

Anyway, it's difficult enough to make the results meaningfuly comparible for the same test across a long time span. It's even more difficult to compare results from different tests by different people (I'm not talking about TPC-C, etc). And it becomes impossible when you basically forbidden by the hardware vendors from talking publicly.

July 22, 2008 5:46 PM
 

mrdenny said:

As our application continues to grow it's user base we will to looking toward the DL785 platform as we are CPU bound.  The boss doesn't like the idea of the SQL licensing for the system, but if can scale the application to 10-20 times our current load I'm sure he'll be OK with it.

We are CPU bound on our OLTP application, and are in the process of moving from a DL385 to a DL585.  The DL 785 probably won't be for a couple of years as we are going from 4 cores to 16 as it is, but the DL 785 is in the 3-4 year road map (unless HP comes out with a DL 985 before then).

July 22, 2008 6:49 PM
 

sql_noob said:

Until last year we were using 2 DL 760's in a clustered configuration. Replaced them with Proliant DL 380's last year.

We bought a DL 580 G5 this year. Personally I think the 780 is a joke. I keep reading that you won't get much performance improvement with that many CPU cores. Today's software is not optimized to run nicely on anything more than 8 cores. The cheaper DL 580 has the same amount of RAM. Will anyone need that many expansion slots?

July 23, 2008 1:52 PM
 

jchang said:

mrdenny: if you are on a 385, then moving to a 585 is the right thing for now. I do not like buying hardware (specifically the processor complex) for more than the immediate needs within 1-2 year, on the expectation that newer faster (throughput anyways) will be available next. For fast growing environments, it is better to buy one system now, replace it in 2 years for a 4-year life, than to buy a system with headroom for the full 4 years. I am inclined to think that AMD will not implement 1-hop capability for more than 8 sockets, or atleast not 16, and the same for Intel on the Xeon side. If we project out 8 cores per socket (and 16 logical in Nehalem) in the 2010 time frame, this makes 64 cores in an 8-way system. What is the MS plan for supporting more than 64-way? Will we have Affinity128 Mask?

sql_noob: I don't consider the 785 a joke, it should be a serious platform for people doing highly intensive work. But as with all expensive things, it get sold to people with big budgets regardless of whether they need it or not. Anyone who writes "that you won't get much performance improvement with that many CPU cores" without qualification is an idiot out of his depth. It is correct to say that one should not assume that the big box will yield improvement on a drop in basis. As I said above and in other places: for any database application, the execution plans for the complete application will consist of many individual component operations. Some component ops have excellent scaling beyond 16-32 cores, some have horrible scaling beyond 2-4 (and sometimes 1). What do you think the chances are that your particular application, designed blind to these considerations, will magically avoid the bad ops? especially considering that many of the poor scaling ops tend be the ones that support really bad SQL coding techniques favored by people who should not be allowed near a database. But I digress, Usually, finding the bad ops is not too difficult, sometimes correcting or bypassing the bad ops is easy, other times difficult. such is life. It is more accurate to say todays apps do not run well on (hard) NUMA systems, instead of 8 cores. I am not sure the original programmer really designed and tested more than 2, its only luck that it seems to do ok. And then what I said about the high mem costs still applies. As to the IO slots, if you have a large DW, 16-32 cpus can consume a lot of data, which does not fit into memory. For this, you might want to feed the engine with with a fire hose, not a garden hose, this means 3-10GB/sec, which big iron systems can do. Too often I have seen people buy big budget systems to the sales rep/SE recommended configuration which ends up doing only 200-300MB/sec. One $quarter-mill DMX did 14MB/sec.

July 23, 2008 2:44 PM
 

GrumpyOldDBA said:

This box would be excellent for consolidation and/or virtualisation.

July 24, 2008 2:56 AM
 

kudabird said:

What you forget of course with a comparision of the ProLaint to Integrity line is the vastly superior RAS with that of Integrity. Yes the DL785 eats into the Integrity product line space but for the enterprise, RAS should matter.

July 29, 2008 2:31 AM
 

jchang said:

I don't forget, but the Integrity is not advertised as: Great RAS, somewhat slower performance, so its up to Intel to get Itanium up to par with other processors, especially Core2 now, and Nehalem later

July 29, 2008 9:57 AM
 

GunterZ said:

Hi Joe (and Don),

We're actually getting up to 1.6GB/s out of a PCIe RAID card.

This system is good for close to 9GB/s, with the Superdome we're

doing over 24GB/s.

Regarding Don's DOP=8, SQL2005 will pick 8 cores on one numa node (i.e. cell board) and thus the synchronization of threads (cxpacket) is much more efficient that synchronizing across nodes.

July 29, 2008 12:23 PM
 

Linchi Shea said:

Hi kudabird;

I guess every vendor would advertise every machine as great on RAS. I'm not sure that Integrity has demonstrated the 'vastly superior RAS'. I'd like to see some data on that.

August 3, 2008 8:44 PM
 

jchang said:

thanks Gunter, I was really hoping to get a highly authoritative source on what the newest components can do, so I do not want to spend my money without knowing whether I'll get any improvements. If we can get 1.6GB/s per card, that could kind of change the IO load strategy. Previously, I said spread load across all x4 or x8 PCI-E slots in a balanced manner. Now with 1.6 on the x8 (or x16), maybe the best strategy is to just use the 6 (3 x8 and 3 x16) wide PCI-E slots on the 785, or an asymmetric, more on the wide, less on the x4.

OK, so the MAXDOP 8 is specific to Itanium Dual-Core, which has 8 cores on 1 NUMA node, and this will change with each major architecture generation. Still, my preference for DW would be to limit smaller queries to MAXDOP 2 or 4, but allow all out parallelism on the really big queries, atleast for some people, as SQL 2K5 can scale across NUMA nodes on Itanium/Superdome, but you know best on this.

August 4, 2008 12:44 PM
 

David Trounce said:

Hey Joe,

You've helped me in the past over at sql-server-performance.com. We're considering an HP DL785 G5 for a new large client, specs something like:

Server:

HP ProLiant DL785 G5

8 x Quad core Opteron 8360 SE (2.5GHz / 105w) processors

64GB fully buffered PC2-5300 DIMM

16 x 146GB 2.5” 10K SAS disks (internal)

2 x HP P600 controllers for internal arrays

DVD drive (likely needs to be external USB given 16 internal disks)

Hotplug AC Power Supplies

Integrated Lights Out 2 (iLO 2) Standard Management

24x7 on-site hardware support

External disk system:

16 x HP MSA60 external disks, each containing 12 x 146GB 15K 3.5” SAS disks

8 x HP P800 controllers (2 racks per controller)

This seems like it would be the logical upgrade to a server we currently have on another engagement, that is more CPU-bound than disk-bound on large OLAP queries (billions of records of retail data, 10-15 business analysts running very large one-off queries in SQL that are not worth hand-optimizing):

Server:

HP ProLiant DL580 G5

4 x Quad core Intel Xeon X7350 (2.93GHz / 2x4M / 130w) processors

32GB fully buffered PC2-5300 DIMM

16 x 146GB 2.5” 10K SAS disks (internal)

2 x HP P400i controllers for internal arrays, each with 512MB RAM cache and battery backup

HP slimline DVD+RW 8X drive

Hotplug AC Power Supplies

Integrated Lights Out 2 (iLO 2) Standard Management

24x7 on-site hardware support

External disk system:

12 x HP MSA60 external disks, each containing 12 x 146GB 15K 3.5” SAS disks

6 x HP P800 controllers (2 racks per controller)

We're running SQL 2005 Enterprise x64 on this one, but will go for SQL 2008 Enterprise x64 on the new one.

August 31, 2008 7:23 PM
 

jchang said:

without knowing the actual app characteristics, then purely from cost considerations, I would advocate 128 or 256GB mem. Figure the base system plus 8 proc will cost $46K, then $3-4K for memory is a drop in the bucket, even 128GB at +$7.6K is kind of low. I would say that 256GB at $26K is a reasonable balance for this system. Of course if you already know your app will not use more than 64-128GB, then thats fine. Please pay attention to system, it has 6 PCI-E slots at x8 or higher, 5 slots are x4. I used to say x4, but Gunter say the P800 controller can drive 1.6GB/s from a x8 slot. If this is a DW/OLAP, then try to balance according the PCI-E slot, perhaps 2 MSA60 per x8 or x16 slot, and 1 MSA per controller on the x4 slots. May be 6 P800s for x8 and x16, plus 4 controllers for the x4 for a total of 16 MSA60. the controllers in the x4 need not be P800, but maybe for interchangeability, just do P800 across the board. I do not see why you would need the P600 for internal, the onboard E/P400 might be fine

September 1, 2008 4:50 PM
 

David Trounce said:

Hey Joe - thanks for the advice. We are now ordering the following:

HP ProLiant DL785 G5 (7U)

8 x Quad core Opteron 8356 (2.3GHz / 75w) processors - this is all that HP had in stock, no 2.5GHz currently available here

128GB fully buffered PC2-5300 DIMM

16 x 146GB 2.5” 10K SAS disks (internal)

2 x HP P400/512/BBWC controllers for internal arrays (one built-in, one PCIe x4)

MS Windows 2008 Server Enterprise x64

MS SQL Server 2008 Enterprise x64

Windows terminal services licenses to host SQL client tools in RDP sessions

Redgate SQL Backup for first-level compressed backups (in our experience, just as effective as Quest Litespeed, and a fraction of the cost)

External disk system:

17 x HP MSA60 external disks, each containing 12 x 146GB 15K 3.5” dual-port SAS disks (34U)

10 x HP P800 controllers, each with Dual Domain I/O for redundancy

This will take up 41U of a 42U rack

My suggestion is that we connect the Dual Domain MSA60s like the following (any advice?)

- Slot 5: P800 in an x16 PCIe:

 -- mini-SAS a): MSA60 #1, which is daisy-chained to MSA60 #2

 -- mini-SAS b): MSA60 #4, which is daisy-chained to MSA60 #3

- Slot 8: P800 in an x16 PCIe:

 -- mini-SAS a): MSA60 #3, which is daisy-chained to MSA60 #4

 -- mini-SAS b): MSA60 #6, which is daisy-chained to MSA60 #5

- Slot 10: P800 in an x16 PCIe

 -- mini-SAS a): MSA60 #5, which is daisy-chained to MSA60 #6

 -- mini-SAS b): MSA60 #2, which is daisy-chained to MSA60 #1

- Slot 1: P800 in an x8 PCIe:

 -- mini-SAS a): MSA60 #7, which is daisy-chained to MSA60 #8

 -- mini-SAS b): MSA60 #10, which is daisy-chained to MSA60 #9

- Slot 9: P800 in an x8 PCIe:

 -- mini-SAS a): MSA60 #9, which is daisy-chained to MSA60 #10

 -- mini-SAS b): MSA60 #12, which is daisy-chained to MSA60 #11

- Slot 11: P800 in an x8 PCIe

 -- mini-SAS a): MSA60 #11, which is daisy-chained to MSA60 #12

 -- mini-SAS b): MSA60 #8, which is daisy-chained to MSA60 #7

- Slot 3: P800 in an x4 PCIe:

 -- mini-SAS a): MSA60 #13, which is daisy-chained to MSA60 #14

 -- mini-SAS b): MSA60 #15

- Slot 4: P800 in an x4 PCIe:

 -- mini-SAS a): MSA60 #15

 -- mini-SAS b): MSA60 #16

- Slot 6: P800 in an x4 PCIe

 -- mini-SAS a): MSA60 #16

 -- mini-SAS b): MSA60 #17

- Slot 7: P800 in an x4 PCIe

 -- mini-SAS a): MSA60 #17

 -- mini-SAS b): MSA60 #14, which is daisy-chained to MSA60 #13

- Slot 2: P400/512 in an x4 PCIe

 -- connected to internal disks 9-16 in optional internal drive cage

One thing: given that a P800 has just 2x external 3Gb/sec SAS ports, I'm wondering how it can drive 1.6GB/sec? Wouldn't it top out at 600MB/sec? (2x 300MB/sec limit of the SAS ports).

Gunter above said he was getting 1.6GB/sec out of a PCIe RAID card, but I'm not sure how he was doing this. Also it wasn't clear that this was an HP P800 vs something else.

We don't have an application per se, we have 10 or so business analysts simultaneously writing ad-hoc SQL against a 10TB data mart. We've tried to minimize row byte count, add indices where appropriate, but we're still running custom calculations and aggregations across billions of records.

September 13, 2008 6:59 PM
 

jchang said:

the best initial advice is to follow the tpc-h report for the DL785, Gunter would be referring to a very (or even exactly) comparable configuration. The external connectors on SAS is x4. Each SAS lane is 3Gbit/s or approx 300MB/s after signaling overhead. so technically, each x3 can do 1.2GB/s. In a x8 PCI-E slot, realistic achievable is 1.6GB/s. (I will try to confirm what a Dell PERC6 can do). Even in a x16 slot, the internal controller of the P800 might only support x8 paths. once have the equipment setup, I can give some SQL scripts for random and sequential IO. If they are far off from what Gunter discuss, I will bug him on it. He might have been talking about special config with the port driver (default is mini-port) and with performance counters disabled (-x from sqlservr.exe), possibly at the OS level as well.

September 15, 2008 8:51 AM
 

DonRWatters said:

Hey Joe,

So, we're pretty much done with the DL785 comparison.  Unfortunately, the DL785 that we got has the HE chips, so it's not as fast as it would be if it were the higher end chips.  HOWEVER, given our workload, it was atleast equivalent to the SuperDome in most aspects.  There were tweaks here and there for each, but for the most part, moving a workload off of the SuperDome and onto the DL785 resulted in no loss of performance...in some cases things worked better on the DL785, where we could remove the MAXDOP setting and get more CPU cycles working on the same memory.  In our workload ont he SuperDome, we were CPU constrained and not as much I/O constrained, so getting the DL785 actually helped our cause.  Now, I'm not emphatically stating this is the case for all situations, but I've found the DL785 to be a very worthy adversary to the SuperDome.  I'd love to post numbers on the comparison, but it really wouldn't be fair, since the SuperDome and DL785 were using different I/O subsystems, HBAs, NICs, etc.

September 23, 2008 11:24 PM
 

jchang said:

If you can find two or more large queries (>20 CPU-sec) that run mostly in memory, then the HBA & NICs don't matter. Be sure to run at the same MAXDOP, try 1,2,4,8,16,32 (or 8,16,32). Use STATISTICS TIME ON to get both CPU and Elapse time (Duration). My recollection was that the Opteron around 2.2GHz has better in memory performance than Itanium 1.6GHz for large queries (The SPEC CPU int for both are about the same). Small queries at high volume benefits from the big Itanium cache, large queries don't care. Both DL785 and SuperDome have ultra-high bandwidth IO if utilized. Lets see what Intel can do GHz-wise with the quad-core Tukwila. Itanium reached 1.6GHz on 130nm, at 90 and 65nm, it should be faster, except there is not sufficient thermal headroom for dual and quad cores.

September 24, 2008 9:32 AM
 

KWADJ said:

JOE

HOW much money will cost to have DL785 G5 WITH 8x2.8GHz 6core, 64x4GB  RAM.

Thanks

September 12, 2012 6:06 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement