THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Storage Performance

Storage has changed dramatically over the last three years driven by SSD developments. Most of the key components necessary for a powerful storage system are available and the cost is highly favorable for direct placement of data files. Some additional infrastructure elements could greatly enhance the flexibility of storage systems with SSDs. There is still some discussion on whether SSD should interface directly to PCI-E or continue using the SAS/SATA interfaces originally designed for hard disks. New products coming this year include Express Bay, an ecosystem of connectors allowing both PCI-E and SAS/SATA to co-exist until a clear direction is established. Also expected in the coming year are PCI-E SSDs based on the NVM Express interface.

System Level

The Intel Xeon E5 processors, codename Sandy Bridge-EP, have 40 PCI-E gen 3 lanes on each processor socket. Even though PCI-E gen 3 is 8GT/s, a change in the encoding means that the usable bandwidth is double that of PCI-E gen2 at 5GT/s. The net realizable bandwidth of a PCI-E gen 3 x8 slot is 6.4GB/s versus 3.2GB/s for gen 2.

The unfortunate aspect is that the major server vendors all implement a mix of x16 and x4 slots, while the HBA vendors seem to be concentrating on products for PCI-E x8. Only Supermicro has a system with 10 8 PCI-E gen 3 x8 slots. Could a vendor put 2 HBA/RAID Controllers designed for x8 onto a single card for a x16 slot? Perhaps the new Express Bay form factor will have some means to use x16 slots?

Sandy Bridge EP 2-socket

Another disappointment is that the 4-socket Xeon E5-46xx systems only connect half of the PCI-E lanes. This might be because the base system configuration is 2-socket populated. If a full set of slots are provided, there would no connection to half of the slots unless all four sockets are populated. But this is also an issue on the 2-socket systems if only 1 socket is populated.

Direct-Attach

For the most part, I will discuss direct-attach storage configuration, as we can pick and choose among the latest components available. Technically, direct-attach with SAS can support a 2-node cluster, but few system vendors promote this configuration. Dell sells the MD3200 as direct-attach storage supporting 4 hosts in a cluster (or not), but technically it is a SAN that just happens to use SAS interfaces on both the front-end and back-end.

The objective in the baseline storage configuration below is to achieve very high IO bandwidth even in the low capacity configuration. Of course it will also have very high IOPS capability because the main elements are SSD. My recommended storage system has both SSD and HDD in each IO channel.

PCIe

This intent is to place the main databases on SSD and use the HDD for backups and for restore verification. For an Inmon style data warehouse, the HDD might also be used for older data. The reason for having both SSD and HDD on each IO channel is to take advantage of simultaneous bi-directional IO. On a database backup, the IO systems reads from SSD, and simultaneously writes to HDD.

4 RAID Controllers, 4GB/s per controller, 16GB/s total IO bandwidth
4 Disk enclosures (yes, I am showing 8 enclosures in the diagram above)
4 x 16 = 64 SSD
4 x 8 = 32 (10K) HDD

The standard 2U enclosure has 24 x 15mm bays. My preference is for a 16 SSD and 8 HDD mix. With the small capacity 100GB SSD, there will be 1.6TB per enclosure and 6.4TB over 4 enclosures before RAID. In 7+1 RAID 5 groups, there will be 5.6TB net capacity, and 4.8TB in 3+1 RG across 4 units. The goal is 4GB/s per controller because the SAS infrastructure is still 6Gbps, supporting 2.2GB/s on each x4 port. With 16 SSDs per controller, each SSD needs to support 250MB/s. Most of the recent enterprise class SSDs are rated for well over 300MB/s per unit, allowing for a large degree of excess capability. Another option is to configure 12 SSDs per controller, expecting each SSD to support 333MB/s.

The cost structure for the above is as follows:
  RAID controller $1K
  2U Enclosure $3K
  Intel SSD DC 3700 100GB SSD $235 x 16 = $3760, RAID 5 7+1: 1.6TB
  Seagate Savvio 600GB 10K HDD $400 x 8 = $3200.

This works out to $11K per unit or $44K for the set of 4. The set of 16 x 100GB contributes $3760. For the 800GB SSD, the R5 7+1 capacity is 44.8TB at cost $148K.

At maximum expansion of 4 enclosures per RAID controller, capacity is 170TB at cost is $590K. Of course at this level, I would elect a system with more PCI-E slots for greater IO bandwidth. Another option is a RAID controller with 4 x4 SAS ports. Unfortunately none of these have 4 external ports.

While the Intel SSD DC 3700 drew reviews for pioneering consistency of IO performance instead over peak performance, it is only available in SATA interface. Micron Crucial has announced the P410m with similar specifications but with SAS interface. This is listed on the Micron website as in production, but probably only to OEM customers. There are other enterprise grade high endureance MLC SSDs with SAS interface as well.

Note: I do not recommend anything less than 10K HDD even to support database backups. The 10K HDDs are not particularly expensive as direct-attach components ($400 for the 600GB model). Only SAN vendors sell $400 HDDs for $2K or more.

SAS 12Gbps Enclosures

Disk enclosures supporting SAS at 12Gbps might become available as early as this year. Each of the 12Gbps SAS x4 uplink and down link ports would then support 4GB/s. The RAID controller (HBA) can support 6GB/s+ in a PCI-E gen 3 x8. The system with 4 RAID controllers could then deliver 24GB/s instead of 16GB/s. At 16 SSDs per controller, this would require 400MB/s per SSD. While SSDs are rated as high as 550MB/s, achieving the full aggregate bandwidth in an array is not necessarily practical. So 400MB/s per SSD in an array is a more reasonable expectation. Also, enterprise SAS SSDs many only be rated to 400MB/s.

We should not need 12Gbps SAS SSDs or HDDs in the near future (but 8 NAND channels is a good match for a 1.1GB/s interface). The internal wires in the enclosure connect through a SAS expander. The IO from each device bay can signal at 6Gbps, then uplink to the HBA at 12Gbps, assuming that packets are buffered on the expander.

PCIe

The standard 2U disk enclosure today supports 24 or 25 2.5in (SFF) bays, with 15mm thickness. This is the dimension of an enterprise class 10K or 15K HDD with up to 3 platters. The older full size notebook used a 9mm HDD supporting 2 platters. Thin notebooks used a 7mm HDD restricted to a single platter. There is no particular reason for an SSD to be more than 7mm.

It would be better if the new 12Gbps SAS enclosures support more than 24 bays. My preference is for 16 x 15mm and 16 x 7mm bays. The key is to match the practically realizable aggregate bandwidth of SSDs to the uplink with some degree of excess. Personally, I would like to discard the SSD case to further reduce thickness.

PCIe

Another option is to employ the NGFF, perhaps a 1U stick, at 5mm or less. There could be 2 rows of 24 for SSD, and the 16 x 15mm bays.

I believe that the all-SSD idea is misguided. SSDs are wonderful, but HDD still have an important role. One example is having the HDDs available for backup and restores. I want local HDD for backups because so very few people know how to configure for multiple parallel 10GbE network transmission, not to mention IO bandwidth on the backup system.

A database backup that has not been actually verified to restore (with recovery) is a potentially useless backup. Having HDDs for backup and restore verification preserves the write endurance on the SSD. This allows the use of high-endurance MLC instead of SLC. In some cases, it might even be possible to use consumer grade MLC if and only if the database organization maintenance strategy is architected to minimize wear on the SSD.

PCI-E SSD

Some of the discussion on PCI-E versus SATA/SAS interface for the NAND/Flash controller incorrectly focuses on the bandwidth of a single 6Gbps lane versus 4 or 8 lanes on PCI-E. It is correct that PCI-E was designed to distribute traffic over multiple lanes and that hard drives were never expected to exceed to bandwidth of a single lane at the contemporary SATA/SAS signaling rate. The transmission delay across an extra silicon trip, from NAND controller with SATA interface to a SATA to PCI-E bridge chip, on the order of 50ns, is inconsequential compare this with the 25-50µsec access time of NAND.

The more relevant matter is matching NAND bandwidth to the upstream bandwidth. All (or almost all?) the SATA interface flash controllers have 8 NAND channels. Back when SATA was 3Gbps and NAND was 40-50MB/s, 8 channel s to the 280MB/s net bandwidth of SATA 3G was a good match. About the time SATA moved to 6Gbps, NAND at 100-133MB/s became available so 8 channels was still a good choice.

PCIe
NAND is now at 200 and 333MB/s, while SATA is still 6Gpbs. The nature of silicon product cost structure is such that there is only minor cost reduction in building a 4 channel flash controller. The 8 channel controller only requires 256-pin package.

The PCI-E flash controllers have been designed with 32 NAND channels. The IDT 32-channel controller has 1517 pins, which is not excessively difficult or expensive for a high-end server product. Note the Intel Xeon processors are 2011-pins. As noted earlier a PCI-E gen 3 x8 port supports 6.4GB/s. Over 32 channels, each channel needs to provide 200MB/s. The new 333MB/s NAND is probably a better fit to sustain the full PCI-E gen 3 x8 bandwidth after RAID (now RAIN because disks are replaced by NAND).

PCIe

Based on 64Gbit die, and 8 die per package, a package has 64GB raw capacity. The 32-channel PCI-E with 1 package per channel would have 2TB raw capacity (net capacity with 0.78 for over-provisioning and 0.875 for RAIN would be 1400GB) versus 512GB on an 8-channel SATA/SAS SSD. The IDT document states capacity is 4TB raw for their 32-channel controllers, so perhaps it allows 2 packages per channel? The Micron datasheet mentions 32-channel and 64 placements.

As is today, a PCI-E SSD can achieve maximum bandwidth at lower NAND capacity and in a more compact form factor than with SAS SSD. On the other hand, SAS infrastructure provides flexible expansion. Capacity can be increased without replacing existing devices. Some systems support hot swap PCI-E slots. However the orientation of the connector in the system chassis makes this a complicated matter. The implications are that PCI-E slot SSDs are highly suitable for high density requirements with limited expansion needs. One database server example is tempdb on SSD.

NVM Express

The new generation of PCI-E SSDs may employ the NVMe interface standard. There is a standard driver for Windows and other operating systems, which will later be incorporated into to the OS distribution media allowing boot from an NVMe device, as with SATA devices today. This is mostly a client side feature.

For the server side, the NVMe driver is designed for massive bandwidth and IOPS. There can be up to 64K queues, 64K commands (outstanding IO requests?). The driver is designed for IO to be both super-efficient in cpu-cycles and scalable on NUMA systems with very many processor cores.

EMC VNX

Express Bay

To promote the growth of SSD without betting on which interface, the Express Bay standard defines a connector that can support both PCI-E and SATA or SAS. Some Dell servers today support PCI-E to SSDs in the 2.5in HDD form factor (SFF), but I am not sure if this is Express Bay. This form factor will allow PCI-E devices to hot-swapped with the same ease as SAS devices today.

PCIe

PCI-E Switches

As mentioned earlier, the PCI-E slot arrangement in server systems does not facilitate hot-add, even if it is supported. Existing PCI-E SSDs also do not provide a mechanism for capacity expansion, aside from adding a card to an empty slot or replacing an existing card.

Of course, there are PCI-E switches, just like the SAS expanders. A 64 lane PCI-E switch could connect 5 x8 PCI-E devices over a x8 upstream link. Other possibilities is a x16 link supporting 12.8GB/s to host with 4 ports for SSD, or 8 x4 ports to SSD for finer grain expansion. It may also be possible to support multiple hosts, as in a cluster storage arrangement?

PCIe

SAN Configuration

Below is a representation of a typical configuration sold to customers by the SAN vendor. I am not joking in that it is common to find 2 ports FC or FCOE on each host. The most astounding case was a SAN with 240 x 15K disks and 2 single port FC HBAs in the server. Even though the storage system service processors had 4 FC port each (and the FC switches had 48-ports), only 1 on each SP was connected. Obviously the storage engineer understood single component and path failure fault-tolerant design. It was just too bad he built a fault-tolerant garden hose system when a fire hose was needed.

SAN_Configuration1c

As I understand it, what happened was the SAN engineer asked how much space is needed for the databases accounting for growth, and then created one volume for it. The Windows OS does support multi-path IO. Originally the storage vendor provided the MPIO driver, but now it is managed by Microsoft. Apparently it was not considered that even with MPIO, all IO for a single volume has a primary path. The secondary path is only used when the primary is not available.

High-bandwidth SAN Configuration

A proper SAN configuration for both OLTP and DW database servers is shown below. Traditionally, a transaction processing database generates small block random IO (2KB in the old days, 8KB since SQL Server 7). As it was difficult to get 10K IOPS (x8KB = 80MB/s), it was thought that IO bandwidth was not a requirement. This was 15 years ago. Apparently the SAN vendors read documents from this period, but not more recently, hence their tunnel vision on IOPS, ignoring bandwidth.

For the last 10 or more years, people have been running large queries on the OLTP system. I have noticed report queries that saturate the storage IO channels could essentially shutdown transaction processing. This is because the report query generates asynchronous IO at high queue depth, while the transaction queries issue synchronous IO at queue depth 1. And the report may escalate to a table lock (or it may use nolock). Furthermore, it is desirable to be able to backup and restore the transaction database quickly. This means bandwidth.

Note that the system below shows 8Gbps FC, not 10Gbps FCOE. A single 10Gbps FCOE may have more bandwidth than a single 8Gbps FC port. But no serious storage system will have less than 4 or even 8 ports. Apparently FCOE currently does not scale well over multiple ports, due to the overhead of handling Ethernet packets? An Intel IDF 2012 topic mentions that this will be solved in the next generation.

SAN_Configuration2e

The above diagram shows 8 x 8Gbps FC ports between host and storage system. Each 8Gbps FC port can support 700MB/s for a system IO bandwidth target of 5.6GB/s. An OLTP system that handles very high transaction volume may benefit from a dedicated HBA and FC ports for log traffic. This would allow the log HBA to be configured for low latency, and the data HBA to be configured with interrupt moderation and high throughput.

An alternate SAN configuration for SQL Server 2012 is shown below with local SSD for tempdb.

SAN_Configuration3b

The write cache on a SAN must be mirrored for fault tolerance. There is very little detail on the bandwidth capability of the link between controllers (or SP) on SAN systems, beyond what can be deduced from the fact that the sustained write bandwidth is much lower than the read bandwidth. So keeping tempdb off the SAN should preserve IO write bandwidth for traffic that actually needs protection.

The number of volumes for data and temp should be some multiple of 8. It would be nice to have 1 volume for each FC path. However we do need to consider how SQL Server place extents over multiple files. This favors RAID groups of 4 disks.

File Layout

In a HDD storage system, the objective for bandwidth is to simultaneously issue large block IO to all data disks across all IO channels. A 256K block could be sufficiently large to generate 100MB/s per disk (400 IOPS, not random). If this were issued at low queue depth (2?), then the storage system would not only generate high IO bandwidth and still be perceptually responsive to other requests for small block IO.

For small block random IO, it is only necessary to distribute IO over all hard disks with reasonable uniformity.

The file layout strategy has two objectives. One is to not overwhelm any single IO channel. In direct-attach, this is not a proper as the smallest pipe is x4 SAS for 2GB/s. In a SAN, even using 8Gbps FC, this is a concern as 8Gb FC can support only 700-760MB/s. Although 10Gb FCoE seems to have higher bandwidth, this may not scale with the number of channels as well as straight FC. The new Intel Xeon E5 (Sandy-Bridge EP) processors may be able to scale 10Gb FCoE with Data Direct IO (DDIO) - but this needs to be verified.

The second is to ensure IO goes to every disk in the RAID Group (or volume). By default, SQL Server allocates a single 64K extent from each file before round-robin allocating from the next file. This might be the reason that many SAN systems generate only 10MB/s per disk (150 IOPS at 64K), along with no read-ahead.

FileLayout_1

The -E startup flag instructs SQL to allocate up to 4 consecutive extents before proceeding to the next file. See James Rowland-Jones Focus on Fast Track : Understanding the –E Startup Parameter for more on this. In a 4-disk RAID group with stripe-size 64K, a 256K IO to the file would generate a 64K IO to each disk.

FileLayout_2

It would be necessary to rebuild indexes before this scheme takes effect. Somewhere it was mentioned that it is important to build indexes with max degree of parallelism limited to either 4 or 8. It might be in the Microsoft Fast Track Data Warehouse Reference Architecture. Start with version 4 for SQL Server 2012, and work backwards?

Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together. See SQL Server 2008 Trace Flag -T 1117.

With SSD, the second may not be important as the SQL Server read-ahead strategy (1024 pages?) should generate IO to all units. On the hard disk, generate close-to-sequential IO was important. On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.

Summary

The old concept of distributing IO over both devices and channels still apply. The recent pricing of SSD is sufficiently low to warrant serious consideration ($2-3K/TB eMLC). While there is more flexibility in SSD configuration, it is still necessary to validate performance characteristics with real queries to an actual database. SQLIO or other synthetic tests are not sufficient. If the SAN vendor advised in the configuration, then chances are IO bandwidth will be not be good.

Addendum
If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
Find the biggest table in your database, excluding LOB fields.
Run this:
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT COUNT(*) FROM TableA WITH (INDEX(0))
GO

Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?

I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC. 

I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.

Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.

I will keep this topic up to date on www.qdpma.com Storage 2013

related posts on storage:
io-queue-depth-strategy (2010-08)
data-log-and-temp-file-placement (2010-03)
io-cost-structure-preparing-for-ssd-arrays (2008-09)
storage-performance-for-sql-server (2008-03)

ps,
If you are using SQL Server 2012 clustering on a SAN, I do suggest placing tempdb on local SSD, making use of the new 2012 feature that does not require tempdb to be on shared storage. Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do. We should plan redo whatever was using tempdb at the time.

Published Monday, March 25, 2013 12:52 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Manson said:

Greatest DB storage analysis ever seen!

March 25, 2013 4:58 AM
 

tobi said:

"Somewhere it was mentioned that it is important to build indexes with max degree of parallelism limited to either 4 or 8."

Project Madison used MAXDOP 1 OFFLINE and one file/filegroup per partition. This seems to be the only reasonable way on SQL Server to ascertain truly sequential layout.

March 25, 2013 9:10 AM
 

jchang said:

MAXDOP 1 was used on index builds in TPC-H benchmark reports because they are fanatical about getting the absolute best performance. Second place by even 0.1% is for losers! So project Madison probably just copied that.

But somebody probably realized that to make such a recommendation in a production environment completely defeats the validity of claiming SQL Server is an enterprise database system. There are times we have elect for good enough. This is why I suggested looking to more recent sources.

I believe that some of the TPC-H reports completely abandoned RAID groups altogether, going JBOD to get the 256K IO to individual disks.

So another great thing about SSD, no more low MAXDOP index builds.Of course, index builds for compressed indexes is still dog slow, with the final write being single threaded.

1 File & FG per partition??? then someone was very confused about table scan and IO

March 25, 2013 9:43 AM
 

Pei said:

Joe, Just curious how you draw these diagram, charts?

March 25, 2013 12:15 PM
 

jchang said:

I do freely copy diagrams from other slidedecks, no sense reiventing the wheel. As much as practical, I try to attribute the source. If it is obvious the material is from a vendor, then I may not bother, especially if it is in their interest. Of course vendors do not always appreciate that when I say their product is not obviously stupid, thats actually a compliment.

My own diagrams are done in powerpoint, not best for heavy duty detail, but what else is there? I have thought about just using Visual Studio and C# to do detail and iterations, so all positions are calculated. I did this type way back in the old days. In the 80's I probably had the best quality computer generated Smith chart.

March 25, 2013 12:52 PM
 

2B or not 2b said:

4GBps != 4Gbps

Are you mixxing your B's with your b's in some places? It is easy to do.  You are not alone, hence I love it when authors spell out Bytes and bits rather than risking mixxing B's with b's.

March 25, 2013 4:47 PM
 

jchang said:

I never cross my B and b like some rookie or clueless (marketing) twit, so I do not feel the need to spell it out often. And I know when G means 10^9 versus 2^30, so I don't use that other silly annotation.

FC was once 4Gbps, but I do not talk about that. The only other place 4 comes up is with the net bandwidth of 2 x4 SAS 6Gbps or 1 x4 12Gbps SAS, which is 4.4GB/s, but I use 4GB/s for brevity.

March 25, 2013 5:17 PM
 

inawe said:

Joe,

This is fantastic!

Thank you.

March 25, 2013 5:17 PM
 

Ajith@Dell said:

With regards to your comment on benchmarking "Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?" Are you saying you have seen closer 4GB/s on the optimal configurations?

By benchmarking within SQL you seem to be automatically trusting MS SQL as not the performance bottleneck. Isn't something like IOMeter or SQLIOSim a better judge of the SAN performance?

March 25, 2013 5:39 PM
 

jchang said:

I only ask whether the simple test produces 0.7 or 4GB/s. I never said 4GB/s was optimal. But if the SAN vendors was involved, then chances are your config will be less than 1GB/s. The last storage system that I spec'd was 4 Dell MD3200 each with 2 MD1200, for a total of 36 disks, 32 for data on each MD3200. Because the database to be migrated already had a single 3TB file, there was no way we could build 8 x 4 disk RAID groups on each MD3200, to get near 100MB/s per disk, for 2.8GB/s per MD3200. But with 2 x 16 disk RG on each MD3200, we could get a combined 5GB/s over 4 units. A table scan to the 2TB table that would have sank the old iSCSI storage barely caused a blip on the new system - in terms of response time and througput of concurrently running transactions.

The reason I cited running SQL oriented tests is that I already know what the pure IO capability should be for a given configuration. The issue is not whether SQL Server is the bottleneck. The point is that SQL Server generates specific IO patterns, 512K IO per file to heap, 64K-256K to clustered index, etc. some at low queue, others at high queue. So if the IO system cannot deliver IOPS and BW in the manner that SQL Server generates, then it does not matter what the pure IO capability is.

I am also assuming that you know how to generate a disk IO bound query, but that is another 1 hour slidedeck, ok, maybe 2 or 3 hr. (I did say in another post that assumption is the mother of all f@ck#ps, but that is from Segal movie?)

One more thing. Some of the fancy storage system do extensive compression. Most of our IO tools were created before this, hence may generate IO using uniform data - all zeros. One vendor claimed outstanding results, but refused to test on an actual database with data of normal compressibility, especially after the implication of what I just said sank in.

Does not SQLIOSim use a 3GB test file? so all the IO goes to the SAN cache? This is why I insist on the full size database. For my customers, I look at their tables to figure out a set of queries that will generate specific IO patterns. So perhaps I should post on that later?

March 25, 2013 6:10 PM
 

David @ Dell said:

Fascinating Blog, + interesting analyses, esp.

The comment about auto-tiering, which when

looking at a small number of DB instances.

What's your view on scaling that approach

to 2,000 DB instances or more without

linearly scaling the DBA and infrastructure teams?

March 26, 2013 5:56 AM
 

KKline said:

Outstanding analysis, as always, Joe.

Just curious if you've looked at some of the new inexpensive iSCSI SSD appliances like ViSX from AstuteNetworks?

I'm seeing a lot more of the SSD and hybrid HD/SSD vendors exhibiting at SQL Saturdays and PASS events. So there's evidentally a strong new in the community for simple, plug-in-play high speed storage.

Thanks,

-Kev

March 26, 2013 7:53 AM
 

jchang said:

Somewhere above, I should have restricted the discussion to the main line of business systems that are actively managed by one of more dedicated DBA, hence the DBA should be able to isolate hot data.

Too often, organization have separated teams of DBA and SAN admins. The SAN admin treat all hosts alike and want to make no additional effort for the main db, hence resorting to SAN management features, per their training/indoctrination.

If we have a small number of instances with very many databases, we could still depend on a large buffer cache to reduce IO. If we have very many instances, then that is out of the scope of my discussion. If there are many instances on one system, then we must carefully allocate memory between the instances and other processes because a free-for-all memory fight does not work.

We might ask how there got to be thousands of instances. I have seen 2 situations. One is a hosting company. Another is one where corporate IT was so locked into their own mission, making rules on what you cannot do, throwing up so many obstacles to getting access to corporate IT resources, that individual departments said the heck with that, and setup their own departmental apps, with their own SQL instance. These were outside corporate and hence not backed up. At some point the data was lost.

So now corporate IT must own all SQL instances, which they cannot actively manage, so they must rely on SAN management tools like auto-tiering. Of course, if auto-tiering works, there must be a few highly actively DBs, and many other inactive one.

Essentially this is not one integrated company but many fiefdoms, meaning they cannot leverage the economies of scale that is normally one of the advantages of a large company. I would watch for signs that the company is in trouble, and bail out before it all comes crashing down.

Kevin, like I said above, I focus on large DBs. The problem with iSCSI as used is that most people do not want to install 10GbE, meaning we have the tiny 1GbE pipes. Then people do not know how to do parallel networking. I saw one company that bought 2 sets of 2 EqualLogic systems fullly populated with the expensive SLC SSD, all going through 2 x 1Gbe pipes, all data going through 1 of the 2. Simple plug-n-play is not high BW. If you want serious BW, it has to be PCI-E to SAS. Do my IO test. Vendors like to talk about how wonderful everything will be when you buy their product, then they dissappear.

I should add that on Windows Server 2012, whole different ball game with SMB3. I would like to look at W12 system with the storage I described above, with db servers connected to this using Infiniband

March 26, 2013 9:44 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement