THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Solid-State Storage

After years of anticipation and false starts, the SSD is finally ready to take a feature role in database server storage. There were false starts because NAND flash is very different from hard disks and cannot be simply dropped into a storage device and infrastructure built around hard disk characteristics. Too many simple(ton) people became entranced on only seeing the featured specifications of NAND-based SSD, usually random IOPS and latency. It is always the details in small print (or outright omitted) that are critical. Now, enough of the supporting technologies to use NAND-based SSD in database storage systems are in place and more are coming soon. (qdpma storage will be the collection point for my storage writings.)

Random IO performance has long been the laggard in computer system performance. Processor performance has improved along the 40% per year rate of Moore's law. Memory capacity has grow at around the 27% per year rate (memory bandwidth has kept pace, but not memory latency). Hard disk drive capacity for a while grew at 50%-plus per year. Even HDD sequential transfer rates has increased at a healthy pace, from around 5MB/s to 200MB/s over the last 15 years. However random IOPS has only tripled over the same 15 year period from 5400RPM to 15K. The wait for SSD to finally break the random IOPS stranglehold has been long, but is finally taking place.

We should expect three broad lines of progress in the next few years. One is the use of SSD to supplement or replace HDD is key functions. Second is a complete redesign of storage system architecture around SSD capabilities with consideration that high-capacity HDD is still useful. Third, that it is time to completely rethink the role of memory and storage in server system architecture.

A quick survey of SSD products is helpful to database professionals because of the critical dependency on storage performance. However, it quickly becomes apparent that it is also necessary to provide at minimum a brief explanation of the underlying NAND flash, including the proliferations SLC, MLC and eMLC. Next are the technologies necessary to implement high-performance storage from NAND flash. The Open NAND Flash Interface ONFI industry workgroup is important in this regard. This progresses to the integration of SSD in storage systems, including form factor and interface strategies. From here were can form a picture of the SSD products available, and develop a plan to implement SSD where appropriate.

Non-Volatile Memory

To take the place of hard drives in a computer system, the storage technology prefers non-volatile memory, in which information is retained on power shutdown. Of the NV-memory technologies, NAND flash is most prevalent in hard-disk alternative/replacement storage devices. NOR flash has special characteristics, suitable for executable code. Other non-volatile memories include Magneto-resistive RAM, Spin-Torque Transfer, and Memristor. Phase-Change Memory has promise in low granularity, and lower read latency.

NAND Flash

The Micron NAND website is a good source of information on NAND. Wikipedia has a description of Flash Memory, explaining the fundamentals and the difference between NAND and NOR. The diagrams below from Cyferz show NOR wiring,

and NAND wiring.

A key difference is that NAND has fewer bit (signal) and ground lines, allowing for higher density, hence lower cost per bit (well today it does not make sense to talk about price per bit, so price per Gbit helps eliminate the leading zeros.

Multi-Level Cell

Sometime in 1997?, Intel published a paper on multi-level cell for NOR Flash, called StrataFlash. At some point, MLC made its way to NAND supporting 2-bits per cell. There is currently a 3-bit cell in development, but this may be more for low performance applications. MLC has significantly longer program (write) time than SLC.

The Intel's 3rd generation SSD, with 25nm NAND from the Intel-Micron Flash Technologies (IMFT) joint venture will be out soon. Below are the IMFT 34nm 2-bit per cell 4GB 172mm2,

the 24nm 2-bit per cell 8GB 167mm2 die (from Anandtech)

and 34nm 3-bit per cell 4GB 126mm2


A significant portion of the die is for logic?

Numonyx SLC and MLC NAND Specifications

Numonyx (now Micron) has public specification sheets for their NAND chips.

Organization x8 x16
Page Size Type Density Page Size Block Spare Page Spare Block Spare
Small page SLC 128M-1G 512 byte 16b 16K 512 256 words 8 words 8K word 256 word
Large page SLC 1G-16G 2 Kbyte 64b 128K 4K 1K words 32 words 64K word 2 Kword
Very Large page SLC 8G-32G 4 Kbyte 128b 256K 8K(?)        
Very Large page MLC 16G-64G 4 Kbyte 224b 512K 28K        


Type Density Random Access Page Program Block erase ONFI
SLC 128M-1G 12μs 200μs 2ms ?
SLC 2-16G 25μs 200μs 1.5ms 1.0
SLC 8-64G 25μs 500μs 1.5ms ?
MLC 16-64G 60μs 800μs 2.5ms ?

A time for each subsequent byte/word is cited as 25 ns, for a 40MHz rate. SLC is typically rated for 100K cycles, and MLC for 5,000 cycles. The (older) lower capacity SLC chips have 512 byte pages.

NAND Organization

I am not sure about this, but I understand the NAND chip itself could be referred to as a target. The chip is divided into planes, the die in the above pictures have 4 or 8 planes? which may also be a logical unit? or is a logical unit below a plane? Below a logical unit is a block, and then the page. NAND organization: plane? logical unit (chip?), 2 planes (may support interleaved addressing), block, page. Target is one or more LU.


Block Erase, Garbage Collection and Write Amplification

After NAND became the solid-state component of choice, the industry started to learn the many quirks and nuances of NAND SSD behavior. NAND must be erased an entire block at a time (2,000μs?). A write (or program) must be to an erased block.


The Wikipedia Write Amplification explain in detail on the additional write overhead due to garbage collection. Write Amplification = Flash Writes/Host Writes. Small random writes increases WA. Write amplification can be kept to a minimum with over-provisioning.


The block write requirement has significant impact write performance. Writes to SLC was already not fast to begin with, writes to MLC is much slower than SLC (800 versus 200-500μs) and on top of this, the implication of the block erase requirement can result in erratic write performance depending on the availability of free blocks. The write performance issues caused by the block erase requirement can be solved with over provisioning.

Below are slides from the Intel Developer Forum 2010 "Enterprise Solid State Drive (SSD) Endurance", Scott Doyle and Ashok Narayanan.



NAND SSD may exhibit a "bathtub" effect in read-after-write performance. The intuitive expectation is that mixed read-write performance should be close to a linear interpolation between the read and write performance specifications. Without precautions, the mixed performance may be sharply lower than both the pure read and pure write performance specifications.

This example is cited by a STEC Benchmarking Enterprise SSDs report.


Wear and MTBF

Flash NAND also has wear limits. Originally this was 100,000 cycles for SLC and 5-10K for MLC. The write longevity issues of MLC seem to be sufficiently solved with wear leveling. SLC SSD may become relegated to the specialty market.

The fact that NAND SSD has a write-cycle limit suggests that database administration could be adjusted to accommodate this characteristic. If there were some means of determining that an SSD is near the write-cyle limit, active data could be migrated off, and the SSD could be assigned to static data. In an OLTP Database, tables could be partitioned splitting active and archival data. In data warehouses, the historical data should static.

Flash Translation Layer

The characteristics of NAND flash such as block erasure and wear limits, a simple direct mapping of logical to physical pages is not feasible. Instead there is a Flash Translation Layer in between. Numonyx provide a brief description here. The FTL is implemented in SSD controller(?), and determines the characteristics of the SSD. Below is a block diagram of FTL between the file system and NAND.


Another diagram from the Micron/Numonyx NAND Flash Translation Layer (NFTL) 4.5.0 document. This document has a detailed description of the Flash Abstract Layer, or Translation Module which incorporates functionality for bad block management, wear leveling and garbage collection.


The strategy for writing to NAND somewhat resembles the database log, and the NetApp Write Anywhere File Layout (WAFL), which is an indication that perhaps a complete re-design of the database data and log architecture could be better suited to solid-state storage.

Error Detection and Correction

NAND density is currently at 128 or 256Gbit density per die for 2-bit cells, meaning 64G cells. This is 16GB on one die! SLC is now at 128Gbit? (Never mind, apparently the Numonyx SLC 64Gbit product is 8 x 8Gbit die stacked. Still very impressive at both the die and package level.) SLC is now at 128Gbit? One aspect of such high densities is that bit error rates are high. All (high-density?) NAND storage require sophisticated error detection and correction. The degree of EDC varies for enterprise and consumer markets.

High Endurance Enterprise NAND

The Micron website describes High-Endurance NAND as

"Enterprise NAND is a high-endurance NAND product family optimized for intensive enterprise applications. Breakthrough endurance, coupled with high capacity and high reliability (through low defect and high cycle rates), make Enterprise NAND an ideal storage solution for transaction-intensive data servers and enterprise SSDs.

Our MLC Enterprise NAND offers an endurance rate of 30,000 WRITE/ERASE cycles, or six times the rate of standard MLC, and SLC Enterprise NAND offers 300,000 cycles, or three times the rate of standard SLC. These parts also support the ONFI 2.1 synchronous interface, which improves data transfer rates by four to five times compared to legacy NAND interfaces."

Enterprise MLC is available upto 256Gbit, and SLC to 128Gbit. I will try to get more information on this.


Below is an interesting combination of SLC and MLC.


Enterprise versus Consumer

This not just about SLC versus MLC. It is important to start with high quality NAND chip with low bit error rate, and longevity. Intel slides from IDF 2010 discuss this as does Micron in their enterpise class NAND, SLC and MLC. The NAND chip should have sufficient ECC capability, but it is not clear if any NAND vendors offer chip with additional ECC capability.


Open NAND Flash Interface

ONFI "define standardized component-level interface specifications as well as connector and module form factor specifications for NAND Flash."

ONFI 1.0

The Michael Abraham, Micron presentation ONFI 2 Source Synchronous Interface Break the I/O Bottleneck explains both ONFI 1.0 (2006) and 2.x versions (if the above link does not work, try ONFI presentations.) Below is a summary of the Abraham presentation.

In the original ONFI specification, the NAND array had parallel read that could support 330MB/s bandwidth (8KB in 25us) with SLC?, but the interface bandwidth was 40MB/sec (the slidedeck mentions 25ns clock, corresponding to 40MHz, but the ONFI webite says 1.0 is 50MB/s). Then accounting Array Read and Data Output is 25 + 211us for SLC and 50 + 211us for MLC for net bandwidth 34 and 30MB/s. Net write bandwidth is 17MB/s and 7MB/s respectively. Below is the single channel IO.

      Read Write
Device Planes Data
SLC 4KB page 2 8KB 25μs 211μs 34MB/s 211μs 250μs 17MB/s
MLC 4KB page 2 8KB 50μs 211μs 30MB/s 211μs 900μs 7MB/s

Note that the write latency is very high relative to hard disk sequential writes, as is transaction log writes. I believe the purpose of DRAM cache on the SSD controller is to hide this latency. While the bandwidth and latency for NAND at the chip level is not spectacular, both could be substantially improved at the device level with more die per channel, more channels, or both as illustrated below.


SLC 2-Plane Performance: Die per channel vs. # of channels

  Read Write
# of channels 1 2 4 8 1 2 4 8
1 die per ch 34 40 40 40 19 38 40 40
2 die per ch 68 80 80 80 38 76 80 80
4 die per ch 136 160 160 160 76 152 160 160

MLC 2-Plane Performance: Die per channel vs. # of channels

  Read Write
# of channels 1 2 4 8 1 2 4 8
1 die per ch 30 40 40 40 7 14 28 40
2 die per ch 60 80 80 80 14 28 56 80
4 die per ch 120 160 160 160 28 56 112 160

SLC could achieve near peak performance with 4 channels and 2 die per channel. MLC could also achieve peak read performance with 4 channels and 2 die per channel, but peak write performance required 8 die per channel.

ONFI 2.x Specification

ONFI 2.0 defines a synchronous interface, improving IO channel to 200MB/sec and allowing 16 die per channel. The version 2.0 (2008) allowed speeds greater than 133MB/s. Version 2.1 (2009) increased this to 166 & 200MB/s, plus other enchancements, including in ECC. (The current Micron NAND parts catalog list 166MT/s as available). Read performance is improved for a single die and for multiple die. Write performance did not improved much for a single die, but did for multiple die on the same channel. Version 2.2 was other features. ONFI 2.3 add EZ-NAND to offload ECC responsibility from the host controller.

Below are the net bandwidth calculations for ONFI 2.x.

      Read Write
Device Planes Data
SLC 4KB page 2 8KB 25μs 43μs 120MB/s 43μs 250μs 28MB/s
MLC 4KB page 2 8KB 50μs 43μs 88MB/s 43μs 900μs 8MB/s

Synchronous SLC 2-Plane Performance: Die per channel vs. # of channels

  Read Write
# of channels 1 2 4 8 1 2 3 4
1 die per ch 120 200 200 200 28 56 112 200
2 die per ch 240 400 400 400 56 112 224 400
4 die per ch 480 800 800 800 112 224 448 800

Synchronous MLC 2-Plane Performance: Die per channel vs. # of channels

  Read Write
# of channels 1 2 4 8 1 2 4 8
1 die per ch 88 176 200 200 8 16 32 64
2 die per ch 176 352 400 400 16 32 64 128
4 die per ch 352 704 800 800 32 64 128 256

Almost all SSDs on the market in 2010 are ONFI 1.0. SSDs using ONFI 2.0 are expected soon(?) with >500MB/s capability?

ONFI 3.0 Specification

The future ONFI 3.0 with increase the interface to 400MT/s.

Non-Volatile Memory Host Controller Interface

The existing interfaces to the storage system today was all designed around the characteristics of disk drives, naturally because the storage system was comprised disk drives. As expected, this is not the best match to the requirements and features of non-volatile memory storage. The Non-Volatile Memory Host Controller Interface (NVMHCI) specification will define "a register interface for communication with a non-volatile memory subsystem" and "also defines a standard command set for use with the NVM device." NVMHCI specification should be complete this year, with product in 2012.

A joint Intel and IDT presentation by Amber Huffman and Peter Onufryk at Flash Memory Summit 2010 discusses Enterprise NVMHCI. In storage system today, there is a controller on the hard drive (the chip on the hard drive PCB), with an SAS or SATA interface to the HBA.


The argument is that the HBA and controller should be integrated into a controller on the SSD, with PCI-E interface upstream. Curiously, IDT mentions nothing about building natie PCI-E flash controller, considering that they are a specialty silicon controller vendor.


Below is the Enterpise NVMHCI view. The RAID controller now has PCI-E interfaces on both upstream and downstream sides. I had previously proposed that RAID functionality should be pushed in to the SSD itself.


SSD with PCI-E Interface

Kam Eshghi also of Integrated Device Technology has a FMS 2010 presentation "Enterprise SSDs with Unrivaled Performance A Case for PCIe SSDs" endorsing the PCI-E interface. The diagrams below are useful to illustrated the form factor. Below is a RAID PCI-E implementation using a standard RAID controller with PCI-E on the front-end and SATA or SAS on the back-end, an Flash controller with a SATA interface, and NAND chips.


In the next example, the host provides management services, consuming resources,


and finally, a Flash controller with native PCI-E interface (and RAID capability?).


The desire to connect solid-state storage directly to the PCI-E interface is understandable. My issue is the current standard PCI-E form factor is not suitable for easy access. There is the Compact PCI form factor (not yet defined for PCI-E?) where the external and PCI connections are at opposite ends of the card, instead of at two adjacent sides. This would be much more suitable for storage devices. Some provision should also be made from greater flexibility in storage capacity expansion with the available PCI-E ports.

SSD SATA/SAS Form Factor

Intel, IDT and others support native PCI-E interface for SSD controllers. LSI and Seagate had a joint presentation at FMS 2010 supporting SAS, countering that the PCI-E to SAS bridge does not add much latency. SAS/SATA also has excellent infrastructure for module expansion and ease of access. I suppose a protocol revision for SAS might better support SSD capabilities and features.

The current trend with SSD with SATA/SAS interfaces is the 2.5in HDD form factor. The standard 3.5in HDD form factor is far too large for SSD. For that matter, the 3.5in form factor has become too big for HDD as well. The standard defined heights for 2.5in drives are 14.8mm, 9.5mm, and 7mm. Only enterprise drives now use the 14.8mm height, as notebook drives are all 9.5mm or thinner.

The 7mm height drive used in thin notebooks have limited capacity (250GB?), but might be ideal for SSD. The standard 2U rack holds 24 x 14.8mm drives, but could hold perhaps 50 x 7mm SSD units?

(Update) Apparently Oracle/Sun has already implemented the high-density strategy. The F5100 implements upto 80 Flash Modules (2.5in, 7mm form factor, SATA interface) in a 1U enclosure for 1.92TB capacity. I suppose the Flash Modules are two deep. A hard drive enclosure is already heavy enough with 1 rank of disks, but 2 deep for a flash enclosure is very practical. And to think there are still storage vendors peddling 3U 3.5in enclosures!

Gary Tressler of IBM proposes that SSD should actually adopt the 1.8 form factor. Presumably there would only be a single SSD capacity. The storage enclosure with have very many slots, and we could just plug in however many we need to.


SSD Controllers Today

I believe STEC is one of the component suppliers for Enterprise-grade SSD, especially with SAS interface, while most SSDs are SATA. EMC just announced Samsung as a second source. SandForce seems to be a popular SSD controller source for many SSD suppliers.

The Storage Review SSD Reference Guide provides a helpful list of SSD Controllers vendors. These include the Intel PC29AS21BA0, JMicron, Marvel, Samsung, SandForce and Toshiba.


The Intel SSD controller below.


SandForce SSD Processor

SandForce makes SSD processors used by multiple SSD vendors. The client SSD processor is the SF-1200. Random Write IOPS is 30K for bursts, 10K sustained, both at 4K blocks. The SF-1500 is the enterprise controller. The performance numbers are similar. Both support ONFI 50MT/s, SATA 3Gbps and can correct 24 bytes (bits?) per 512-byte sector. The SF-1500 is listed as also supporting eMLC, has unrecoverable read errors less than 1 in 1017, with Reliability MTTF 10M operating hours and supports 5-year enterprise life cycle (100% duty). The SF-1200 has unrecoverable read errors less than 1 in 1016, reliability MTTF is 2M operating hours and supports 5-year consumer life cycle with 3-5K cycles.

The new SandForce 2000 processor line should be available in early 2011. The SF-2000 series supports the ONFI 2 166MT/s. The Enterprise processor is the SF-2500 & 2600 line. SATA 6Gbps and below are supported. The SF-2500 is SATA, supporting only 512B sectors? The SF-2600 also supports 4K sectors, has a SATA interface, but can work behind a SAS/SATA bridge.

  Sequential (128K) Random (4K)
Processor Read Write Read Write
SF-1200 260MB/s 260MB/s 30K 30K 10K
SF-1500 260MB/s 260MB/s 30K 30K 30K?
SF-2500 500MB/s 500MB/s 60K 60K 60K?

The SandForce 2000 controller diagram.


SSD vendors with the SandForce processor include Corsair and OCZ.

SSD and other Solid-State Storage Today

Below is a quick survey of SSD either currently available or expected in the near-term.


The Crucial RealSSD C300 MLC models are currently available and the Micron P300 SLC models are in sampling (but I was left off the eval list). Support 6Gb/s, sequential read up to 355MB/s (with 6Gpbs interface, 265MB/s with 3Gbps), The form factors are 2.5in and 1.8in. There are differences between the 1.8in and 2.5in models, and I may not have the table below entirely correct.

  Sequential (128K) Random (4K)
Capacity Read Write Read Write
C300 64G 355MB/s 70MB/s 50K 15K
C300 128G 355MB/s 140MB/s 60K 30K
C300 256G 355MB/s 215MB/s 60K 45K
P300 256G 360MB/s 275MB/s 60K 45K

See the Storage Review of the Crucial Real SSD C300. "The heart of the Crucial RealSSD C300 is a Marvell 88SS9174-BJP2 controller with a 128MB Micron 0AD12-D9LGQ RAM buffer. The storage section is made up of sixteen Micron 00B12 MLC NAND flash modules." Eight of the NAND chips are on the other side.


Micron also has a RealSSD P300 drive based on SLC for the enterprise market that is currently in the sampling stage. The interface is 6Gbps SATA, 2.5in, 50, 100 and 200GB capacities, 360MB/s read, 275MB/s write on all models. Empty IOPS 60K read, 45K write, sustained IOPS 44K read, 16K write?

Intel SSD

In 2008?, Intel made a big splash with their SSD Flash controllers having outstanding performance. The Intel SSDs achieved a big lead over other vendors. Since then, Intel has been stationary and other flash controller vendors have caught up. It is almost as if there were now many more layers of management involved in decision making with many making no contribution but exercising veto authority? or was that another situation?

The original Intel X25-M SSD implemented 10 parallel channels with ONFI 1.0. Below are specifications for the Intel SSDs (smaller capacity models have lower write specs).

  Sequential Random (4K) Latency
  Read Write Read Write Read Write
X25-E (50nm) 250MB/s 170MB/s 35K 3.3K 75μs ?μs
X25-M (50nm) 250MB/s 70MB/s 35K 3.3K 85μs 115μs
X25-M (34nm) 250MB/s 100MB/s 35K 8.6K 65μs 85μs

The next generation Intel MLC SSD is the Postville Refresh with 25nm, expected towards the end of 2010 at 160, 300 and 600GB capacities. The X25-E SLC line, Londonville is expected in early 2011 with 25nm enterprise grade MLC at 100, 200 and 400GB capacities. The Intel X25M G2 below.



OCZ Technology

The OCZ product lineup is not easy to decipher for anyone not familiar with the actual product introduction time line. OCZ has a very broad range of SSD products with PCI-E, SATA, HSDL, and USB 3.0 interfaces. The SATA products are available in 3.5, 2.5 and 1.8in form factors, of which only the 2.5in FF will discussed here. The product line includes both SLC and MLC NAND, with different market segments.


First the PCI-E form factor. It seems the original product was the Z-Drive, of which the most recent is the Z-Drive R2. There are several models, including the e88 with 512GB SLC and the p88 512GB to 2TB MLC. The Z-Drives have an onboard RAID controller card, PCI-E gen 1 x8 on the front interface and SATA on the back interface. The 88 models have 512MB cache and 8 (SandForce controllers) SATA ports. The 84 models have 256MB cache and 4 (SandForce controllers) SATA ports.


There appears to be 2 NAND boards, with 8 modules per board? The SLC e88 capacity is 512GB, perhaps each module is 32GB? The product specifications say 8 SATA controllers, so perhaps 2 modules per SATA device?

The RevoDrive line has a PCI-E x4 Gen 1 interface with 2 SandForce 1222 SATA controllers, and is designed to consumer requirements and price points. The two chips immediately to the left of the NAND chips are the SandForce controllers. (Update) OCZ just announced the new RevoDrive X2 with 4 SandFoce 1200 controllers.


Below are specifications for some of the Z-Drive R2 and Revo Drive models.

Model NAND Capacity Max Read Max Write Sustained Write Random
Z-Drive R2 e88 SLC 512GB 1400MB/s 1400MB/s up to 950MB/s 29,000 7,200
Z-Drive R2 p88 MCL 512GB-2TB 1400MB/s 1400MB/s up to 950MB/s 29,000 14,500
Z-Drive R2 p84 MCL 256GB-1TB 850MB/s 800MB/s up to 500MB/s 29,000 7,500
Revo Drive MCL 120-480GB 540MB/s 480MB/s up to 400MB/s ? 75,000
Revo Drive MCL 50-80GB 540MB/s 450MB/s up to 350MB/s ? 70,000
Revo Drive X2 MLC 100-160GB 740MB/s 690MB/s up to 550MB/s ? 100,000
Revo Drive X2 MLC 240-960GB 740MB/s 720MB/s up to 600MB/s ? 120,000

The 88 model sequential IO bandwidth matches well with PCI-E gen1 x8, as expected with 8 SATA ports on the RAID controller. The random IOPS is low by today's standards. The p84 256GB model has different specifications.

I am inclined to think the G-Drive R2 is older technology (even 2009 is a lifetime ago in SSD) and is not being updated perhaps due to weak interest. The RevoDrive is based on newer technology with spectacular 4K IOPS, but bandwidth is limited by the 2 SATA interfaces.

Below are the prices I found, mostly from PC Connection or NewEgg?

Capacity and Price 512GB 1TB 2TB
Z-Drive R2 e88 $9,299 n/a n/a
Z-Drive R2 p88 $2,359 $4,400 $8,400
Z-Drive R2 p84 $2,000 $4,330 n/a

RevoDrive pricing (2010-10)

Capacity and Price 50GB 80GB 120GB 240GB 360GB 480GB
RevoDrive $219 $280 $340 $699 $1029 $1249

Below for the new RevoDrive X2 (2010-10)

Capacity and Price 100GB 160GB 240GB 360GB 480GB 960GB
RevoDrive   $687 $1183 $1468

I do not know what OCZ uses for the PCI-E to SATA controller in their PCI-E form factor SSD products. I hope it is the LSI controller, which is capable of aggregating the bandwidth of several SATA SSD units effectively. Marvell makes an adequate consumer PCI-E SATA controller, but is probably not suited for multiple devices.

OCZ SATA 2.5in

The more active OCZ SSD product lines are in the 2.5in SATA form factor. All are still on the SATA 3Gb/s interface, but the next generation SandForce controllers will support 6Gbps in early 2011? The SATA 2.5in SSD product groups at the high-end Vertex 2 are comprised of the Vertex 2 EX series with SLC, and in MLC the Vertex 2 Pro series, the Vertex 2 and the Agility. The Vertex 2 series seem to be comprised of regular and extended models, the latter with E in the part number. I am not sure why there are two separate sub-lines, perhaps it denotes different generational NAND. There are also mainstream Onyx II and Value Onyx series, which are not discussed here.

Below is a summary of the high-end SATA 2.5in FF SSD specifications. The Vertex 2 40GB, 400GB, and 480GB models have different specifications. The 40GB model is slightly lower, and the 400 and 480GB models are even lower for some reason. (Is this because there are actually two devices sharing the same uplink?) Refer to the vendor website for actual model-specific specifications.

NAND Capacities Max Read Max Write Sustained
Write IOPS
Vertex 2 EX
SandForce 1500
SLC 50, 100,
285MB/s 275MB/s 250MB/s 50,000 50,000
Vertex 2 Pro
SandForce 1500
MLC 50, 100,
200, 400GB
285MB/s 275MB/s 250MB/s 50,000 50,000
Vertex 2 E
SandForce 1222
MLC 60, 90, 120,
180, 240, 480GB
285MB/s 275MB/s 250MB/s 50,000 50,000
Vertex 2
SandForce 1222
MLC 40, 50, 100,
200, 400GB
285MB/s 275MB/s 250MB/s 50,000 50,000
Agility 2 E
SandForce 1222
MLC 60, 90, 120,
180, 240, 480GB
285MB/s 275MB/s 250MB/s ? 10,000
Agility 2
SandForce 1222
MLC 40, 50, 100,
200, 400GB
285MB/s 275MB/s 250MB/s ? 10,000

The OCZ specification qualifies the transfer bandwidth and IOPS as up to that number. The test results are based on IOMeter 2008 at queue depth 32. The sequential is 128K, and the IOPS is 4KB aligned. Because the IOPS at 4K is already close to the 3Gbps SATA interface limit, we should expect the SQL Server oriented 8KB IOPS to below slightly more than half the 4K IOPS. Also, the OCZ specifications only cited the random write IOPS, so the read is presumed.

I am not sure what the purpose of the Agility 2 line is, given that the prices are about equivalent.

Seek time is 0.1ms all of the above models. Both the EX and Pro series are rated at 10M-hour MBTF. I am not sure how this is assessed. Hard drives are tested at elevated temperature, and MBTF is extrapolated based on known temperature-failure rate scaling, but does this apply to SSD? A built-in "super-capacitor" is listed for the EX and Pro series for power loss protection. The Vertex 2 series is listed with 2M-hour MTBF.

The prices below are from PC Connection for the EX and higher-end, and NewEgg for most of the others.

Capacity and Price 50GB 100 200 400
Vertex 2 EX $931 $1,732 $3,946 n/a
Vertex 2 Pro $399 $630 $1,059 $2,458
Vertex 2 E $145
Vertex 2 $145 $249 $569 $1,899
Agility 2 E $150

The Vertex 2 series with the E in the part number appears to be newer models with 20% higher capacity at the same or lower price point of the original Vertex 2. The prices may have been reduced recently. There does not appear to be a difference in price between the Vertex and Agility series, so I do not know why the two separate lines.

Previously, I had commented that the Dell 50GB SSD at $1199 and 100GB at $2199 seemed expensive, but if they SLC, then the price is reasonable.



The general idea behind the Fusion-IO architecture is that the storage interfaces were not really intended for the capabilities of an SSD. The storage interface, like SAS, was designed for many drives to be connected to a single system IO port. Since Fusion-IO could build the SSD unit to match the IO capability of a PCI-E slot, it is natural to interface directly to PCI-E. All the major server system vendors (Dell, HP, IBM) OEM the Fusion-IO cards.

Below is the original Fusion-IO ioDrive card with a PCI-E gen 1 x4 upstream interface.


Below are the Fusion-IO ioDrive specifications. The prices are from the Dell website, (HP website prices in parenthesis).

ioDrive Capacity 160GB 320GB 640GB
NAND Type SLC (Single Level Cell) MLC (Multi Level Cell) MLC (Multi Level Cell)
Read Bandwidth (64kB) 770 MB/s 735 MB/s 750 MB/s
Write Bandwidth (64kB) 750 MB/s 510 MB/s 550 MB/s
Read IOPS (512 Byte) 140,000 100,000 93,000
Write IOPS (512 Byte) 135,000 141,000 145,000
Mixed IOPS (75/25 r/w) 123,000 67,000 74,000
Access Latency (512 Byte) 26 µs 29 µs 30 µs
Bus Interface PCI-Express x4 PCI-Express x4 PCI-Express x4
Price $8,173 ($7K) $7,719 ($7.5K) $11,832 (?)

The specifications are somewhat improved over the specifications listed in 2009, listing 50-80μs per read. So presumably there were firmware enhancements, as the NAND chips should be the same. Or the old latency spec could have been for 4K IO? The latency specification is a little misleading, cited for 512 bytes. Other vendors cite 100μs for 4K IO which incorporates 26 μs for SLC random access, plus the transfer time. I am assuming that Fusion-IO implements very wide parallelism, in both channels and die per channel.

Below is the Fusion-IO Duo, matched to PCI-E gen1 x4 or PCI-E gen2 x4 bandwidth, supporting both interfaces.


Below are the Fusion-IO ioDrive Duo specifications.

ioDrive Duo Capacity 320GB 640GB 1.28TB
NAND Type SLC (Single Level Cell) MLC (Multi Level Cell) MLC (Multi Level Cell)
Read Bandwidth (64kB) 1.5 GB/s 1.5 GB/s 1.5 GB/s
Write Bandwidth (64kB) 1.5 GB/s 1.0 GB/s 1.1 GB/s
Read IOPS (512 Byte) 261,000 196,000 185,000
Write IOPS (512 Byte) 262,000 285,000 278,000
Mixed IOPS (75/25 r/w) 238,000 138,000 150,000
Access Latency (512 Byte) 26 µs 29 µs 30 µs
Bus Interface PCI-Express x4/x8
or PCI Express 2.0 x4
PCI-Express x4/x8
or PCI Express 2.0 x4
PCI-Express x4/x8
or PCI Express 2.0 x4
Price $17,487 ($14K) $15,431 ($15K) ?

The reliability specs are ECC for 11-bit per 240 bytes, uncorrected error 1 in 1020, undetected error 1 in 1030.

What I would like from Fusion-IO are a range of cards that can match the IO bandwidth of PCI-E gen 2 x4, x8 and x16 slots, delivering 2, 4 and 8GB/s respectively. Even better is the ability to simultaneously read 2GB/s and write 500MB/s or so from a x4 port, and so on for x8 and x16. I do not think it is really necessary for the write bandwidth to be more than 30-50% of the read bandwidth in proper database applications. One way to do this is to have a card with a x16 PCI-E interface, but the onboard SSD only connects to a x4 slice. The main card allows daughter cards each connecting to a x4 slice, or something to this effect.

Fusion-IO has an Octal card that is matched to the x16 PCI-E gen2? bandwidth. This card draws power from the dedicated power connector for high-end graphics. Apparently a x8 card draws more power than available from the PCI-E slot, and normal server system do not have the special graphics power connector.

One more thought on the Fusion-IO is using the PCI-E to PCI-E bridge chips. In my other blog on System Architecture, I mentioned that the 4-way systems such as the Dell PowerEdge R900 and HP ProLiant DL580G5 for Xeon 7400 series with the 7300MCH use bridge chips that let two PCI-E port share one upstream port. Could the Fusion-IO reside in an external enclosure? attached to the bridge chip. The other two ports connect to the host system(s). One the host would be a simple pass through adapter that sends the signals from the host PCI-E port to the bridge chip in the Fusion-IO external enclosure. This means the SSD is connected to two hosts. So now we can have a cluster? Sure it would probably involve a lot of software to make this work, who said life was easy.



HP is hinting about their upcoming SSD, now with SAS 6Gb interface, read 560MB/s, 100K IOPS, write TBD. See the HP Solid-State-Storage-Overview and the Disk technology overview brief. HP currently offers 60 & 120GB SSD with SATA 3Gb interface, Read/Write 230/180MB/s, 20K/5K IOPS.

NAND Capacities Max Read Max Write Sustained
Write IOPS
Current SLC? 60, 120GB 230MB/s 180MB/s ? 20,000 5,000
Future/Soon? ? ? 560MB/s ? ? 100K TBD

The HP specification sheet for their PCIe IO Accelerators also provides information on the Fusion-IO driver memory usage.

The amount of free RAM required by the driver depends on the size of the blocks used when writing to the drive. The smaller the blocks, the more RAM required. Here are the guidelines for each 80GB of storage:

Block Size (bytes)
RAM usage
8,192 400
4,096 800
2,048 1,500
1,024 2,900
512 5,600

Give credit to the HP technical people really know what key information that should documented is. This is too important to leave to the simpletons in !@#$%^&*. (IBM also puts out great Redbooks, give credit to IBM for spending money on important material, not just marketing rubbish).



Samsung has SLC and MLC SSD, both with SATA 3Gbps. The 3.5in SLC models have 100 and 200GB capacities. The 2.5in SLC models have 50,60,100 and 120GB capacities.

  Sequential Random (4K)
  Read Write Read Write
SLC 260MB/s 245MB/s 47K 29K
MLC 250MB/s 200MB/s 31K 21K



STEC makes the SSD for the EMC DMX line, possibly other models, and for several other storage vendors as well. The Zeus IOPS has FC 4Gb, SATA 3Gb and SAS 3Gb interface options, 3.5in and 2.5in form factors. The specifications for the STEC Zeus IOPS (2008 whitepaper) SSD is 52K random read IOPS, 17K random write IOPS, 250MB/s sequential reads, and 200MB/s sequential write. The STEC 2008 Zeus IOPS brochure cites 220/115 MB/sec, and 45K/16K IOPS.


(The STEC website as of 2010-10 lists the following for the Zeus IOPS.) The STEC Zeus IOPS is listed with FC 4Gb (3.5 FF only), SAS 3Gb and 6Gb (2.5 and 3.5in FF) options. Capacity upto 800GB for 3.5in and 400GB for 2.5in. Sequential read/write at 350/300MB/s and IOPS at 80/40K.

  Sequential Random
  Read Write Read Write
Whitepaper (2008) 250MB/s 200MB/s 52K 17K
Brochure (2008) 220MB/s 115MB/s 45K 16K
New? (2010) 350MB/s 300MB/s 80K 40K

It is presumed that there have been multiple generations of the Zeus IOPS. Perhaps STEC will eventually distinguish each generation and model more clearly. STEC also supports 512-528 byte sector sizes.

Zeus RAM

This a RAM-like device? dual-port SAS interface, upto 8GB capacity. Average latency is 23μs


Toshiba has an SSD line for which the sequential read is 250MB/s and write is 180MB/s. The interface is SATA 3Gb/s. NAND and NOR flash were invented by Dr. Fujio Masuokaaround 1980, then working at Toshiba .


Oracle (Sun)

Oracle-Sun has a comphrensive line of storage products, even accounting for the Exadata Storage System. The main Oracle Flash storage product comprises the F5100 Flash Array, the F20 PCIe Card, SSD with SATA interface, and Flash Modules (these go in the F5100?). If someone could look into the Oracle Sun Flash Resource Kit, I would appreciate it.

Oracle F5100 Flash Array

The Oracle Sun F5100 Flash Array is a very progressive storage system. The interface has 16 x4 SAS channels, presumably more for multi-host connectivity. The storage unit board is shown below. The modules to the left are Energy Storage Modules (ESM).


A Flash Module below.


NAND Capacity Max Read Max Write Random
4K Read IOPS
4K Write IOPS
F5100 SLC 480GB 3.2GB/s 2.4GB/s 397K 304K $46K
F5100 SLC 960GB 6.4GB/s 4.8GB/s 795K 610K $87K
F5100 SLC 1920GB 12.8GB/s 9.7GB/s 1.6M 1.2M $160K

The capacity is net after 25% was reserved for wear leveling (over-provisioning). For anyone (excluding young people) who have ever lifted a hard disk storage unit, the weight of the F5100 is 35lb or 16kg.

Oracle F20 PCIe Card and SSD

The PCI-E card Flash unit has capacity 96GB, 100K IOPS, $4,695. The SSD is 32GB, SLC, 2.5in, 7mm, rated for Read/Write 250/170MB/sec Sequential, 35K/3.3K IOPS, SATA interface. (I think this is the Intel X-25-E?). Price around $1,299? Sun also has Flash Modules, SO-DIMM form factor, SATA interface, 24GB, 64M DRAM. I think these go in the F5100. Do Sun systems have SATA SO-DIMM connectors?


LSI has a new PCI-E SSD product, the LSI SSS6200, with SLC NAND. I am not sure if the product is actually in production, or just sampling.



The specifications are very impressive, as it is fully current with PCI-E gen2. Latency is 50μs. The reliability specifications are 2M hours and BER 1 in 1017.

NAND Capacity Max Read
Max Write
Random 4K
Random 4K
Write IOPS
SSS6200 SLC 100GB ? ? 150K 190K ?
SSS6200 SLC 200GB ? ? 150K 190K ?
SSS6200 SLC 300GB 1.4GB/s 1.2GB/s 240K 200K ?


Other Solid-State Storage

Texas Memory Systems (RamSan)

Texas Memory Systems (RamSAN) has been in the solid-state storage business for a long time. Their products include SAN systems with various solid-state options. The RamSan-440 is a SAN system with 512GB DRAM memory as storage, 4.5GB/s, 600K IOPS at 15μs latency, and FC interface (4Gbps was listed, I expect 8Gb is now available?). The RamSan-630 has 10TB SLC Flash, 10GB/s, 1M IOPS, 250μs read latency, 80μs write, and FC or Infini-Band interfaces. There are also a PCI-E SLC NAND products.


Volin Memory makes SAN type solutions? Their website describes a technique for RAID on flash (vRAID), as the disk-type RAID is not suited. Spike-free latency is discussed.



DDR DRAM Storage

DRAM based SSD vendors include DDR Drive,
HyperOSSystems (HyperDrive).
The advantages of DRAM is that lead-off latency is far lower than with NAND (26-50μs at the chip level). Disadvantage is DRAM is volatile and more expensive per GB than NAND. Some means is necessary to ensure data protection during a power loss. (SuperCap can provide sufficient power to sweep data to NAND)

DDR Drive currently make a DDR DRAM SSD with PCI-E x1. While this is a desktop product, the choice of DDR DRAM allows for really fast access time, far better than ~100μs from NAND SSD.

Updates - More Vendors

More vendors: Smart Modular Technologies, Bit Micro.


Complete Rethinking of Memory and Storage


I have also thought that it was well past time to completely rethink the role of DRAM in computer system architecture as well. For years now, the vast majority of DRAM has been used as data cache, if not by the application, then by the operating system. The concept of a page file on HDD is totally obsolete. Is this not the signal that a smaller "true" main memory should be moved closer to the processor for lower latency. A completely seperate DRAM system should then be designed to function as data cache, perhaps having the page file moved here. At the time, I was thinking SRAM for true main-memory, but the Intel paper describing a DRAM chip mounted directly on the processor die with through connections would achieve the goal of much reduced latency along with massive bandwidth.

(Apparently there is a startup company RethinkDB that just raised $1.2M to do this on MySQL, "the database is the log". I hope our favorite database team is also rethinking the database?)

IBM Storage Class Memory

The FAST'10 tutorial by Dr. Richard Freitas and Lawrence Chiu of IBM, Solid-State Storage: Technology, Design and Applications introduces(?) the term Storage Class Memory (SCM).

Fusion-IO NAND Flash as a High Density Server Memory

The FMS 2010 presentation "NAND Flash as a High Density Server Memory" by David Flynn CTO, now CEO of Fusion-IO, proposes NAND flash in the form of ioMemory modules as high-density server memory. The current DRAM main memory is for the operating system, application in ioMemory meta-data, with the NAND flash ioMemory for the data previously buffered in DRAM. My thinking for this to work in database, there should be some logic capability in the flash memory.



Below are links to websites that cover non-volatile memory and solid-state storage.
As always, Wikipedia is a good starting point: non-volatile memory, Flash Memory, Solid-state drive.
The Open NAND Flash Interface ONFI.
The Flash Memory Summit conference (FMS 2010 conference proceedings requires an email).
Micron, Intel IDF, and Microsoft WinHEC
WinHEC 2008 Design Tradeoffs for Solid-State Disk Performance ent-t539_wh08.
Micron WinHEC 2007 NAND Flash Memory Direction Presentation.
USENIX HotStorage '10, and FAST '10,
including the tutorial by Dr. Richard Freitas and Lawrence Chiu of IBM, Solid-State Storage: Technology, Design and Applications
The 26th IEEE Symposium on Massive Storage Systems and Technologies MSST2010
SNIA Technical Activities, will try to establish a standard Solid State Storage Performance Test Specification. Currently there is not a common standard, be alert for questionable measurement and reporting.
See Tom's Hardware SSD 102: The Ins And Outs Of Solid State Storage for SSD.
Marc Bevand Zorinaq's blog seems to have useful information on SSD.

This post is still rough, I will polish this over time, more frequent updates at


Below are the Toshiba blade SSD for Mac Airbooks. I like this form factor. The standard SSD storage enclosure could be a 1U rack, for about 80 of these devices. This should probably be supported by 16-24 SAS 6Gbps lanes, for 8-12GB/s bandwidth. Each device can support 200MB/sec, so the 80 slots are really for capacity expansion.


Published Monday, October 25, 2010 12:19 PM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Greg Linwood said:

Nice article Joe.

One point on a comment you made about MLC:

>Write performance issues of MLC can be solved with over provisioning.<

Over-provisioning can't improve the write performance per se - it improves consistency of write performance & extends the life of the device. It mainly allows a device to avoid the "write cliff" effect (falling behind in garbage collection to the point where there's no capacity left to perform block erasure / write amplification) but doesn't change the way the core write cycle works.

MLC core write cycle is more complex for two main reasons - first, the redundancy implementation is more complex than SLC. SLC works on basic Hamming Codes + extra parity where MLC gets into more complex algorithms & the essential process for charging multi level cells is also more complex.

Over-provisioning is about achieving long term stability under heavy workloads & is extremely important for write intensive systems, but these are rare with SQL Server data files due to the nature of SQL Server's data caching framework. If you were putting TLogs onto SSDs, you'd definitely want to heavily over-provision though..

October 26, 2010 8:19 AM

jchang said:

see Wesley Brown, SQL Man of Mystery
solid state storage basics for more on Flash. It was not my intent to rewrite everything that was already out there. But it was important to assemble enough information to form the whole picture. An explanation of Trim might help. More on the Flash Translation Layer later.

October 26, 2010 12:07 PM

Nasser said:

Hi Joe,

thanks alot for all the effort you put in your articles which really helpful for us all.

I wanted to ask you regarding the status of the Enterprise SSD and how it stands against the HDDs when it comes to stand alone servers, in simple words, for small DW project, would you pick 1 SSD or a RAID implementation of 8x73GB 15K?


October 28, 2010 12:40 AM

jchang said:

Without the entire scope, any answer is limited. Still, it is not practical to only have a single SDD. So the choice is really between HDD only and HDD + SSD. From 8 15K 2.5in HDD, it is possible to get 150MB/s per disk if everything is done a certain way, but more practically, it will be good to get 100MB/s per disk for 800MB/s. Based on 400 IOPS per disk at high Q, expect 4K IOPS total. With SSDs today, expect 260+ MB/sec and 20-25K 8K IOPS from SQL Server. In a few months, will should be able to get SSD at 500MB/s and perhaps 40K IOPS. The mistake many people make, including the MS FTDW and PDW concept, is assuming DW is all sequential. It is not unless you put WITH(INDEX(0)) on all your queries.

For small DW, I would recommend skipping the SLC SSD, go with MLC like the OCZ Vertex 2 Pro. For a small business with flexible turn-on-a-dime decision making, I would go with 2 x 7200 HDD to boot, 4 15K for flat files and backup, and 4 consumer SATA SSD or 2 consumer PCI-E SSD. The DW is not write intense like OLTP, and you can always reload from HDD if you encounter enrecoverable write errors. Plus I would then chuck the SSDs next year, buying the newer models.

In a large environment, the effort to fill out the paperwork outweighs everything else, go with Enterpise models, and buy extra so you never have to fill out more paper work later, explaining why your first try did not work that you have to ask again.

October 28, 2010 9:27 AM

Glenn Berry said:

Very nice work, Joe!

October 29, 2010 11:35 PM

jchang said:

thanks glenn

if Coach bags were a US company, I might be inclined to file a sex discrimination complaint. What makes them assume only women would be interested in a Coach/Chanel bag? Perhap I might be interested in one for myself? But here they go assuming I wouldn't be interested!

October 30, 2010 7:42 PM

Nasser said:

Hi Joe,

you misunderstood Coach bags, they think there are a lot of women who enjoy reading your articles!  

November 2, 2010 12:13 AM

jchang said:

if there is a lot of female traffic on my blog, somehow i think they were actually looking for the taiwan actor/model with simliar name, as search engines include alternate spellings

for years, me and some professor at Yale were the top results for Joe Chang, but then this guy born in 1982 takes over the top spot. I am tempted to substitute his picture for mine though.

November 2, 2010 12:13 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement