THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Computers without ECC memory are crap - no exceptions

In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.

First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).

Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.

Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.

Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.

All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).

I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.

(In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).

Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.

Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)

Additional notes:
A. The Wikipedia article ECC memory cites references that mentions actual soft-error rates. There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s). There is a separate entry soft-errors for anyone interested in the underlying physics. Of course there are other Wiki entries on the implementation of ECC.

Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.

B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.

A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.

C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.

An undetected error is just that. It is undetected. The system continues running with incorrect memory content. Depending on the nature of the memory corruption, anything can happen. It could be executable code, in which case the instruction changes. It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.

Edit
* It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)

Edit 2015-11-10
The Wikipedia topic ECC Memory states "ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.

Edit 2015-12-10
Donaldvc pointed to this new article on IEEE Spectrum drams-damning-defects-and-how-they-cripple-computers
Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide. Back then, the soft-error might only affect many memory cells but it would only be one bit in a word. Then as memory became more dense, a soft error could affect multiple bits in a word? So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms. I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated. If someone free time, please look into this.

Published Monday, November 02, 2015 12:02 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

jchang said:

Sometimes I wonder if the rudeness of my rants is too much.

but then I see this, so I think I must be reasonably even tempered.

http://www.theregister.co.uk/2015/11/01/linus_torvalds_fires_off_angry_compilermasturbation_rant/

November 2, 2015 4:01 PM
 

mark said:

It's actually a fun part of the blog :)

November 3, 2015 4:01 AM
 

GW said:

One of the very few not doing vendor speak Joe (Truth)

November 3, 2015 6:45 AM
 

Greg Low said:

Hi Joe, the lack of decent error handling in memory on many systems today amazes me. In the early 1980's I was working on large minicomputers that had 37 bit memory for storing 32 bit values. They could correct single bit errors on the fly and detect double bit errors. You'd have hoped by now that we'd be way past this.

Regards, Greg

November 11, 2015 5:07 AM
 

jchang said:

Greg, are you sure it was 37? not 38?. I think the formula for single correct, double detect is log2(n). if you have n bits, then log2(n) rounded up is needed for SECDED, hence log(33-64) = 6, so 38 bits should be needed, leaving 32 bits for data.

You were obviously working on an advanced mini then, as a PC would have been parity protection (4 x9 SIMMs) for 36 bits per 32-bit dword.

Parity went away because 1) Intel incorporated ECC into their memory controller, then disabled it for non-server SKUs, 2) the standard 9 bits per 8 bit data at 64-bit wide means there are sufficient bits to do ECC.

The sad thing is ECC is built in, just disabled unless you happen to buy the Xeon E3 equivalent of the Core i7.

I do this for my brothers business desktops, but we don't have a choice in (non-Eurocom) laptops until the Lenovo P50/70.

Too bad I had already ordered a Dell XPS just before learning about this.

November 11, 2015 12:49 PM
 

Long time Register reader said:

There's already enough politically correctness around, please keep you rants as they are, content is king anyway :)

November 13, 2015 3:22 PM
 

bhtooefr said:

It's worth noting that for mobile platforms, Intel has Xeon E3-1200M v5 SKUs (same thing as i7-6000HQ, but with ECC), and the CM236 chipset (same thing as QM170, but with ECC), too.

Granted, this is for heavier laptops (6+ pounds, not 2-4 pounds), but it is an option.

November 28, 2015 7:52 PM
 

jchang said:

I do believe there is anything in the Xeon E3 v5 + CM236 that would make significantly heavier than i7 + 170. Perhaps it is that vendors are targeting Xeon E3 at workstations with hi-power graphics that result in 6lb+, while the i7 go into ultra lights.

Dell shipped my XPS 16 9550, looks like a good system. the touchpad is too large in that when I scroll, I can not reach the button part.

I will probably get a Xeon notebook next year.

PC vendors should stop trying to copy Apple ,I want F@CKING buttons on the touch pad.

My view, Apple is successful not because of any one feature, such that if copied, would make a crap system successful.

It is the overall package reflects an obsession in being designed for a mission, and that they are not afraid to blaze new direction.

It seems that PC vendors have decision makers who got there on ass kissing skills.

I am venting again?

November 29, 2015 11:18 AM
 

donaldvc said:

Not sure if this validates your 'rant' but it's an interesting read: http://spectrum.ieee.org/computing/hardware/drams-damning-defects-and-how-they-cripple-computers

December 10, 2015 10:44 AM
 

jchang said:

thanks donaldvc, I will read this when I get a chance. That ECC for single/double bit errors is necessary has be known for a long time. From a quick skim, I think this article is stressing the importance of correlated multi-bit errors, given the organization and high density of DRAM. My comments will be a the bottom of the main post.

December 10, 2015 11:11 AM
 

independent un said:

ROWHAMMER via javascript. Most independents are

the key  problem is that ECC corrects 2 bit errors.

Rowhammer will produce larger bit errors and ECC is not

a solution.

Some GENIUSES are opinionated and even used words like

CRAP.

This is not a surprise. Mc duck ONALDS, making hamburgers

is featured in movie Supersize me about destroying your

health quickly.

Solution for independents.

browser suns on DDR2 RAM on separate machine.

machine downloads are only via 2nd macihine connected

via serial port for keyboard commands.

Obviiously, never use tar tape archive or gzip compression

on linux.

Only use star replace tar. Only use lzip replace gzip.

new modify lzip for better safety against soft errors.

LZIP is the only one that passes the test. ALL others

include xz fail!

IMHO.

Thank you.

PS. simple stupid items for independent smart consultants.

1.)strip ithe machine down to bare metal.

2.)check grounding straps.  That has causes car engine failure.

3.)remove dust

4.)loosen and snap fasten the memory chips.

IT IS MECHANICAL AND IT GETS LOOSE.

Thank you!  reminder there iis retrofit intelligent memory chip

for DDR2 and DDR3 mem. ECC with correct pinouts

not approval of any vendor.

January 7, 2016 4:21 PM
 

Leonard Mitts said:

With respect to missing ECC memory no rudeness is enough. The same goes for the usage of Javascript and Java.

February 8, 2016 12:08 PM
 

Hojoung Thompson said:

Not rudeness but passion. And passion for precision can't be soft and fuzzy in its expectations. This is a great blog. I was going to buy a gaming laptop for business /DB apps because they were powerful and I was lured by the i7 threaded 4 quads and 32Gb of RAM.  I wasn't sure about the power being allocated to the graphics. Thanks for saving me a lot of headaches and money.  Still not sure if I should go with a MacBook or PC.

February 12, 2016 5:44 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Privacy Statement