In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.
First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).
Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.
Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.
Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.
All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).
I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.
(In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).
Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.
Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)
A. The Wikipedia article
cites references that mentions actual soft-error rates.
There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s).
There is a separate entry
for anyone interested in the underlying physics.
Of course there are other Wiki entries on the implementation of ECC.
Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.
B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.
A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.
C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.
An undetected error is just that. It is undetected. The system continues running with incorrect memory content.
Depending on the nature of the memory corruption, anything can happen.
It could be executable code, in which case the instruction changes.
It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.
* It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)
The Wikipedia topic ECC Memory states
"ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.
Donaldvc pointed to this new article on IEEE Spectrum
Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide.
Back then, the soft-error might only affect many memory cells but it would only be one bit in a word.
Then as memory became more dense, a soft error could affect multiple bits in a word?
So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms.
I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated.
If someone free time, please look into this.