Pull out a microscope and peer at the surface of a hard disk and you’ll see bumpy landscape of exotic metals arrayed in reasonably neat patterns.
The metals need to be neat because a disk drive delivers a very precise jolt of electricity to a very small region of the disk, changing its stored charge to denote stored data.
Sometimes, those regions spontaneously lose or change their charge, a phenomenon known as “flipping.” When a region on a disk flips, the data it contains is erased, corrupted or rendered unreadable. To denote the mysterious nature of this degradation, the industry has developed the organic-sounding term ““bit rot” to explain the phenomenon.
Storage array vendors are aware of bit rot and build their products to identify flaws in disks before they place them in arrays, and then monitor disks in production to detect rot before it becomes a problem.
“EMC only purchases, and then sells, drives that have a low percentage of ‘manufacturing’ sector failures,” explains Clive Gold, Marketing Chief Technology Officer for EMC Australia New Zealand.
The company also scans drives to make sure bit rot is not destroying data.
“All data that is received by the front end is ‘tagged’ and this allows the backend to check the data that is stored on the disk to ensure it hasn’t changed as it has gone through the storage system,” Gold explains. “In-fact, where an application like Oracle databases has a checksum, we use that to ensure end-to-end integrity, from application to the rust on the disk! These technologies do detection as well as correction.”
Adrian De Luca, Hitachi Data Systems’ Director of Pre-Sales and Solutions for Australia and New Zealand, says his company also takes care to ensure that damaged drives don’t destroy data, through connectivity precautions as well as corruption checks.
“HDS ensures all physical disk drives are dual-ported into the backplane, controllers and cache to ensure there is no physical single point of failure as data comes in through the front end controllers and out to the physical disks,” he says. “We also support Oracle H.A.R.D (Hardware Assisted Resilient Data) to prevent corrupted data blocks generated in the database-to-storage system infrastructure from being written onto the disk storage.”
How dangerous is Bit Rot?
While Bit Rot is something most storage vendors work to counter, NetApp has recently conducted studies that play down the risk it poses.
“While ‘bit rot’ has received a reasonable amount of attention recently, two NetApp sponsored studies shows that bit rot is far less of a problem for storage array reliability than many other factors,” says John Martin, Principal Technologist for NetApp Australia New Zealand.
One of the papers Martin refers to, A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht, appeared in the journal IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009”.
Martin summarises the paper by saying that Bit Rot is a risk, as it “raises the spectre, not just of a lost or corrupted file, but of the potential to completely lose an entire RAID group after the failure of a single drive due to the ‘Media Error on Data Reconstruct’ problem. “
But Martin adds that “The less catastrophic issue on an enterprise class array is far less because the additional error detection and correction available through the use of RAID and block level checksums means the chances of bit rot causing the loss or corruption of a file is vanishingly remote.”
WhatreDrawing on Elerath and Pecht’s paper, Martin therefore offers four other phenomena as more likely sources of data loss, namely:
• “Thermal asperities” - Instances of high heat for a short duration caused by head-disk contact. This is usually the result of heads hitting small “bumps” created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sufficient to thermally erase data but may be sufficient after many contacts;
• Disk head issues - Disk heads are designed to push particles away, but contaminants can still become lodged between the head and disk, hard particles used in the manufacture of an HDD, can cause surface scratches and data erasure any time the disk is rotating;
• Soft particle corruption - Other “soft” materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media, rendering the data unreadable;
• Corrosion - Although carefully controlled, can also cause data erasure and may be accelerated by thermal asperity generated heat.
Whatever the cause of lost data, storage administrators need a way to combat it, and NetApp’s Martin recommends “disk scrubs,” the practise of wiping disks to erase any problem sectors. Another alternative is to “Use additional levels of RAID protection such as RAID-6 which allows for higher levels of resiliency and error correction in the event of hitting a latent block error when reconstructing a RAID set. NetApp uses both approaches as studies have shown that the risk of losing data through these kinds of events is thousands of times higher than predicted by most simple ‘MTBF’ failure models.
Keith Busson, Quantum’s Country Manager for Australia and New Zealand, has more prosaic advice for ameliorating Bit Rot.
“Quantum recommends that IT organisations stage practice data recoveries on a regular basis,” he says. “It is important to demonstrate the ability of fast, comprehensive data recovery before it is required in an emergency situation. Such testing is a test not only of hardware and software but of people and processes.”