With hard disk drive (HDD) capacities edging upwards – 6TB HDDs are now available – Raid is becoming increasingly problematic as a method of data protection against hardware failure.
As a response, erasure coding as an alternative to backup has emerged as a method of protecting against drive failure.
Raid just does not cut it in the age of high-capacity HDDs. The larger a disk's capacity, the greater the chance of bit error.
And when a disk fails, the Raid rebuild process begins, during which there is no protection against a second (or third) mechanism failure. So not only has the risk of failure during normal operation grown with capacity, it is much higher during Raid rebuild, too.
Also, rebuild times were once measured in minutes or hours, but disk transfer rates have not kept pace with the rate of disk capacity expansion, so large Raid rebuilds can now take days or even longer.
Consequently, many argue that alternatives to Raid are now needed, and one alternative is erasure coding.
Erasure coding explained
Erasure coding is a method of data protection in which data is broken into fragments that are expanded and encoded with a configurable number of redundant pieces of data and stored across different locations, such as disks, storage nodes or geographical locations.
The goal of erasure coding is to enable data that becomes corrupted to be reconstructed by using information about the data that is stored elsewhere in the array – or even in another location.
It works by creating a mathematical function to describe a set of numbers so that they can be checked for accuracy and recovered if one is lost. Otherwise known as polynomial interpolation or oversampling, this is the key concept behind erasure coding methods that are implemented most often using Reed-Solomon codes.
Developed in 1960, Reed-Solomon is found most widely on CDs and DVDs, where error correction allows a player to calculate the correct information even though part of the disc's surface may be obscured. It is also used by space agencies to pick up signals from far-flung spacecraft, such as the Voyager probes.
Erasure coding benefits and drawbacks
Erasure coding allows for the failure of two or more elements of a storage array and so offers more protection than Raid as commonly deployed.
Marc Staimer, president of Dragon Slayer Consulting, says: "If a copy turns up bad during a checksum… it will pull from another copy of the data. When [a copy] comes up unhealthy, it will just call from a good copy and delete the one that's not healthy."
More on erasure coding
- Definition: Erasure coding
- As users search for Raid alternatives, erasure coding returns
- Raid alternatives: Erasure codes and multi-copy mirroring
- Erasure coding, mirroring offer data protection for cloud storage
- Erasure coding tradeoffs include additional storage, disk update needs
- OpenStack Swift object storage to add space-saving erasure coding
Staimer describes erasure coding as up to 10,000 times more resilient than Raid 6.
It can also rebuild from fewer elements, says Ethan Miller, computer science professor at the University of California.
"If you had, say, 12 data elements and four erasure code elements, any 12 elements from that group of 16 would be enough to rebuild the missing ones," he says. "Any 12 – it doesn't matter which four fail; you can always rebuild."
Erasure coding also consumes less storage than mirroring, which effectively doubles the volume of storage required. Erasure coding typically requires only 25% more capacity.
However, the drawback of erasure coding is that it can be more CPU-intensive, and that can translate into increased latency.
Staimer explains: "Any time you're adding processing – which is what you're doing, because you've got to process a lot of different chunks versus just read it all as one sequential data chunk or datagram – you're going to add latency.
“And, when you have latency, that affects response time, and you can especially get high latency if you distribute this geographically or over a lot of different systems."
So, the more chunks that have been distributed, the more resilience the technique provides, but the greater the latency. So the decision about how the trade-off will be made depends on the value of the data.
Erasure coding use cases
Erasure coding's high CPU utilisation and latency make it well suited to archiving applications because of the long-term nature of the storage where, over time, a number of storage elements can be expected to fail. It is also suited to those with large datasets and a correspondingly large number of storage elements.
Staimer adds: "You could start thinking about erasure coding in the hundreds of terabytes, but once you get to a petabyte, you should definitely be thinking about it. And, once you get into the exascale range, you have to have something with erasure coding."
Erasure coding is also found in the context of object storage, with very large-volume cloud operators the most likely users now.
So, erasure coding is less suited to primary data and, like Raid, it cannot protect against threats to data integrity that are not a result of hardware failure.
For applications where latency is not an issue, such as archiving, erasure coding works by ensuring that the life of the storage medium, which on its own can never offer a 100% guarantee for all time, is extended.
Erasure coding prospects
Use of erasure coding by organisations is limited at the moment, says Tony Lock, an analyst with Freeform Dynamics, who describes it as "very niche".
But this is set to change, according to Staimer. "Long term, I see it getting better," he says. "I see the latency being managed in silicon or FPGAs [field-programmable gate arrays]. I see the latency algorithms getting faster, so you're not going to have as much latency, and it can replace, for active data, Raid 5 and Raid 6.
“But, for passive data, it's going to replace Raid, period. You don't need Raid if you're using erasure coding because it provides better resilience and better durability than Raid."
Suppliers that are already offering erasure coding include NEC's Hydrastor, a scale-out global data deduplication system for long-term storage, and Cleversafe, which uses erasure coding for its large-scale dispersed storage systems.
So, erasure coding can save capacity compared with mirroring, offers higher – and configurable – levels of protection against hardware failure, is suited to very large scale and archival storage but, for the moment, less so for production data.
Lack of education among storage managers and buyers compared with well-understood Raid techniques are probably erasure coding's biggest barriers, although, as capacities grow, this could well change.