But, as the capacity of hard drives has moved to multiple terabytes the overheads of the Raid protection system are starting to make it creak. And in some circumstances, where, for example, Raid rebuild times become too lengthy, using Raid as a protection method is no longer appropriate.
To fix this problem, suppliers have turned to erasure coding, which promises to fill the gap left by Raid when deployed at scale.
In the era of the mainframe, disk drives (particularly those provided by IBM) were delivered with no built-in data protection at the hardware or operating system level. Failed drives were recovered manually by reclaiming data file-by-file from each track of the platter.
Raid was invented as a solution to the drive failure problem, especially that encountered in minicomputer and distributed systems where recovering data from a failed drive was difficult or impossible to achieve.
Data protected by Raid uses a variety of techniques that replicate the data itself or use parity information to rebuild data lost from a failed device.
At a basic level, Raid 1 simply replicates all data between two drives, using 100% additional capacity but with no overhead incurred as it stores the second copy, other than writing the data to another drive.
This scheme is expensive in terms of capacity, so additional protection algorithms were developed that do not double required capacity. Instead they incur a “compute” cost to create what is known as parity data for each block of data stored.
Typical configurations today include Raid 5, which uses two or more data disks and a parity disk (although in practice the data and parity is spread across all the disks to even out I/O performance), and Raid 6, which uses two parity disks.
A typical Raid 5 configuration (or group) of three data disks and one parity disk (written as 3+1) has a 33% overhead or 75% usable space, depending on how the calculation is expressed.
If any drive fails, the lost data can be calculated from the remaining data and parity information. Parity data is calculated as a binary XOR operation using all the data as it is written to disk for the first time. Recovery of data simply uses the parity data and the available data to reverse the XOR calculation and recover the lost information.
Unfortunately, however, Raid has a number of shortcomings. As disk drives have increased significantly in capacity over the years, the time required to rebuild a failed disk has also increased to a point where rebuilds of data can take days or weeks to complete. Estimating recovery time can be tricky because it depends on the workload already on the system from host I/O.
During the rebuild process data is not protected against a second disk failure in the same group – known as a double disk failure – and if this occurs before the rebuild is completed, data is lost.
The reaction from suppliers has been to increase the number of parity disks, hence the evolution of Raid 6, which provides two parity disks in each group, but also increases the overhead of the protection mechanism.
One answer to the overhead issue is to increase the number of disks in the Raid group. Theoretically, there is no limit to the number of data and parity disks in a Raid group. Some suppliers deploy their arrays with up to 28+2 configurations – 28 data disks with two parity – which represents only a 7% additional cost.
But, here lies the second problem with Raid as disk capacities have increased – the problem of an unrecoverable read error.
An unrecoverable read error occurs when a request to read data from disk fails for some reason. Today’s hard drives are remarkably reliable so the risk of an unrecoverable read error is extremely small – but is possible.
Sata drives can have an error rate of one bit in 10-12TB read, which with large Raid stripes and high capacity drives (now 8TB) means the risk of a failure during a rebuild is a very real possibility.
Some suppliers mitigate this risk by performing “predictive” failures – logically failing a disk in a Raid group and copying the data off without a rebuild, before the drive actually “hard fails”.
Note that in practice, unrecoverable read error rates have been seen to be much lower than quoted by the manufacturers, but this is no guarantee of data security as high unrecoverable read error rates have also been seen.
The use of Raid for data protection with large volumes of data such as that found in object storage is pretty much impractical due to the sheer numbers of drives involved, the numbers of Raid groups required and the amount of rebuilding that would be needed when each recovery task involves recreating the contents of an entire disk drive.
The solution adopted by supplers is known as erasure coding or forward error correction and is similar to the technology used in data transmission across radio networks.
Read more about Raid and erasure coding
Erasure coding works in the same way as Raid in that extra information is created from the actual data being stored that is used in the recovery process. The difference with erasure coding lies in the process of calculating that additional data.
In a system that uses erasure coding, source data is divided into blocks. These are typically multiple megabytes in size and much larger than the blocks in a Raid-based system because the overhead of performing the erasure coding process is more computationally expensive.
Each block goes through a transformation process that produces a number of slices or shards of data from the original block. To recover the original data, a minimum number of shards are required from the total initially created. For example, a block of data could produce 16 shards, of which any 12 are needed to reconstitute the original information.
The sharding process has a number of benefits. First, assuming all shards are on separate disks, a single disk failure doesn’t result in a loss of data protection, and second, recovery doesn’t require reading all the other data components – only enough to rebuild a missing shard.
However, reading any data block requires accessing the minimum number of shards (12 in our example) and performing a calculation on the pieces, which results in more I/O requests and compute time to reconstitute the data.
Erasure coding provides one additional advantage that Raid can’t achieve. If I want to provide a backup copy of my data, traditional Raid systems require the use of an entire replica of the data in a secondary location.
But, with erasure coding the system can simply distribute the shards, ensuring that enough shards exist at sites outside of the one with the failure. For example, a three-site erasure coded system with at least four shards in each location in a 16/12 scheme can tolerate the loss of a single site without data loss.
Raid vs erasure coding
Although erasure coding seems to be a good replacement for Raid, in fact each has merits in different circumstances.
Erasure coding is not good for small-block I/O like that of block-based storage arrays, as the overhead of the coding calculation significantly affects performance.
Instead, erasure coding works best with large data found in object stores and file systems. Raid continues to be best for small-block data.
Erasure coding is best for large archives of data where Raid simply can’t scale due to the overheads of managing failure scenarios. Typically, these types of systems aren’t built for performance, but rather capacity.
It’s worth noting that some suppliers are introducing the ability to mix Raid and erasure coding in their systems, depending on the type of data being stored. In this way they are optimising the protection process for the customer.