Mean time between failure vs. load/unload cycles

Mean time between failure, the traditional metric used to predict the life of a disk drive, is being challenged by counts of load/unload cycles and annualized failure rates.

Disk drives are wonderfully reliable machines, with many offering “mean time between failure” (MTBF) of a million hours or more, indicating that they can be expected to operate for the advertised MTBF before breaking. That lovely large number has meant disk drive vendors have historically pushed MTBF to the fore as the promote the longevity of their products, but in recent months the industry has quietly started to use other measures to describe how long you can expect a drive to deliver the goods.

One of the new metrics - the number of load/unload cycles – is currently raising a few eyebrows.

A single “load/unload” cycle sees a disk drive ready itself for operation by spinning up its spindle so that the disk starts to rotate, a necessary action before its read and write heads are activated to go about their business capturing or creating data. When the operation is finished, the disk unloads, ceasing rotation and returning the read/write heads to a safe position.

The ability to measure load/unload cycles, also known as “ramp load/unload or LUL” technology is built into almost all disk drives, but has largely been ignored in manufacturers’ descriptions of their products.

Recently, however, load/unload has made a comeback as manufacturers include it as an indicator of reliability for mobile drives aimed at consumers. Mobile drives spin down more than their desktop or enterprise cousins, in order to save battery life. It therefore makes sense to rate them according to load/unload cycles.

But vendors of disks destined for servers or storage arrays are also starting to disclose load/unload cycles. Western Digital’s new Velociraptor drives, for example, use load/unload cycles as their indicator of reliability on this “Disti Spec Sheet” which lists the metric as the first item of its "reliability/data integrity" qualities and does not mention MTBF at all.

SearchStorage ANZ asked Western Digital why it has mentioned load/unload cycles for this drive. Intriguingly, Western Digital responded with an emailed one-liner: “WD uses MTBF as the metric for Enterprise drives.” Another disk vendor, Hitachi GST, also offered MTBF as its preferred metric for disk longevity.

Seagate pointed us to this article offering annualized failure rate as its preferred metric, explaining that this metric is preferred because it offers a predictor of reliability based on a population of drives instead of applying MTBF to every disk.

Churn and burn

Why should you care about this battle of the acronyms?

The main reason is that under some circumstances, operating systems or applications can initiate a lot of load/unload cycles. Linux distribution Ubuntu has a flaw that sees it cycle a drive through so many load/unload cycles that a disk could reach the end of its rated life in a few months’ use. Backup vendor Acronis has also noticed users who find unusually high numbers of load/unload cycles. Applications that use the S.M.A.R.T standard to measure disk status can therefore report that a disk is near the end of its working life, even though it may be well short of its expected lifetime using MTBF data.

That variance makes it harder to compare apples with apples when buying disk.

A second reason for concern, according to David Deakin, storage and data centre consultancy Thomas Duryea’s National Practice Manager for Data Centre Solutions, is that modern storage arrays are generating more load/unload cycles because they spin down disks to save power. Ignoring load/unload cycles is therefore an unsound practice.

“Maid vendors like Copan who spun down disks reduced their overall lifecycle,” Deakin told SearchStorage ANZ. Products like virtual tape libraries (VTLs) also see disks asked to perform more load/unload cycles. “SANs keep disks spinning but VTLs are more cyclical. When you are loading virtual tape, it represents disk operation,” he said.

Tiering - the practice of sending old data to old, slow arrays so that faster and newer arrays are spared for more important jobs - can also see a disk get through more cycles, as when data is sent to lower tier disks they may not be spinning, meaning a load/unload cycle. Arrays and disks designed to be “green” may also generate more cycles, Deakin said, as they spin down in order to conserve power.

But not everyone agrees that load/unload cycles are of concern. Clive Gold, EMC’s Marketing CTO for Australia, says the company designs its arrays to avoid unnecessary cycles. “The only time we park the heads is upon power down or going into idle mode,” he said, and added that he sees a further metric - contact start-stop (CSS) – as something else to watch out for as it typically has a reliability threshold of around 30,000 cycles.

CSS is hardly ever mentioned in manufacturers’ description of disk drives, so we presume it is factored into MTBF and AFR predictions.

Deakin, however, tries to factor all of these metrics into the designs Thomas Duryea creates for its clients.

“MTBF does not cut it anymore,” he said. “Moving forward, it [load/unload cycles] is something that needs to be considered.”

Read more on Data protection, backup and archiving