Everything Google knows about disk failure

Google has released a paper analysing the performance of over 100,000 disk drives and concludes that the machines may not fail for the reasons most assume causes disks to die.

Google has conducted a study of the reasons hard drives fail, using information gathered from more than 100,000 of its own disks.

The study, which Google says is "...unprecedented in that it uses a much larger population size than has been previously reported" presents "...a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime."

The study's key finding was "...the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them

by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment."

Google's data was collected by tapping into drives' self monitoring facility (SMART) and "confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities.

"We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities."

"Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

The full report is available for download at http://labs.google.com/papers/disk_failures.pdf

Read more on Computer storage hardware