Fault tolerance: A SMART move

The SMART system allows for a high degree of fault tolerance when incorporated in hard disk drives

The SMART system allows for a high degree of fault tolerance when incorporated in hard disk drives

Is there such thing as enough reliability? In industries providing capabilities that are crucial to the day-to-day productivity and survival of their clients, the answer is clearly no. And while many companies are investing significant resources in elevating the reliability of individual components and devices, overall system reliability is not always addressed.

By broadening the scope of reliability enhancements from the device level to the system level, system designers and integrators can leverage the full capabilities of the system, not just its components, to devise more complete, intelligent solutions that benefit overall system reliability.

The S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) system was designed precisely with this approach in mind. By using S.M.A.R.T. technology, virtually any intelligent component or device within a computer can communicate its predicted reliability status to its user and system administrator to provide comprehensive protection that can prevent system downtime, productivity loss, and even the loss of valuable data.

Since user data is not always as easy to replace as hardware devices, most computer users will argue that their data is the most valuable element of their computer system. In recognising the need to elevate the protection of user data, hard drive manufacturers have pioneered the implementation of the S.M.A.R.T. System for hard disk drives.

Through the S.M.A.R.T. System, disk drives incorporate a suite of advanced diagnostics that monitor the internal operations of a drive and provide an early warning for many types of potential problems. If detected, the drive can be scheduled for replacement before the loss of data occurs. The result? Higher productivity and increased data security.

The S.M.A.R.T. system for disk drives is designed to revolutionise overall system and data reliability. The system comprises software that resides both on the disk drive and on the host computer. The disk drive software monitors the internal performance of the motors, media, heads, and electronics of the drive, while the host software monitors the overall reliability status. The reliability status is determined through the analysis of the drive's internal performance level and by comparing internal performance levels with predetermined threshold limits.

In some configurations the host may play a more active role and direct the drive to perform diagnostic tests and report the results. The host may then make comparisons with previous results, or against values set for that particular drive model, and take appropriate action.

Upon determination of a critical reliability condition where a device failure has been predicted, the host software warns the user of the impending condition and advises the user to take appropriate action to protect the data being stored. In more advanced implementations, the host could notify a network administrator and automatically reduce the workload on the device, relocate key files, and even begin a backup of the data to tape or other disk drives.

In recent years, disk drives have gone from being reliable to being supremely reliable. With product quality levels reaching 99.96 per cent and actual field failure levels as low as 0.27 per cent, hard disk drives are among the most reliable man-made products in the world. One measure of this increasing reliability has been that the MTBF (Mean Time Between Failures) rating for a typical drive has climbed to 300,000 hours, and MTBFs of 800,000 hours or more are often found in high-end drives. In comparison with other devices, the MTBF ratings for hard disk drives are 10 to 30 times higher than that of typical floppy disk drives and CD-ROM drives.

Yet, despite significant gains in MTBF ratings, system managers and end users have continued to press for improvements in reliability. The increase in MTBF ratings alone has not been enough to meet this need for three main reasons:

The number of drives per system continues to increase

The capacity of the typical drive continues to grow

MTBF is a statistical rating that applies to a large population of devices, not to a specific device in a particular system.

Today, a computer installation may include multiple servers attached to several RAID cabinets. The number of drives installed may have increased from five or 10 a decade ago to more than 100 today.

With this increase in the number of drives has come a dilution of overall system reliability. An installation with 100 or so 300,000 MTBF drives can be expected to have a failure every 3,000 hours perhaps twice a year.

Today's drives are considerably larger than those of just a few years ago. Each potential drive failure now puts more data at risk. Therefore, ensuring the reliability of each drive has become more important.

Similarly, the MTBF values calculated by manufacturers apply to a large population of a particular model's drive. But this average for all drives can only broadly reflect the reliability to be expected for a particular drive.

With the S.M.A.R.T. system, data reliability will evolve from the general and statistical to the specific and individual. Part of what makes the S.M.A.R.T. system possible is that disk drive reliability has been intensely studied for many years. Though difficulties remain and new technologies are continually being introduced, the key vital areas have been well explored. By analysing the vital functions of disk drive components and understanding their typical failure mechanisms, disk drive designers can not only develop more reliable products, but also apply their knowledge to the prediction of device failures.

Through research and monitoring of vital functions, performance thresholds which correlate to imminent failure can be determined. By applying these thresholds to the monitoring of each individual device, the S.M.A.R.T. system achieves the goal of effective failure prediction.

In addition to handling damage, assembly defects or material defects, environmental conditions can also contribute to device failure or the loss of data. The mobile computing environment, for example, exposes systems to a broad range of extreme adverse conditions such as shock, vibration, temperature, and humidity. Exposure to these types of conditions can promote a variety of failure mechanisms.

The specific measures and techniques used in hard drives are selected individually for each design. They can vary by model and change over time as the drive architecture changes and diagnostic techniques improve.

It is important to note that no one measure is effective for all problem areas. S.M.A.R.T. is truly a suite of diagnostics. Regardless of the failure area and the components involved, failures can be identified as one of two broad types: predictable or non-predictable. The predictable failures show a gradual and detectable decline in performance, and there is a known threshold for acceptable performance. The challenge of designing an early warning algorithm is to identify the threshold and detect the decline.

Some measures, like power-on hours or number of contact start/stops, are easy to measure, but have no clear limit. They are somewhat like the number of miles on a car's milometer - 100,000 is a high amount, but does not mean that the particular car will fail anytime soon.

Non-predictable failures either show no gradual decline in performance, or the measures needed to detect them cannot be accomplished by the drive. Some failures of integrated circuits could perhaps be predicted by monitoring microscopic cracks in the substrate and circuits, but an electron microscope is needed.

As drive technologies advance, some non-predictable failures may become predictable. But for now it is important to acknowledge the limits of the S.M.A.R.T. system - advanced diagnostics can provide an early warning for many failures, but not all.

With the S.M.A.R.T. system, drive manufacturers are pioneering an open standard that can bring a new level of data security to the industry. It incorporates a number of key features:

It is an open industry standard, developed and endorsed by Compaq Computer and other industry leaders. The specification has been published by the disk drive industry's Small Form Factor Committee under document number SFF-8035.

The technology can be extended to include a variety of devices tapes, CD-ROM, communications devices and so forth.

S.M.A.R.T. drives can be monitored automatically for impending failure conditions.

Drive manufacturers can use the specific internal diagnostics best suited to their drive

S.M.A.R.T.-compliant hosts can take early action to protect the data on a drive automated backup, load reduction, and so forth

S.M.A.R.T. is designed to detect up to 70 per cent of all predictable device failures to extend data and overall system reliability

S.M.A.R.T. can detect and report failure conditions that originate in the field due to shock, vibration, temperature, and voltage extremes.

In many ways the S.M.A.R.T. system can be thought of as a set of diagnostic software that is built into the drive. Mainframe computers and minicomputers have used disk drive diagnostic routines for many years. In the PC world, many users are familiar with CHKDISK, SCANDISK, and other utilities that can provide early detection of some disk drive problems.

The S.M.A.R.T. system extends this technology by designing the diagnostics into the drive. There, the diagnostic routines can be more precise because they are designed for a specific drive design. They can also be more effective because they have access to the internal performance and calibration measurements collected by the drive's controller. Through its use of internal performance indicators and real-time monitoring and analysis, the S.M.A.R.T. system is designed to extend its data protection capability beyond that of traditional diagnostic software.

To ensure compatibility of software and hardware implementations of the S.M.A.R.T. system, hard drive manufacturers are actively pursuing standardisation for both ATA and SCSI interfaces. For both interfaces, the implementation of S.M.A.R.T. requires the introduction of new firmware that specifies the drive activities to be monitored, the diagnostics to be performed, and the values to be used as thresholds or for an analysis.

The key feature of the open S.M.A.R.T. specification being is that any drive that is S.M.A.R.T. compliant can communicate with any host that is also compliant. Though the specific parameters measured and the limits and analysis used may vary, the communications framework is usable across vendors.

Compiled by Ajith Ram

Read more on Integration software and middleware