The consequence of IT failure on a space mission
If everything goes to plan, an HPE edge server, Spaceborne Computer-2, will soon be running on the International Space Station, proving that commodity x86 hardware can be used to run mission critical systems on manned space missions.
It may seem like a so-what moment. Computer Weekly has written extensively on the development of computers that put man on the Moon. But this is very different: it represents a stepping stone on a technology path that will eventually lead to a manned Mars mission.
As every datacentre operator knows, no matter how high the mean time between failure figure from the manufacturer is, hardware can and will ultimately fail. That’s on Earth. In space there is the added issue that radiation like cosmic rays, leads to increased error rates in computer memory chips.
With Nasa’s rover Perseverance, finally arriving at the Red Planet, after a seven month trip across the solar system, there is a very real risk that on any future manned mission, astronauts will need to deal with computer hardware failures.
Back on the ISS and HPE has provided an inventory of spare parts. The redundancy of the system means that parts do not have to be replaced immediately. Clearly, unless it is safety critical, fixing computer hardware may not be a top priority for an astronaut. Any failures need to be repaired during routine maintenance windows. And these should, ideally, be kept to a minimum, because an astronaut is not a full time datacentre administrator.
Systems management on the ISS
Typically, systems management is focused on preventative design. This aims to anticipate failures and apply fixes to prevent such a failure before it actually occurs for real.
Taking a leaf out of Nasa’s operating philosophy, managing the Spaceborne Computer-2 is based on the principle of consequential design. “Our consequential design treats all sensors and sensor data equally from a systems management perspective; and does not affect processing performance. Only when a standard reading falls out of range is action potentially taken,” explains Mark Fernandez, principal investigator for Spaceborne Computer-2.
While the environment on the ISS may well be very different to an Earth-bound datacentre, consequential design in datacentre operations raises some interesting questions. Is predicting IT failures the best way to manage complex technology, or should IT administrators work within a framework where they can assess the extent to which any failure will damage the overall system and have an informed conversation with decision makers on the actions that can be taken.