Single point failures

The recent two hour outage of Google’s Gmail, affecting the majority of its 150 million users reflects the growing risks associated with the inevitable drift towards centralised system management.

At least Google was honest enough to issue an apology explaining that the incident was caused by an engineer’s miscalculation and that they were investigating ways to ensure it did not happen again. (Mind you it’s not the first of these incidents.)  That’s a big improvement over O2 whose service was down for many customers during most of Saturday without any explanation.

Expect more of these crashes. Information technology is spectacularly vulnerable to tiny errors and we are building massive single point failure scenarios based on cloud computing, centralised management and technology monoculture. In response, we must all raise our game in business continuity and crisis response.