Single point failures

The recent two hour outage of Google’s Gmail, affecting the majority of its 150 million users reflects the growing risks associated with the inevitable drift towards centralised system management.

At least Google was honest enough to issue an apology explaining that the incident was caused by an engineer’s miscalculation and that they were investigating ways to ensure it did not happen again. (Mind you it’s not the first of these incidents.)  That’s a big improvement over O2 whose service was down for many customers during most of Saturday without any explanation.

Expect more of these crashes. Information technology is spectacularly vulnerable to tiny errors and we are building massive single point failure scenarios based on cloud computing, centralised management and technology monoculture. In response, we must all raise our game in business continuity and crisis response. 

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

I couldn't agree more, we really do need to raise our game in terms of redundancy. A couple of years ago centralised management was a key selling point, now it's kind of lost in the wind in terms of... Single place to update apps... Single place to fail ! At VESK virtual desktop we have 3 datacentres for DR and we're finding that our power charges are being hiked up ! It just means that business is less profitable from the normal sense of hosting. We have many other revenue streams so I think like you say, everyone needs to step up their game which will hopefully have a reciprocal effect on the rest of the industry.
This article asserts that Cloud Computing and the current direction of software architecture makes global system outages inevitable. Surely the point behind this sort of technology is to reduce such single points of failure, and hence the system outages that go with it. Is the assertion that this is just hype, or that the implementation is poor?
System outages are inevitable and always have been, the effects can be reduced by multiple redundancy sites. We find the implementation is pretty good but there are a number of factors that can cause an outage from power, Internet connection, faulty hardware/software so multiple sites are the only option for DR.