Why Azure failed: Procedural failure amplifies Azure bug

Microsoft has admitted procedural failure process in deploying updates led to last month’s catastrophic downtime of Azure storage

Microsoft has admitted last month’s catastrophic failure of Azure storage was because a process to deploy updates progressively in small manageable chunks was not followed.

The process, known as "flighting", is used to run health checks as the update is deployed.

But, in a blog post on the final root cause analysis of the failure of the Azure  Storage Table, Microsoft corporate vice president, Jason Zander said:  "The standard flighting deployment policy of incrementally deploying changes across small slices was not followed."

He admitted that Microsoft did not have in place tools and procedures to prevent the accidental deployment of the update."The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk. Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure."

While the update had been tested against Azure Table storage, Zander said the configuration switch was incorrectly enabled for Azure Blob storage Front-End, which exposed a bug causing Azure Blob to lock-up in an infinite loop. Since it was locked out, Microsoft engineers were unable to fix the problem without restarting it. Zander said: "Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol. With the tooling updates the policy is now enforced by the deployment platform itself."

While the automated recovery mechanism in Azure Compute enabled many Virtual Machines to recover from the outage, Zander said: "We identified a subset of VMs that required manual recovery because they did not start successfully."

Windows VMs created during the outage needed to be recreated, but Linux VMs were not impacted, Zander said.

Read more on Cloud computing services