Andrea Danti - Fotolia
Microsoft had to roll back an update and reroute access to alternate IT infrastructure last night, after an outage affected its cloud service.
About 23.25pm BST on 28 September, the company tweeted that it was investigating an issue affecting access to multiple Microsoft 365 services. “We’re working to identify the full impact and will provide more information shortly,” it wrote in the MSFT365Status tweet.
Microsoft confirmed it was investigating availability of the Azure AD service, which organisations use to authenticate onto Microsoft 365. “Customers using Azure Active Directory may experience HTTP 503 errors when accessing the Azure portal,” it said.
However, some people on Twitter observed that the outage had implications that extended beyond the AD portal. One IT admin tweeted: “None of my clients with Azure AD-backed applications can log in or authenticate right now.”
Another noted that the corporate application was down because Azure AD authentication was unavailable. Some people complained they could not use Teams or other Microsoft online services such as Outlook 365.
The company initially said it had identified a recent change that appeared to be the source of the issue and announced it had rolled back the change to mitigate the impact. While monitoring the IT environment, Microsoft admitted it had not observed an increase in successful connections after rolling back the recent change. “We’re working to evaluate additional mitigation solutions while we investigate the root cause,” it said.
This mitigation involved rerouting network traffic to alternate IT infrastructure, which Microsoft said would improve the user experience while it continued to investigate the issue.
In 2019, Mark Russinovich, chief technology officer at Microsoft Azure, described how Azure AD had been architected so that it had no single point of failure (SPOF). He said Azure AD was a global service with multiple levels of internal redundancy and automatic recoverability and was deployed in over 30 datacentres around the world using Azure Availability Zones.
Read more about Active Directory resilience
He wrote: “Given the criticality of our services, we don’t accept SPOFs in critical external systems like Distributed Name Service (DNS), content delivery networks (CDN) or telco providers that transport our multi-factor authentication (MFA), including SMS and Voice. For each of these systems, we use multiple redundant systems configured in a full active-active configuration.
“Much of that work on this principle has come to completion over the last calendar year and, to illustrate, when a large DNS provider recently had an outage, Azure AD was entirely unaffected because we had an active/active path to an alternate provider.”
The fact that an update made Azure AD inaccessible to many organisations shows that even with the level of resilience and replication it has built in, the service still becomes a SPOF for many businesses that rely on it to authenticate users of their line-of-business applications. One IT admin tweeted: “It’s like the front door to the house was locked, but hey, everything is in the house. You just can’t get in.”
The issues with Azure AD have now been resolved. Microsoft said: “We’ve fixed the service interruption that some customers may have experienced while performing authentication operations.”