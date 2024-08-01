Just over a week ago, I wrote a piece about what appeared to be a global failure of Microsoft services, asking what enterprises should do when the infrastructure they rely on fails.

At that point, the world was experiencing major impacts in transportation, finance, retail and other systems, although the UK appears to have escaped that incident fairly well - notwithstanding issues for anyone trying to get a GP appointment.

It quickly became clear the problem was not an issue with Microsoft’s Azure service, as it first appeared, but an issue with a single software provider – named CrowdStrike - who released a faulty update to their software, which was then distributed rapidly around the world via the Azure global networks.

As reported by Computer Weekly, that “bad patch” was available online for 78 minutes, and in that time was distributed to 8.5 million Microsoft machines that got locked into a boot cycle and became unusable.

Once it became clear the source of the problems was not an organised cyber-attack from persons unknown, things settled into resolution mode.

The impact on affected businesses and the general public was in some cases major, but - when it comes to hyperscaler outages - the world has a short memory, and things quickly fell back into “business as usual” mode.

Not another outage Except, on 30 July 2024, Microsoft’s cloud services suffered another outage, affecting businesses globally and – again - without any warning. This outage, however, was nothing like the CrowdStrike debacle in terms of cause, impact, or even implication. What this latest outage demonstrates is that we have one single problem: our level of reliance on cloud services which might not be all that reliable. But first we need to dig a bit deeper into why these two outages were not the same. IT security folks try to determine and manage risks to data and IT systems and in doing so tend to consider three key characteristics: confidentiality, integrity and availability. Maintaining these characteristics and keeping them within defined and acceptable ranges is what cyber-security is all about. It is impractical in nearly every case to maintain perfect equilibrium of confidentiality, integrity and availability. And, in any event, different organisations need different blends of these three things to function optimally. It is common for IT security folks to focus on confidentiality as the biggest concern, and indeed the UK Government Security Classification Scheme is principally about assigning classifications to data confidentiality. But, in some cases, confidentiality is the least important factor, whilst integrity and availability are of very high importance. Think of the fire brigade, as an example. When a fire is reported, the fire’s location needs to be as accurate as possible, and the firefighters on the ground need to communicate as accurately as possible to ensure they get the resources needed to fight the fire. In this example, integrity and availability are high priorities, but keeping the fire a secret is unlikely to be. What we do need, if IT security is to be achieved, is all of those three things in some form. And when the balance is not right, that’s a problem.

Outage verses breach The media use two different words to describe these problems, depending on the characteristic that is compromised. A loss of confidentiality is usually referred to as a breach, while a loss of integrity or availability is often called an outage. These describe the visible effects of the compromise, but not always the cause of the problem. And that’s why the two reports of Microsoft outages in a little over a week need to be taken separately. They might look the same to the public’s eye and might be referred to in the same way in the press – but they’re different things and understanding that is both important and necessary for lessons to be learned from each. The Crowdstrike incident was a loss of integrity of a single file in its software, which resulted in a loss of overall service availability. The 30 July incident does not appear to be the same at all. And whilst it was shorter lived at just a couple of hours, after which most services came back online largely unscathed, it might actually be a lot more serious in nature. The latest ‘outage’ was a general and widespread loss of availability of Microsoft networking services for its global Azure service, reportedly caused by a “usage spike”, which could be a Microsoft euphemism for a denial-of-service (DoS) attack by an unknown bad actor. A DoS attack occurs when a (usually malicious) user consumes all of the available service resources and leaves nothing for anyone else. For as long as the attacker retains those resources, the service will remain unavailable to its legitimate users. And during that time the affected business or user will typically be unable to operate or function. Denial of Service attacks are major threats that can result in serious financial and threat-to-life situations, and a lot of money and resource is put into preventing their occurrence, which to be fair Microsoft is usually pretty good at. This time, however, it looks like something went wrong, and that might be a failure of the security countermeasure to stop these attacks. Or it might simply be that the bad guys found a way to throw more resources into the attack.