Romolo Tavani - stock.adobe.com

How breaking things builds resilient systems

To prevent and recover from outages in today’s complex, cloud-native world, enterprises must proactively and deliberately inject failure into their systems though chaos engineering practices

Enterprises must proactively break their own systems in a controlled way to prevent catastrophic failures. This practice, known as chaos engineering, is becoming essential in the complex world of cloud-native applications.

That was according to Sayan Mondal, a senior software engineer at software delivery platform Harness, who noted that developers, site reliability engineers (SREs) and IT operations teams have to get comfortable with chaos.

Speaking at the KubeCon + CloudNativeCon China 2025 conference in Hong Kong this week, Mondal, who is also a maintainer and community manager for Cloud Native Computing Foundation (CNCF) incubating project LitmusChaos, explained how deliberately injecting failure into systems can help to harden them against outages.

“There are cloud-native and distributed systems everywhere these days, with a lot of interdependent failures,” he said. “Cloud providers are not really 100% reliable, because you can get device failures, power supply outages and memory leaks.”

The financial and reputational damage from such outages can be costly. Mondal cited examples where a financial company lost over $55m due to a single infrastructure issue that prevented transactions from being processed, as well as a Slack outage caused by “a lot of logging” that took down the collaboration service for thousands of businesses.

Traditional testing only scratches the surface and typically involves just the application layer, he said.

“Rarely do we ever test the infrastructure or the underlying platform services,” said Mondal. “Chaos engineering specifically focuses on touching the bottom layers.”

A fire drill for your application

Mondal described chaos engineering as “a fire drill that you do at the beginning of the delivery cycle, so that when the actual event happens, you are much more prepared”. It involves planning and simulating failures; using different types of faults to identify vulnerabilities; and understanding how systems fail so organisations can build more resilient systems.

For DevOps and IT teams wondering where to begin, Mondal said they don’t have to start by breaking production systems. The journey can begin safely in a local environment using tools like K3s and Minikube, moving to staging and pre-production environments, and only then graduating to production once the team is comfortable.

He introduced LitmusChaos as an open-source, cross-cloud framework that provides the “barebones minimum open-source chaos tooling”, which can be paired with other plugins for monitoring and multi-tenancy. With over two million installations, it provides a rich set of features for fault injection, experiment management and observability.

The core of LitmusChaos is built on three custom resource definitions (CRDs):

  • ChaosExperiment: The blueprint of the fault, defining what you are injecting.
  • ChaosEngine: The user-defined tuning for the experiment, defining how you are running it, including duration and specific parameters.
  • ChaosResult: The output of the experiment, detailing what happened and what was learned.

During a demonstration, Mondal used LitmusChaos to target the product catalogue microservice of a sample e-commerce application. By running a “pod-delete” fault, he showed how the application’s product listings vanished and then how Kubernetes, through its own resilience mechanisms, brought the service back online. This simple experiment demonstrated how teams can verify the resilience of their services and understand their behaviour under stress.

Organising for chaos

As for where chaos engineering teams typically reside in an organisation, Mondal told Computer Weekly it’s often a shared responsibility, led by those closest to system reliability.

“Primarily, what we have seen in SREs are the ones who are doing it, mostly at the principal level,” he noted, adding that his team also includes operations and principal engineers.

However, the goal is to expand the practice to developers. “In my team, we put it in the CI [continuous integration] pipeline. Whenever we do a release, we automatically inject a chaos step as part of the release process,” said Mondal, adding that this shift-left approach empowers developers to test resilience as they ship code.

This developer-led testing is complemented by larger, more structured events. “The ops team and the SREs also do ‘game-day’ events,” he said, adding that this can take place on a quarterly basis to tackle more complex failure scenarios.

Mondal added that combining continuous chaos testing with periodic game days can help create a culture of reliability.

Read more about DevOps in APAC

Read more on Software development tools