Enterprises must proactively break their own systems in a controlled way to prevent catastrophic failures. This practice, known as chaos engineering, is becoming essential in the complex world of cloud-native applications.

That was according to Sayan Mondal, a senior software engineer at software delivery platform Harness, who noted that developers, site reliability engineers (SREs) and IT operations teams have to get comfortable with chaos.

Speaking at the KubeCon + CloudNativeCon China 2025 conference in Hong Kong this week, Mondal, who is also a maintainer and community manager for Cloud Native Computing Foundation (CNCF) incubating project LitmusChaos, explained how deliberately injecting failure into systems can help to harden them against outages.

“There are cloud-native and distributed systems everywhere these days, with a lot of interdependent failures,” he said. “Cloud providers are not really 100% reliable, because you can get device failures, power supply outages and memory leaks.”

The financial and reputational damage from such outages can be costly. Mondal cited examples where a financial company lost over $55m due to a single infrastructure issue that prevented transactions from being processed, as well as a Slack outage caused by “a lot of logging” that took down the collaboration service for thousands of businesses.

Traditional testing only scratches the surface and typically involves just the application layer, he said.

“Rarely do we ever test the infrastructure or the underlying platform services,” said Mondal. “Chaos engineering specifically focuses on touching the bottom layers.”

A fire drill for your application Mondal described chaos engineering as “a fire drill that you do at the beginning of the delivery cycle, so that when the actual event happens, you are much more prepared”. It involves planning and simulating failures; using different types of faults to identify vulnerabilities; and understanding how systems fail so organisations can build more resilient systems. For DevOps and IT teams wondering where to begin, Mondal said they don’t have to start by breaking production systems. The journey can begin safely in a local environment using tools like K3s and Minikube, moving to staging and pre-production environments, and only then graduating to production once the team is comfortable. He introduced LitmusChaos as an open-source, cross-cloud framework that provides the “barebones minimum open-source chaos tooling”, which can be paired with other plugins for monitoring and multi-tenancy. With over two million installations, it provides a rich set of features for fault injection, experiment management and observability. The core of LitmusChaos is built on three custom resource definitions (CRDs): ChaosExperiment: The blueprint of the fault, defining what you are injecting.

ChaosEngine: The user-defined tuning for the experiment, defining how you are running it, including duration and specific parameters.

ChaosResult: The output of the experiment, detailing what happened and what was learned. During a demonstration, Mondal used LitmusChaos to target the product catalogue microservice of a sample e-commerce application. By running a “pod-delete” fault, he showed how the application’s product listings vanished and then how Kubernetes, through its own resilience mechanisms, brought the service back online. This simple experiment demonstrated how teams can verify the resilience of their services and understand their behaviour under stress.