Chaos Engineering explained: preparing for the worst in production

We rarely test a distributed system where it hurts the most: at the moment a dependency gives out, when the network freezes, when an instance vanishes without warning. In 2021, a study estimated that an hour of downtime cost more than $100,000 for 98% of the companies surveyed. The problem is that these outages almost never happen under comfortable conditions: they strike in production, under load, at the worst possible time. Rather than passively waiting for that scenario, one discipline proposes triggering it yourself, within a controlled setting: Chaos Engineering.

Trigger the failure before it finds you

Popularized by Netflix in the 2010s, Chaos Engineering consists of deliberately injecting disturbances into a distributed system to uncover its weaknesses before users run into them. The idea may sound counterintuitive (deliberately breaking what works) but it flows directly from the Design For Failure principle: in a distributed architecture, failure isn’t an exception to avoid, it’s a certainty to anticipate. The ultimate goal, then, isn’t to break for the sake of breaking, but to increase the confidence we can place in the system’s actual resilience.

What sets Chaos Engineering apart from plain sabotage is its rigor: it’s a genuine scientific approach, structured in four phases.

Define the steady state: characterize the system’s normal behavior through measurable metrics (latency, error rate, throughput).
Form a hypothesis: describe what you expect to observe once the disturbance is introduced, ideally “nothing changes for the user.”
Inject the disturbance: apply the targeted failure under controlled conditions.
Measure and learn: compare the observed behavior to the hypothesis, then fix the gaps it reveals.

This loop then plays out at different levels of the system, from the broadest down to the most local.

At the infrastructure level

This is the discipline’s historical home turf. Chaos Monkey, one of the first tools Netflix built, deliberately acts on any instance, whatever technology happens to be running on it: it kills one at random and watches whether the system absorbs the disappearance. More complete solutions have since emerged, such as Gremlin, Azure Chaos Studio, and Chaos Toolkit.

These tools let you simulate a whole range of realistic disturbances:

latency injected into the communication layers;
network outages targeted at a specific region or cluster;
specific instances taken out of service;
operations that saturate CPUs to reproduce an overload.

The value is in validating, at the scale of the whole system, that the redundancy and failover mechanisms actually hold up when a component disappears for real.

At the application level

Not everything plays out at the infrastructure level. It’s often just as relevant to introduce chaos more locally, at the very heart of an instance, through dedicated libraries. The C# ecosystem is exemplary here with the Polly and Simmy duo.

Polly is a library that implements the resilience patterns I’ve already detailed in my dedicated resilience series: circuit breaker, retries, timeouts. Simmy is its natural complement: a direct integration with Polly that, conversely, lets you inject chaos to verify that those protections work. Concretely, Simmy can introduce:

exceptions into the system (timeouts, HTTP errors);
artificial latency;
altered behaviors and results.

When the ecosystem doesn’t offer a mature tool

Not every language has a library as polished as Polly. That’s no reason to give up: you can introduce a minimal level of disturbance with a few basic building blocks.

A substitute proxy: slot in an alternative version of your dependencies that injects errors and latency. This does assume a dependency injection system is in place, so you can easily swap a regular dependency for its disruptive variant.
Random crashes: simulate a process that exits unexpectedly, to check the ability to restart and recover.
Event loop saturation: in the Node.js ecosystem, run CPU-intensive tasks to cause event loop lag, or push the Event Loop Utilization (ELU) to reproduce saturation conditions.

Conclusion

Chaos Engineering flips the logic around: instead of crossing our fingers and hoping failures never happen, we choose to trigger them while we’re still able to observe, understand, and fix them. Whether it’s killing an instance with Chaos Monkey or injecting an exception with Simmy, the approach stays the same: turn uncertainty into a testable hypothesis.

And that may be the real payoff: an hour of unplanned downtime is expensive; an hour of deliberate chaos, within a controlled setting, is one of the most profitable investments you can make in a system’s robustness.