← Back to Blog
Resilience Illustration of Chaos Engineering: preparing for chaos, with the four steps of the scientific method (steady state, hypothesis, injection, measurement).

Chaos Engineering explained: preparing for the worst in production

Chaos Engineering, popularized by Netflix: deliberately inject failures to uncover the weaknesses of a distributed system, at both the infra and application levels.

📅 ✍️ Antoine Coulon
chaos-engineeringresiliencedistributed-systemsnodejsfault-tolerance

We rarely test a distributed system where it hurts the most: at the moment a dependency gives out, when the network freezes, when an instance vanishes without warning. In 2021, a study estimated that an hour of downtime cost more than $100,000 for 98% of the companies surveyed. The problem is that these outages almost never happen under comfortable conditions: they strike in production, under load, at the worst possible time. Rather than passively waiting for that scenario, one discipline proposes triggering it yourself, within a controlled setting: Chaos Engineering.

Trigger the failure before it finds you

Popularized by Netflix in the 2010s, Chaos Engineering consists of deliberately injecting disturbances into a distributed system to uncover its weaknesses before users run into them. The idea may sound counterintuitive (deliberately breaking what works) but it flows directly from the Design For Failure principle: in a distributed architecture, failure isn’t an exception to avoid, it’s a certainty to anticipate. The ultimate goal, then, isn’t to break for the sake of breaking, but to increase the confidence we can place in the system’s actual resilience.

What sets Chaos Engineering apart from plain sabotage is its rigor: it’s a genuine scientific approach, structured in four phases.

This loop then plays out at different levels of the system, from the broadest down to the most local.

At the infrastructure level

This is the discipline’s historical home turf. Chaos Monkey, one of the first tools Netflix built, deliberately acts on any instance, whatever technology happens to be running on it: it kills one at random and watches whether the system absorbs the disappearance. More complete solutions have since emerged, such as Gremlin, Azure Chaos Studio, and Chaos Toolkit.

These tools let you simulate a whole range of realistic disturbances:

The value is in validating, at the scale of the whole system, that the redundancy and failover mechanisms actually hold up when a component disappears for real.

At the application level

Not everything plays out at the infrastructure level. It’s often just as relevant to introduce chaos more locally, at the very heart of an instance, through dedicated libraries. The C# ecosystem is exemplary here with the Polly and Simmy duo.

Polly is a library that implements the resilience patterns I’ve already detailed in my dedicated resilience series: circuit breaker, retries, timeouts. Simmy is its natural complement: a direct integration with Polly that, conversely, lets you inject chaos to verify that those protections work. Concretely, Simmy can introduce:

When the ecosystem doesn’t offer a mature tool

Not every language has a library as polished as Polly. That’s no reason to give up: you can introduce a minimal level of disturbance with a few basic building blocks.

Conclusion

Chaos Engineering flips the logic around: instead of crossing our fingers and hoping failures never happen, we choose to trigger them while we’re still able to observe, understand, and fix them. Whether it’s killing an instance with Chaos Monkey or injecting an exception with Simmy, the approach stays the same: turn uncertainty into a testable hypothesis.

And that may be the real payoff: an hour of unplanned downtime is expensive; an hour of deliberate chaos, within a controlled setting, is one of the most profitable investments you can make in a system’s robustness.