Retry pattern: don't stop at the first error

Shipping to production is good. Surviving it is better. Murphy’s law sums up the mindset to adopt: “Anything that can go wrong, will go wrong.” Sooner or later, an interaction with an external service is going to fail, and the real question isn’t whether it will happen, but how your application will take the hit.

This first installment opens a series devoted to resilience: five concrete patterns to make your applications more robust against the inevitable failures of production. We start with the most fundamental one, the one all the others build upon: the Retry.

The principle: retry rather than give up

The idea behind the Retry is as simple as its name suggests: don’t stop at the first error. An interaction between two services can fail for a thousand reasons: an unstable network, a service that’s momentarily unavailable, or simply too slow because it’s absorbing a load spike. None of these errors are necessarily permanent.

Trying only once is betting that everything will go right on the first attempt. That’s exactly the kind of assumption Murphy’s law makes a point of disproving. The Retry pattern therefore consists of putting in place a strategy to relaunch a request that has just failed, on the premise that a new attempt has a good chance of succeeding.

Retry, yes, but not just any way

A word of caution, though: retrying doesn’t mean hounding. If a service is already underwater, bombarding it with requests on a loop will only make its situation worse, until it goes down completely. That’s the classic trap of a poorly thought-out Retry: turning your own resilience strategy into a denial-of-service (DDoS) attack against a dependency that’s already struggling.

The whole challenge therefore comes down to the delay you leave between two attempts, what we call the backoff. That’s where the real design decisions are hiding. Let’s review the three most common strategies, from the most naive to the most robust.

To illustrate them, we lean on the Effect library, which makes defining and composing these various retry strategies considerably easier.

Fixed Backoff

The simplest strategy: the time between each new attempt is fixed: 1s, 2s, 3s, 4s, and so on.

/**
 * To model the Retry strategies we use the Effect library,
 * which makes defining and composing different strategies
 * considerably easier.
 */
import { Duration, Effect, pipe, Schedule } from "effect";

const networkCall = Effect.tryPromise(() =>
  fetch("https://jsonplaceholder.typicode.com/todos/1")
);

// FIXED BACKOFF (base * n)
const fixedBackoff = Schedule.fixed(Duration.seconds(1));

const networkCallWithRetry = networkCall.pipe(Effect.retry(fixedBackoff));

/**
 * Attempt #1: 1s
 * Attempt #2: 2s
 * Attempt #3: 3s
 * Attempt #4: 4s
 * Attempt #5: 5s
 */

Its advantage is obvious: it’s the most immediate strategy to implement. A new attempt is made at a regular interval, full stop.

Its flaw is just as obvious: this constant rhythm keeps significant pressure on the service. Yet the fact that an attempt keeps failing is precisely the signal that the service still hasn’t recovered, so you might as well take advantage of that to wait longer, rather than coming back to knock at the same pace.

Exponential Backoff

This strategy directly addresses the previous limitation: instead of a fixed delay, you increase the wait time by a defined factor on each attempt: for example 1s, 2s, 4s, 8s, 16s.

// EXPONENTIAL BACKOFF (base * factor ^ n)
const factor = 2;
const exponentialBackoff = Schedule.exponential(Duration.seconds(1), factor);

const networkCallWithExponentialRetry = networkCall.pipe(
  Effect.retry(exponentialBackoff)
);

/**
 * Attempt #1: 1s
 * Attempt #2: 2s
 * Attempt #3: 4s
 * Attempt #4: 8s
 * Attempt #5: 16s
 */

The benefit is clear: the more failures pile up, the more you space out the attempts, giving the service on the other end a real chance to catch its breath.

But this strategy keeps an Achilles’ heel: its rhythm stays perfectly deterministic, paced by an exponential calculation. If several clients adopt the same strategy at the same time, they’ll all retry within the same time windows. The third-party service then finds itself absorbing waves of synchronized requests, exactly the recurring load spikes we were trying to avoid.

Jitter Exponential Backoff

The fix for this synchronization problem comes down to one word: jitter. We take the Exponential Backoff again, but add a bit of randomness to each attempt’s delay: for example 1s, 2.7s, 4.5s, 8.3s, 15.2s.

// JITTERED EXPONENTIAL BACKOFF (base * factor ^ n * jitterInInterval)
const jitteredExponentialBackoff = pipe(
  exponentialBackoff,
  Schedule.jitteredWith({
    min: 0.5,
    max: 1.5,
  })
);

const networkCallWithJitterExponentialRetry = networkCall.pipe(
  Effect.retry(jitteredExponentialBackoff)
);

/**
 * Attempt #1: 1s
 * Attempt #2: 2.7s
 * Attempt #3: 4.5s
 * Attempt #4: 8.3s
 * Attempt #5: 15.2s
 */

This random variation desynchronizes the attempts of the different clients: even if they all failed at the same time, they won’t come back to knock at the same second. You thus smooth out the load instead of concentrating it into spikes aligned on the same cadence.

The trade-off is minor but real: the behavior becomes less predictable, and the jitter can introduce a slight additional delay in the case where the service has, in fact, just become available again. A very largely favorable trade-off as soon as you have several clients in play.

Conclusion

The Retry is the baseline reflex of any resilient application: accepting that a transient error doesn’t have to doom an operation, and giving it a second chance. But as is often the case, the devil is in the details: here, in the backoff strategy. A fixed delay maintains useless pressure, an exponential delay waits intelligently, and adding jitter prevents all your clients from synchronizing their assaults on an already fragile service.

In the vast majority of cases, the Jitter Exponential Backoff is the best starting point. That said, a Retry alone isn’t enough: retrying without bounding each attempt in time amounts to stacking up requests that never complete. That’s precisely the role of the next pattern in this series, the Timeout, which we’ll cover in the next episode.