blog/system-design/circuit-breaker-pattern
System Design

The Circuit Breaker Pattern: Failing Fast to Stay Alive

When a dependency goes down, the worst thing your service can do is keep calling it. The circuit breaker pattern stops cascading failures before they take down your entire system.

ยท10 min read

The 3 AM Incident You'll Eventually Have

It's 3 AM. Your payment service is timing out. Every request to the payment provider hangs for 30 seconds, then fails. But here's the real problem: your checkout service keeps retrying. Every user hitting "Place Order" spawns requests that pile up, consuming threads and connections. Within minutes, your checkout service is down too. Then the product service can't check inventory through checkout. Then the homepage fails because it can't load product data.

One slow dependency just took down your entire system.

This is a cascading failure, and it's the most common way distributed systems die. The fix? Stop calling the broken dependency. That's exactly what a circuit breaker does.


How a Circuit Breaker Works

A circuit breaker sits between your service and a dependency. It monitors failures and, when things go wrong enough, stops forwarding requests entirely -- failing fast instead of hanging.

It has three states:

Circuit Breaker State Machine
CLOSED
Requests pass through
failures hit threshold
OPEN
Requests blocked
HALF-OPEN
Probe request
1

CLOSED (Normal Operation)

Requests flow through to the dependency normally. The circuit breaker counts consecutive failures. As long as failures stay below the threshold, nothing changes. Your service doesn't even know the circuit breaker exists.

2

OPEN (Failing Fast)

Once failures hit the threshold (say, 5 failures in a row), the circuit trips open. Now every request is immediately rejected with a fast error -- no waiting, no hanging, no consuming resources. A timer starts counting down.

3

HALF-OPEN (Testing Recovery)

When the timer expires, the circuit breaker lets one probe request through to test if the dependency has recovered. If it succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit opens again and the timer restarts.


Why Failing Fast Matters

The intuition is counterintuitive: returning an error immediately is better than waiting for one. Here's why.

FactorWithout Circuit BreakerWith Circuit Breaker
Response time on failure30s timeout per request~1ms instant rejection
Thread/connection usageThreads pile up waitingThreads freed immediately
Cascade riskHigh -- downstream failure propagates upLow -- failure is contained
Recovery behaviorThundering herd when dependency recoversGradual -- half-open probes first
User experiencePage hangs, then errorsFast fallback (cached data, default, error message)
Thread Pool Impact: 100 Concurrent Requests, Dependency Down
No circuit breaker (30s timeout)100 threads blocked
With circuit breaker (open)0 threads blocked

Without a circuit breaker, a 30-second timeout on 100 concurrent requests means 100 threads stuck waiting. Your thread pool is exhausted. New requests to any endpoint -- including healthy ones -- start queuing. That's how one bad dependency takes down everything.

With a circuit breaker, those 100 requests fail in under a millisecond. Your threads are free. Healthy endpoints keep working. The blast radius is contained.


Try It Yourself

Use the simulator below to experience the state machine in action. Send successful and failed requests, watch the circuit trip, and observe the recovery cycle.

Interactive Circuit Breaker Simulator
closed
fails: 0/3
โ†’
open
โ†’
half-open
State: closedSuccesses: 0Consecutive fails: 0
Event log
Send requests to trigger state transitions...

Try this sequence:

  1. Send 3 failures to trip the circuit open
  2. Watch the countdown timer
  3. When it moves to half-open, send a success to close it
  4. Or send another failure to see it reopen

The Configuration Decisions

A circuit breaker's behavior depends on how you configure it. Get these wrong and you'll either trip too eagerly (blocking valid requests) or too slowly (not protecting anything).

๐Ÿ“Œ Failure Threshold

How many consecutive failures before the circuit opens. Too low (1-2) and transient errors trip it unnecessarily. Too high (50+) and your service absorbs too much damage before protection kicks in. Start with 5, adjust based on your dependency's normal error rate.

๐Ÿ“Œ Timeout Duration

How long the circuit stays open before testing recovery. Too short and you hammer a recovering service with probes. Too long and you stay in degraded mode unnecessarily. Start with 30 seconds, increase for dependencies that take longer to recover (databases, third-party APIs).

๐Ÿ“Œ Half-Open Strategy

How many probe requests to allow in half-open state. One probe is the safest -- you minimize load on a recovering service. Multiple probes give you more confidence before closing. Most implementations use a single probe.

๐Ÿ“Œ Failure Definition

Not every error should count as a failure. A 400 Bad Request is a client bug, not a dependency failure. A 503 or a timeout? That's a real failure. Only count errors that indicate the dependency itself is unhealthy -- 5xx codes, timeouts, and connection refused.


What Happens When the Circuit Is Open?

This is the design question most developers skip. When requests are being rejected, what does the user see?

Option 1: Return a cached response

If you have a recent cached version of the data, return it. The data might be slightly stale, but it's better than an error. This works well for read-heavy endpoints.

Option 2: Return a default/fallback

For non-critical features, return a sensible default. Can't load personalized recommendations? Show trending items instead. Can't reach the rating service? Hide ratings temporarily.

Option 3: Return a clear error

Sometimes there's no fallback. The payment service is down and the user wants to pay. Return a clear error: "Payment processing is temporarily unavailable. Please try again in a few minutes." Don't pretend everything is fine.

What to return when the circuit is open?

Is the data critical to the user's current action?


Where Circuit Breakers Live in Architecture

Circuit breakers aren't a single global switch. You typically have one per dependency, per service. If your checkout service talks to payments, inventory, and notifications, that's three circuit breakers.

Circuit Breakers Per Dependency
passingblockedpassingCheckoutYour serviceCBCLOSEDCBOPENCBCLOSEDPaymentHealthyInventoryDownNotifyHealthy

When inventory goes down, only the inventory circuit breaker opens. Payments and notifications keep working. Checkout can still process orders -- it just skips the inventory check or uses a cached stock count.

This is the power of per-dependency circuit breakers: partial degradation instead of total failure.


Circuit Breaker vs. Retry vs. Timeout

These three patterns work together, not as alternatives. Think of them as layers of defense:

PatternPurposeWhen it acts
TimeoutCap how long you wait for a responsePer individual request
RetryHandle transient failures (network blips)After a single failure
Circuit BreakerStop calling a broken dependency entirelyAfter repeated failures

โœ… The layering pattern

The typical setup: Timeout (5s) wraps each request. Retry (2-3 attempts with exponential backoff) handles transient errors. Circuit Breaker (trips after N consecutive failures) stops the bleeding when the dependency is genuinely down. Each layer catches what the previous one can't.

The dangerous combination to avoid

Retries without a circuit breaker create a multiplier effect. If every request retries 3 times and you have 100 concurrent requests, that's 300 calls hitting an already-struggling dependency. This accelerates the cascade instead of preventing it. Always pair retries with a circuit breaker.


Monitoring Your Circuit Breakers

A circuit breaker that trips is giving you a signal. Make sure you're listening.

What to alert on:

  • Circuit breaker state changes (closed to open) -- something is wrong
  • Time spent in open state -- how long was the dependency down?
  • Half-open probe success/failure rate -- is the dependency actually recovering?
  • Requests rejected while open -- the impact of the outage on your users

What to dashboard:

  • Circuit state per dependency over time
  • Failure rate trending (catch problems before the circuit trips)
  • Recovery time patterns (is this dependency getting flakier over time?)

โš ๏ธ The silent circuit breaker

The worst circuit breaker is one that trips and nobody notices. If your monitoring doesn't alert on state changes, you might be serving degraded responses for hours without knowing it. Treat every circuit-open event as an incident worth investigating.


Key Takeaways

โœ… The mental model

A circuit breaker is a fail-fast mechanism. Its job isn't to fix the problem -- it's to stop making it worse. When a dependency is down, the best thing you can do is stop calling it, free up your resources, and serve whatever you can without it.

The decisions that matter:

  • Threshold: How many failures before tripping? (Start with 5)
  • Timeout: How long before probing recovery? (Start with 30s)
  • Fallback: What does the user see when the circuit is open?
  • Scope: One circuit breaker per dependency, not one global switch
  • Monitoring: Alert on state changes. A tripped breaker is an incident signal.

Don't wait for the 3 AM cascading failure to add circuit breakers. Add them the moment your service has an external dependency. They're cheap insurance against the most common way distributed systems fail.


References