The Circuit Breaker Pattern: Failing Fast to Stay Alive
When a dependency goes down, the worst thing your service can do is keep calling it. The circuit breaker pattern stops cascading failures before they take down your entire system.
The 3 AM Incident You'll Eventually Have
It's 3 AM. Your payment service is timing out. Every request to the payment provider hangs for 30 seconds, then fails. But here's the real problem: your checkout service keeps retrying. Every user hitting "Place Order" spawns requests that pile up, consuming threads and connections. Within minutes, your checkout service is down too. Then the product service can't check inventory through checkout. Then the homepage fails because it can't load product data.
One slow dependency just took down your entire system.
This is a cascading failure, and it's the most common way distributed systems die. The fix? Stop calling the broken dependency. That's exactly what a circuit breaker does.
How a Circuit Breaker Works
A circuit breaker sits between your service and a dependency. It monitors failures and, when things go wrong enough, stops forwarding requests entirely -- failing fast instead of hanging.
It has three states:
CLOSED (Normal Operation)
Requests flow through to the dependency normally. The circuit breaker counts consecutive failures. As long as failures stay below the threshold, nothing changes. Your service doesn't even know the circuit breaker exists.
OPEN (Failing Fast)
Once failures hit the threshold (say, 5 failures in a row), the circuit trips open. Now every request is immediately rejected with a fast error -- no waiting, no hanging, no consuming resources. A timer starts counting down.
HALF-OPEN (Testing Recovery)
When the timer expires, the circuit breaker lets one probe request through to test if the dependency has recovered. If it succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit opens again and the timer restarts.
Why Failing Fast Matters
The intuition is counterintuitive: returning an error immediately is better than waiting for one. Here's why.
| Factor | Without Circuit Breaker | With Circuit Breaker |
|---|---|---|
| Response time on failure | 30s timeout per request | ~1ms instant rejection |
| Thread/connection usage | Threads pile up waiting | Threads freed immediately |
| Cascade risk | High -- downstream failure propagates up | Low -- failure is contained |
| Recovery behavior | Thundering herd when dependency recovers | Gradual -- half-open probes first |
| User experience | Page hangs, then errors | Fast fallback (cached data, default, error message) |
Without a circuit breaker, a 30-second timeout on 100 concurrent requests means 100 threads stuck waiting. Your thread pool is exhausted. New requests to any endpoint -- including healthy ones -- start queuing. That's how one bad dependency takes down everything.
With a circuit breaker, those 100 requests fail in under a millisecond. Your threads are free. Healthy endpoints keep working. The blast radius is contained.
Try It Yourself
Use the simulator below to experience the state machine in action. Send successful and failed requests, watch the circuit trip, and observe the recovery cycle.
Try this sequence:
- Send 3 failures to trip the circuit open
- Watch the countdown timer
- When it moves to half-open, send a success to close it
- Or send another failure to see it reopen
The Configuration Decisions
A circuit breaker's behavior depends on how you configure it. Get these wrong and you'll either trip too eagerly (blocking valid requests) or too slowly (not protecting anything).
How many consecutive failures before the circuit opens. Too low (1-2) and transient errors trip it unnecessarily. Too high (50+) and your service absorbs too much damage before protection kicks in. Start with 5, adjust based on your dependency's normal error rate.
How long the circuit stays open before testing recovery. Too short and you hammer a recovering service with probes. Too long and you stay in degraded mode unnecessarily. Start with 30 seconds, increase for dependencies that take longer to recover (databases, third-party APIs).
How many probe requests to allow in half-open state. One probe is the safest -- you minimize load on a recovering service. Multiple probes give you more confidence before closing. Most implementations use a single probe.
Not every error should count as a failure. A 400 Bad Request is a client bug, not a dependency failure. A 503 or a timeout? That's a real failure. Only count errors that indicate the dependency itself is unhealthy -- 5xx codes, timeouts, and connection refused.
What Happens When the Circuit Is Open?
This is the design question most developers skip. When requests are being rejected, what does the user see?
Option 1: Return a cached response
If you have a recent cached version of the data, return it. The data might be slightly stale, but it's better than an error. This works well for read-heavy endpoints.
Option 2: Return a default/fallback
For non-critical features, return a sensible default. Can't load personalized recommendations? Show trending items instead. Can't reach the rating service? Hide ratings temporarily.
Option 3: Return a clear error
Sometimes there's no fallback. The payment service is down and the user wants to pay. Return a clear error: "Payment processing is temporarily unavailable. Please try again in a few minutes." Don't pretend everything is fine.
Is the data critical to the user's current action?
Where Circuit Breakers Live in Architecture
Circuit breakers aren't a single global switch. You typically have one per dependency, per service. If your checkout service talks to payments, inventory, and notifications, that's three circuit breakers.
When inventory goes down, only the inventory circuit breaker opens. Payments and notifications keep working. Checkout can still process orders -- it just skips the inventory check or uses a cached stock count.
This is the power of per-dependency circuit breakers: partial degradation instead of total failure.
Circuit Breaker vs. Retry vs. Timeout
These three patterns work together, not as alternatives. Think of them as layers of defense:
| Pattern | Purpose | When it acts |
|---|---|---|
| Timeout | Cap how long you wait for a response | Per individual request |
| Retry | Handle transient failures (network blips) | After a single failure |
| Circuit Breaker | Stop calling a broken dependency entirely | After repeated failures |
โ The layering pattern
The typical setup: Timeout (5s) wraps each request. Retry (2-3 attempts with exponential backoff) handles transient errors. Circuit Breaker (trips after N consecutive failures) stops the bleeding when the dependency is genuinely down. Each layer catches what the previous one can't.
The dangerous combination to avoid
Retries without a circuit breaker create a multiplier effect. If every request retries 3 times and you have 100 concurrent requests, that's 300 calls hitting an already-struggling dependency. This accelerates the cascade instead of preventing it. Always pair retries with a circuit breaker.
Monitoring Your Circuit Breakers
A circuit breaker that trips is giving you a signal. Make sure you're listening.
What to alert on:
- Circuit breaker state changes (closed to open) -- something is wrong
- Time spent in open state -- how long was the dependency down?
- Half-open probe success/failure rate -- is the dependency actually recovering?
- Requests rejected while open -- the impact of the outage on your users
What to dashboard:
- Circuit state per dependency over time
- Failure rate trending (catch problems before the circuit trips)
- Recovery time patterns (is this dependency getting flakier over time?)
โ ๏ธ The silent circuit breaker
The worst circuit breaker is one that trips and nobody notices. If your monitoring doesn't alert on state changes, you might be serving degraded responses for hours without knowing it. Treat every circuit-open event as an incident worth investigating.
Key Takeaways
โ The mental model
A circuit breaker is a fail-fast mechanism. Its job isn't to fix the problem -- it's to stop making it worse. When a dependency is down, the best thing you can do is stop calling it, free up your resources, and serve whatever you can without it.
The decisions that matter:
- Threshold: How many failures before tripping? (Start with 5)
- Timeout: How long before probing recovery? (Start with 30s)
- Fallback: What does the user see when the circuit is open?
- Scope: One circuit breaker per dependency, not one global switch
- Monitoring: Alert on state changes. A tripped breaker is an incident signal.
Don't wait for the 3 AM cascading failure to add circuit breakers. Add them the moment your service has an external dependency. They're cheap insurance against the most common way distributed systems fail.
References
- Michael Nygard, Release It! -- The book that popularized the circuit breaker pattern for software
- Martin Fowler: Circuit Breaker -- Clear conceptual overview
- Netflix Hystrix (archived) -- The library that brought circuit breakers to mainstream microservices
- Resilience4j -- Modern Java circuit breaker library