blog/system-design/rate-limiting-strategies
System Design

Rate Limiting Strategies: Protecting Your API from Yourself

Rate limiting isn't just about stopping bad actors. It's about protecting your infrastructure from traffic spikes, runaway clients, and your own frontend making too many calls. Here's how the four main algorithms actually work -- and when to use each one.

ยท12 min read

When Your Own Client Is the DDoS

You've deployed your API. Everything's running smoothly. Then your mobile team ships an update with a bug: the retry logic has no backoff, and on failure it retries every 100ms. Your 50,000 active users become a 500,000-request-per-second firehose aimed directly at your servers.

No external attacker. No bot. Just your own app, politely destroying your infrastructure.

Rate limiting isn't just about security. It's infrastructure self-defense. Without it, a single misbehaving client -- whether malicious, buggy, or just enthusiastic -- can take down services that every other user depends on.


The Four Algorithms

There are four main approaches to rate limiting, each with different trade-offs around fairness, burst tolerance, and implementation complexity.

Rate Limiting Algorithm Landscape
Fixed Window
Simplest
fixes boundary spike
Sliding Window
Smoother
Token Bucket
Burst-friendly
Leaky Bucket
Smoothest output

Fixed Window: The Simple Counter

The most straightforward approach: divide time into fixed windows (say, 1 minute each), and count requests per window. When the count hits the limit, reject until the window resets.

๐Ÿ“Œ How it works

Keep a counter per client per time window. On each request, increment the counter. If counter exceeds the limit, reject. When the window ends, reset to zero. Simple, fast, uses almost no memory.

Window: [12:00:00 - 12:01:00]  Limit: 100 req/min

12:00:05  Request #1   -> Allowed  (1/100)
12:00:06  Request #2   -> Allowed  (2/100)
...
12:00:55  Request #100 -> Allowed  (100/100)
12:00:56  Request #101 -> REJECTED (limit reached)
12:01:00  Counter resets to 0
12:01:01  Request #1   -> Allowed  (1/100)

The problem: The boundary spike. If a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, they've sent 200 requests in 2 seconds -- double the intended rate. The window boundary creates a loophole.

Fixed Window: Boundary Spike Problem
Intended max rate100 req/min
Actual possible rate at boundary200 req/2s

Despite this flaw, fixed window is the right choice for many use cases. It's simple to implement, easy to understand, and the boundary spike is often acceptable. If your limit is 1000/min and someone briefly hits 2000 at the boundary, most systems can absorb that.


Sliding Window: Fixing the Boundary

Sliding window eliminates the boundary spike by looking at the last N seconds of actual time, not a fixed time block. Instead of resetting a counter, it tracks individual request timestamps and counts how many fall within the sliding window.

๐Ÿ“Œ How it works

For each incoming request, count how many requests this client sent in the last N seconds. If the count exceeds the limit, reject. There are no boundaries to exploit because the window moves continuously with real time.

Limit: 5 requests per 10-second sliding window

t=0s   Request #1  -> Allowed  (1/5 in last 10s)
t=2s   Request #2  -> Allowed  (2/5 in last 10s)
t=4s   Request #3  -> Allowed  (3/5 in last 10s)
t=6s   Request #4  -> Allowed  (4/5 in last 10s)
t=8s   Request #5  -> Allowed  (5/5 in last 10s)
t=9s   Request #6  -> REJECTED (5/5 in last 10s)
t=11s  Request #7  -> Allowed  (4/5 -- req #1 expired)

The trade-off: You need to store each request's timestamp, not just a counter. For high-volume APIs, this means more memory. In practice, most implementations use a hybrid approach: keep the precision of sliding windows but approximate with counter interpolation to reduce memory.

FactorFixed WindowSliding Window
Memory per clientOne counterOne counter per timestamp (or hybrid)
Boundary spikePossible (2x burst)Eliminated
ImplementationTrivialModerate
Redis implementationINCR + EXPIRESorted sets or hybrid counters
PrecisionWindow-levelRequest-level

Token Bucket: Controlled Bursts

Token bucket is the first algorithm that explicitly allows bursts -- but in a controlled way. The idea: imagine a bucket that holds tokens. Tokens are added at a steady rate. Each request consumes a token. If the bucket is empty, requests are rejected.

๐Ÿ“Œ How it works

A bucket holds up to N tokens (the burst capacity). Tokens are added at a fixed rate (e.g., 10 per second). Each request removes one token. If no tokens remain, the request is rejected. The bucket can accumulate tokens during quiet periods, allowing short bursts of traffic when things get busy.

This is the algorithm behind AWS API Gateway, Stripe, and most major API providers. Why? Because real traffic is bursty. Users don't send exactly 10 requests per second -- they send 0 for a while, then 30 in a burst, then 0 again. Token bucket accommodates this natural pattern.

Bucket: max 5 tokens, refill 1 token per second

t=0s   Bucket: 5 tokens
t=0s   3 requests arrive -> All allowed (2 tokens left)
t=1s   Bucket: 3 tokens (1 refilled)
t=1s   No requests
t=2s   Bucket: 4 tokens (1 refilled)
t=2s   6 requests arrive -> 4 allowed, 2 rejected (0 tokens)
t=3s   Bucket: 1 token (1 refilled)

The key insight: The bucket size controls burst tolerance. The refill rate controls sustained throughput. You can tune these independently. A large bucket with a slow refill allows big bursts followed by cooldown. A small bucket with a fast refill enforces a steadier rate.


Leaky Bucket: Steady Output

Leaky bucket takes the opposite approach: instead of allowing bursts, it smooths them out. Requests enter a queue (the bucket). The queue is processed at a fixed rate, like water leaking through a hole. If the queue is full, new requests are dropped.

๐Ÿ“Œ How it works

Incoming requests are added to a FIFO queue with a fixed capacity. The queue is drained at a constant rate (e.g., 1 request per 100ms). If the queue is full, new requests are rejected. The output rate is perfectly smooth regardless of how bursty the input is.

Queue capacity: 5, drain rate: 1 per second

t=0s   5 requests arrive -> All queued (5/5)
t=0s   Processing request #1 (4/5 in queue)
t=0s   2 more requests arrive -> REJECTED (queue full)
t=1s   Processing request #2 (3/5 in queue)
t=2s   Processing request #3 (2/5 in queue)

When leaky bucket shines: When your downstream system can only handle a fixed throughput. If your database can process 100 queries per second, a leaky bucket ensures exactly that rate hits it, regardless of traffic spikes. The bucket absorbs bursts; the drain provides smooth output.


Try It Yourself

Experiment with all four algorithms below. Send individual requests to see normal behavior, or hit "Send Burst" to fire 8 requests at once and see how each algorithm handles the spike.

Interactive Rate Limiter Simulator
0/5 used | resets in 10s
Request log
Send requests to see the limiter in action...

Things to notice:

  1. Fixed Window: Send requests slowly, then watch the counter reset. Try timing a burst right as it resets.
  2. Sliding Window: Send a few requests, wait a few seconds, and see old ones expire from the window.
  3. Token Bucket: Let tokens accumulate during idle time, then send a burst -- it absorbs them.
  4. Leaky Bucket: Send a burst and watch the queue fill up. Notice the steady drain rate.

Distributed Rate Limiting: The Hard Part

Everything above works great on a single server. But when you have 10 API servers behind a load balancer, each one has its own counter. A client sending 10 requests per second might hit a different server each time, and no individual server sees enough to trigger the limit.

The Distributed Problem: Per-Server Counters
round robinClient100 req/sLoad BalancerServer 1sees ~33 req/sServer 2sees ~33 req/sServer 3sees ~33 req/s

The solution: a centralized counter, usually in Redis. Every API server checks and increments the same counter for each client. This adds a network round-trip per request, but Redis is fast enough (sub-millisecond) that this is rarely the bottleneck.

-- Fixed window in Redis (pseudocode)
key = "ratelimit:{client_id}:{current_minute}"
count = INCR(key)
if count == 1:
    EXPIRE(key, 60)  -- auto-cleanup after window
if count > limit:
    return 429 Too Many Requests

โš ๏ธ Redis becomes a single point of failure

If Redis goes down, your rate limiter goes down. The common approach: fail open. If you can't check the limit, allow the request. A brief period without rate limiting is better than rejecting all traffic. But monitor for this -- you need to know when your rate limiter is offline.


What Your 429 Response Should Look Like

When you reject a request, don't just return a bare 429 Too Many Requests. Give the client enough information to back off intelligently:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708531200

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please retry after 30 seconds.",
  "retry_after": 30
}
HeaderPurpose
Retry-AfterSeconds until the client can retry (standard HTTP header)
X-RateLimit-LimitThe total allowed requests per window
X-RateLimit-RemainingHow many requests are left in the current window
X-RateLimit-ResetUnix timestamp when the window resets

Good clients will read these headers and back off automatically. Bad clients will keep hammering you -- but that's what the rate limiter is for.


Choosing Your Algorithm

Which rate limiting algorithm should you use?

Does your system need to handle bursty traffic gracefully?


Layered Rate Limiting

In practice, you don't use just one rate limiter. You layer them:

Rate Limiting Layers
thenthenGlobal Limit10K req/s totalPer-User Limit100 req/minPer-Endpoint20 req/min (write)
  • Global limit: Protect the entire system from traffic floods (DDoS, viral events)
  • Per-user limit: Prevent any single user from hogging resources
  • Per-endpoint limit: Stricter limits on expensive operations (writes, searches, file uploads)

A user might have 100 requests/minute overall, but only 10 writes/minute and 5 file uploads/minute. The global limit might be 10,000 requests/second across all users. Each layer catches different kinds of abuse.

โœ… Authenticated vs. unauthenticated limits

Unauthenticated endpoints (login, signup, password reset) need per-IP rate limits -- and they should be aggressive. A legitimate user doesn't try to log in 100 times per minute. Rate limit login attempts by IP at something like 10/minute, and you've blocked most brute-force attacks without building anything complex.


Key Takeaways

โœ… The mental model

Rate limiting is a resource allocation problem, not a security problem (though it helps with security). You have finite capacity. Rate limiting ensures it's distributed fairly and no single client can exhaust it. Think of it as a traffic light for your API -- it keeps things flowing by sometimes making people stop.

The decisions that matter:

  • Fixed Window for simplicity. Accept the boundary spike if your system can handle 2x burst.
  • Sliding Window when the boundary spike matters. Slightly more complex but no loopholes.
  • Token Bucket for user-facing APIs where burst tolerance matters. This is what most major APIs use.
  • Leaky Bucket when you need to protect downstream systems with a fixed processing rate.
  • Use Redis for distributed rate limiting. Fail open if Redis is unavailable.
  • Return useful 429 responses with Retry-After headers so clients can back off intelligently.
  • Layer your limits: global, per-user, and per-endpoint serve different purposes.

Start with a simple per-user fixed window counter. You can always upgrade to token bucket later when you understand your traffic patterns better. The worst rate limiter is the one you didn't deploy because the "right" algorithm felt too complex to implement.


References