Rate Limiting Strategies: Protecting Your API from Yourself
Rate limiting isn't just about stopping bad actors. It's about protecting your infrastructure from traffic spikes, runaway clients, and your own frontend making too many calls. Here's how the four main algorithms actually work -- and when to use each one.
When Your Own Client Is the DDoS
You've deployed your API. Everything's running smoothly. Then your mobile team ships an update with a bug: the retry logic has no backoff, and on failure it retries every 100ms. Your 50,000 active users become a 500,000-request-per-second firehose aimed directly at your servers.
No external attacker. No bot. Just your own app, politely destroying your infrastructure.
Rate limiting isn't just about security. It's infrastructure self-defense. Without it, a single misbehaving client -- whether malicious, buggy, or just enthusiastic -- can take down services that every other user depends on.
The Four Algorithms
There are four main approaches to rate limiting, each with different trade-offs around fairness, burst tolerance, and implementation complexity.
Fixed Window: The Simple Counter
The most straightforward approach: divide time into fixed windows (say, 1 minute each), and count requests per window. When the count hits the limit, reject until the window resets.
Keep a counter per client per time window. On each request, increment the counter. If counter exceeds the limit, reject. When the window ends, reset to zero. Simple, fast, uses almost no memory.
Window: [12:00:00 - 12:01:00] Limit: 100 req/min
12:00:05 Request #1 -> Allowed (1/100)
12:00:06 Request #2 -> Allowed (2/100)
...
12:00:55 Request #100 -> Allowed (100/100)
12:00:56 Request #101 -> REJECTED (limit reached)
12:01:00 Counter resets to 0
12:01:01 Request #1 -> Allowed (1/100)
The problem: The boundary spike. If a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, they've sent 200 requests in 2 seconds -- double the intended rate. The window boundary creates a loophole.
Despite this flaw, fixed window is the right choice for many use cases. It's simple to implement, easy to understand, and the boundary spike is often acceptable. If your limit is 1000/min and someone briefly hits 2000 at the boundary, most systems can absorb that.
Sliding Window: Fixing the Boundary
Sliding window eliminates the boundary spike by looking at the last N seconds of actual time, not a fixed time block. Instead of resetting a counter, it tracks individual request timestamps and counts how many fall within the sliding window.
For each incoming request, count how many requests this client sent in the last N seconds. If the count exceeds the limit, reject. There are no boundaries to exploit because the window moves continuously with real time.
Limit: 5 requests per 10-second sliding window
t=0s Request #1 -> Allowed (1/5 in last 10s)
t=2s Request #2 -> Allowed (2/5 in last 10s)
t=4s Request #3 -> Allowed (3/5 in last 10s)
t=6s Request #4 -> Allowed (4/5 in last 10s)
t=8s Request #5 -> Allowed (5/5 in last 10s)
t=9s Request #6 -> REJECTED (5/5 in last 10s)
t=11s Request #7 -> Allowed (4/5 -- req #1 expired)
The trade-off: You need to store each request's timestamp, not just a counter. For high-volume APIs, this means more memory. In practice, most implementations use a hybrid approach: keep the precision of sliding windows but approximate with counter interpolation to reduce memory.
| Factor | Fixed Window | Sliding Window |
|---|---|---|
| Memory per client | One counter | One counter per timestamp (or hybrid) |
| Boundary spike | Possible (2x burst) | Eliminated |
| Implementation | Trivial | Moderate |
| Redis implementation | INCR + EXPIRE | Sorted sets or hybrid counters |
| Precision | Window-level | Request-level |
Token Bucket: Controlled Bursts
Token bucket is the first algorithm that explicitly allows bursts -- but in a controlled way. The idea: imagine a bucket that holds tokens. Tokens are added at a steady rate. Each request consumes a token. If the bucket is empty, requests are rejected.
A bucket holds up to N tokens (the burst capacity). Tokens are added at a fixed rate (e.g., 10 per second). Each request removes one token. If no tokens remain, the request is rejected. The bucket can accumulate tokens during quiet periods, allowing short bursts of traffic when things get busy.
This is the algorithm behind AWS API Gateway, Stripe, and most major API providers. Why? Because real traffic is bursty. Users don't send exactly 10 requests per second -- they send 0 for a while, then 30 in a burst, then 0 again. Token bucket accommodates this natural pattern.
Bucket: max 5 tokens, refill 1 token per second
t=0s Bucket: 5 tokens
t=0s 3 requests arrive -> All allowed (2 tokens left)
t=1s Bucket: 3 tokens (1 refilled)
t=1s No requests
t=2s Bucket: 4 tokens (1 refilled)
t=2s 6 requests arrive -> 4 allowed, 2 rejected (0 tokens)
t=3s Bucket: 1 token (1 refilled)
The key insight: The bucket size controls burst tolerance. The refill rate controls sustained throughput. You can tune these independently. A large bucket with a slow refill allows big bursts followed by cooldown. A small bucket with a fast refill enforces a steadier rate.
Leaky Bucket: Steady Output
Leaky bucket takes the opposite approach: instead of allowing bursts, it smooths them out. Requests enter a queue (the bucket). The queue is processed at a fixed rate, like water leaking through a hole. If the queue is full, new requests are dropped.
Incoming requests are added to a FIFO queue with a fixed capacity. The queue is drained at a constant rate (e.g., 1 request per 100ms). If the queue is full, new requests are rejected. The output rate is perfectly smooth regardless of how bursty the input is.
Queue capacity: 5, drain rate: 1 per second
t=0s 5 requests arrive -> All queued (5/5)
t=0s Processing request #1 (4/5 in queue)
t=0s 2 more requests arrive -> REJECTED (queue full)
t=1s Processing request #2 (3/5 in queue)
t=2s Processing request #3 (2/5 in queue)
When leaky bucket shines: When your downstream system can only handle a fixed throughput. If your database can process 100 queries per second, a leaky bucket ensures exactly that rate hits it, regardless of traffic spikes. The bucket absorbs bursts; the drain provides smooth output.
Try It Yourself
Experiment with all four algorithms below. Send individual requests to see normal behavior, or hit "Send Burst" to fire 8 requests at once and see how each algorithm handles the spike.
Things to notice:
- Fixed Window: Send requests slowly, then watch the counter reset. Try timing a burst right as it resets.
- Sliding Window: Send a few requests, wait a few seconds, and see old ones expire from the window.
- Token Bucket: Let tokens accumulate during idle time, then send a burst -- it absorbs them.
- Leaky Bucket: Send a burst and watch the queue fill up. Notice the steady drain rate.
Distributed Rate Limiting: The Hard Part
Everything above works great on a single server. But when you have 10 API servers behind a load balancer, each one has its own counter. A client sending 10 requests per second might hit a different server each time, and no individual server sees enough to trigger the limit.
The solution: a centralized counter, usually in Redis. Every API server checks and increments the same counter for each client. This adds a network round-trip per request, but Redis is fast enough (sub-millisecond) that this is rarely the bottleneck.
-- Fixed window in Redis (pseudocode)
key = "ratelimit:{client_id}:{current_minute}"
count = INCR(key)
if count == 1:
EXPIRE(key, 60) -- auto-cleanup after window
if count > limit:
return 429 Too Many Requests
โ ๏ธ Redis becomes a single point of failure
If Redis goes down, your rate limiter goes down. The common approach: fail open. If you can't check the limit, allow the request. A brief period without rate limiting is better than rejecting all traffic. But monitor for this -- you need to know when your rate limiter is offline.
What Your 429 Response Should Look Like
When you reject a request, don't just return a bare 429 Too Many Requests. Give the client enough information to back off intelligently:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708531200
{
"error": "rate_limit_exceeded",
"message": "Too many requests. Please retry after 30 seconds.",
"retry_after": 30
}
| Header | Purpose |
|---|---|
| Retry-After | Seconds until the client can retry (standard HTTP header) |
| X-RateLimit-Limit | The total allowed requests per window |
| X-RateLimit-Remaining | How many requests are left in the current window |
| X-RateLimit-Reset | Unix timestamp when the window resets |
Good clients will read these headers and back off automatically. Bad clients will keep hammering you -- but that's what the rate limiter is for.
Choosing Your Algorithm
Does your system need to handle bursty traffic gracefully?
Layered Rate Limiting
In practice, you don't use just one rate limiter. You layer them:
- Global limit: Protect the entire system from traffic floods (DDoS, viral events)
- Per-user limit: Prevent any single user from hogging resources
- Per-endpoint limit: Stricter limits on expensive operations (writes, searches, file uploads)
A user might have 100 requests/minute overall, but only 10 writes/minute and 5 file uploads/minute. The global limit might be 10,000 requests/second across all users. Each layer catches different kinds of abuse.
โ Authenticated vs. unauthenticated limits
Unauthenticated endpoints (login, signup, password reset) need per-IP rate limits -- and they should be aggressive. A legitimate user doesn't try to log in 100 times per minute. Rate limit login attempts by IP at something like 10/minute, and you've blocked most brute-force attacks without building anything complex.
Key Takeaways
โ The mental model
Rate limiting is a resource allocation problem, not a security problem (though it helps with security). You have finite capacity. Rate limiting ensures it's distributed fairly and no single client can exhaust it. Think of it as a traffic light for your API -- it keeps things flowing by sometimes making people stop.
The decisions that matter:
- Fixed Window for simplicity. Accept the boundary spike if your system can handle 2x burst.
- Sliding Window when the boundary spike matters. Slightly more complex but no loopholes.
- Token Bucket for user-facing APIs where burst tolerance matters. This is what most major APIs use.
- Leaky Bucket when you need to protect downstream systems with a fixed processing rate.
- Use Redis for distributed rate limiting. Fail open if Redis is unavailable.
- Return useful 429 responses with Retry-After headers so clients can back off intelligently.
- Layer your limits: global, per-user, and per-endpoint serve different purposes.
Start with a simple per-user fixed window counter. You can always upgrade to token bucket later when you understand your traffic patterns better. The worst rate limiter is the one you didn't deploy because the "right" algorithm felt too complex to implement.
References
- Stripe: Rate Limiting -- How Stripe implements rate limiting at scale
- Google Cloud: Rate Limiting Strategies -- Comprehensive overview of algorithms
- RFC 6585 -- Defines the 429 Too Many Requests status code
- Redis Rate Limiting Patterns -- Practical Redis implementations