Every production API does it. Stripe does it. Twilio does it. Your own data pipeline endpoints probably should too.
Rate limiting.
You hit an endpoint too fast and you get a 429. Retry-After says 60 seconds. You wait. You move on. Simple, right?
Then one day you're building something that has to handle spiky traffic and you realize you have no idea how that 429 actually gets calculated. Is it per second? Per minute? Sliding window? Fixed bucket? Why does Stripe let you burst a little before cutting you off?
I built a token bucket from scratch in pure Python to find out.
Why Token Buckets? Not the Simpler Alternatives?
There are three common approaches to rate limiting:
Fixed window: Count requests in a 60-second bucket. Reset at the boundary. Simple. But if someone sends 100 requests at 11:59 and 100 at 12:00, you just let 200 through in two seconds. The boundary is a gap.
Sliding window log: Keep a log of timestamps. Count the last N requests within the window. Accurate. But storing a timestamp per request is expensive at scale.
Token bucket: Tokens accumulate at a steady refill rate up to a max capacity. Each request costs one token. If you run out, you wait. Bursting is allowed up to the bucket capacity.
The token bucket is what most real-world rate limiters use. It handles burst traffic naturally. You can buy up tokens when traffic is low, spend them when it spikes.
The Mental Model
Imagine a physical bucket. It holds up to N tokens. Tokens drip in at a rate of R per second. Each API call takes one token out.
If you send requests slowly, you accumulate tokens. When you suddenly need to send a burst, you spend them. If the bucket empties, you wait.
That is it. That is the whole idea.
The math is simple:
tokens_now = min(capacity, tokens_last + rate * (now - last_refill_time))
You do not run a background thread dripping tokens. You just calculate how many should have accumulated based on elapsed time whenever a request comes in.
Building It in Pure Python
Let me show you a minimal but real implementation.
import time
import threading
from dataclasses import dataclass, field
@dataclass
class TokenBucket:
capacity: float # max tokens the bucket can hold
refill_rate: float # tokens added per second
_tokens: float = field(init=False)
_last_refill: float = field(init=False)
_lock: threading.Lock = field(default_factory=threading.Lock, init=False)
def __post_init__(self):
self._tokens = self.capacity # start full
self._last_refill = time.monotonic()
def _refill(self):
now = time.monotonic()
elapsed = now - self._last_refill
gained = elapsed * self.refill_rate
self._tokens = min(self.capacity, self._tokens + gained)
self._last_refill = now
def consume(self, tokens: float = 1.0) -> bool:
with self._lock:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return True
return False
def wait_and_consume(self, tokens: float = 1.0) -> float:
"""Block until tokens are available. Returns wait time in seconds."""
while True:
with self._lock:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return time.monotonic() - self._last_refill
time.sleep(0.01)
A few things to notice here.
The lock is per-bucket. In a real distributed system you'd use Redis with WATCH/MULTI/EXEC or a Lua script for atomicity. For a single process, threading.Lock is enough.
The _refill method is lazy. Tokens accumulate based on elapsed clock time, not from a background thread. This is cheaper and avoids drift.
Start full (_tokens = capacity). This lets a brand-new client burst immediately rather than waiting for the bucket to fill.
Testing It
import time
def test_burst_then_throttle():
bucket = TokenBucket(capacity=10, refill_rate=2) # 2 req/sec, burst up to 10
# Burst: should burn through the full bucket
burst_results = [bucket.consume() for _ in range(10)]
assert all(burst_results), "Should allow full burst"
# Immediate next request should fail (bucket empty)
assert not bucket.consume(), "Should reject when empty"
# Wait half a second, should have ~1 token
time.sleep(0.6)
assert bucket.consume(), "Should allow after partial refill"
print("All burst tests passed.")
test_burst_then_throttle()
Output:
All burst tests passed.
Wiring It to an HTTP Endpoint
Here is how you'd use this with a simple HTTP server. No framework needed.
from http.server import HTTPServer, BaseHTTPRequestHandler
# One bucket per API key (in a real system, keyed by user/org)
buckets: dict[str, TokenBucket] = {}
def get_bucket(api_key: str) -> TokenBucket:
if api_key not in buckets:
buckets[api_key] = TokenBucket(capacity=20, refill_rate=5)
return buckets[api_key]
class RateLimitedHandler(BaseHTTPRequestHandler):
def do_GET(self):
api_key = self.headers.get("X-API-Key", "anonymous")
bucket = get_bucket(api_key)
if bucket.consume():
self.send_response(200)
self.end_headers()
self.wfile.write(b"OK\n")
else:
self.send_response(429)
self.send_header("Retry-After", "1")
self.end_headers()
self.wfile.write(b"Too Many Requests\n")
def log_message(self, fmt, *args):
pass # silence default access log
if __name__ == "__main__":
server = HTTPServer(("localhost", 8080), RateLimitedHandler)
print("Server on :8080")
server.serve_forever()
Run this and then hammer it:
for i in $(seq 1 25); do curl -s -o /dev/null -w "%{http_code}\n" -H "X-API-Key: test123" http://localhost:8080/; done
You will see 20 responses with 200, then 429s until tokens refill.
The Retry-After Header Matters
When you return a 429, you should tell the client how long to wait. Here is a small helper:
def seconds_until_token(bucket: TokenBucket) -> float:
deficit = 1.0 - bucket._tokens
if deficit <= 0:
return 0.0
return deficit / bucket.refill_rate
Add it to the 429 response:
wait = seconds_until_token(bucket)
self.send_header("Retry-After", str(int(wait) + 1))
This is exactly what Stripe does. The client reads it, backs off the right amount, and retries. No guessing.
Per-User vs Global Buckets
One bucket per API key gives per-user isolation. But you often want both:
class TwoTierBucket:
def __init__(self, global_bucket: TokenBucket, user_bucket: TokenBucket):
self.global_bucket = global_bucket
self.user_bucket = user_bucket
def consume(self) -> bool:
# Both must succeed. Check user first (cheaper to fail fast).
if not self.user_bucket.consume():
return False
if not self.global_bucket.consume():
# User had tokens but global is empty. Give the user token back.
self.user_bucket._tokens = min(
self.user_bucket.capacity,
self.user_bucket._tokens + 1
)
return False
return True
This is closer to what production systems do. A single abusive user cannot exhaust the global budget. But even a well-behaved user gets throttled during a traffic spike.
What I Learned That Surprised Me
The bucket starts full on purpose. New integrations should be able to burst. If your bucket starts empty, new clients immediately hit 429 and think your API is broken.
Burst capacity is a product decision. High capacity means more tolerance for spiky clients. Low capacity means tighter SLAs. The rate (tokens/second) controls the steady-state. The capacity controls how much breathing room you give.
Distributed token buckets are the hard part. In a single process, this is 30 lines. In a distributed system with multiple API servers, you need a shared store. Redis is the standard answer. The Lua script trick is: INCR, expire, return. But the leaky bucket and token bucket algorithms in Redis need to be careful about atomic reads and writes.
429 is kinder than 500. If you let traffic flood through unchecked and your downstream database falls over, everyone gets 500s. A well-calibrated rate limiter returns 429 to a minority of callers and keeps the rest of the system healthy. That is the real value.
Build the Toy Version of What You Depend On
Rate limiters are behind every API you call. Stripe, Twilio, OpenAI, AWS, your own team's internal services. Most engineers just treat 429 as a thing that happens to them.
Building this taught me how the numbers actually connect. A capacity of 20 and a rate of 5 means at steady state you can do 5 requests per second forever, but you can absorb a burst of 20. When I see Retry-After: 12 now, I know someone computed a deficit.
The token bucket is 30 lines. The mental model is one sentence: tokens drip in, requests drain them, full bucket means you can burst.
Build the toy version of the thing you depend on. You will understand it better than any documentation can teach you.
If this helped, follow me on dev.to. I publish one of these every week. The series so far covers WAL, Bloom filters, LSM-trees, columnar storage, feature stores, and streaming window aggregators. Each one is under 200 lines of pure Python and no external dependencies.
What system should I build next? Token bucket is done. B-tree and consistent hashing ring are both on the list.












