Most systems don't break at 1 million users.
They break at 50,000 — because the architecture was never designed to go beyond the first 10,000. The decisions that felt fine at launch become the constraints that define your ceiling.
This isn't a post about theory. It's about the specific, practical decisions that separate systems that scale from systems that get rewritten under pressure.
The Fundamental Shift Nobody Warns You About
At 1,000 users, your biggest problem is building fast enough.
At 1,000,000 users, your biggest problem is failing gracefully.
That shift in mindset — from "how do we ship features" to "how do we contain blast radius" — is what scaling actually requires. Every architectural decision at scale is really a decision about how your system behaves when something goes wrong. Because at a million users, something is always going wrong somewhere.
1. Stop Treating Your Database as a General-Purpose Tool
The database is the first thing that breaks at scale. Not because databases are weak — because engineers ask them to do too many things at once.
At 1M+ users, one database handling transactional writes, analytical queries, full-text search, and reporting simultaneously is a liability. Each workload has different access patterns. A long-running analytics query holds locks that block your transactional writes. A full-text search query does sequential scans that compete with your indexed reads.
The separation that works:
You don't need all of these on day one. But by the time you're approaching 1M users, your transactional database should be doing exactly one thing: handling writes and simple indexed reads.
Anything else is borrowed time.
2. Cache Aggressively — But Cache the Right Things
Caching solves a specific problem: you're computing or fetching the same data repeatedly when you don't need to.
At scale, the wrong caching strategy is often worse than no caching at all. Cached stale data causes support tickets. Cache stampedes — where a cache key expires and 10,000 concurrent requests all hit the database simultaneously — cause outages.
What to cache:
# Good cache candidates
- User session data (changes rarely, read constantly)
- Computed aggregates (total order count, dashboard metrics)
- Reference data (pricing plans, feature flags, config)
- API responses for public, non-personalized endpoints
# Bad cache candidates
- Anything that must be real-time accurate (inventory, balances)
- Data that's unique per request
- Anything you'd regret serving stale during an incident
Handle cache stampedes with probabilistic early expiration:
import redis
import random
import time
def get_with_stampede_protection(key, ttl, fetch_fn):
r = redis.Redis()
cached = r.get(key)
if cached:
remaining_ttl = r.ttl(key)
# Probabilistically refresh before expiry
if remaining_ttl < 30 and random.random() < 0.1:
value = fetch_fn()
r.setex(key, ttl, value)
return value
return cached
value = fetch_fn()
r.setex(key, ttl, value)
return value
10% of requests start refreshing when TTL drops below 30 seconds. The cache never fully expires for all users simultaneously.
3. Design for Horizontal Scale From the Start
Vertical scaling — bigger server, more RAM, faster CPU — has a ceiling and an invoice.
Horizontal scaling — more servers handling the same load — has neither, provided your application is stateless.
Stateless means: any request can be handled by any server, because no server holds state that another doesn't have.
What breaks stateless architecture:
What enables it:
Once your application is stateless, scaling is an infrastructure decision — add servers behind a load balancer. Without it, scaling is an engineering rewrite.
4. Async Everything That Doesn't Need to Be Synchronous
At 1M users, synchronous processing is a throughput killer.
The pattern that kills most systems: user hits an endpoint, endpoint does 14 things (sends email, updates analytics, triggers webhook, logs to 3 services, recalculates user score), user waits 4 seconds for a response.
The response time is the sum of all operations. At scale, that becomes unacceptable — and fragile. One downstream service being slow makes your entire endpoint slow.
The rule: If the user doesn't need the result of an operation to continue, it should be async.
# Synchronous — user waits for all of this
def create_order(user_id, items):
order = db.create_order(user_id, items)
email.send_confirmation(user_id, order) # 300ms
analytics.track_purchase(user_id, order) # 150ms
webhook.notify_integrations(order) # 200ms
inventory.update_stock(items) # 100ms
return order # Total: 750ms+
# Async — user gets response in <50ms
def create_order(user_id, items):
order = db.create_order(user_id, items)
queue.enqueue('send_confirmation', user_id, order.id)
queue.enqueue('track_purchase', user_id, order.id)
queue.enqueue('notify_integrations', order.id)
queue.enqueue('update_stock', [i.id for i in items])
return order # Total: ~40ms
The user gets their order confirmation instantly. Everything else happens in the background, with retries built in.
5. Rate Limiting Is Not Optional
At 1M users, a small percentage of them will accidentally or intentionally hammer your API.
One user running a misconfigured sync job making 10,000 requests per minute can degrade your service for everyone else. Without rate limiting, you have no defense against this.
Implement rate limiting at multiple layers:
A simple Redis-based rate limiter:
def is_rate_limited(tenant_id: str, endpoint: str, limit: int, window: int) -> bool:
r = redis.Redis()
key = f"ratelimit:{tenant_id}:{endpoint}"
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window)
results = pipe.execute()
request_count = results[0]
return request_count > limit
Always return a Retry-After header on 429 responses. Clients that don't get a retry hint will immediately retry — making the problem worse.
6. Observability Before You Need It
At small scale, debugging means reproducing the issue locally.
At 1M users, you cannot reproduce production. You can only observe it.
Teams that scale well have three things in place before they hit serious traffic — not after:
Structured logging:
{
"timestamp": "2026-06-17T10:23:44Z",
"level": "error",
"service": "order-service",
"tenant_id": "abc-123",
"user_id": "usr-456",
"request_id": "req-789",
"message": "Payment gateway timeout",
"duration_ms": 5043,
"endpoint": "POST /orders"
}
Unstructured logs are unsearchable at scale. Every log line should be JSON with consistent fields.
Metrics that matter:
Distributed tracing:
When a request touches 6 services before returning, knowing that "something was slow" is useless. A trace ID that follows the request through every service tells you exactly which hop took 3 seconds.
Use OpenTelemetry. Instrument once, export to whatever backend you use (Jaeger, Datadog, Honeycomb).
7. Design for Partial Failure
At 1M users, the question is not whether something will fail. It's whether a failure in one part of your system takes down everything else.
Circuit breakers:
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = "closed" # closed = normal, open = blocking calls
def call(self, fn, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open"
else:
raise Exception("Circuit open — downstream service unavailable")
try:
result = fn(*args, **kwargs)
self.failure_count = 0
self.state = "closed"
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise e
When your payment provider goes down, a circuit breaker stops your order service from waiting 30 seconds per request — instead failing fast and letting the user know immediately.
Graceful degradation:
Define what your system looks like with parts missing:
Not every dependency failure should be a user-facing error.
The Scaling Readiness Checklist
Before you need to handle 1M users — not after:
- [ ] Is your application stateless? (No local session or file storage)
- [ ] Are reads and writes separated at the database layer?
- [ ] Is cache stampede protection in place on critical keys?
- [ ] Are all non-critical operations processed asynchronously via a queue?
- [ ] Is rate limiting implemented at the edge AND application layer?
- [ ] Are logs structured JSON with consistent fields including tenant and request ID?
- [ ] Are you tracking P95/P99 latency, not just averages?
- [ ] Do you have distributed tracing across service boundaries?
- [ ] Are circuit breakers in place for all external service dependencies?
- [ ] Is graceful degradation defined for each critical dependency failure?
The Real Lesson
Scaling is not a feature you add later. It's a series of small architectural decisions made early that either compound in your favor or against you.
The teams that handle 1M users without drama didn't build something magical. They built something boring — stateless services, async queues, proper caching, real observability, and defined failure modes. Nothing on this list is novel. All of it requires discipline to implement before you feel the pressure.
By the time you feel the pressure, you're already behind.
This post is part of OutworkTech's backend engineering series. Related reading: Database Indexing Mistakes That Kill SaaS Performance at Scale and Designing High-Performance APIs That Scale.
OutworkTech builds and scales backend systems, APIs, and SaaS infrastructure for companies that need engineering depth without the overhead. If you're approaching scale and need the architecture to match — let's talk.












![[🗄️DataBase] Database Transactions 底層到底做了什麼:從記憶體到磁碟](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwb9v9fi9r3i0we2a4ur1.png)
