- 3h 27m outage — 12:18 UTC to 15:45 UTC, April 16 2025
- 675M monthly active users affected globally
- 48,000+ peak Downdetector reports
- 0 regions with staged rollout — applied globally simultaneously
- Root cause: Envoy max heap configured higher than K8s memory limit
- Fix: capacity increase reduced per-instance memory below the kill threshold
On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy (an open-source edge proxy that receives all incoming user traffic before distributing it to backend services) perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed — and then the restart loop began, powered by Kubernetes itself, killing each new server as fast as it came back up. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.
The Story
This crash happened simultaneously on all Envoy instances.
— Spotify Engineering, Incident Report: Spotify Outage on April 16, 2025
There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. To extend Envoy's capabilities, Spotify develops its own custom filters — plugins that handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a latent bug in one of the custom filters: a code path that had existed harmlessly, triggered only when the filter received control at that specific position. Envoy crashed. Not one instance, not one region. All of them.
Problem
12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash
The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.
Cause
The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit
The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a recoverable crash into an infinite restart loop.
Solution
Asia Pacific Stayed Up — and Explained Everything
Asia Pacific was the only region unaffected. Engineers investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.
Result
15:45 UTC — Death Loop Broken, Full Recovery
Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalisation at 15:40 UTC. Total duration: 3 hours 27 minutes.
The Fix
The Misconfiguration Nobody Noticed — Until the Crash
The root problem was that Envoy's max heap size was set higher than the Kubernetes memory limit for the pod. In normal operation, Envoy memory usage never approached its heap maximum — the misconfiguration was invisible. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.
- 3h 27m — Total outage duration, 12:18 to 15:45 UTC
- 675M — Users affected; 263M paying Premium subscribers — no perimeter differentiation by tier
- 48,000+ — Peak Downdetector reports (active reporters only; actual affected users in the hundreds of millions)
- 0 — Regions with staged rollout before full deployment
# THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit
# Kubernetes pod resource specification (simplified)
apiVersion: v1
kind: Pod
spec:
containers:
- name: envoy
resources:
requests:
memory: "2Gi"
limits:
memory: "3Gi" # K8s will OOMKill the pod above this
# Envoy overload manager configuration (simplified)
overload_manager:
resource_monitors:
- name: envoy.resource_monitors.fixed_heap
typed_config:
max_heap_size_bytes: 4294967296 # 4GB — HIGHER than K8s 3GB limit!
# Why this is catastrophic:
# - K8s kills at 3GB memory usage
# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB
# - K8s limit is hit BEFORE Envoy's graceful degradation kicks in
# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible
# - Under retry flood: Envoy climbs past 3GB → OOMKill → restart → repeat
# IMMEDIATE FIX: Increase perimeter server count
# More servers = retry traffic spread across more instances
# = each instance stays under 3GB = K8s doesn't kill = loop breaks
# PERMANENT FIX: Align heap config with K8s memory limit
# max_heap_size_bytes: 2684354560 # 2.5GB — safely below K8s 3GB limit
Spotify's four post-incident commitments:
- Fix the filter bug that caused the initial crash on filter reorder
- Fix the heap/K8s limit mismatch — align Envoy config with pod resource limits
- Staged perimeter rollouts — regional validation before global deployment
- Improved monitoring — detect configuration issues earlier in the failure chain
Incident timeline:
| Time (UTC) | Event | Status |
|---|---|---|
| 12:18 | Filter reorder applied; all Envoy instances crash | 🔴 Global failure |
| 12:20 | Alarms fire on traffic drop; death loop running | 🔴 Engineers paged |
| 12:28 | Escalated; only APAC serving traffic | 🔴 Incident declared |
| ~13:xx | Root cause identified via APAC asymmetry | 🟡 Diagnosis complete |
| 14:20 | EU fully recovered | 🟡 Partial recovery |
| 15:10 | US fully recovered | 🟡 Partial recovery |
| 15:40 | All regions normalised | 🟢 Full recovery |
Architecture
Spotify's networking perimeter places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what backend it is destined for. When every Envoy instance crashes simultaneously, no user request can reach any backend service. The entire platform goes dark regardless of whether individual backend services remain healthy. This is the shared fate property of perimeter architecture: a perimeter failure has a blast radius of every service, every user, every region simultaneously.
Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
The Three-Layer Failure Cascade: From Filter Bug to Death Loop
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Configuration drift: why this existed undetected for months
The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: configuration mismatches that are only dangerous under abnormal load go undetected indefinitely in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But it was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage, including synthetic stress tests, is the practice that catches these before they detonate.
Lessons
'Low risk' is not a substitute for staged rollout at the perimeter. A change's risk profile determines what validation it needs — it doesn't override the need for validation. The filter reorder was simple; the blast radius of failure was total. Stage perimeter changes by region and monitor before expanding.
Latent bugs (code defects harmless until a specific triggering condition occurs) that depend on execution context cannot be caught by tests that don't vary that context. A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order.
Audit resource limit configurations against actual and stress-test peak usage regularly. Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. A misconfiguration harmless for months can become catastrophic under the right load spike.
Client-side retry logic turns total simultaneous failures into traffic amplification events. Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client.
When one region survives an outage that hits all others, that region is your fastest path to root cause. APAC's survival was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Systematically compare surviving regions against failed ones — it shortens MTTR.
Engineering Glossary
Client-side retry logic — application behaviour where the client automatically retries failed requests after a brief delay. Designed to handle transient failures, but capable of amplifying load during sustained simultaneous failures by converting each failed request into one or more retry requests.
Death loop — an informal term for an infinite restart cycle where a pod crashes, Kubernetes restarts it, and the replacement crashes for the same reason. Powered by K8s restart behaviour combined with a condition (here: retry flood + heap misconfiguration) that guarantees each replacement fails.
Envoy Proxy — an open-source, high-performance edge proxy originally built at Lyft, widely used as the networking perimeter layer in distributed systems. Receives all incoming user traffic before distributing it to backend services.
Filter chain — the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance. Each filter can inspect, modify, or reject the request before passing it to the next filter. Order is semantically meaningful.
Latent bug — a code defect that exists in production but is harmless until a specific triggering condition occurs. Undetectable by standard testing if the triggering condition is rare or contextual.
OOMKill — Out-Of-Memory Kill. The Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation.
Shared fate system — an architecture where all dependent services rise and fall with a shared component. Spotify's Envoy perimeter is a shared fate system: if it fails, every backend service becomes unreachable regardless of whether those services are healthy.
Staged rollout — deploying a change to a subset of infrastructure (one region, one cluster) and validating behaviour before expanding to the full fleet. The safety mechanism absent from the April 16 deployment.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.













