Your checkout service calls a 3rd-party fraud-check API on every order.
That API just started timing out at 30s instead of its usual 200ms.
Your Node.js checkout pods have a 50-connection pool. Within 90 seconds, every connection is parked waiting on the fraud API. New checkout requests pile up in the queue. P99 latency on /checkout goes from 300ms to 28s. Customers retry. Pods OOM. The fraud API is degraded — your entire checkout is down.
Here's the setup:
• Checkout (NestJS) → Fraud API (3rd party) — 30s timeouts
• Same pods also handle /cart, /orders, /health — all healthy dependencies
• Fraud API's own dashboard says it'll be back in ~10 minutes
• Your SLO budget for the quarter is about to evaporate
You need to stop the bleeding without losing the rest of checkout. What do you do?
A) Drop the timeout to 2s and add 3 retries with exponential backoff.
B) Add a Circuit Breaker that opens after N failures, then half-opens with a single probe request before fully closing.
C) Bulkhead the fraud API calls into a separate connection pool / thread pool so they can't starve the rest of checkout.
D) Both B and C — circuit breaker for the failing dependency, bulkhead to isolate the blast radius.
Three of these are patterns senior engineers genuinely debate in postmortems. One of them is the answer most staff engineers actually ship. One is the answer that makes the outage worse.
Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.
If your team has ever had one slow downstream take down a healthy service, repost this. That conversation needs to happen before the outage, not after.
Drop your answer 👇













