The week we crossed 500 active users, three things broke in the same 48 hours: Postgres connections, OpenAI's rate limiter, and my assumption that "it works in staging" means anything at all.
Here's what actually happened, in the order it happened, with the code that fixed it.
The week in one sentence
We onboarded a batch of new accounts running bulk generation jobs (50-200 articles each), and within a day our connection pool was maxed out, half our OpenAI calls were getting 429'd, and our average job latency had tripled.
None of it was one bug. It was three separate assumptions that held at 50 users and quietly broke at 500.
What broke first: the connection pool
Our generation worker looked roughly like this. Each job — one article — opened its own Postgres client to write status updates as it progressed (queued → generating → formatting → done).
// The original version. Worked fine at low concurrency.
async function processArticleJob(job) {
const client = new Client({ connectionString: process.env.DATABASE_URL });
await client.connect();
await client.query(
'UPDATE jobs SET status = $1 WHERE id = $2',
['generating', job.id]
);
const content = await generateContent(job.prompt);
await client.query(
'UPDATE jobs SET status = $1, content = $2 WHERE id = $3',
['done', content, job.id]
);
await client.end();
}
At 50 concurrent jobs, this is 50 connections. Postgres on our RDS tier defaults to a max of 100. Fine, with room to spare.
At 500 users, a single bulk-generation batch could spin up 300+ jobs at once. Each one opened its own client. We hit remaining connection slots are reserved for non-replication superuser connections in production, which is Postgres's polite way of saying "absolutely not."
The fix wasn't more connections — RDS lets you bump the limit, but that just delays the problem. The fix was not opening a new connection per job in the first place.
// Shared pool, reused across all jobs in the worker process
const { Pool } = require('pg');
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 20, // hard cap, tuned to our RDS instance size
idleTimeoutMillis: 30000,
});
async function processArticleJob(job) {
await pool.query(
'UPDATE jobs SET status = $1 WHERE id = $2',
['generating', job.id]
);
const content = await generateContent(job.prompt);
await pool.query(
'UPDATE jobs SET status = $1, content = $2 WHERE id = $3',
['done', content, job.id]
);
}
20 pooled connections handled the same 300-job batch that previously tried to open 300 individual ones. The jobs just wait their turn for a connection slot instead of each grabbing their own.
The part I didn't expect: this also made the failures cleaner. Before, a job that crashed mid-write sometimes left its connection hanging open until idle timeout. With a shared pool, that connection just goes back to the queue for the next job.
What broke second: OpenAI's rate limiter
With the pool fixed, jobs started flowing again — and immediately started failing for a different reason. Our tier's rate limit is requests-per-minute, and a 300-job batch firing near-simultaneously blew through it in seconds.
Error: 429 Rate limit reached for gpt-4o in organization
org-xxxx on requests per min (RPM): Limit 500, Used 500, Requested 1.
The naive fix — just retry on 429 — works until your retries also get rate limited, and you end up hammering the API with retries of retries. We needed actual backoff, plus something that controlled how many requests left our system in the first place, not just how we reacted after they failed.
async function generateContent(prompt, attempt = 1) {
const MAX_ATTEMPTS = 5;
try {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
max_tokens: 1500,
});
return response.choices[0].message.content;
} catch (err) {
if (err.status === 429 && attempt < MAX_ATTEMPTS) {
// Exponential backoff with jitter — without jitter, every
// throttled job retries at the exact same moment and you
// just recreate the spike one tick later.
const baseDelay = Math.min(1000 * 2 ** attempt, 30000);
const jitter = Math.random() * 500;
await new Promise((r) => setTimeout(r, baseDelay + jitter));
return generateContent(prompt, attempt + 1);
}
throw err;
}
}
That stopped the cascading failures, but it was still treating the symptom. The real fix was upstream: cap how many generation requests run concurrently, so we stop sending requests we already know will get throttled.
const PQueue = require('p-queue').default;
// Tuned against our actual OpenAI tier limit, not a guess.
// At 500 RPM, 8 concurrent keeps us comfortably under burst limits
// while still processing a 300-job batch in a few minutes, not hours.
const generationQueue = new PQueue({ concurrency: 8 });
async function enqueueArticleJob(job) {
return generationQueue.add(() => processArticleJob(job));
}
This single change cut our 429 rate from roughly 1 in 6 requests down to near zero in load testing. Concurrency 8 was not a number I picked from a blog post — it came from watching our actual rate limit headers under test load and backing into a number that stayed under them.
What broke third: my own assumptions about "done"
This one wasn't a crash. It was worse — it was silent. Some jobs were marked done in the database with empty content. No error thrown, no alert fired.
It took me embarrassingly long to find this, because I was looking at the OpenAI call, not the code around it. The actual bug: our backoff retry on 429 returned undefined on the 5th failed attempt instead of throwing, because of a leftover return; from an earlier version of the function that I never deleted.
// The bug, hiding in plain sight
if (err.status === 429 && attempt < MAX_ATTEMPTS) {
// ... backoff logic
return generateContent(prompt, attempt + 1);
}
return; // <-- this. silently swallowed every other error type
// and fell through after max retries.
Jobs that exhausted their retries didn't fail loudly — they just returned nothing, which got written to the content column as an empty string, which got marked done. From the dashboard, it looked like success.
Fixed by actually re-throwing instead of swallowing:
if (err.status === 429 && attempt < MAX_ATTEMPTS) {
const baseDelay = Math.min(1000 * 2 ** attempt, 30000);
const jitter = Math.random() * 500;
await new Promise((r) => setTimeout(r, baseDelay + jitter));
return generateContent(prompt, attempt + 1);
}
throw err; // exhausted retries or non-429 error — surface it, don't eat it
I'm still a little annoyed at myself for this one. It's not a scaling problem in the architecture sense — it's a stale line of code that only became visible once we had enough volume for retries to actually exhaust.
Where Redis came in
Once the pool and the queue were stable, we noticed a chunk of generation requests were near-duplicates — same keyword, same tone, same options, submitted minutes apart by the same account re-running a batch after tweaking one field.
We added Redis purely as a result cache, keyed on a hash of the prompt + options, with a short TTL:
const redis = require('redis').createClient({ url: process.env.REDIS_URL });
await redis.connect();
async function generateWithCache(prompt, options) {
const cacheKey = `gen:${crypto
.createHash('sha256')
.update(prompt + JSON.stringify(options))
.digest('hex')}`;
const cached = await redis.get(cacheKey);
if (cached) return cached;
const content = await generationQueue.add(() =>
processArticleJob({ prompt, options })
);
// 1 hour TTL — long enough to catch re-runs in the same editing
// session, short enough that nobody gets stale content days later
await redis.set(cacheKey, content, { EX: 3600 });
return content;
}
This wasn't the architecturally interesting fix of the week. It was the cheap one. It knocked out somewhere between 10-15% of our OpenAI calls during a typical bulk batch, which mattered more for cost than for the rate limit problem — by the time Redis was in, the queue was already keeping us under the limit on its own.
What's next
The pool, the queue, and Redis got us stable at 500. I don't think they get us to 5,000 — the next bottleneck I'm watching is Postgres write contention on the jobs table itself, since every status update is a write, and a big enough batch means a lot of writes to one table at once.
I haven't decided whether that's a partitioning problem or a "stop writing intermediate status updates so often" problem. Leaning toward the second since it's simpler, but I want to see where it actually breaks before I build for a problem I don't have yet.
Open question for the comments
If you've scaled a job queue past the point where per-job DB connections work, did you go with a shared pool like this, or move status tracking out of Postgres entirely (Redis, in-memory, something else)? I went with the boring option here and I'm wondering if I'm going to regret it at the next order of magnitude.












