A spend baseline and a 3σ alarm: catching a runaway AI agent the same day

Every team that hands an AI agent a budget eventually meets the same surprise: a quiet bill that wasn't quiet at all, it just wasn't being watched. A retry loop with no backoff. An agent stuck in a tool-call cycle, re-reading the same file forty times. Someone who pointed the test suite at the most expensive model "just to see." None of these announce themselves. They show up at the end of the month as a number nobody can explain.

The fix isn't a spending cap — caps are blunt, and the legitimate heavy user hits them as often as the runaway does. The fix is a baseline: learn what normal looks like for each developer, then alert the same day when a person blows past their own normal. Here's how to build that with a trailing window and a bit of arithmetic you already know.

Step 1: get a per-developer, per-day spend table

You can't baseline what you don't record. The goal is the most boring table imaginable:

(day, developer, model, cost_usd)

Where this data comes from depends on your provider, but the principle is constant: capture attribution at call time, where you actually know who the caller is, rather than reverse-engineering it from a billing PDF later. Two low-friction ways to get there without standing up a proxy:

A key per developer. Issue each engineer (or each service) their own provider API key. Spend groups by key for free, and revoking a leaked key is one click.
Usage accounting in the response. Most providers will hand you the cost if you ask. On OpenRouter, for example, add "usage": {"include": true} to the request and the response carries token counts, the resolved cost, and — importantly if you use openrouter/auto — the actual model that served the request rather than the router alias. Log that alongside your internal user id and you're done.

The reason the model field matters: a spend spike on a cheap model is a volume problem; the same dollar spike on an expensive model is a routing or config problem. You want to tell those apart on day one, which means recording the real model, not auto.

Step 2: compute each developer's own baseline

Here is the part people overcomplicate. You do not need a forecasting model. You need a trailing mean and standard deviation per developer, and the classic mean + 3σ threshold. Three sigma is roughly the line past which "busy Tuesday" becomes "something is wrong," and because it's computed per person, your heaviest user and your lightest user get appropriately different ceilings automatically. Nobody hand-tunes a magic number.

Use a trailing window — say 14 days — and crucially, exclude today from the baseline so a spike can't inflate the very threshold it should trip:

SELECT developer, day, spend, mu, sigma
FROM (
  SELECT developer, day, spend,
         AVG(spend)    OVER w AS mu,
         STDDEV(spend) OVER w AS sigma
  FROM daily_spend
  WINDOW w AS (
    PARTITION BY developer ORDER BY day
    ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING
  )
) t
WHERE spend > mu + 3 * sigma;

Any row that comes back is a developer who spent more today than their last two weeks would predict, by a wide margin.

Step 3: handle the awkward edges before they embarrass you

A naive 3σ alert fires constantly in week one and then cries wolf. Three guards earn their keep:

Minimum history. Don't alert until a developer has, say, 7 days of data. With two data points, σ is meaningless and everything looks anomalous.
A dollar floor. mean + 3σ on someone who normally spends 12 cents will fire at 40 cents — true, and useless. Require the day to also clear an absolute floor (a few dollars) before it counts. Anomaly and materiality.
New-developer ramp. A first week of heavy use is onboarding, not a runaway. Suppress alerts during a grace period or widen the threshold while history is thin.

These three turn a noisy statistical curiosity into something an on-call engineer will actually keep enabled.

Step 4: make it land where people look

A query that runs in a notebook nobody opens is not a control. Wire the job to a daily cron and push hits to wherever your team already lives — a Slack message naming the developer, the model, today's spend, and their baseline is enough to start a conversation in seconds:

⚠️ priya — $58.40 today vs. baseline $9.10 (+5.4σ), 90% on claude-3.7-sonnet

That one line tells you who, how much, how unusual, and where to look. Same-day, while the loop is still running and you can actually kill it, instead of next month when the money's gone.

Worth saying: this never reads a prompt

Notice what this system needs and what it doesn't. It needs timestamps, identities, models, and costs. It does not need the contents of a single request. A cost monitor built on usage data is read-only by construction — it sees what was spent and by whom, never what was asked. That's the right privacy posture for something the whole engineering org has to trust, and it's a good constraint to design toward even if you build it yourself.

If you'd rather not maintain this

The above is an honest afternoon of work plus the ongoing tax of owning a polling worker, a spend table, an alerting job, and the three edge-case guards that keep it from crying wolf. If you'd rather not maintain that yourself, it's roughly what Reckon does — read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, per-developer baselines with same-day anomaly alerts and Slack digests, a /spend command, and a Linear integration. It now tracks OpenRouter usage by model in realtime too, so teams on openrouter/auto — OpenClaw users included — get per-model, per-developer visibility without changing how they call anything.

Build it or buy it; the discipline is the same. Learn each person's normal, watch for the day they leave it, and find out while you can still do something about it.