The Prometheus label that blew our monitoring bill out 6x

TL;DR: Our metrics bill went 6x in a single month. Traffic was flat. One Prometheus label carrying per-build IDs spawned millions of time series, and the backend charges by active series. Here's how we caught it and the label rules we run now so it doesn't happen again.

The bill, not the traffic

I'm on the infra team at Buildkite. We run a fairly chunky Prometheus setup feeding a managed backend, and one Monday the monthly estimate had quietly gone from about $1,800 to a touch over $11k. Nobody shipped more traffic. Build volume was the same 40k-ish builds a day it'd been for weeks.

So it wasn't load. It was series count. Active series had climbed from roughly 1.2 million to nearly 9 million, and the backend prices on active series, not on request volume. That's the trap most people miss the first time.

What cardinality actually is

Think of every unique combination of metric name plus label values as its own drawer in a filing cabinet. http_requests_total{status="200"} is one drawer. Add region="ap-southeast-2" and now you've got a drawer per region. Add a label whose values are unbounded and you've got a cabinet the size of a warehouse.

Cardinality is the count of those drawers. Each one is a separate time series that has to be stored and indexed. Low-cardinality labels (status, region) are fine. High-cardinality ones are where the money leaks.

The one label that did it

A teammate had added build_id to a counter so they could debug a flaky deploy. Fair enough in the moment. Problem is every build has a unique ID, we do ~40k a day, and those IDs hang around for the full retention window.

40k unique values a day, multiplied across a handful of other labels, multiplied across retention. That's your several-million-series jump right there. One label.

Catching it

The fastest way to find the offender is to ask Prometheus which metric has the most series:

topk(10, count by (__name__)({__name__=~".+"}))

Then drill into the worst metric and see which label is doing the damage:

count(count by (build_id)(deploy_attempts_total))

When that second query came back with a number in the tens of thousands, we had our culprit.

The fix

You drop the label before it ever hits storage. metric_relabel_configs runs at scrape time, so you can strip a label without touching the app code:

scrape_configs:
  - job_name: "build-agents"
    metric_relabel_configs:
      - regex: "build_id"
        action: labeldrop

Per-build detail didn't vanish, we moved it to where unbounded identifiers belong: traces and logs. If you genuinely need a metric sliced per build, use exemplars so the high-cardinality bit lives in the trace store, not the series index.

Here's how we now reason about labels before adding one:

Label	Unique values	Safe to add?
status	~5	Yes
region	~6	Yes
instance_type	~15	Yes
agent_queue	~200	Usually fine
build_id	~40k/day	No, use a trace
user_email	unbounded	No, never

Rule of thumb we reckon on: if you can't name the upper bound of a label's values on a whiteboard, it doesn't go on a metric.

Same trap, different service

This isn't only a Prometheus-the-app thing. Any service that emits Prometheus metrics can sink you the same way. We run a small internal feature that summarises failed build logs through an LLM, and those calls go through Bifrost, an open-source AI gateway that ships native Prometheus metrics out of the box. Handy. But the instinct to tag those metrics with a per-request ID or per-virtual-key label is exactly the same footgun.

We keep its labels down to provider and model. That gives us cost-per-provider and latency-per-model without minting a new series for every call. The discipline travels with the metric, not the tool.

Trade-offs and Limitations

Dropping build_id means you can't slice a single build inside Prometheus anymore. For ad-hoc "what did build 84213 do" questions, you're now in the trace or log tooling, which is a context switch some folks grumbled about for a week.

Recording rules, the other common fix, aren't free either. They add evaluation load on the Prometheus side, and if you write a sloppy one you can quietly recreate the cardinality you were trying to kill. Test the output series count before you ship the rule.

Exemplars need backend support and a tracing system wired up. If you haven't got distributed tracing yet, that path's a bigger project than a one-line labeldrop. Be honest about where you are.

And labeldrop is a blunt instrument. Once it's gone at scrape, it's gone. If you later decide you wanted that dimension bounded rather than dropped, you're re-instrumenting.