Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?

TL;DR: Most eval sets are sized by "what we had lying around", not by what they can actually detect. If your eval set is 50 traces and you are trying to catch a 5-point drop in pass rate, you are underpowered: the regression hides inside sampling noise more often than not, and you ship it green. A two-line power calculation tells you the size you actually need, and ours said roughly 4x what we were running.

The number nobody computes

We argue about which metric to use and skip the prior question: how big a change can this eval set even see. An eval set has a detection floor, like any experiment. Below it, a real regression and an unlucky sample look identical, so a green run means nothing.

A two-line power check

For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# detect a drop from 0.90 to 0.85 (5 points), 80% power, alpha 0.05
es = proportion_effectsize(0.90, 0.85)
n = NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, alternative="smaller")
print(round(n))   # a few hundred per run, not 50

At 50 traces we could only reliably catch a swing of ~15 points, which is a disaster you would notice anyway, not the slow drift you actually care about.

What we changed

Sized the eval set to the smallest regression we cared about (a 5-point drop), which set the floor. Stratified so rare-but-important slices were not drowned out. Reported the eval result with its uncertainty, so a 1-point move stopped triggering investigations.

The honest caveat

Bigger eval sets cost more (every trace is judge tokens), so there is a real tension between detection power and eval cost. The answer is not "make it huge", it is "size it to the smallest regression that would actually hurt, and no smaller." For us that was a few hundred; for a safety-critical check it might be thousands.

Open question

The power calc assumes i.i.d. traces, and production traffic is bursty, correlated, and drifting. I do not have a clean way to compute effective sample size for a correlated eval set, so I treat the "few hundred" as a floor and pad it. If you have done power analysis on correlated eval traffic properly, I would like to read how.