I recently calibrated a recovery-rate model that had only two weak features. Its point accuracy was almost nothing — R² basically zero. I expected its uncertainty estimates to be junk too. They weren't: the 90% conformal prediction intervals covered ~89% of held-out outcomes. Valid, just wide.
That surprised me enough to nail it down, because it contradicts a belief a lot of us carry around: "my model isn't accurate, so I can't trust its uncertainty." For split conformal prediction, that's backwards. Here's the precise statement, a runnable demo, and the one caveat that actually bites.
Coverage is a property of the procedure, not the model
Split conformal prediction gives a distribution-free, finite-sample marginal coverage guarantee:
P( Y ∈ Ĉ(X) ) ≥ 1 − α
and it holds for any point model, as long as the calibration and test data are exchangeable. The model is a black box. You fit it however you like, then on a held-out calibration set you take the (1−α) quantile of the absolute residuals, and that quantile becomes the half-width of your intervals.
Nowhere does that construction require the model to be good. A bad model just has large residuals, so the calibration quantile is large, so the intervals are wide — wide enough to still cover at the stated rate. Accuracy doesn't buy you validity; it buys you efficiency (narrower intervals at the same coverage).
The demo (numbers are reproducible, seed fixed)
Same dataset and target, three models from strong to useless, target coverage 90%:
| model | R² | marginal coverage | mean interval width |
|---|---|---|---|
| gradient boosting | 0.741 | 0.895 | 5.39 |
| weak linear (1 noisy feature) | 0.061 | 0.905 | 10.39 |
| predict-the-mean | −0.000 | 0.907 | 10.83 |
All three land at ~90% coverage. The only thing that changes is width: the good model's intervals are half as wide. That's the whole story in one table — validity is constant, efficiency tracks accuracy.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
rng = np.random.default_rng(20260617)
n = 6000
X = rng.normal(size=(n, 5))
group = rng.integers(0, 3, size=n)
y = X @ np.array([2.0, -1.5, 1.0, 0.5, -0.8]) + 1.5 * group + rng.normal(size=n)
s = lambda a: (a[:3000], a[3000:4500], a[4500:])
Xtr, Xcal, Xte = s(X); ytr, ycal, yte = s(y); _, _, gte = s(group)
ALPHA = 0.10
def conformal(model, label):
model.fit(Xtr, ytr)
res = np.abs(ycal - model.predict(Xcal))
k = int(np.ceil((len(res) + 1) * (1 - ALPHA)))
q = np.sort(res)[min(k, len(res)) - 1] # calibration quantile
pred = model.predict(Xte)
covered = (yte >= pred - q) & (yte <= pred + q)
r2 = 1 - np.sum((yte - pred)**2) / np.sum((yte - yte.mean())**2)
gcov = {int(g): round(covered[gte == g].mean(), 3) for g in np.unique(gte)}
print(f"{label}: R2={r2:6.3f} cov={covered.mean():.3f} width={2*q:5.2f} group={gcov}")
conformal(GradientBoostingRegressor(random_state=0), "strong")
class Weak(LinearRegression):
def fit(s, X, y): return super().fit(X[:, 4:5], y)
def predict(s, X): return super().predict(X[:, 4:5])
conformal(Weak(), "weak ")
The catch: marginal ≠ conditional
Here's the part you can't skip. The guarantee is marginal — averaged over the whole distribution. It says nothing about coverage within a subgroup. Watch what the same run reports per subgroup:
| model | marginal | group 0 | group 1 | group 2 |
|---|---|---|---|---|
| strong GBM | 0.895 | 0.835 | 0.985 | 0.857 |
| predict-the-mean | 0.907 | 0.889 | 0.933 | 0.897 |
The strong model has the worse conditional coverage — groups 0 and 2 sit at 83–86% while group 1 is over-covered at 98%. A single global residual quantile produces constant-width intervals that can't adapt to residuals that vary by group, so it robs the hard groups to pay the easy one. (The mean-only model looks more uniform here only because its residuals happen to be roughly homoskedastic across groups — luck, not virtue.)
If your decisions are made per-subgroup — per region, per asset class, per customer segment — marginal coverage is not enough, and a high overall number can hide silent under-coverage where it matters. The fixes are Mondrian / group-conditional conformal (calibrate a separate quantile per group) or a normalized/locally-weighted nonconformity score so interval width adapts.
What to take away
- A weak model gives you wide but honest intervals, not invalid ones. "The model is bad so the uncertainty is meaningless" is the wrong instinct — wide intervals are the correct signal that the model doesn't know much.
- The genuinely dangerous case is the opposite: a confident-looking narrow interval whose coverage is a lie. That happens not from low accuracy but from a broken exchangeability assumption — distribution drift between calibration and deployment. (That failure mode, and adaptive conformal as the fix, is a separate write-up.)
- Always check conditional coverage on the groups you actually act on. The marginal number is necessary, not sufficient.
Conformal prediction is one of the few tools that gives you a real guarantee with almost no assumptions. Just remember which guarantee it gives — coverage over the whole distribution — and verify the rest yourself.











