Most multi-armed bandit / A-B allocation systems add a minimum exploration weight: every arm should get at least, say, 5% of traffic, so no variant is ever fully starved and you keep collecting data on all of them. The guarantee sounds simple — p_i >= f for every arm — and the implementation looks even simpler:
def clip_renorm(w, f):
p = np.maximum(w, f) # raise anything below the floor up to it
return p / p.sum() # renormalize so probabilities sum to 1
This is wrong, and it fails silently. The renormalize step pushes the floored arms back below the floor.
Why clip-then-renormalize breaks
Clipping raises the small weights up to f, which makes the total exceed 1. Dividing by that total then scales everything down — including the arms you just clipped to f. So they land below f again, and the floor you advertised is not the floor you enforce.
Concrete case — 4 arms, a confident winner, floor f = 0.10:
w = [0.94, 0.02, 0.02, 0.02] floor = 0.10
clip-renorm -> [0.7581, 0.0806, 0.0806, 0.0806] min = 0.0806 ❌ (< 0.10)
The three starved arms each get 8.06%, not the 10% you promised. And it isn't an edge case. Over 100,000 random peaky weight vectors (Dirichlet, α=0.3, n=4, f=0.10):
clip-and-renormalize violated the floor 97.2% of the time — worst arm seen: 7.69% against a 10% floor.
Whenever one arm dominates (exactly when a bandit is exploiting), the floor leaks.
The fix: one affine map onto the simplex
Instead of clipping, mix the learned weights with the uniform floor. Put the weights on the simplex (sum(w) = 1), then:
def additive_simplex(w, f):
w = w / w.sum()
return f + (1.0 - len(w) * f) * w
Each output is f + (non-negative), so p_i >= f holds exactly, and the total is n*f + (1 - n*f)*1 = 1 by construction — no renormalization needed, so nothing gets dragged back under the floor. It also preserves the ordering and relative spacing of w (it's affine), so you don't distort the policy you learned. Same run:
additive-simplex -> [0.664, 0.112, 0.112, 0.112] min = 0.112 ✅
Over the same 100,000 vectors it violated the floor 0.00% of the time.
The one guard you do need
The map needs n * f <= 1 — you can't promise four arms a 30% floor each (that's 120%). Handle it explicitly instead of producing negative weights:
def exploration_floor(w, f):
n = len(w)
if f < 0:
raise ValueError("floor must be non-negative")
if n * f >= 1.0:
return np.full(n, 1.0 / n) # floor is infeasible -> uniform
w = np.asarray(w, dtype=float)
w = w / w.sum()
return f + (1.0 - n * f) * w
That's the whole correct primitive: a non-negativity check, an infeasible-floor fallback to uniform, and the affine mix.
Why it actually matters
The exploration floor isn't cosmetic. It's what bounds worst-case regret and guarantees you keep collecting data on every arm — the property a lot of bandit regret arguments lean on, and often a fairness/SLA requirement too ("no variant ever drops below X%"). A floor that's silently 7.7% instead of 10% means the guarantee you reported to stakeholders, and any bound that depends on it, doesn't hold. The bug is invisible because the output still sums to 1 and still looks floored — the smallest number is just quietly too small.
import numpy as np
rng = np.random.default_rng(0)
f, n, viol = 0.10, 4, 0
for _ in range(100_000):
w = rng.dirichlet(np.ones(n) * 0.3)
p = np.maximum(w, f); p = p / p.sum() # clip-renorm
if p.min() < f - 1e-12: viol += 1
print(f"clip-renorm floor violations: {viol/100_000:.1%}") # ~97%
I ran into this reviewing a Thompson-sampling weighting routine and proposed the additive-simplex version (plus the two guards) as a fix upstream. If your bandit or weighted-experiment layer clips-then-renormalizes to enforce a minimum, it's worth a one-line check: does the smallest probability it emits actually clear the floor?











