TL;DR: We turned on vLLM continuous batching for a throughput win and watched p99 latency 8x in the wrong direction. Long prefills were stalling decodes in the same forward pass. Chunked prefill and a tuned max_num_batched_tokens got the SLO back at the cost of ~11% of the throughput gain.
We run Llama 3.3 70B as the routing brain for our agent platform at Nexus Labs. ~14 internal services hit it. SLO is 2s p99 for the single-turn routing call.
Last month we flipped on vLLM 0.7's continuous batching to push more requests through our 4xH100 box. p50 dropped from 340ms to 190ms. We were happy for about 36 hours.
Then the latency dashboard turned red.
What we actually saw
p99 went from 1.2s to 9.8s on the routing endpoint. p50 was still good. p99.9 was unprintable.
The first alert came off our routing service's p99 panel. We checked the upstream load balancer. Healthy. Then the model server CPU and GPU. Healthy by every coarse metric. GPU utilization was 81%, not saturated. KV cache hit rate held at 67%. The Prometheus exporter from vLLM showed something stranger: vllm:time_per_output_token_seconds had widened from 32ms to 380ms during peak. The model itself wasn't slow. The scheduler was making everyone wait.
Long requests with 4k+ token prefills were eating decode slots. Short single-turn routing calls were starving behind them. The forward pass would dedicate ~60ms to a prefill chunk for one user's request, and 23 in-flight decode streams would block on it.
That's the contract of naive continuous batching. Prefill and decode share one forward pass. A big prefill stops everyone.
The fix
vLLM ships chunked prefill. It splits a large prefill into ~512-token chunks and interleaves them with decode steps. The tradeoff: total throughput per long request goes down. In exchange, decode never stalls for more than one chunk worth of time.
The other knob is max_num_batched_tokens. Set too high and you reintroduce the stall. Set too low and you starve throughput. We landed at 4096 for our workload after a sweep.
# vllm config that ended up in prod
model: meta-llama/Llama-3.3-70B-Instruct
tensor_parallel_size: 4
max_model_len: 8192
enable_chunked_prefill: true
max_num_batched_tokens: 4096
max_num_seqs: 96
gpu_memory_utilization: 0.92
swap_space: 16
Before and after
| Metric | No batching | Naive CB | + chunked prefill |
|---|---|---|---|
| p50 latency | 340ms | 190ms | 215ms |
| p99 latency | 1.2s | 9.8s | 1.4s |
| p99.9 latency | 2.1s | 27s | 3.1s |
| Tokens/sec (cluster) | 2,650 | 4,820 | 4,310 |
| Cost/1M output | $0.74 | $0.41 | $0.46 |
We paid back ~11% of the throughput win. We bought back the SLO. Cheap trade.
Things that didn't help
We tried priority lanes where small requests jump the queue. It cut p99 to 5.2s but cratered p99 for the long requests instead of solving the underlying scheduling problem. Routing them to separate replicas would have worked, but doubled our GPU footprint. Not worth it for our traffic mix.
We tried bumping max_num_seqs to 256 thinking more concurrent decodes would amortize prefills. It made things worse. KV pressure spiked, eviction churn ate compute.
We tried separating ingress by content length at the gateway layer. Under 1k tokens to one pool, the rest to another. Worked on paper. In practice the small pool got 92% of traffic and we ran out of headroom there. Bin packing prompts isn't free either.
We added a circuit breaker upstream that sheds to a hosted provider when our internal p99 crosses 3s. We pipe everything through Bifrost so the failover is one config change instead of a deploy. It catches the edge cases when prefill-heavy traffic spikes faster than autoscaling reacts.
Trade-offs and Limitations
Chunked prefill is not free. For workloads with very long prompts and short decodes (think doc-QA over 32k context), per-request latency goes up by 15-25%. If that's your hot path, you'd want to split traffic by class and run two pools with different configs.
max_num_batched_tokens is workload-specific. The number we landed on is wrong for someone with a different prompt distribution. There's no shortcut. You run the sweep.
Continuous batching also makes p99 noisier across deployments. A neighbor service pushing a new feature with 8k prompts can hurt yours. The isolation story at the vLLM layer is real but not airtight. We file this under "things our k8s admission controller now checks."
What the eval suite said
The boring point. None of this showed up in offline eval. Eval measured correctness on a fixed batch size. Production measures tail latency under realistic prompt mix. If you only have the first one, you'll ship the dashboard regression we shipped.
We added a load-shape replay step to our deployment pipeline two weeks ago. It replays a sampled 5-minute window of real traffic shape against the candidate. Catches this class of regression before it touches real users.













