Qwen 2.5 Coder 7B Q4 vs Q8 scored the same on my agent test, then I read *how* they failed

I ran Qwen2.5-Coder-7B at Q8 and Q4 through the same multi-step agent test. Same pass rate at every tier. But on the hardest tier they failed in two completely different ways — and that difference says more than the score does.

If you run local models for agents, you've made this trade a hundred times: do I keep the big Q8, or drop to Q4 to save VRAM and get more speed?
The usual way to decide is the benchmark score. Run both, compare the numbers, pick the one that scores higher. If they score the same, shrug and take the smaller one.
I did exactly that this week with Qwen2.5-Coder-7B. And the scores told me almost nothing. The failures told me everything.

Same model, two quants: q8_0 (7.6 GB) and q4_k_m (4.6 GB). Same machine (16 GB Apple Silicon). Same tasks. I didn't test one chat prompt — I ran each one through a real agent loop: call a tool, read the result, decide the next step, repeat, with traps and failures thrown in. Each task runs many times, and it only counts as passed if it passes every run.

Here's what came back.

Tier Q8(7.6 GB) Q4 (4.6 GB)
Easy Pass (5/5) Pass (5/5)
Medium Pass (40/40) Pass (40/40)
Hard 1 of 4 tasks 1 of 4 tasks

On the score alone, these two are twins. Easy and Medium: both clean. Hard: both fall apart, passing only 1 of 4 tasks. If you stopped at the number, you'd say "no real difference, take the Q4, it's smaller and twice as fast."

That conclusion would be wrong. Here's why.

Same score, two different failures

On the Hard tier, the two quants broke in opposite ways.

Q8's top failure was a forbidden call. One of the hard tasks plants a trap: a tool the model is explicitly told not to use on a decision boundary — think "delete the record" or "deploy to prod," an action it's supposed to gate, not take. Q8 walked right up to the problem, acted decisively… and pulled the trigger on the thing it wasn't allowed to touch. The moment it did, the run failed.

Q4's top failure was a loop cap. Same kind of task. But Q4 never got far enough to trip a trap. It got stuck — repeating the same call, never resolving the task, spinning until it ran out of steps. It didn't do the wrong thing. It couldn't do anything.

Read those two side by side:

Q8 — competent but reckless. It can plan and act. It just blew past a guardrail.
Q4 — can't hold the plan together. It looped in place and made no progress.

That's not the same model with a slightly lower score. That's a different kind of model.

Why the failure mode matters more than the number

Looping is the tell. When a model repeats actions and never advances, it's usually because it's lost the thread of the multi-step plan — it can't keep track of what it's done and what's left. And that planning/state-tracking ability is exactly what lower-bit quantization tends to erode first on long, hard tasks.

So dropping Q8 → Q4 didn't just shave a few points off a score (both landed at 1 of 4). It changed how the model fails — from "acts, but trips a guardrail" to "can't make progress at all." One of those is a discipline problem you might fix with better prompting or tighter tool permissions. The other is a capability problem you mostly can't prompt your way out of.

If you only looked at the pass rate, you'd never see this. Two models tied at 1/4 look identical on a leaderboard. In production they'd fail your users in two completely different ways — and you'd debug them completely differently.

The honest caveats

I'm not going to oversell this, because the whole point is measuring honestly:

Neither quant is "ready" for a real coding agent here. Both failed the context-cliff probe right at the baseline — a tool-call failure, not a context-length limit. The headline is the failure-mode difference at Hard, not "Q8 is production-ready."
This is one model on one machine. Qwen2.5-Coder-7B at 16 GB. Your model, your hardware, your tasks may land differently — which is the entire reason to test your own combo instead of trusting someone else's.
Small sample at Hard (4 tasks). The divergence is a strong signal, not a statistical proof.

The takeaway

"Which quant should I use" is the wrong question if you only check the score. Two quants can tie and still be different models underneath. The number tells you whether it failed. The failure mode tells you what's actually broken — and that's what you need to know before you ship.
I measured this with QuantaMind, the open-source tool I'm building to test exactly this — local models, in a real agent loop, on your own hardware, with the failure broken down by type instead of hidden behind a score. It's free and fully offline.