I Blamed the Model for Months. The Bug Was My Sampler.
40GB In, Word Salad Out
Running local LLMs on M1 Max hardware is one of those setups that looks great on paper — unified memory, no PCIe bottleneck, offline and private. For about a year I ran mlx-community/Qwen3.6-35B-A3B-8bit, a 35B Mixture-of-Experts model that consumed ~40GB of the machine's 64GB pool. Generation speed was fine. Quality, past about 150 words, was not.
The output turned to garbage — run-on sentences that looped without structure, synonyms piling on synonyms, no paragraph ever landing a conclusion. I assumed the MoE architecture was the problem. That assumption sent me down months of tooling: sectional generation scaffolding, retry logic, truncation heuristics, all built to work around what I thought was a hard architectural limit.
The model was fine. My code was broken.
The Diagnostic: Four Configs, One Metric
I stopped guessing and wrote a small diagnostic script. Same prompt, four sampler configurations, one measurement: the longest continuous stretch of text without a sentence terminator. A coherent paragraph rarely runs beyond 40 words between periods. A collapse runs into the hundreds.
| Config | Words out | Longest run (no period) |
|---|---|---|
| temp=0.6 + rep-penalty ON | 581 | 472 ← collapse |
| temp=0.6, no processors | 479 | 27 ✓ |
| temp=0.6 + top_p=0.9, no processors | 510 | 31 ✓ |
| greedy temp=0, no processors | 490 | 27 ✓ |
Row one is what production was running. The rest are controls. Row two settled it: remove the processor, the collapse disappears. The model was never the problem.
The Root Cause: A Logits Processor That Fought Itself
My codebase had a custom _repetition_penalty_processor. On every decoding step, it divided each recently-generated token’s logit by 1.15, across a sliding 128-token window. The intent was to discourage repetition. The effect was the opposite.
By penalizing recently-used tokens, the processor forced the model to reach for synonyms. Those synonyms got penalized next step. The model kept reaching further — abstract nouns, tangential phrases, concepts that drifted further from the original topic — until coherence collapsed entirely. It’s a feedback loop. The harder you penalize recent tokens, the more the model is pushed toward low-probability continuations that spiral into incoherence.
Turn the processor off, and the same model generates cleanly past 500 words. The architecture wasn’t the problem. The sampler config was.
The Fix
Three changes in mlx_inference.py:
- Swapped
MLX_MODEL_PRIMARYto"mlx-community/Qwen3.6-27B-4bit"— the dense 4-bit variant, ~18GB RSS instead of ~40GB. - Changed
make_sampler(temp=0.6)tomake_sampler(temp=0.6, top_p=0.9). Top-p sampling limits generation to the smallest set of tokens whose cumulative probability reaches 90%, pruning the low-probability tail without introducing a penalty feedback loop. - Removed
logits_processors=[_repetition_penalty_processor]from the maingenerate()call entirely.
Before shipping, I validated the smaller model against a set of already-scored drafts: Pearson r=0.89 correlation against the old model’s scores, with 7 of 8 gate decisions (the ≥70 publish threshold) matching. That’s 8 samples — enough to justify a swap for an evaluation-only workload, not enough to make architectural claims about generation quality at scale.
What 22GB Back Feels Like on Apple Silicon
There’s no separate VRAM bucket on unified memory. The GPU and CPU share the same 64GB pool, which means a 40GB model isn’t just using "graphics memory" — it’s consuming RAM that macOS, the browser, the terminal, and everything else also depends on.
When the model shrank from ~40GB to ~18GB, system free memory jumped from 65% to 93%. Compressed swap dropped. Background processes stopped stuttering. The machine just felt different. Before, keeping Chrome open alongside Xcode and a running LLM was a tradeoff I had to think about. With ~46GB of headroom instead of ~14GB, it stopped being a question.
On a unified memory machine, the RAM number isn’t just a spec. It’s the entire working experience. Every gigabyte you save on the model is a gigabyte that macOS can use to keep your browser alive, your terminal responsive, your compiler from stalling. The 22GB reclaim from switching models was worth more than any hardware upgrade I could have bought.
How I measured
Everything here is first-party, measured on my own M1 Max (64 GB) — no external benchmarks. The four-config table comes from a single diagnostic script: same prompt each run, scoring the longest stretch of text with no sentence terminator. The r=0.89 figure is from re-scoring eight already-graded drafts with both Qwen3.6-35B-A3B-8bit and Qwen3.6-27B-4bit and comparing the two sets of scores. The memory numbers (65% → 93% free, ~22 GB reclaimed) are system readings taken before and after the model swap. The whole point is that you can reproduce any of this on your own hardware.
Don’t Size for Prestige
The 40GB model was chosen for generation quality — the reasoning being that bigger MoE models produce better long-form text. But the actual job the model was doing was evaluation: scoring draft content on a 0–100 scale. For that job, the 27B-4bit is sufficient, the correlation numbers confirm it, and it costs 22GB less.
I spent months building scaffolding for a wrong assumption. The bug was three lines of sampler code I wrote myself. Know what a model actually does before deciding which one to run.







