Don't trust large context windows

Points

Comments

computersuck

Author

Top Comments

bob1029Jun 14

I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

kelnosJun 14

This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.

(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)

I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.

wood_spiritJun 14

Yes context management is key.

I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.

This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.

I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.

But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.

SwellJoeJun 14

Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.

But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.

The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.

kristiancJun 14

I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.

mgJun 14

Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.

steveridoutJun 14

I wonder how much this depends on the quality and consistency of the context?

For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.

Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?

cowangJun 14

Evaluating the Sensitivity of LLMs to Prior Context

https://arxiv.org/abs/2506.00069

Visit the Original Link

Read the full content on garrit.xyz

Visit garrit.xyz View on Hacker News

Source

garrit.xyz

Author

computersuck

Posted

June 14, 2026 at 06:07 AM

Visit Original Hacker News Thread

Don't trust large context windows

Top Comments

Visit the Original Link

Source

Author

Posted

More Top Stories

Phoenix LiveView 1.2 Released

Honda Civics and the Evil Valet

Free SQL→ER diagram tool, runs in the browser, nothing uploaded

GLM 5.2 Is Out

Noise infusion banned from statistical products published by Census Bureau

Every Frame Perfect