Most engineers who adopted Claude Code or Codex are still using them like a faster autocomplete: one prompt, one answer, repeat. The real productivity unlock is somewhere else entirely — in treating these tools as an orchestra of specialized agents you direct, rather than a single assistant you chat with. This is a practical guide to building that multi-agent workflow into your day-to-day, and to doing it without fooling yourself about the gains.
From autocomplete to orchestration
If you have used an AI coding agent for more than a week, you already know the basic loop: describe a change, watch it edit files, run the tests, fix what broke. That loop is genuinely useful. But it is also the floor of what these tools can do, not the ceiling.
The engineers getting the largest gains are not writing better single prompts. They are running several agents at once, each with a narrow job, coordinated through a plan that a human reviewed before any code was written. One agent explores the codebase and writes a spec. Another implements against that spec. A third reviews the diff with fresh eyes. A fourth keeps the documentation in sync. Some of this happens in parallel across isolated branches; some of it happens in a strict sequence because step three genuinely depends on step two.
This is the difference between using an agent and building an agentic workflow. The first is a tool you reach for. The second is a system you design. This guide is about the second — what an ideal multi-agent workflow looks like on a normal development task, regardless of whether your tool of choice is Claude Code, OpenAI Codex, Cursor, or something that did not exist when this was written.
What a multi-agent workflow actually means
The vocabulary here is muddier than it should be, so it is worth being precise. The clearest definitions come from Anthropic's engineering writing, and they have largely become the industry's shared language.
An agent, in the practical sense, is an LLM autonomously using tools in a loop. It reads a file, decides to run a test, reads the output, decides to edit another file, and continues until it judges the task done. That is one agent: one model, one context window, one continuous train of thought.
A workflow is a system where LLMs and tools are orchestrated through predefined code paths. The steps are known in advance. You chain them together because you have decided, as the engineer, that this sequence produces a reliable result.
A multi-agent system introduces a lead or orchestrator agent that breaks a task into pieces and delegates them to specialized sub-agents — often running in parallel, each with its own context window, its own tools, and its own instructions. The orchestrator does not do the work itself; it decomposes, delegates, and synthesizes.
The distinction that matters most in practice is between workflows and agents. Workflows offer predictability and consistency for tasks you can define up front. Agents are the better choice when you need flexibility and model-driven decision-making across a path you cannot map in advance. Anthropic's own guidance is refreshingly conservative on this point: find the simplest solution possible, and only increase complexity when it demonstrably improves outcomes. Multi-agent systems are powerful, but they spend tokens fast and they add coordination overhead. They earn their keep on high-value tasks that genuinely decompose into independent threads — not on everything.
Find the simplest solution possible, and only increase complexity when it demonstrably improves outcomes.
The five patterns worth knowing by name
Before wiring up a crew of agents, it helps to have a vocabulary for the shapes these systems take. Anthropic's "Building Effective Agents" lays out five composable patterns that have become the industry's reference set. You will recognize most of them from systems you have already built by accident.
Prompt chaining decomposes a task into a fixed sequence of steps, where each step operates on the output of the last. You trade a little latency for a lot of accuracy, because each call has a narrower, easier job. Generating a spec, then generating code from that spec, then generating tests from that code is a prompt chain.
Routing classifies an input and sends it to a specialized handler. A triage agent that reads an incoming bug report and decides whether it is a frontend issue, a database issue, or a flaky test — then hands it to the right specialist — is routing.
Parallelization runs subtasks simultaneously. This comes in two flavors: sectioning, where you split work into independent chunks that run at once, and voting, where you run the same task several times to get multiple opinions and take the consensus. Asking three agents to independently find security issues in a diff and pooling their findings is voting.
Orchestrator-workers is the pattern most real coding agents use. A central agent dynamically breaks down a task, delegates the pieces to worker agents, and synthesizes their results. Unlike a fixed chain, the subtasks are not predetermined — the orchestrator decides what they are based on what it finds. This is what lets an agent handle a GitHub issue that touches eleven files it has never seen.
Evaluator-optimizer puts two agents in a loop: one generates a solution, the other evaluates it against explicit criteria and sends back feedback, and the cycle repeats until the work passes. This is the reviewer-critic pattern, and it is one of the highest-leverage things you can add to your workflow.
The core loop: explore, plan, code, commit
Underneath all the orchestration, there is a single loop that the best agentic workflows follow on almost every task. It is worth internalizing because it maps cleanly onto how a thoughtful senior engineer already works, and because skipping any one phase is where most agent failures come from.
1. Explore
Before touching a line of code, the agent reads. It opens the relevant files, the existing tests, the surrounding modules, and any specs or tickets. Crucially, in this phase it is not allowed to edit anything. Claude Code calls this plan mode; in other tools you enforce it by simply telling the agent to investigate and report back before proposing changes. The point is to load the right context before any decisions get made. When you are dropped into an unfamiliar codebase, this is also how you onboard: ask the agent the same questions you would ask a senior engineer on the team — where does authentication live, how is state managed, what is the test setup.
2. Plan
Next, the agent produces a written plan: what it intends to change, in which files, in what order, and how it will verify the result. This plan is an artifact you read. It is the single most important human-in-the-loop checkpoint in the entire workflow, because correcting a flawed plan costs a sentence, while correcting a flawed implementation costs a review cycle. A widely shared rule of thumb: the only time you can safely skip the planning phase is when you could describe the entire diff in a single sentence. For anything larger, make the agent plan first.
There is a subtle technique here that separates people who are good at this from people who are great at it: run planning and implementation in separate sessions. The exploration phase fills the context window with file contents, dead ends, and reasoning that is useful for producing the plan but becomes noise during implementation. Start the coding phase fresh, with the clean plan as its input.
3. Code
Now the agent implements against the plan, and — this is the part that makes the whole thing work — it runs a check after each meaningful step. Tests, a build, a type-checker, a linter, a screenshot diff: anything that returns an unambiguous pass or fail. The highest-leverage thing you can give an agent is a way to tell whether its own work is correct. When that feedback loop exists, the agent self-corrects and you stay out of the way. When it does not, you become the feedback loop, and your throughput collapses to the speed of your own attention.
This is also where test-driven development quietly becomes a superpower. Have one agent write the tests for a behavior first, confirm they fail, then have a second agent (or a fresh session) write the implementation until they pass. The tests become an objective target that keeps the implementing agent honest.
4. Commit
Finally, the agent commits with a clear message and opens a pull request, ideally summarizing what changed and why. Frequent, small commits give you rollback points and keep each change reviewable. If something went sideways three steps back, you want to return to a known-good state without unpicking an hour of tangled edits.
The highest-leverage thing you can give an agent is a way to tell whether its own work is correct.
The specialist crew: which agents to actually run
Once the core loop is second nature, the multi-agent layer is about assigning narrow roles. The mistake most people make is creating broad, generic agents — a "backend engineer" agent, a "QA" agent — and wondering why the results are mediocre. The guidance from the people who build these tools is the opposite: make your sub-agents specific, and give each one a single job. A focused agent with a tight brief and the right tools outperforms a generalist every time.
A practical crew for everyday development looks like this:
The coding agent does the implementation. It has write access to the source, can run the build and the tests, and works against an approved plan.
The reviewer agent reads diffs with fresh context and flags problems. This is your evaluator-optimizer in human-readable form. One important caveat: an agent told to "find issues" will always find some, inventing nitpicks if it has to. Instruct it to flag only issues that affect correctness or the stated requirements, or you will drown in over-engineering suggestions.
The testing agent writes and maintains tests. Test coverage gaps are an ideal thing to delegate, because the success criterion is objective and the work is tedious for a human.
The research agent explores unfamiliar territory — a new library, an undocumented internal service, an error nobody recognizes — and returns a compact summary rather than dumping everything it read into the main context.
The documentation agent keeps READMEs, changelogs, and inline docs in sync with what actually shipped.
You do not need all five on every task. A one-line bug fix needs none of them. A multi-file feature might use four. The skill is matching the size of the crew to the size of the problem.
Context engineering: the skill that separates good from great
If there is one discipline that determines whether your agentic workflow soars or stalls, it is context engineering — and it has quietly replaced prompt engineering as the thing worth getting good at.
The core insight is that a context window is a finite resource with diminishing returns. As it fills, models get measurably worse at using what is in it — a phenomenon documented under the name context rot. The goal is not to cram in as much as possible; it is to find the smallest set of high-signal tokens that make the right outcome most likely. More context is not better. Better context is better.
Three techniques do most of the heavy lifting:
Compaction. When a session approaches the limits of its window, summarize it into a fresh one — carrying forward the decisions and state that matter, dropping the exploratory noise. Long-running agents survive precisely because they compress their own history instead of drowning in it.
Structured note-taking. Have the agent write its progress to an external file — a running list of what is done, what is left, and what it learned. That file persists across context resets, so a fresh agent can pick up exactly where the last one stopped. This is, in effect, giving your agents a memory that outlives any single session.
Sub-agent isolation. This is the deepest reason multi-agent systems work at all. When you delegate exploration to a sub-agent, that agent burns through its own context window reading files and chasing references — and returns only a tidy summary. The orchestrator's context stays clean. You have effectively parallelized not just the work but the attention, keeping the main thread focused while the messy investigation happens elsewhere.
Give your agents a memory: CLAUDE.md and AGENTS.md
The single highest-return piece of configuration in any agentic setup is the project memory file — CLAUDE.md for Claude Code, AGENTS.md for Codex and a growing list of other tools (it is now an open standard). This file is loaded into context automatically, so it is where you encode the things you would otherwise have to repeat in every prompt.
A good memory file tells the agent three things: the what (the stack, and a map of the project — essential in a monorepo), the why (what the major components are for), and the how (how to run the tests, the type-check, the build — the exact commands the agent uses to verify its own work). That last category is the most valuable and the most often forgotten.
# AGENTS.md
## Stack
- Next.js 15 (App Router), TypeScript, PostgreSQL via Prisma
- Tests: Vitest (unit), Playwright (e2e)
## Commands the agent should use
- Install: `pnpm install`
- Typecheck: `pnpm typecheck` # run after every change
- Unit tests: `pnpm test` # must pass before commit
- E2e tests: `pnpm test:e2e` # run for any routing/auth change
- Lint: `pnpm lint --fix`
## Conventions
- Server components by default; mark client components explicitly.
- Never edit files in `/generated`. They are build artifacts.
- Database changes require a Prisma migration, never a manual schema edit.
## Where things live
- Auth: `src/lib/auth/`
- API route handlers: `src/app/api/`
- Shared UI: `src/components/ui/`
The discipline that matters most: keep it lean. Even the strongest models have a limited budget of attention for instructions; a bloated memory file does not make the agent more careful, it makes the agent start ignoring instructions. A useful test is to delete any line whose removal would not cause a mistake. And because this file shapes everything the agent does, check it into version control and review changes to it like you would review code — it is a shared asset for the whole team, not a personal scratchpad.
Running AI agents in parallel with git worktrees
Here is where the throughput math changes. A single agent session, however well configured, makes you perhaps one and a half to two times faster on a given task. The teams reporting dramatically larger gains are not getting them from single-session speed — they are getting them from concurrency, running many agent sessions at once on independent pieces of work.
The enabling technology is humble: git worktrees. A worktree lets you check out multiple branches of the same repository into separate directories simultaneously, so several agents can each work on their own branch without stepping on each other. Modern tools have built this in — some IDE-based agents will run up to eight parallel sessions in worktree isolation out of the box, and the cloud-based coding agents spin up a fresh sandboxed environment per task by default.
There are two distinct ways to use parallelism, and they are worth separating:
Breadth: run several different tasks at once — fix three unrelated bugs, each in its own worktree, each with its own agent. This is the everyday multiplier. While one agent grinds through a tedious refactor, two others are closing tickets.
Competition: point several agents at the same hard problem with different approaches, then keep the best result and discard the rest. When you genuinely do not know the right design, racing a few attempts in parallel is often faster than agonizing over one.
A caution that the enthusiasts sometimes skip: there is a hard limit on how many parallel sessions a human can actually supervise. Each one produces work you are responsible for reviewing. Concurrency moves the bottleneck from writing code to reviewing it — which means review capacity, not generation speed, becomes the thing that governs your real throughput. We will come back to this.
A day in the life: one task, end to end
Theory is cheap. Here is what the whole system looks like applied to a single, ordinary task: "Add rate limiting to our public API endpoints."
You start in plan mode. The research agent explores the codebase — it finds the existing middleware stack, notes that there is a Redis instance already used for sessions, and reports back that there is no current rate-limiting layer and three places where new middleware could hook in. It does not write anything; it returns a paragraph of findings. Your main context stays clean.
You ask for a plan. The agent proposes: add a sliding-window limiter backed by the existing Redis, wire it into the middleware chain, make the limits configurable per route, and cover it with unit tests plus one end-to-end test. The plan names the four files it will touch. You read it and notice it forgot about the health-check endpoint, which must never be rate-limited. You add one sentence. That correction just saved you a production incident, and it cost you ten seconds.
You start a fresh session for implementation, handing it the approved plan. The coding agent writes the limiter, runs the type-checker after each file, and writes the tests. The first run of the e2e test fails — the limiter is counting health-check requests. Because the agent has a passing/failing signal, it sees the failure, recalls the constraint from the plan, adds the exclusion, and the suite goes green. You did nothing during this.
Now the reviewer agent reads the diff with fresh context, instructed to flag only correctness and requirements issues. It catches that the limiter fails open — if Redis is unreachable, all requests are allowed through — and asks whether that is intentional. It is a real question, so you make a real decision: fail closed for write endpoints, open for reads. The coding agent applies it.
Finally the agent commits with a descriptive message and opens a PR, and the documentation agent adds the new per-route configuration to the API docs. Elapsed human effort: reading one plan, adding one sentence, answering one design question, making one judgement call. Everything else ran on its own — and the two things you contributed were exactly the two things a machine should not have decided for you.
The maturity ladder: how to actually adopt this
You do not get here in a day, and you should not try. The reliable path is a ladder, where each rung pays off on its own and sets up the next. Climb it only as fast as the value justifies the added complexity and token cost.
Single-agent assistant. Use one agent for questions, exploration, and small, well-scoped edits. Always start from a clean git state so you can see exactly what changed. Get fluent in the explore-plan-code-commit loop.
A memory file. Write a lean
CLAUDE.mdorAGENTS.mdcovering your stack, your commands, and your conventions. Commit it. This one file removes most of the repetitive instruction you have been typing.Custom commands. Capture the workflows you repeat — "review this PR for security issues," "write tests for this module" — as reusable slash commands or skills, so they are one keystroke instead of a paragraph.
Specialist sub-agents. Introduce focused, single-job agents — a reviewer, a tester — each with its own context window and a restricted set of tools. Tell the orchestrator explicitly when to use them.
Parallel and orchestrated workflows. Run multiple agents across git worktrees for independent tasks. Bring in orchestrator-worker and evaluator-optimizer patterns for the complex jobs. Reach for a dedicated orchestration framework only when a task genuinely decomposes into parallel threads and the value clearly justifies the overhead.
The temptation is to jump straight to rung five because it is the most exciting. Resist it. A team fluent at rungs two and three ships more reliably than a team fumbling a multi-agent framework it does not yet understand.
The honest scorecard: does this actually make you faster?
This is the section most write-ups skip, and it is the most important one. The productivity story for AI agents is real, but it is genuinely mixed, and an engineer deciding how to invest their time deserves the unvarnished version.
The optimistic data is striking. Internal reports from teams that have leaned in describe large jumps in throughput — engineers merging substantially more pull requests per day after adopting an agentic workflow, and individual tasks completing far faster than before. Vendor case studies cite feature timelines collapsing from weeks to days. Controlled studies of older, completion-style assistance found meaningful increases in successful builds and pull request volume. There is clearly something real here.
But the most rigorous independent study points the other way, and it deserves your attention precisely because it is inconvenient. In a randomized controlled trial conducted in early 2025, experienced open-source developers working on their own mature repositories were measured completing real tasks with and without AI assistance. They predicted the tools would make them about 24% faster. Afterward, they believed they had been about 20% faster. They were actually 19% slower with the AI. The gap between perceived and actual performance was the whole story: the tools felt fast while quietly costing time, because steering, reviewing, and correcting the agent on a codebase they already knew intimately outweighed the typing it saved. The tooling has improved since that study ran, and the result may not hold for today's agents — but the perception gap it exposed is the durable lesson, and you should assume you are subject to it.
Developers predicted a 24% speedup, felt a 20% speedup, and were measured 19% slower.
Industry-wide surveys land in the same ambivalent place. The largest annual developer survey found that while the overwhelming majority of developers now use or plan to use these tools, trust is actually falling — a minority trust the accuracy of AI output, and the single most common complaint is code that is "almost right, but not quite," which is precisely the failure mode that eats time, because nearly-correct code is harder to debug than obviously-broken code. The major DevOps research report reached a conclusion worth taping to your monitor: AI does not fix a struggling team, it amplifies what is already there. Strong teams with good tests and tight feedback loops get faster. Weak teams just generate their dysfunction more quickly — and, notably, the same report found that rising AI-driven throughput correlated with worse delivery stability unless the engineering fundamentals were already in place.
Measurement specialists have a name for the trap: false velocity. More pull requests merged is not the same as more value delivered. If generation speeds up but review, testing, and integration do not, you have not gotten faster — you have just moved the bottleneck downstream and made it harder to see.
How do you reconcile the striking gains with the sobering studies? The honest synthesis is that agents help most where the work is broad, unfamiliar, or boilerplate-heavy — scaffolding a new service, exploring a codebase you have never seen, generating the tedious tests — and help least, or actively hurt, when a domain expert is working on a mature, tightly-coupled system with a high quality bar, where the cost of explaining the task to the agent exceeds the cost of just doing it. The leverage is real. It is just not uniform, and believing it is uniform is how you end up slower while feeling faster.
Anti-patterns and how to avoid them
Most of the ways agentic workflows fail are predictable, which means they are avoidable.
The kitchen-sink session. Pouring three unrelated tasks into one long-running context pollutes it and degrades every answer. Clear the context between tasks. A good rule: after two failed attempts at the same fix, stop, clear, and rewrite the prompt from scratch rather than digging the hole deeper.
Skipping the plan. Letting the agent code immediately on anything non-trivial is how you get confidently-wrong implementations that are expensive to unwind. Plan first; the plan is your cheapest correction point.
Over-engineering. Left unprompted, agents add abstractions, helpers, and options nobody asked for. Tell them explicitly to use the simplest approach that works, and have your reviewer agent flag unnecessary complexity rather than reward it.
Review pile-up. If you scale generation without scaling review, you create a downstream traffic jam and your cycle time gets worse even as your commit count climbs. Invest in review capacity — including automated reviewer agents — before you crank up output.
Trusting unverifiable work. Never fully delegate a task whose result you cannot check. If there is no test, no build, no observable behavior to confirm correctness, you are not delegating — you are gambling.
The new attack surface: AI agent security
Agentic workflows introduce a class of risk that did not exist when your AI tool was just suggesting completions: the agent now reads untrusted content and takes actions. The dominant new threat is indirect prompt injection — malicious instructions hidden in the data an agent consumes. A poisoned dependency, a booby-trapped issue comment, a web page the agent fetches during research, even a crafted string in a file it reads can all carry instructions that hijack the agent's behavior. Because agents chain tools together, a single injection can escalate into a sequence of unintended actions.
The defenses are practical and worth adopting before you scale autonomy, not after an incident:
Sandbox aggressively. Run agents in isolated environments — containers or dedicated VMs — with no standing access to anything they do not need. The cloud coding agents do this by default; for local agents, you have to set it up.
Restrict the blast radius. Limit network egress to an allowlist, block writes outside the workspace, and protect configuration files from agent edits. An agent that cannot reach the open internet or modify its own permissions is dramatically harder to weaponize.
Gate the irreversible. Require explicit human approval before anything you cannot undo — deleting data, deploying, moving money, force-pushing. Keep a human in the loop precisely at the points where a mistake is permanent.
Treat the dangerous flags as dangerous. The options that bypass approvals and sandboxing entirely have their place — inside a hardened container, for a trusted task. Running them on your own machine against a real codebase is how a bad afternoon becomes a very bad afternoon.
Own the output. Whoever's name is on the pull request owns the code, regardless of how much of it an agent wrote. AI assistance does not transfer responsibility, and the survey data is clear that nearly-right insecure code is a real and common failure mode.
Start tomorrow
The gap between using an AI agent and running an agentic workflow is not about access to a better model. It is about method — and method is what actually moves your turnaround time, not the raw speed of the model underneath. If you take only a handful of things from this guide, make them these.
Adopt the explore-plan-code-commit loop on your very next task, and refuse to skip the plan on anything you could not describe in one sentence. Write a lean memory file for your main repository this week and commit it. Make every task you delegate verifiable — if the agent cannot run a check that tells it whether it succeeded, build that check before you hand over the work. Add a reviewer agent and instruct it to care only about correctness. And when you are ready, start running two independent tasks in parallel worktrees, then three, until you hit the edge of what you can actually review — because that edge, not the model's speed, is your real limit.
Above all, stay honest with yourself about the gains. These tools can make a strong engineer on the right kind of problem genuinely, dramatically faster. They can also make you feel fast while quietly slowing you down. The engineers who win with agents are not the ones who trust them the most — they are the ones who built the verification, the structure, and the judgement to know the difference.
Further reading
Building Effective Agents — Anthropic's reference taxonomy of agent workflow patterns.
Effective context engineering for AI agents — the discipline behind keeping agents sharp.
Measuring the impact of AI on experienced developers — the RCT behind the perception-gap finding.
DORA — research on what actually makes software delivery faster and more stable.
Stack Overflow Developer Survey — adoption and trust trends across the profession.
AGENTS.md — the open, cross-tool standard for agent instruction files, supported by Codex, Cursor, Copilot, and more.
Quantifying GitHub Copilot's impact with Accenture — a controlled enterprise study behind the build and pull-request figures.










