One night my AI agent Hermes, which I run 24/7 on my own hardware, spent 47 turns trying to "fix" a script. Every turn it ran the same broken command, got the same error, apologized and tried again. I sat there watching the token counter climb like a Tatkal queue at 10 AM, and I genuinely could not tell what had gone wrong. Was the prompt bad? Was it missing context? Was a tool broken? Was the loop just never going to stop? I am an AI Observability Architect, and I was staring at my own agent unable to name which layer had failed.
That night is what this post is about. A production AI agent is not one skill, it is five, and when something breaks at 2 AM you need to know exactly which one to blame.
Everyone talks about prompt engineering. Fewer people talk about context engineering. Almost nobody talks about harness, loop or evaluation engineering, even though they're the difference between a demo that works once and a system that survives a night alone on my server.
An AI observability architect who doesn't understand the full stack of agent engineering is just a person who reads dashboards. So I sat down and split it into the five disciplines that actually make up a production agent: what each one is, how they differ and how they fit together.
The Five Disciplines
Before the definitions, one metaphor ties all five together: a worker doing a job at a workbench. I reuse it for every discipline so the boundaries stay sharp.
- Prompt is the job description pinned to the wall: what to do and how to do it.
- Context is the briefing packet on the desk: the facts for this one task, not the whole library.
- Harness is the workbench and the tools bolted to it: what the worker can physically reach and operate, plus the safety guards.
- Loop is the work rhythm: do a step, check the result, decide the next step and know when to down tools.
- Evaluation is the QA inspector who checks finished pieces and reports trends.
Keep that worker in your head. The whole post hangs off it.
1. Prompt Engineering
What it is: Writing the instructions that tell the model what to do, how to do it, and what format to return.
The job: Craft system prompts, few-shot examples, output format constraints, and chain-of-thought scaffolding that produce reliable outputs from the model. There are different styles of prompting (zero-shot, few-shot, chain-of-thought, tree-of-thought) and different techniques (role prompting, constraint prompting, format prompting) that can be combined in various ways.
What you're optimizing for:
- Output quality,
- Instruction adherence,
- Format consistency,
- Token efficiency of the prompt itself.
Failure mode:
- Over-prompting.
- A 2000-token system prompt that tries to cover every edge case but degrades model performance because the model loses focus on what matters. Hermes' system prompt is 200+ lines, and I have learned the hard way that every rule I add to squash one bug quietly makes three others more likely.
Example: "You are a Python debugging assistant. Given a traceback and the relevant code, identify the root cause and suggest a fix. Always output: 1) Root cause, 2) Fix (as a code block), 3) Explanation. Do not hallucinate functions that don't exist in the code."
2. Context Engineering
What it is: Deciding what information goes into the model's context window and in what order.
In the metaphor: the briefing packet you hand the worker for this task. Not the whole library, just the pages that matter right now.
The job: Build retrieval pipelines (RAG), manage conversation history, compress old context, inject relevant documents, and structure the information so the model can use it effectively.
What you're optimizing for:
- Retrieval precision,
- Context window utilization,
- Signal-to-noise ratio,
- Cost per query (fewer irrelevant tokens = lower cost).
Failure mode:
- Context stuffing.
- Dumping 100K tokens of "relevant" documents into the context window when the model only needs 5K. The model gets distracted, costs spike, and quality drops. My first version of Hermes' memory shoved everything it had ever seen into context, and the bill at the end of the month taught me restraint faster than any blog post could.
Example: Instead of passing an entire 500-page PDF, you chunk it into sections, embed each chunk, retrieve the top-5 most relevant sections based on the query, and inject only those, with a summary of what was excluded so the model knows the full document exists.
3. Harness Engineering
What it is: Building the infrastructure that wraps the model, tool definitions, API integrations, execution environments, and safety rails.
In the metaphor: the workbench, the power tools and the safety guards. It is what the worker can reach and operate, nothing more. This is the static part: it exists before the agent runs a single step.
The job: Define the tool schemas the model can call. Wire up the terminal, file system, web browser, and external APIs. Enforce permissions, rate limits, and sandboxing. Handle retries, timeouts, and error recovery.
What you're optimizing for:
- Tool call reliability,
- Execution safety,
- Latency,
- Breadth of capabilities available to the agent.
Failure mode: The agent calls a tool that doesn't exist, or a tool that exists but returns an error the model can't interpret. The agent spirals into retry loops or hallucinates a successful outcome. Half of Hermes' early "successes" were it cheerfully telling me a task was done while the tool had actually thrown a stack trace it never bothered to read.
Example: Defining a read_file tool with a clear schema (path: string, offset: int, limit: int), a check_fn that verifies the file exists before the tool is exposed to the model, and an error handler that returns structured JSON the model can act on instead of a raw stack trace.
4. Loop Engineering
What it is: Designing the agent's reasoning loop, how it decides what to do next, when to stop, and when to ask for help.
In the metaphor: the work rhythm. If the harness is the machine on the bench, the loop is the worker pressing "go" again and again, checking the result each time, until the piece is finished or the foreman gets called. This is the dynamic part: it only exists while the agent is running.
Peter Steinberger, the creator of OpenClaw, put the same idea more provocatively in June 2026: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." His framing adds a sharp point I love: a good loop corrects itself against objective signals (tests, type checkers, linters, runtime errors) rather than against your patience. That is the difference between an agent that runs while you sleep and one that needs you babysitting every turn.
The job: Choose the loop pattern (ReAct, Plan-and-Execute, Tree of Thoughts). Set the max turns. Implement context compression when the conversation gets long. Decide when the agent should pause for human approval vs. proceed autonomously.
What you're optimizing for:
- Task completion rate,
- Loop efficiency (fewer turns for the same outcome),
- Autonomous depth,
- Graceful degradation when the agent gets stuck.
Failure mode:
- Infinite loops.
- The agent calls a tool, gets an unexpected result, retries with the same approach, gets the same result, and burns through 50 turns without progress. This is the exact 47-turn night from the top of this post: pure Loop Engineering failure.
- Or the opposite: the agent stops too early, declaring success when the task is half-done.
Example: A ReAct loop that caps at 90 turns, compresses context at 50% utilization, and falls back to asking the user for clarification after 3 consecutive failed tool calls on the same sub-task.
5. Evaluation Engineering
What it is: Designing the systems that tell you whether the other four are working.
The job: Build eval datasets. Configure LLM-as-a-Judge pipelines. Run regression tests on every model change. Aggregate scores, track trends, and set alerting thresholds for quality degradation.
What you're optimizing for:
- Evaluation accuracy (does the eval correlate with real quality?),
- Evaluation cost (are you spending more on eval than on the agent itself?),
- Signal timeliness (how fast do you learn about a regression?).
Failure mode:
- Evaluating the wrong thing.
- A judge model that scores "helpfulness" but actually rewards verbosity. Or a regression suite that passes every time because the test cases are too easy. You feel confident while quality silently degrades. This is the failure that scares me the most, because unlike the 47-turn night, it makes no noise at all.
Example: A Langfuse evaluator that classifies every agent trace by query type (coding, research, ops), scores output quality on a 1-5 rubric, and alerts when the research query quality score drops below 3.5 for two consecutive days.
How are Context, Harness, and Loop different?
This is the part that trips everyone up, including me. Prompt and Evaluation are easy to tell apart. The blur is between Context, Harness and Loop, because all three feel like "stuff around the model." Here is the cleanest way I have found to separate them.
One word each
| Discipline | One word | The question it answers |
|---|---|---|
| Context | KNOW | What does the model get to see? |
| Harness | DO | What can the model actually touch in the real world? |
| Loop | DECIDE | How many times, in what order, and when does it stop? |
Or, borrowing a line I like from LangChain: prompt shapes behavior, context shapes reasoning, harness shapes execution. The loop is the slice of execution that repeats.
The same bug, three different owners
Here is the trick that finally made it click. Take one task, "fix the failing test," and one symptom, "the agent didn't fix it." Who gets paged depends entirely on why:
- It never saw the test file or the error output. Context bug: the knowledge was missing.
- It had no
run_teststool, or the tool crashed. Harness bug: the capability was missing. - It saw the file, ran the test once, ignored the red and quit (or looped 50 times getting nowhere). Loop bug: the decision was missing.
Same task. Same surface failure. Three different engineers fix it. Once you can sort a failure into KNOW, DO or DECIDE, the blur is gone.
Harness vs Loop: the machine and the "go" button
Harness and Loop blur the most, so be precise. Honestly, the industry itself is not fully settled here: LangChain treats the loop as one component inside the harness, while others treat Loop Engineering as a separate layer on top. The split I use:
- Harness is the static part. Everything that exists before the agent runs: tool schemas, sandbox, file system, permissions, guard rails. The machine on the bench.
- Loop is the dynamic part. The act, observe, decide, repeat cycle plus the stopping condition. The worker pressing "go" on that machine until the goal is met.
The one-liner that sticks: the harness runs the agent once, the loop runs the harness until the job is done.
One agent turn, three zones
Forget the big architecture diagram for a second. Here is a single turn of the agent, with each discipline owning exactly one segment:
They do not overlap. Context is the arrow going in. Harness is the arrow going out to the world and back. Loop is the circle plus the exit gate. That is the whole difference in one picture.
Side-by-Side Comparison
| Dimension | Prompt Engineering | Context Engineering | Harness Engineering | Loop Engineering | Evaluation Engineering |
|---|---|---|---|---|---|
| What you build | Instructions for the model | Information pipeline into the model | Tool infrastructure around the model | Reasoning cycle for the model | Measurement system over the model |
| Core artifact | System prompt, few-shot examples | RAG pipeline, context compressor, chunking strategy | Tool schemas, API adapters, sandbox | Agent loop (ReAct, Plan-Execute, ToT) | Eval datasets, judge prompts, score dashboards |
| Key question | "What should the model do?" | "What does the model need to know?" | "What can the model do?" | "How does the model decide what to do next?" | "Is the model actually doing it well?" |
| Optimizes for | Output quality, instruction adherence | Signal-to-noise ratio, retrieval precision | Tool reliability, execution safety | Task completion, loop efficiency | Evaluation accuracy, regression detection |
| Failure mode | Over-prompting, instruction dilution | Context stuffing, retrieval noise | Tool errors, missing safety rails | Infinite loops, premature stopping | False confidence, measuring the wrong thing |
| Measured by | Eval scores, format compliance | Retrieval precision/recall, token cost | Tool call success rate, latency | Turn count, completion rate | Judge-human agreement, alert precision |
| When it breaks | Outputs are wrong format, off-topic | Model hallucinates from missing context | Agent can't act on its environment | Agent gets stuck or loops forever | You don't know quality is degrading |
| Who typically owns it | Prompt engineer, ML engineer | RAG engineer, data engineer | Platform engineer, infra engineer | Agent engineer, ML engineer | AI observability engineer, QA engineer |
| Maturity in industry | High (everyone does it) | Medium (RAG is mainstream) | Growing (agent frameworks emerging) | Early (most teams have simple loops) | Lowest (most teams skip it entirely) |
How They Fit Together
Each discipline feeds into the next, and Evaluation wraps around all of them. Here's what a production AI agent system looks like when all five are in place, from the user's question to the final answer:
For better resolution image, please check the OG post
The dotted arrows are the key insight. Evaluation Engineering doesn't just sit at the end, it feeds back into every other discipline. When eval scores drop, you don't know which layer is broken until you instrument all five.
What Happens When You Skip One?
I've seen each of these failure modes in production. They're not hypothetical.
| Skip This | What Happens |
|---|---|
| Prompt Engineering | Model outputs inconsistent formats, ignores constraints, hallucinates expected behavior |
| Context Engineering | Model answers from stale or irrelevant information. Token costs 5-10x higher than necessary |
| Harness Engineering | Agent can't interact with the real world. Every tool call is a potential crash |
| Loop Engineering | Agent runs forever on simple tasks, or stops too early on complex ones |
| Evaluation Engineering | You ship a model update and quality drops 20% and nobody notices for a week |
Evaluation Engineering is the most skipped and the most dangerous to skip. Every other discipline has a visible failure mode, the output is wrong, the agent crashes, costs spike. Evaluation failure is invisible. Everything looks fine until a user complains.
Where Is the Industry?
Most teams today are at different maturity levels for each discipline: Prompt Engineering is high (everyone does it), Context Engineering is medium (RAG is mainstream), Harness Engineering is growing, Loop Engineering is early, and Evaluation Engineering is the lowest, often skipped entirely.
The gap between Prompt Engineering (everyone does it) and Evaluation Engineering (almost nobody does it well) is where most production AI systems fail silently. This is also where the most interesting work is happening right now, LLM-as-a-Judge, automated regression pipelines, and observability-driven development.
How I Apply This
I run an AI agent (Hermes) 24/7 on my own infrastructure. Here's how the five disciplines map to real decisions I make:
- Prompt Engineering: My agent's system prompt is 200+ lines defining its role, constraints, and output format. I tweak it when eval scores drop, not when I "feel like" the output is off.
- Context Engineering: My agent has persistent memory (user preferences, environment facts, session history). I manage a memory budget, when it hits 8000 chars, older entries are compressed or pruned. Context window management is a real engineering problem.
-
Harness Engineering: My agent has access to terminal, file system, web browser, and 50+ tools. Each tool has a
check_fnthat verifies requirements before the tool is exposed to the model. Destructive commands require approval. - Loop Engineering: Max 90 turns per session. Auto-compress context at 50% utilization. Fallback to user clarification after consecutive failures. I've tuned these numbers over months of watching the agent get stuck and unstuck.
- Evaluation Engineering: I'm building an LLM-as-a-Judge pipeline in Langfuse that scores every agent trace. This is how I know whether my prompt changes actually improve output quality or just shift the failure modes.
The Bottom Line
If you're building AI systems, you're doing all five of these whether you call them "engineering" or not. The question is whether you're doing them deliberately.
Prompt Engineering is table stakes. Context Engineering is where most teams spend their RAG effort. Harness Engineering is where agent frameworks add value. Loop Engineering is where most teams have the most room to grow. Evaluation Engineering is where most teams are flying blind.
If you can't measure it, you can't improve it. Start with Evaluation Engineering, even a simple LLM-as-a-Judge over your existing traces will tell you more about your system than months of intuition.
















