We Burned Dozens of Sprints Learning This, But You Don’t Have To

My name is Hung also known as Paul.

I have worked in the technical field for 6 years and started learning about LLM around spring 2023, when ChatGPT had just launched and the whole world was asking “can this be used in production?“.

Paul

I have been involved in deploying and building dozens of AI agent systems and RAG pipeline products. We spent a lot of time, weeks went by, many products were built. But some of them were not good enough. There was a project I wish I knew how to optimize better sooner. Today, I just finished a course on AWS Bedrock of Anthropic, and I suddenly found myself reminiscing about that time. To be honest, we did not answer this question from the beginning, the question that nobody asked before building at that time. I think this article might help people avoid repeating the same mistakes.

“How do we measure AI response when deployed in production?”

I mean, it was not just about analyzing conversations, collecting logs, or gathering user feedback after a product launch. We need to measure AI’s response during the testing phase, before launching the product.

In 2024, we created a website that helps people learn to use AI in their work through a gamified experience. At that time, there were only a few coding agent tools and our vision was to educate people in tech industry to become familiar with them. Because we believed that these tools would continue to develop and would soon become the new trend. We developed a self-learning experience with AI and a secure cloud-based sandbox environment for practice. My co-founder always says

“we have to achieve 95% accuracy“.

But nobody knew 95% of what? Which dataset is it 95% of? Or which test case is it 95% of? We did not answer the questions. We tried to build a list of test cases based on our understanding in this field and feedback from end-users. We used to believe that we could just ship the product and then customers would give feedback before we could improve it.

And this was a wonderful trap. Everyone on the team pushed themselves to deliver. Testers wrote test cases based on their experience and understanding of the field. Developers completed their frontend with Next.js and backend with NestJS and LangGraph. Prompt engineers finished the final system prompt. DevOps engineers deployed the product to the cloud. Marketers started social media advertising campaigns. We gained around 500 new free users and 3 pieces of feedback after spending a significant amount of time and money.

The loop.

After reviewing the system logs, I realized that end users never act exactly like test cases. They acted by the way that none of us had thought. They used different words. They asked confused questions. They combined cases that we would never have written into tests. Our tests all passed, but the tests did not match actual real user input data. We came up with the numbers based on intuition. There was no baseline, no definition. Just a feeling like that “it seems okay“. The true problem here was not AI, the most important issue is our lack domain knowledge to write good enough test cases. We were testing based on our own biases, and those biases do not reprersent real users.

Our mistake: a lack of domain expert and accurate test measure.

We fell into to a loop. The loop that optimizes prompts. We listened to user feedback, adjusted test cases, modified system prompts, re-launch, repeated… But 3 feedback were never enough for us to improve. Ultimately, users just kept leaving and we were forced to abandon the idea after wasting an entire year.

1. Why this is design issue, and not just prompt issue?

Before diving into the solution, I want to say that I am not writing this to complain. My former teammates have done very well within their capabilities. But we just had not found the right method to solve the issue at the time.

Overview: most builder teams are building outer harness and calling LLM API via a provider. They usually do not care about inter harness. They leave it to LLM providers.

Inter Harness - Outer Harness

What is Inter Harness? That includes all the infrastructure surrounding the system that serves LLM models, compliance with post-training alignment (a special type of unchangeable prompt), calling internal tool, caching user prompts, handling errors, handling agentic loops, handling guardrails…
What is Outer Harness? These are all the layers that we usually build into our products such as system prompts, PRDs, skills, hooks, memories…

I have noticed that 80% of the large businesses I know are not using Claude via Claude, while other builders are. The enterprises are using their own inter harness on the biggest cloud services such as AWS, GCP, and Azure.

I know, Claude or GPT also has Enterprise plan with powerful features such as user spending limits, better context window, role based access permission, network access control, IP white list, audit logs, observability, monitoring… But you must pay at least 240 dollars for a seat per year, and you still have to bear the risks of service disruption.

Have you subscribed on Claude Status mailing list? I have counted at least 253 incidents from December 2025 to May 2026, including a number of incidents related to the Claude dot ai Enterprise plan.

At least 253 emails about Claude incident from Dec 2025 to May 2026

When you need Claude API to use in your product, you need to use Claude Platform (not claude dot ai), which is completely separate from your Claude enterprise licenses. Claude dot AI enterprise license only grants you access to Chat, Claude Code, the Claude Github App…, but not API access.

AWS’s SLA commitment 99.99% uptime, allowing complete control over even the smallest behaviors through IAM. Immune to Claude incidents.

Wait, I am not selling for AWS. I wonder why the most big companies use it to control their inter harness and what can I learn from it.

2. About the Claude in Amazon Bedrock course

I am a fan of Anthropic. I like how they incorporate ethics and AI safety issues into their product. My first encounter with Anthropic was probably through the ebook “The Way of Code”, which they collaborated with Rick Rubin; I read it in 2025. But it is only recently that I have started taking their course on Skilljar. Claude in Amazon Bedrock consist of 83 lessons, about 7 hours long, covering all stacks to build AI agent production-ready.

API Bedrock: handle Multi-turn conversations, System Prompts, Temperature, Streaming.
Tool Use: call external functions, guardrail tool calls, batch execution.
RAG: Hybrid Search, Contextual Retrieval, Reranking, Chunking Strategy.
Prompt Caching: FinOps.
MCP, Agent, Computer Use, Claude Code, Agent Design Principles.
Evaluate Prompt Engineering.

All of there above modules include hands-on practice code, and each of them could be written as a separate topic. But if I had to choose only one, I would choose Evaluate Prompt Engineering. If only I had known about it sooner, our previous project might have turned out better.

3. Build a pipeline to objectively measure and evaluate AI responses

Anthropic solved the issue with a complete workflow in Amazon Bedrock. There were only 4 steps:

3.1. Step one: Define “good“ before testing

So what exactly does good mean?

Okay. A good software may have been tested through thousands of samples and thousands of cases. An AI agent system has… unlimited cases.

Ouroboros

Wait, am I kidding you?

Yes, if you have chosen to ask ChatGPT, to download a dataset from somewhere on the internet, or to ask someone and pray that they will give you some samples. You will always fall into one of two situations: too few test cases or too many that are still insufficient.

However, testing AI agents is different from testing conventional software. We need to build a golden dataset which defines about 200 scenarios and their expected outputs.

Accuracy: was output’s meaning or the action the AI took accurate?
Format compliance: was the output format correct?
Tone: was the chatting tone correct? was its behavior acceptable? did it ever overstep its permission?
Completeness: was the information in the output sufficient?

A good dataset for testing is a clear, complete scenario and expected output.

No definition = no measurement = the 95% figure you gave is just subjective.

3.2. Step two: Generate evaluation datasets

Then, how to build a golden dataset?

Automated generation vs Manual generation

A tester can write around 60 to 80 manual test cases a day for a software project. And they can run about 50 to 70 cases per day. Testing an AI agent is very difficult. Because its output is non-deterministic. In a traditional software, you know, there would be a place to input data, a place to press interactive button, and yet another place to filter output data.

I mean, for example if I have 2 simple checkboxes then I will have total two to the power of two equal four scenarios:

Do not check any checkboxes
Check all checkboxes
Check only the first one
Check only the last one

You can always calculate how many cases are enough for covering the requirements.

But AI agent does not work like that. It is probabilistic prediction software. Each input goes through multiple intermediate steps and ultimately returns outputs that are never exactly the same. So you can’t calculate the number of test cases. In other words, counting test cases is pointless.

The input cases that your AI agent system will receive can be broadly divided into three types:

Typical cases: happy cases
Edge cases: boundary cases
Adversarial inputs: ways users intentionally “disrupt“ the system

At the time, adversarial inputs were what we lacked. We had a few clients who were willing to share their opinions, and we used to consult with several domain experts, so I thought we could cover typical cases and some edge cases. But the adversarial inputs were our blind spot. We could only see them when analyzing the logs after users went the wrong way.

AWS Bedrock provides a built-in feature that allows you to upload the PDF document in the field to use it as a RAG source, to plan with Opus and to use Haiku to write each record for the dataset that cover all three input types above. In an automatic way, it might help us to save a lot of time to think of and write the test cases.

Or you can simple write your records without Claude generation.

After collecting the test dataset, our job is to review these records, it will be used as samples in the automated testing and in future run.

3.3. Step three: Grading

The judges

First of all, I need to write lambda functions that check the format, use regex to separate data and expose number, enter the calculation results (number) into calculator, validate code block format, validate code quality (run code in sandbox environment), calculate response speed, calculate output token (cost), check tools called… Overall, code-based grading are lambda functions which I have written which can be tailoring easily. After that, it combines and then averanges the scores of the functions.

If the final score is lower than the allowed threshold, go back to improving the agent, skipping the model-based grader.

Second stage: in this phase, ưe will use a teacher LLM to judge the outputs. You can choose between Amazon Bedrock Model Evaluation and Amazon Bedrock AgentCore Evaluations. The first one is used to compare and evaluate foundation models, helping you choose the right model for your task. But it is too simple. In this article, I will talk about AgentCore Evaluations, which provides comprehensive evaluation of a complex AI agent system. This service allows you to evaluate every stage of an agent in each trace: its reasoning process, which tool it calls, multi-turn conversations…

Detailed analysis

Okay, we can temporarily divide a huge amount of output data received after running the testing dataset into 3 levels:

Session level (Goal Success Rate) a full session from the user’s first message to the end.
Trace level (per response / step) a sub-task or a single turn.
Tool level (Tool Selection / Tool Parameter Accuracy) a span logged via OpenTelemetry/OpenInference every time a tool called

There are 13 built-in evaluators working across these 3 levels, and you can define your own custom evaluators with your own rubrics.

How does the AgentCore Evaluation process work?

AgentCore Observability Pipeline

At the beginning of this phase, Agent or RAG that you use in Bedrock be is instrumented through OpenTelemetry/OpenInference and the Traces/Spans are logged and sent to AgentCore Observability. AgentCore Observability is a service built on top of CloudWatch Log Group. You need to set up Evaluation Configuration includes Config data source (you may choose log group), Evaluator (it is our judge = model + prompt + a scoring schema and/or threshold), Sampling, IAM role, filter… After AgentCore Evaluation Job runs, you will get a score, a reason, and metadata of each record on CloudWatch Logs and visualize scores on CloudWatch metrics.

Example of CloudWatch Metrics

That is what will happen, but now I want to dive deeper into the evaluators. For example, in a testing dataset generated by Claude, we will choose the following scenario:

User asked: “Book a flight from Hanoi to HCMC tomorrow afternoon at the lowest price“.
Agent did: it called tools to fetch flight prices, choose the lowest price at 5pm tomorrow, and then asked the user to confirm the action.

Session level

At this level, the judge tries to answer the question: “Does this conversation achieve the user’s goals?”. The judge reads the traces, use the Builtin.GoalSuccess evaluator, and then returns GoalSuccess score of 1.0 (PASS) with the reason “agent found the correct flight, no error”.

Trace level

Each “evaluator” graded the “product”

The judge reads a trace and use Builtin.Helpfulness, Builtin.Correctness Builtin.Coherence, Builtin.Faithfulness, Builtin.Harmfulness, and Builtin.InstructionFollowing to analyze the context and answer, and then returns a score (from 0.0 to 1.0) and a logical reason for each evaluator. Something like that

{
  “Helpfulness”: {
    “score”: 0.9,
    “reason”: “The response provides useful information, addresses the user’s needs, and has practical value.”
  },
  “Correctness”: {
    “score”: 1.0,
    “reason”: “The content is factually accurate with no clear errors or misunderstandings.”
  },
  “Coherence”: {
    “score”: 0.9,
    “reason”: “The answer is clear, logical, and well-structured.”
  },
  “Faithfulness”: {
    “score”: 0.9,
    “reason”: “The response stays grounded in the given context and does not introduce unsupported claims.”
  },
  “Harmfulness”: {
    “score”: 0.0,
    “reason”: “The content is safe and does not include harmful or policy-violating elements.”
  },
  “InstructionFollowing”: {
    “score”: 1.0,
    “reason”: “The response closely follows the user’s instructions and requirements.”
  }
}

Tool level

This level is easier to understand; the examiner will use two evaluation criteria to determine: using Builtin.ToolSelectionAccuracy to see if the agent selected the correct tool for the request, and using Builtin.ToolParameterAccuracy to see if the parameters fed to the tool are correct/complete/appropriate. For example, did this agent call fetch_flight tool and sent parameters “afternoon” “tomorrow”? If model call get_weather then ToolSelectionAccuracy = 0.0.

3.4. Step four: Improving loop

Okay, after all, you cam make a decision once you see the final average scores. Then, you can build a post-processing step to complete workflow. Example: IF GoalSuccess >= X% AND Helpfulness >= a AND Harmfulness <= b AND ToolSelectionAccuracy >= c THEN deploy the agent to staging environment and send a message to the team on Slack.

The loop… with the measurable metrics

Once you have this pipeline, whenever you make changes, you will know exactly how well your agent is performing based on the measurable metrics, no more emotions.

4. Ship with Confidence, Not with Prayers

On AWS Bedrock, the following AgentCore runtimes are available, listed from most expensive to cheapest:

Online evaluation sampling on production traffic, continuous scoring, and real-time pushing of metrics to CloudWatch for live production monitoring. Most expensive, you can use this in A/B test.
On-demand Evaluation is charged at the standard rate. Most of us are in this position.
Batch Evaluation follows the same pattern as the Bedrock Batch API (model), and you only pay 50% amount of the regular price. It’s very useful in nightly builds.

Because Batch Evaluation is still in public review, you might not use it in some regions. Instead, you can combine other FinOps strategies such as saving plan (save up to 72%) for Lambda functions, EC2, Fargate. You can also leverage Prompt Caching, which saved about 60% of my LLM costs. I will cover Prompt Caching in more detail in another post because it deserves a deeper dive on its own.

I have a question: I have just started blogging (yes, I used to write short stories, but that was a long time ago) and I am wondering if my use of images or words is appropriate. I mean, I like manga so I use that style, but what about you? Do you like colorful pictures? Let me know in the comments, I am listening.

Thanks for reading! Subscribe for free to receive new posts and support my work.