The Project I Started But Never Finished
Earlier this year I started building ai-qe-agent —
a multi-agent system that auto-generates QA test cases
using Claude (Anthropic's AI).
8 specialized agents. TypeScript. Direct Anthropic SDK.
It worked. But it had a critical problem:
No visibility into whether the outputs were actually correct.
Agents were generating test cases, reviewing them,
converting them to Playwright scripts — and I had
no idea if Claude was hallucinating, truncating,
or silently failing between agents.
That's what I set out to finish.
The Before
How Claude Helped Me Finish It
I used Claude (via Claude Code) as my primary
AI coding assistant throughout this project.
Claude helped me:
- Design the LLM-as-Judge eval architecture
- Generate eval_suite.py from scratch
- Debug LangSmith tracing integration
- Build the TruLens monitoring setup
- Create the Fintech AI Agent Gradio app
The meta-irony: I used Claude to build a system
that evaluates Claude's own outputs.
What I Finished
1. Custom LLM Eval Suite
Built eval_suite.py using LLM-as-Judge pattern —
Claude evaluating Claude's own outputs across 4 dimensions:
- Completeness — did the agent complete the full task?
- Specificity — were outputs precise and detailed?
- Faithfulness — did the agent follow all instructions?
- Hallucination detection — did it invent facts not in context?
2. TruLens Monitoring Dashboard
Real-time quality metrics across all 4 agents:
- Faithfulness scores
- Hallucination flags
- Chain compatibility checks
- Quality score trends
3. LangSmith Production Tracing
Every Claude API call now traced:
- Input prompt
- Output response
- Latency per agent
- Token usage
4. Pinecone Vector Store
Semantic deduplication for test cases:
- Prevents duplicate test generation
- 0.85+ cosine similarity = HIGH OVERLAP flag
5. Fintech AI Agent (New HF Space)
Live demo combining everything:
- Fraud detection with risk scoring (0-10)
- Compliance Q&A (KYC/AML/GDPR/SOX/PCI-DSS)
- AML risk report generation (6-section formal reports)
- Real-time eval dashboard
The Findings — What Claude Found About Itself
Running the eval suite on my own pipeline revealed:
🔴 2 hallucinations caught
AutomationScriptGenerator invented 'Invalid credentials'
as error text — never specified in the input context.
SelfHealingAgent fabricated DOM selectors without a DOM.
🔴 2 pipeline breaks found
ManualTestGenerator output = bare array.
QAReviewAgent expected a wrapped ManualTestSuite object.
chain_compatibility = 0. Would silently fail in production.
🔴 2 faithfulness failures
ManualTestGenerator generated 2 of 8 required test cases.
Stopped with no error. No warning. Just silent truncation.
🟢 0.902 avg quality score
AutomationScriptGenerator: 0.94
SelfHealingAgent: 1.0 quality — but 0.0 faithfulness.
Good output. Wrong process. Only eval catches this.
The After
The Key Insight
AI systems fail silently.
No errors. No warnings. No crashes.
Just wrong outputs — shipped with confidence.
This is why LLM Evaluation Engineering exists.
And why finishing this project mattered.
Demo
🤗 Fintech AI Agent (live):
https://huggingface.co/spaces/Vijayarv07/fintech-ai-agent
🤗 ai-qe-agent (live):
https://huggingface.co/spaces/Vijayarv07/ai-qe-agent
⭐ GitHub:
https://github.com/vijayarjun7/ai-qe-agent
Tech Stack
- Claude (claude-sonnet-4-20250514) — Anthropic
- Python + TypeScript
- TruLens (eval monitoring)
- LangSmith (production tracing)
- Pinecone (vector store)
- Gradio (HF Space UI)
- Playwright (automation)
Built in public. Follow my journey: #BuildInPublic














