Most teams shipping agentic AI systems are operating on vibes. "It seems to work" is the quality bar. When things break in production, they scramble through logs trying to understand how the agent behaved. Mix in a multi-agent paradigm and this all of a sudden gets even more difficult.
The industry has converged on LLM-as-judge evaluations as the solution where you have another LLM rate your agent's responses based on metrics such as quality, helpfulness and correctness. This approach however, has a ton of serious limitations that nobody really talks about.
What if 80% of your AI system's behavior could be tested deterministically? What if you could run assertions on tool calls, token consumption, database mutations and conversation flow with the same precision as traditional unit tests?
This is the story of how I built Agent QA - an agent testing framework that brings determinism to the world of non-deterministic AI systems.
The Two Types of Tests in Agentic Systems
Before we dive into tooling, let's establish a mental model. When testing AI systems, you're dealing with two fundamentally different categories of assertions.
Non-Deterministic Tests (LLM Evaluations)
These use another LLM to judge the quality of your AI's responses. You might ask: "On a scale of 1-5, how helpful was this response?" or "Did this answer correctly address the user's question?"
When they're useful:
- Evaluating tone and style
- Assessing creative output quality
- Checking for harmful or inappropriate content
- Subjective quality metrics
The problems:
- Non-reproducible: Run the same eval twice, get different scores
- Expensive: Every eval costs tokens
- Slow: You're waiting on another LLM call
- Hard to debug: When an eval fails, why did it fail?
- Turtles all the way down: Who evaluates the evaluator?
Deterministic Tests (Behavioral Assertions)
These are concrete assertions on observable behavior - things that either happened or didn't and values that either match or don't.
What you can test deterministically:
- Tool calls: Which tools were called, how many times, with what arguments
- Tool outputs: What the tools returned, what fields were present
- Token consumption: Input tokens, output tokens, cache hits, total cost
- Entity mutations: What was created, updated, or deleted in your database
- Conversation flow: Number of LLM turns, which agents were invoked
- Response content: Does it mention expected entities, contain required strings
Here's what this looks like in practice with my Agent QA testing framework:
- chat: "Create a task called 'Review quarterly report' with high priority"
tools:
manageTasks: { min: 1, max: 3 }
created:
- entity: tasks
fields:
title: { containsAny: ["quarterly", "report"] }
priority: high
usage:
inputTokens: { gt: 0, lt: 50000 }
cacheReadTokens: { gt: 0 }This scenario asserts that:
- The
manageTaskstool was called 1-3 times - A task was created with a title containing "quarterly" or "report"
- The task has high priority
- Token consumption was under 50,000
- Prompt caching is working (cache read tokens > 0)
Every single one of these assertions is deterministic. Pass or fail, no ambiguity. And similary to Cucumber test definitions, this test is extremely easy for anyone on your team to understand at a glance.
The Key Difference
You need to be mindful such as not to conflate "response sounds good" with "system behaved correctly."
When your agent says "I've created a high-priority task for reviewing the quarterly report" - that's a claim. The deterministic test verifies the reality: Did it actually call the right tool? Did a task actually get created in the database? Is the priority actually set to high?
LLM evaluations might tell you the response was polite and well-structured. Deterministic tests tell you whether your agent actually did its job.
The split should be roughly 80/20. 80% of your test suite can (and should) be deterministic. Save LLM evaluations for the genuinely subjective 20%.
So what is Agent QA?
Agent QA is a YAML-based scenario runner for testing AI agents. It follows a config-first, convention-over-configuration design inspired by tools like Vitest.
Note: Agent QA may be open-sourced in the future. For now, this post shares the patterns and principles that guided its development.
You're more than welcome to reverse engineer all of this but I'd greatly appreciate a shoutout if you found this helpful for your journey!
Core Capabilities
Multi-step conversations: Test flows that span multiple user turns, or even multiple separate conversations:
steps:
- chat: "Create a task to review the report"
conversation: main
created:
- entity: tasks
as: $reportTask
- chat: "What tasks do I have?"
conversation: main
response:
mentionsAny: ["report"]
# Start a brand new conversation
- chat: "Delete all my tasks"
conversation: differentTool assertions: Validate not just that tools were called, but how they were called:
- chat: "Create a high priority task to call mom"
tools:
manageTasks:
count: 1
input:
creates:
title: { contains: "mom" }
priority: highEntity assertions with cross-step references: Capture entities in one step, verify them in another:
steps:
- chat: "Create a task called 'Review Q4'"
created:
- entity: tasks
as: $myTask
fields:
title: { contains: "Q4" }
- chat: "Mark my task as complete"
# ...later...
- verify:
tasks:
- id: { ref: $myTask.id }
fields:
status: completedToken assertions with compound matchers: Handle the inherent non-determinism of caching:
usage:
inputTokens: { gt: 0, lt: 50000 }
outputTokens: { gt: 10, lt: 5000 }
# Cache might be cold OR warm - either is acceptable
anyOf:
- cacheCreationTokens: { gt: 0 } # Cold cache: tokens being cached
- cacheReadTokens: { gt: 0 } # Warm cache: tokens read from cacheDesign Philosophy
-
Scenarios should be readable by non-engineers. Product managers should be able to understand what's being tested.
-
Express intent, not implementation. The YAML describes what should happen, not how to make it happen.
-
Infrastructure is isolated and self-managed. Agent QA spins up its own infra including a Postgres container, API server and a reverse proxy for testing AWS SQS end-to-end. After each test scenario or suite, Agent QA tears everything down - no shared state between tests that can pollute validation.
Token Economics
Tokens are your marginal cost. Every conversation has a price and small inefficiencies compound into real money at scale.
Why Token Consumption Matters
Consider Anthropic's prompt caching: cached tokens cost ~90% less than uncached tokens. If you're sending 100,000 tokens per request and only 10% are being cached, you're leaving massive savings on the table.
But here's the problem - LLM providers only give you totals. You see aggregate input tokens, output tokens and cache hit/read metrics but that's not always actionable.
What You Actually Need to Measure
Per-turn breakdown:
- How many tokens did this specific user message consume?
- How did token consumption grow across a 6 turn conversation?
- Which turn caused the token budget to spike?
Per-agent breakdown (for multi-agent systems):
- Is the router agent consuming half your tokens just to decide which specialist to route to?
- Are some agents dramatically more expensive than others?
Token attribution to code:
- How many tokens does your system prompt consume?
- How expensive are your tool definitions?
- What's the token cost of the error messaging you added?
The Missing Layer: Token Attribution
LLM providers won't tell you that your TaskManageSchema Zod definition costs 847 tokens. They won't tell you that your system prompt is over 12,000 tokens. They won't tell you that the error messages you're stuffing into sub-agent tool results are doubling your conversation cost.
You have to build this instrumentation yourself.
Agent QA includes a schema-tokens command that analyzes the token cost of your Zod schemas:
agentqa schema-tokens ./src/tools/schemas.ts --sort tokensSchema Token Analysis (claude-haiku-4-5)
| Schema Name | Tokens | Size |
|---|---|---|
| TaskManageSchema | 847 | 3.2 KB |
| TaskInputSchema | 523 | 2.1 KB |
| ReminderSchema | 412 | 1.8 KB |
| Total | 1,782 | 7.1 KB |
Now you and your preffered coding agent (ideally Claude Code with an agent-optimization skill) knows exactly where to optimize.
The Optimization Workflow
-
Run a scenario with diagnostics:
agentqa run suite.yaml --id test-001 --save-diagnostics -
Analyze consumption:
agentqa analyze-tokens ./diagnostics-output/test-001/*/http-responses.json --per-agentToken Consumption Analysis (6 turns)
Metric Value Input Tokens 124,582 Output Tokens 1,440 Total Tokens 126,022 Cache Hit Rate 85.7% Per-Agent Breakdown
Agent Input Output Calls % router-agent 68,040 804 18 54.6% tasks-agent 57,582 606 12 46.2% -
Identify the problem: Router agent is consuming 54% of tokens just to route. That's a red flag.
-
Analyze tool definitions:
agentqa schema-tokens ./src/agents/router/tools.ts -
Make targeted changes: Trim the router's schema, reduce context, optimize.
-
Re-run to validate: Same scenario, measure the delta.
A/B Testing Across Models and Configurations
Different models have different cost/latency/quality tradeoffs. The same model with different prompts produces different behavior. You need reproducible scenarios to compare fairly.
What to Measure in A/B Tests
| Metric | Why It Matters |
|---|---|
| Token consumption | Direct cost comparison |
| Latency | User experience |
| Tool call patterns | Does one model call more tools? |
| Cache efficiency | Different models may cache differently |
| Pass/fail rate | Do deterministic assertions pass? |
How Agent QA Enables This
YAML scenarios are model-agnostic. The same test-001-create-task.yaml runs against Claude, GPT-4, or any model your API supports. Switch models via config or environment variables:
// agentqa.config.ts
export default defineConfig({
agent: {
baseUrl: '$API_URL',
model: process.env.TEST_MODEL || 'claude-sonnet-4-20250514',
},
// ...
});Run your entire suite against two models:
TEST_MODEL=claude-sonnet-4-20250514 agentqa run suite.yaml --tag smoke --save-diagnostics
TEST_MODEL=claude-haiku-4-5 agentqa run suite.yaml --tag smoke --save-diagnosticsNow you have comparable diagnostics for both. Same scenarios, same user inputs, different models—apples to apples.
Hallucination Testing via Multi-Run Analysis
Hallucinations are a pain and single-run tests can easily miss them.
Your agent says "Done! I've deleted the task." but it never actually called the delete tool. The response sounds confident, helpful and even cheerful - but the system is lying to you. This happens more often than you'd think, especially with smaller models or complex multi-step operations.
The worst part is that a single test run might pass. The agent usually calls the tool. But 1 in 5 times? 1 in 10? It hallucinates the action without actually calling the right tool or sub-agent sometimes! Users end up thinking their task was deleted when it's still sitting in the database.
Multi-Run Testing
The solution is statistical: run the same scenario multiple times and look for variance.
agentqa run suite.yaml --id test-001 --runs 5Agent QA tracks pass/fail rates per step across all runs:
Running scenario 5 times: Delete task flow
────────────────────────────────────────────────────────────
[Run 1/5]
✓ Run 1: PASSED (56040ms)
[Run 2/5]
✗ Run 2: FAILED (52890ms)
[Run 3/5]
✓ Run 3: PASSED (54200ms)
...
════════════════════════════════════════════════════════════
Multi-Run Summary
════════════════════════════════════════════════════════════
Scenario: Delete task flow
Total Runs: 5
Pass Rate: 80.0%
✓ Passed: 4
✗ Failed: 1
⚠ FLAKY: This scenario passes sometimes and fails sometimesHallucination Detection Logic
Agent QA specifically detects hallucinations by correlating two signals:
- Response text contains action keywords: "deleted", "created", "updated", "completed", etc.
- Tool assertion failed: The expected tool call didn't happen
When both conditions are true, it's flagged as a hallucination—the agent claimed to do something it didn't actually do.
Hallucination Detection
────────────────────────────────────────────────────────────
⚠ turn-5-delete: 20.0% hallucination rate
Occurred in 1 of 5 runs
Missing tools: manageTasks
Response snippet: "bye bye presentation slides task 👋 deleted!"That response snippet is damning evidence. The agent said "deleted!" with a cheerful emoji, but manageTasks was never called. The task is still in the database. The user has been deceived.
Why This Matters
Hallucinations are fundamentally different from other failures:
| Failure Type | User Experience | Detection Difficulty |
|---|---|---|
| Tool error | User sees error message | Easy - explicit failure |
| Wrong tool called | Unexpected behavior | Medium - assertion catches it |
| Hallucination | User thinks action succeeded | Hard - response looks correct |
Hallucinations are the most dangerous because they're invisible to the user. Everything looks fine. The agent confirmed success. Only deterministic assertions on actual tool calls reveal the lie.
Building Hallucination-Resistant Systems
Once you can detect hallucinations, you can reduce them:
- Identify flaky scenarios: Run your critical paths with
--runs 5or--runs 10 - Pinpoint problematic steps: The multi-run summary shows exactly which steps are unreliable
- Analyze patterns: Do hallucinations correlate with conversation length? Token consumption? Specific tools?
- Strengthen prompts: Add explicit instructions like "You MUST call the tool to perform this action - never claim success without calling tools"
- Validate and iterate: Re-run with multiple iterations to confirm hallucination rate dropped
The goal isn't 100% pass rate on a single run - it's consistent behavior across many runs. A scenario that passes 100% of the time over 10 runs is far more trustworthy than one you only ran once.
Observability: The Foundation of Everything
You can't build any of this without deep observability. Token analysis, failure debugging, A/B testing, AI self-improvement—all of it depends on being able to see exactly what happened.
Three Key Components
1. Traces (OpenTelemetry)
Hierarchical view of what happened:
chat-request (45.2s)
├── router-agent (12.1s)
│ ├── llm-call (8.2s) → tokens: 24,000 in, 200 out
│ └── llm-call (3.9s) → tokens: 24,200 in, 150 out
└── tasks-agent (33.1s)
├── llm-call (10.4s) → tokens: 18,000 in, 400 out
├── tool-call: manageTasks (0.2s)
└── llm-call (22.5s) → tokens: 19,500 in, 350 outThis tells you where time and tokens are being spent. The tasks-agent's second LLM call took 22.5 seconds—why? You can drill into that span.
2. Logs
Structured API logs, tmux session capture, error aggregation. When something fails, you need the server's perspective, not just the test runner's.
3. Diagnostics
On every failure (or when explicitly requested), Agent QA saves:
diagnostics-output/test-001/2026-01-06T10-30-00/
├── http-responses.json # Per-step token breakdowns, tool calls
├── tempo-traces.json # Raw OpenTelemetry spans
├── tmux-logs.txt # Server logs around the failure
└── failure.json # Error details, stack tracesEverything you need to understand what happened, in one directory, ready to be analyzed.
What To Instrument
Every LLM call gets a span with:
- Model name and parameters
- Full request (messages, tools, system prompt)
- Full response
- Token counts (input, output, cache creation, cache read)
- Duration
- Correlation ID linking to the conversation
Every tool invocation gets a span with:
- Tool name
- Input arguments
- Output result
- Duration
- Errors if any
This instrumentation isn't optional overhead—it's what makes everything else possible.
Custom Telemetry Tools
For more fine-grained analysis, you should be building custom tools for yourself, your team and coding agents. Here are a few that I found quite helpful:
traces CLI: Query OpenTelemetry data from Tempo:
# Find all traces for a conversation
pnpm traces search --correlation conv_abc123 --fetch
# Get a specific trace
pnpm traces get abc123def456 --format tree
# Recent traces from a service
pnpm traces recent --service pocketcoach-api --since 6hDiagnostics writer: Auto-saves on failure, aggregates HTTP responses with Tempo traces and server logs.
Closing the Loop: Enabling AI to Test Itself
So with all of this infrastructure in place (deterministic scenarios, rich diagnostics, queryable traces and token attribution) you can enable Claude Code (or any AI coding assistant) to test and improve the system autonomously.
What Claude Code Should Be Able to Do
- Run tests: Execute specific scenarios and get structured pass/fail results
- Analyze token consumption: Parse diagnostics to identify expensive agents or tools
- Query traces: Search for specific conversation patterns or errors
- Identify root causes: Correlate high token usage with specific code paths
- Propose fixes: Suggest prompt optimizations, schema reductions, or architectural changes
- Validate fixes: Re-run scenarios to confirm improvements
- Build a custom Claude Code skill: Create a specialized skill with the proper workflow, context engineering principles, and standard operating procedures for using Agent QA—essentially encoding the entire optimization methodology into a reusable directive
To really optmize this process you absolutely need a Claude Code skill which is effectively a markdown file that provides Claude with domain-specific context and instructions. For Agent QA, this skill should include:
- How to run specific scenarios (never run full suites—always filter)
- How to interpret diagnostics output
- The optimization workflow (run → analyze → identify → fix → validate)
- Key principles: favor deterministic assertions, attribute tokens to code, check cache efficiency
- Common pitfalls and how to avoid them
With this skill, Claude Code doesn't just have tools - it has a workflow to follow. This skill should empower your coding agent to knows all the right questions to ask, where to find all your telemetry, how to simulate agents, how to detect hallucinations and how to effectively optimize every layer of your agent stack without ambiguity.
The Self-Improvement Loop
-
Claude Code runs a scenario:
agentqa run suite.yaml --id test-001 --save-diagnostics -
It fails or exceeds token budget: Maybe the assertion
inputTokens: { lt: 50000 }fails with actual value 78,000. -
Claude reads diagnostics: Parses
http-responses.jsonto see per-turn and per-agent breakdowns. -
Identifies the expensive agent: Router agent is consuming 54% of tokens, which seems excessive for routing logic.
-
Analyzes tool definitions:
agentqa schema-tokens ./src/agents/router/tools.ts—finds a 2,000 token schema that could be trimmed. -
Proposes a change: Suggests removing verbose descriptions from the schema, or splitting into smaller schemas loaded conditionally.
-
Re-runs to validate: Same scenario, checks that tokens dropped below 50,000.
-
Repeats until passing: The loop continues until all assertions pass.
And just like that you have complete self-play through pure engineering - no vibe testing or vibe coding. You're providing the AI with the same tools, information, and methodology that a human engineer would use.
The main difference now is speed and reliability. Because you've created a skill that emulates how you would think about agent testing and optimization - your coding agent can rapidly fly through everything at the speed of thought. It can detect a failing test, fix it, add new ones to prevent regression and do all this insanely fast.
Practical Implementation Guide
If you want to build something similar, here's where to start.
1. Start with Your Test User
Create an isolated test environment:
// agentqa.config.ts
export default defineConfig({
database: {
url: '$TEST_DATABASE_URL', // Separate from dev/prod
entities: [
{ table: schema.tasks, name: 'tasks', titleColumn: 'title' },
],
},
hooks: {
beforeEach: async (scenario) => {
// Clean slate for every scenario
await db.delete(tasks).where(eq(tasks.userId, TEST_USER_ID));
},
},
});2. Write Your First Scenario
Start with the simplest CRUD operation:
id: test-001-create-task
name: "Create a simple task"
tags: [smoke, tasks]
steps:
- chat: "Create a task to buy groceries"
tools:
manageTasks: 1
created:
- entity: tasks
fields:
title: { contains: "groceries" }3. Add Token Assertions
Once basic scenarios pass, add usage assertions:
- chat: "Create a task to buy groceries"
tools:
manageTasks: 1
usage:
inputTokens: { lt: 30000 }
outputTokens: { lt: 1000 }4. Layer in Diagnostics
Configure automatic capture:
diagnostics: {
tmux: { sessionName: 'api-server' },
tempo: { url: '$TEMPO_URL' },
outputDir: './diagnostics-output',
},5. Build the Observability Stack
Docker Compose makes this straightforward:
services:
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
grafana:
image: grafana/grafana:latest
ports:
- "3031:3000"Instrument your API to send traces:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-agent');
async function handleChat(message: string) {
return tracer.startActiveSpan('chat-request', async (span) => {
span.setAttribute('ai.correlation_id', conversationId);
// ... your agent logic
span.end();
});
}6. Create the Claude Code Skill
Document the workflow (mine is much longer - just sharing as a reference point):
# Agent QA Testing Skill
## When to Use
Use this skill when testing, debugging, or optimizing AI agents.
## Key Commands
- `agentqa run suite.yaml --id <id>` - Run a specific scenario
- `agentqa run suite.yaml --id <id> --save-diagnostics` - Run with diagnostics
- `agentqa analyze-tokens ./diagnostics-output/<id>/*/http-responses.json` - Analyze consumption
## Workflow
1. Run the scenario with `--save-diagnostics`
2. If it fails, read `failure.json` and `http-responses.json`
3. Identify high token consumers using `--per-agent` flag
4. Analyze tool definitions with `schema-tokens`
5. Make targeted changes
6. Re-run to validate
## Principles
- Always filter scenarios (never run full suites)
- Favor deterministic assertions over LLM evaluations
- Attribute token costs to specific code
- Validate caching is working (cache read tokens > 0)Lessons Learned
What Worked
YAML scenarios are surprisingly expressive. Non-engineers can read them, engineers can write them quickly, and they version control cleanly.
Compound matchers handle non-determinism. The anyOf/allOf pattern for cache assertions was a breakthrough—it acknowledges that some aspects are non-deterministic while still providing meaningful assertions.
Per-agent token tracking revealed optimization opportunities. We discovered our router agent was consuming 50%+ of tokens just to make routing decisions. That's a concrete, actionable insight you'd never get from aggregate totals.
Isolated test environments prevent flakiness. Every scenario gets a clean database. No shared state means no mysterious failures.
What Was Hard
Cache behavior is inherently non-deterministic. The same scenario might have a cold cache on first run and warm cache on second run. You have to design assertions that accept either.
Multi-agent systems need correlation IDs. Without a consistent ID linking all spans from a single conversation, trace analysis becomes impossible.
Token attribution requires custom instrumentation. LLM providers don't give you this. You have to build it yourself.
Trace latency means deferred collection. Tempo takes a few seconds to ingest spans. We defer trace collection until the end of a suite run to ensure spans are available.
The Future of AI Testing
LLM-as-judge evaluations are a crutch, not a solution. They're useful for subjective quality, but they've become a catch-all that skips over the fact that we can test AI systems deterministically - we just haven't been.
Most AI behavior can be tested deterministically. Tool calls, entity mutations, token consumption, conversation flow—these are facts, not opinions.
Observability is the foundation. You can't optimize what you can't measure. You can't debug what you can't see. Invest in tracing, logging, and diagnostics before you need them.
Token economics matter. Tokens are your marginal cost. Know where they're going. Attribute them to code. Optimize ruthlessly.
Enable your AI tools to test themselves. With the right infrastructure, Claude Code can run tests, analyze results, identify root causes, and propose fixes - all by itself. The key is giving it the same tools and methodology you'd use yourself.
What's Next
Agent QA may be open-sourced in the future - I'm still working on making it better and more generic! In the meantime, hopefully these patterns and principles are useful for anyone building AI systems.
The vision is to create tight feedback loops where AI can improve AI. Deterministic scenarios give you reproducible baselines. Diagnostics give you the data. Queryable traces give you the context. And Claude Code skills give you the means to do all this at the speed of thought.
The foundation, however, is good ole' engineering work (observability). The future of all that is an AI system that can leverage this foundation to continuously improve itself.
Build the foundation. Close the loop. And let AI test AI.
If you're building something similar or have thoughts on these patterns, reach out on Twitter/X or via email.