Deterministic Testing for Agentic AI Systems

Most teams shipping agentic AI systems are operating on vibes. "It seems to work" is the quality bar. When things break in production, they scramble through logs trying to understand how the agent behaved. Mix in a multi-agent paradigm and this all of a sudden gets even more difficult.

The industry has converged on LLM-as-judge evaluations as the solution where you have another LLM rate your agent's responses based on metrics such as quality, helpfulness and correctness. This approach however, has a ton of serious limitations that nobody really talks about.

What if 80% of your AI system's behavior could be tested deterministically? What if you could run assertions on tool calls, token consumption, database mutations and conversation flow with the same precision as traditional unit tests?

This is the story of how I built Agent QA - an agent testing framework that brings determinism to the world of non-deterministic AI systems.

The Two Types of Tests in Agentic Systems

Before we dive into tooling, let's establish a mental model. When testing AI systems, you're dealing with two fundamentally different categories of assertions.

Non-Deterministic Tests (LLM Evaluations)

These use another LLM to judge the quality of your AI's responses. You might ask: "On a scale of 1-5, how helpful was this response?" or "Did this answer correctly address the user's question?"

When they're useful:

Evaluating tone and style
Assessing creative output quality
Checking for harmful or inappropriate content
Subjective quality metrics

The problems:

Non-reproducible: Run the same eval twice, get different scores
Expensive: Every eval costs tokens
Slow: You're waiting on another LLM call
Hard to debug: When an eval fails, why did it fail?
Turtles all the way down: Who evaluates the evaluator?

Deterministic Tests (Behavioral Assertions)

These are concrete assertions on observable behavior - things that either happened or didn't and values that either match or don't.

What you can test deterministically:

Tool calls: Which tools were called, how many times, with what arguments
Tool outputs: What the tools returned, what fields were present
Token consumption: Input tokens, output tokens, cache hits, total cost
Entity mutations: What was created, updated, or deleted in your database
Conversation flow: Number of LLM turns, which agents were invoked
Response content: Does it mention expected entities, contain required strings

Here's what this looks like in practice with my Agent QA testing framework:

- chat: "Create a task called 'Review quarterly report' with high priority"
  tools:
    manageTasks: { min: 1, max: 3 }
  created:
    - entity: tasks
      fields:
        title: { containsAny: ["quarterly", "report"] }
        priority: high
  usage:
    inputTokens: { gt: 0, lt: 50000 }
    cacheReadTokens: { gt: 0 }

This scenario asserts that:

The manageTasks tool was called 1-3 times
A task was created with a title containing "quarterly" or "report"
The task has high priority
Token consumption was under 50,000
Prompt caching is working (cache read tokens > 0)

Every single one of these assertions is deterministic. Pass or fail, no ambiguity. And similary to Cucumber test definitions, this test is extremely easy for anyone on your team to understand at a glance.

The Key Difference

You need to be mindful such as not to conflate "response sounds good" with "system behaved correctly."

When your agent says "I've created a high-priority task for reviewing the quarterly report" - that's a claim. The deterministic test verifies the reality: Did it actually call the right tool? Did a task actually get created in the database? Is the priority actually set to high?

LLM evaluations might tell you the response was polite and well-structured. Deterministic tests tell you whether your agent actually did its job.

The split should be roughly 80/20. 80% of your test suite can (and should) be deterministic. Save LLM evaluations for the genuinely subjective 20%.

So what is Agent QA?

Agent QA is a YAML-based scenario runner for testing AI agents. It follows a config-first, convention-over-configuration design inspired by tools like Vitest.

Note: Agent QA may be open-sourced in the future. For now, this post shares the patterns and principles that guided its development.

You're more than welcome to reverse engineer all of this but I'd greatly appreciate a shoutout if you found this helpful for your journey!

Core Capabilities

Multi-step conversations: Test flows that span multiple user turns, or even multiple separate conversations:

steps:
  - chat: "Create a task to review the report"
    conversation: main
    created:
      - entity: tasks
        as: $reportTask
  
  - chat: "What tasks do I have?"
    conversation: main
    response:
      mentionsAny: ["report"]
  
  # Start a brand new conversation
  - chat: "Delete all my tasks"
    conversation: different

Tool assertions: Validate not just that tools were called, but how they were called:

- chat: "Create a high priority task to call mom"
  tools:
    manageTasks:
      count: 1
      input:
        creates:
          title: { contains: "mom" }
          priority: high

Entity assertions with cross-step references: Capture entities in one step, verify them in another:

steps:
  - chat: "Create a task called 'Review Q4'"
    created:
      - entity: tasks
        as: $myTask
        fields:
          title: { contains: "Q4" }
 
  - chat: "Mark my task as complete"
    # ...later...
 
  - verify:
      tasks:
        - id: { ref: $myTask.id }
          fields:
            status: completed

Token assertions with compound matchers: Handle the inherent non-determinism of caching:

usage:
  inputTokens: { gt: 0, lt: 50000 }
  outputTokens: { gt: 10, lt: 5000 }
  
  # Cache might be cold OR warm - either is acceptable
  anyOf:
    - cacheCreationTokens: { gt: 0 }  # Cold cache: tokens being cached
    - cacheReadTokens: { gt: 0 }       # Warm cache: tokens read from cache

Design Philosophy

Scenarios should be readable by non-engineers. Product managers should be able to understand what's being tested.
Express intent, not implementation. The YAML describes what should happen, not how to make it happen.
Infrastructure is isolated and self-managed. Agent QA spins up its own infra including a Postgres container, API server and a reverse proxy for testing AWS SQS end-to-end. After each test scenario or suite, Agent QA tears everything down - no shared state between tests that can pollute validation.

Token Economics

Tokens are your marginal cost. Every conversation has a price and small inefficiencies compound into real money at scale.

Why Token Consumption Matters

Consider Anthropic's prompt caching: cached tokens cost ~90% less than uncached tokens. If you're sending 100,000 tokens per request and only 10% are being cached, you're leaving massive savings on the table.

But here's the problem - LLM providers only give you totals. You see aggregate input tokens, output tokens and cache hit/read metrics but that's not always actionable.

What You Actually Need to Measure

Per-turn breakdown:

How many tokens did this specific user message consume?
How did token consumption grow across a 6 turn conversation?
Which turn caused the token budget to spike?

Per-agent breakdown (for multi-agent systems):

Is the router agent consuming half your tokens just to decide which specialist to route to?
Are some agents dramatically more expensive than others?

Token attribution to code:

How many tokens does your system prompt consume?
How expensive are your tool definitions?
What's the token cost of the error messaging you added?

The Missing Layer: Token Attribution

LLM providers won't tell you that your TaskManageSchema Zod definition costs 847 tokens. They won't tell you that your system prompt is over 12,000 tokens. They won't tell you that the error messages you're stuffing into sub-agent tool results are doubling your conversation cost.

You have to build this instrumentation yourself.

Agent QA includes a schema-tokens command that analyzes the token cost of your Zod schemas:

agentqa schema-tokens ./src/tools/schemas.ts --sort tokens

Schema Token Analysis (claude-haiku-4-5)

Schema Name	Tokens	Size
TaskManageSchema	847	3.2 KB
TaskInputSchema	523	2.1 KB
ReminderSchema	412	1.8 KB
Total	1,782	7.1 KB

Now you and your preffered coding agent (ideally Claude Code with an agent-optimization skill) knows exactly where to optimize.

The Optimization Workflow

Run a scenario with diagnostics:

agentqa run suite.yaml --id test-001 --save-diagnostics

Analyze consumption:
```
agentqa analyze-tokens ./diagnostics-output/test-001/*/http-responses.json --per-agent
```
Token Consumption Analysis (6 turns)

Metric Value
Input Tokens 124,582
Output Tokens 1,440
Total Tokens 126,022
Cache Hit Rate 85.7%

Per-Agent Breakdown

Agent Input Output Calls %
router-agent 68,040 804 18 54.6%
tasks-agent 57,582 606 12 46.2%
Identify the problem: Router agent is consuming 54% of tokens just to route. That's a red flag.

Metric	Value
Input Tokens	124,582
Output Tokens	1,440
Total Tokens	126,022
Cache Hit Rate	85.7%

Agent	Input	Output	Calls	%
router-agent	68,040	804	18	54.6%
tasks-agent	57,582	606	12	46.2%

Analyze tool definitions:

agentqa schema-tokens ./src/agents/router/tools.ts

Make targeted changes: Trim the router's schema, reduce context, optimize.
Re-run to validate: Same scenario, measure the delta.

A/B Testing Across Models and Configurations

Different models have different cost/latency/quality tradeoffs. The same model with different prompts produces different behavior. You need reproducible scenarios to compare fairly.

What to Measure in A/B Tests

Metric	Why It Matters
Token consumption	Direct cost comparison
Latency	User experience
Tool call patterns	Does one model call more tools?
Cache efficiency	Different models may cache differently
Pass/fail rate	Do deterministic assertions pass?

How Agent QA Enables This

YAML scenarios are model-agnostic. The same test-001-create-task.yaml runs against Claude, GPT-4, or any model your API supports. Switch models via config or environment variables:

// agentqa.config.ts
export default defineConfig({
  agent: {
    baseUrl: '$API_URL',
    model: process.env.TEST_MODEL || 'claude-sonnet-4-20250514',
  },
  // ...
});

Run your entire suite against two models:

TEST_MODEL=claude-sonnet-4-20250514 agentqa run suite.yaml --tag smoke --save-diagnostics
TEST_MODEL=claude-haiku-4-5 agentqa run suite.yaml --tag smoke --save-diagnostics

Now you have comparable diagnostics for both. Same scenarios, same user inputs, different models—apples to apples.

Hallucination Testing via Multi-Run Analysis

Hallucinations are a pain and single-run tests can easily miss them.

Your agent says "Done! I've deleted the task." but it never actually called the delete tool. The response sounds confident, helpful and even cheerful - but the system is lying to you. This happens more often than you'd think, especially with smaller models or complex multi-step operations.

The worst part is that a single test run might pass. The agent usually calls the tool. But 1 in 5 times? 1 in 10? It hallucinates the action without actually calling the right tool or sub-agent sometimes! Users end up thinking their task was deleted when it's still sitting in the database.

Multi-Run Testing

The solution is statistical: run the same scenario multiple times and look for variance.

agentqa run suite.yaml --id test-001 --runs 5

Agent QA tracks pass/fail rates per step across all runs:

Running scenario 5 times: Delete task flow
────────────────────────────────────────────────────────────
 
[Run 1/5]
  ✓ Run 1: PASSED (56040ms)
 
[Run 2/5]
  ✗ Run 2: FAILED (52890ms)
 
[Run 3/5]
  ✓ Run 3: PASSED (54200ms)
 
...
 
════════════════════════════════════════════════════════════
Multi-Run Summary
════════════════════════════════════════════════════════════
 
Scenario: Delete task flow
Total Runs: 5
Pass Rate: 80.0%
  ✓ Passed: 4
  ✗ Failed: 1
 
⚠ FLAKY: This scenario passes sometimes and fails sometimes

Hallucination Detection Logic

Agent QA specifically detects hallucinations by correlating two signals:

Response text contains action keywords: "deleted", "created", "updated", "completed", etc.
Tool assertion failed: The expected tool call didn't happen

When both conditions are true, it's flagged as a hallucination—the agent claimed to do something it didn't actually do.

Hallucination Detection
────────────────────────────────────────────────────────────
 
⚠ turn-5-delete: 20.0% hallucination rate
  Occurred in 1 of 5 runs
  Missing tools: manageTasks
  Response snippet: "bye bye presentation slides task 👋 deleted!"

That response snippet is damning evidence. The agent said "deleted!" with a cheerful emoji, but manageTasks was never called. The task is still in the database. The user has been deceived.

Why This Matters

Hallucinations are fundamentally different from other failures:

Failure Type	User Experience	Detection Difficulty
Tool error	User sees error message	Easy - explicit failure
Wrong tool called	Unexpected behavior	Medium - assertion catches it
Hallucination	User thinks action succeeded	Hard - response looks correct

Hallucinations are the most dangerous because they're invisible to the user. Everything looks fine. The agent confirmed success. Only deterministic assertions on actual tool calls reveal the lie.

Building Hallucination-Resistant Systems

Once you can detect hallucinations, you can reduce them:

Identify flaky scenarios: Run your critical paths with --runs 5 or --runs 10
Pinpoint problematic steps: The multi-run summary shows exactly which steps are unreliable
Analyze patterns: Do hallucinations correlate with conversation length? Token consumption? Specific tools?
Strengthen prompts: Add explicit instructions like "You MUST call the tool to perform this action - never claim success without calling tools"
Validate and iterate: Re-run with multiple iterations to confirm hallucination rate dropped

The goal isn't 100% pass rate on a single run - it's consistent behavior across many runs. A scenario that passes 100% of the time over 10 runs is far more trustworthy than one you only ran once.

Observability: The Foundation of Everything

You can't build any of this without deep observability. Token analysis, failure debugging, A/B testing, AI self-improvement—all of it depends on being able to see exactly what happened.

Three Key Components

1. Traces (OpenTelemetry)

Hierarchical view of what happened:

chat-request (45.2s)
├── router-agent (12.1s)
│   ├── llm-call (8.2s) → tokens: 24,000 in, 200 out
│   └── llm-call (3.9s) → tokens: 24,200 in, 150 out
└── tasks-agent (33.1s)
    ├── llm-call (10.4s) → tokens: 18,000 in, 400 out
    ├── tool-call: manageTasks (0.2s)
    └── llm-call (22.5s) → tokens: 19,500 in, 350 out

This tells you where time and tokens are being spent. The tasks-agent's second LLM call took 22.5 seconds—why? You can drill into that span.

2. Logs

Structured API logs, tmux session capture, error aggregation. When something fails, you need the server's perspective, not just the test runner's.

3. Diagnostics

On every failure (or when explicitly requested), Agent QA saves:

diagnostics-output/test-001/2026-01-06T10-30-00/
├── http-responses.json   # Per-step token breakdowns, tool calls
├── tempo-traces.json     # Raw OpenTelemetry spans
├── tmux-logs.txt         # Server logs around the failure
└── failure.json          # Error details, stack traces

Everything you need to understand what happened, in one directory, ready to be analyzed.

What To Instrument

Every LLM call gets a span with:

Model name and parameters
Full request (messages, tools, system prompt)
Full response
Token counts (input, output, cache creation, cache read)
Duration
Correlation ID linking to the conversation

Every tool invocation gets a span with:

Tool name
Input arguments
Output result
Duration
Errors if any

This instrumentation isn't optional overhead—it's what makes everything else possible.

Custom Telemetry Tools

For more fine-grained analysis, you should be building custom tools for yourself, your team and coding agents. Here are a few that I found quite helpful:

traces CLI: Query OpenTelemetry data from Tempo:

# Find all traces for a conversation
pnpm traces search --correlation conv_abc123 --fetch
 
# Get a specific trace
pnpm traces get abc123def456 --format tree
 
# Recent traces from a service
pnpm traces recent --service pocketcoach-api --since 6h

Diagnostics writer: Auto-saves on failure, aggregates HTTP responses with Tempo traces and server logs.

Closing the Loop: Enabling AI to Test Itself

So with all of this infrastructure in place (deterministic scenarios, rich diagnostics, queryable traces and token attribution) you can enable Claude Code (or any AI coding assistant) to test and improve the system autonomously.

What Claude Code Should Be Able to Do

Run tests: Execute specific scenarios and get structured pass/fail results
Analyze token consumption: Parse diagnostics to identify expensive agents or tools
Query traces: Search for specific conversation patterns or errors
Identify root causes: Correlate high token usage with specific code paths
Propose fixes: Suggest prompt optimizations, schema reductions, or architectural changes
Validate fixes: Re-run scenarios to confirm improvements
Build a custom Claude Code skill: Create a specialized skill with the proper workflow, context engineering principles, and standard operating procedures for using Agent QA—essentially encoding the entire optimization methodology into a reusable directive

To really optmize this process you absolutely need a Claude Code skill which is effectively a markdown file that provides Claude with domain-specific context and instructions. For Agent QA, this skill should include:

How to run specific scenarios (never run full suites—always filter)
How to interpret diagnostics output
The optimization workflow (run → analyze → identify → fix → validate)
Key principles: favor deterministic assertions, attribute tokens to code, check cache efficiency
Common pitfalls and how to avoid them

With this skill, Claude Code doesn't just have tools - it has a workflow to follow. This skill should empower your coding agent to knows all the right questions to ask, where to find all your telemetry, how to simulate agents, how to detect hallucinations and how to effectively optimize every layer of your agent stack without ambiguity.

The Self-Improvement Loop

Claude Code runs a scenario: agentqa run suite.yaml --id test-001 --save-diagnostics
It fails or exceeds token budget: Maybe the assertion inputTokens: { lt: 50000 } fails with actual value 78,000.
Claude reads diagnostics: Parses http-responses.json to see per-turn and per-agent breakdowns.
Identifies the expensive agent: Router agent is consuming 54% of tokens, which seems excessive for routing logic.
Analyzes tool definitions: agentqa schema-tokens ./src/agents/router/tools.ts—finds a 2,000 token schema that could be trimmed.
Proposes a change: Suggests removing verbose descriptions from the schema, or splitting into smaller schemas loaded conditionally.
Re-runs to validate: Same scenario, checks that tokens dropped below 50,000.
Repeats until passing: The loop continues until all assertions pass.

And just like that you have complete self-play through pure engineering - no vibe testing or vibe coding. You're providing the AI with the same tools, information, and methodology that a human engineer would use.

The main difference now is speed and reliability. Because you've created a skill that emulates how you would think about agent testing and optimization - your coding agent can rapidly fly through everything at the speed of thought. It can detect a failing test, fix it, add new ones to prevent regression and do all this insanely fast.

Practical Implementation Guide

If you want to build something similar, here's where to start.

1. Start with Your Test User

Create an isolated test environment:

// agentqa.config.ts
export default defineConfig({
  database: {
    url: '$TEST_DATABASE_URL',  // Separate from dev/prod
    entities: [
      { table: schema.tasks, name: 'tasks', titleColumn: 'title' },
    ],
  },
  hooks: {
    beforeEach: async (scenario) => {
      // Clean slate for every scenario
      await db.delete(tasks).where(eq(tasks.userId, TEST_USER_ID));
    },
  },
});

2. Write Your First Scenario

Start with the simplest CRUD operation:

id: test-001-create-task
name: "Create a simple task"
tags: [smoke, tasks]
 
steps:
  - chat: "Create a task to buy groceries"
    tools:
      manageTasks: 1
    created:
      - entity: tasks
        fields:
          title: { contains: "groceries" }

3. Add Token Assertions

Once basic scenarios pass, add usage assertions:

- chat: "Create a task to buy groceries"
  tools:
    manageTasks: 1
  usage:
    inputTokens: { lt: 30000 }
    outputTokens: { lt: 1000 }

4. Layer in Diagnostics

Configure automatic capture:

diagnostics: {
  tmux: { sessionName: 'api-server' },
  tempo: { url: '$TEMPO_URL' },
  outputDir: './diagnostics-output',
},

5. Build the Observability Stack

Docker Compose makes this straightforward:

services:
  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"   # API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3031:3000"

Instrument your API to send traces:

import { trace } from '@opentelemetry/api';
 
const tracer = trace.getTracer('my-agent');
 
async function handleChat(message: string) {
  return tracer.startActiveSpan('chat-request', async (span) => {
    span.setAttribute('ai.correlation_id', conversationId);
    // ... your agent logic
    span.end();
  });
}

6. Create the Claude Code Skill

Document the workflow (mine is much longer - just sharing as a reference point):

# Agent QA Testing Skill
 
## When to Use
Use this skill when testing, debugging, or optimizing AI agents.
 
## Key Commands
- `agentqa run suite.yaml --id <id>` - Run a specific scenario
- `agentqa run suite.yaml --id <id> --save-diagnostics` - Run with diagnostics
- `agentqa analyze-tokens ./diagnostics-output/<id>/*/http-responses.json` - Analyze consumption
 
## Workflow
1. Run the scenario with `--save-diagnostics`
2. If it fails, read `failure.json` and `http-responses.json`
3. Identify high token consumers using `--per-agent` flag
4. Analyze tool definitions with `schema-tokens`
5. Make targeted changes
6. Re-run to validate
 
## Principles
- Always filter scenarios (never run full suites)
- Favor deterministic assertions over LLM evaluations
- Attribute token costs to specific code
- Validate caching is working (cache read tokens > 0)

Lessons Learned

What Worked

YAML scenarios are surprisingly expressive. Non-engineers can read them, engineers can write them quickly, and they version control cleanly.

Compound matchers handle non-determinism. The anyOf/allOf pattern for cache assertions was a breakthrough—it acknowledges that some aspects are non-deterministic while still providing meaningful assertions.

Per-agent token tracking revealed optimization opportunities. We discovered our router agent was consuming 50%+ of tokens just to make routing decisions. That's a concrete, actionable insight you'd never get from aggregate totals.

Isolated test environments prevent flakiness. Every scenario gets a clean database. No shared state means no mysterious failures.

What Was Hard

Cache behavior is inherently non-deterministic. The same scenario might have a cold cache on first run and warm cache on second run. You have to design assertions that accept either.

Multi-agent systems need correlation IDs. Without a consistent ID linking all spans from a single conversation, trace analysis becomes impossible.

Token attribution requires custom instrumentation. LLM providers don't give you this. You have to build it yourself.

Trace latency means deferred collection. Tempo takes a few seconds to ingest spans. We defer trace collection until the end of a suite run to ensure spans are available.

The Future of AI Testing

LLM-as-judge evaluations are a crutch, not a solution. They're useful for subjective quality, but they've become a catch-all that skips over the fact that we can test AI systems deterministically - we just haven't been.

Most AI behavior can be tested deterministically. Tool calls, entity mutations, token consumption, conversation flow—these are facts, not opinions.

Observability is the foundation. You can't optimize what you can't measure. You can't debug what you can't see. Invest in tracing, logging, and diagnostics before you need them.

Token economics matter. Tokens are your marginal cost. Know where they're going. Attribute them to code. Optimize ruthlessly.

Enable your AI tools to test themselves. With the right infrastructure, Claude Code can run tests, analyze results, identify root causes, and propose fixes - all by itself. The key is giving it the same tools and methodology you'd use yourself.

What's Next

Agent QA may be open-sourced in the future - I'm still working on making it better and more generic! In the meantime, hopefully these patterns and principles are useful for anyone building AI systems.

The vision is to create tight feedback loops where AI can improve AI. Deterministic scenarios give you reproducible baselines. Diagnostics give you the data. Queryable traces give you the context. And Claude Code skills give you the means to do all this at the speed of thought.

The foundation, however, is good ole' engineering work (observability). The future of all that is an AI system that can leverage this foundation to continuously improve itself.

Build the foundation. Close the loop. And let AI test AI.

If you're building something similar or have thoughts on these patterns, reach out on Twitter/X or via email.