Core Concepts

This page explains the foundational ideas behind AgentEval and how evaluations are structured.

Why Evaluate AI Agents?

AI agents are inherently non-deterministic. The same prompt can produce different outputs across runs, and quality can vary significantly depending on model parameters, prompt phrasing, and external API behavior. Manual spot-checking does not scale and misses reliability problems that only surface over many runs.

Systematic evaluation addresses these challenges:

Quantify reliability -- measure how often an agent produces correct results, not just whether it can
Track regressions -- detect quality drops when prompts, models, or configurations change
Compare configurations -- make data-driven decisions about model selection, temperature settings, and prompt design
Enforce quality gates -- block deployments that fall below minimum pass rates

Key Terminology

Suite

An evaluation suite is a complete configuration that defines which agent to test, what tasks to run, and how to grade results. A suite is defined in a single YAML file (typically eval.yaml) and references one or more task files.

Task

A task represents a single evaluation case with an input prompt and expected output. Tasks are defined in YAML files and referenced from the suite via glob patterns. Each task has a unique ID, a prompt to send to the agent, and expected output for grading.

Trial

Each task is run multiple times (configurable via trials_per_task). Each run is called a trial. Multiple trials enable statistical reliability measurements -- a single pass tells you the agent can answer correctly, but multiple trials tell you how often it does.

Grader

A grader evaluates an agent's output against expected results. Each task can have one or more graders, each producing a score (0.0-1.0) and a pass/fail status. The 8 built-in graders are:

Grader	Purpose
`exact_match`	Exact string comparison (with optional case/whitespace normalization)
`contains`	Checks for required keywords in the output
`regex`	Matches output against a regular expression
`json_match`	Compares JSON structures with configurable path matching
`command`	Runs an external command to grade output
`llm`	Uses an LLM as a judge with a rubric
`pairwise`	Compares two outputs using an LLM to pick the better one
`constraint`	Validates structural constraints (word count, length, format)

Evaluation Flow

The evaluation follows a well-defined pipeline:

Load configuration -- Parse the YAML suite file, expand environment variables, apply defaults
Create agent -- Instantiate the agent adapter (OpenAI, Anthropic, HTTP, or Command)
Build work items -- Combine tasks and trial counts into a flat list of work items
Load checkpoint (optional) -- Skip already-completed trials if resuming an interrupted run
Run trials concurrently -- Execute work items using a concurrent scheduler with rate limiting
Grade each trial -- Apply all configured graders to each trial's output
Aggregate results -- Compute per-task and suite-level statistics (pass rates, scores, pass@k, pass^k)
Generate reports -- Output results as terminal table, JSON, and/or HTML

Scoring Logic

The scoring system uses a multi-grader model with AND-based pass logic:

Each grader produces a score (0.0 to 1.0) and a pass/fail status
A trial's final score is the weighted average across all graders
A trial passes only if ALL graders pass (AND logic)
This means a single failing grader causes the whole trial to fail, regardless of scores

This design ensures that evaluation criteria are strictly enforced. For example, if a response is factually correct but exceeds a word limit, it still fails.

pass@k and pass^k

These two complementary metrics capture different aspects of agent performance.

pass@k (Capability Ceiling)

The probability of getting at least one correct answer in k attempts. This is computed using the formula:

pass@k = 1 - C(n-c, k) / C(n, k)

where n is the total number of trials and c is the number of passing trials. This measures what the agent can do at its best -- given enough tries, how likely is it to succeed at least once?

pass^k (Reliability)

The probability of getting all k attempts correct. This is computed using the formula:

pass^k = C(c, k) / C(n, k)

This measures how dependably the agent performs. High pass^k indicates consistent, production-ready behavior.

Example

If 7 out of 10 trials pass:

pass@1 = 0.70 -- 70% chance a single attempt succeeds
pass@3 = 0.99 -- 99% chance at least one of three attempts succeeds
pass^3 = 0.29 -- 29% chance all three attempts succeed

The gap between pass@k and pass^k reveals reliability issues. A high pass@3 combined with a low pass^3 means the agent is capable but inconsistent -- it usually gets it right, but not always. For production systems that need every request to succeed, pass^k is the metric that matters.

TIP

AgentEval uses log-space arithmetic for pass@k and pass^k computation to avoid numerical overflow when working with large trial counts.

Next Steps

Follow the Quick Start to run your first evaluation
See Examples for practical evaluation configurations
Read Advanced Usage for CI/CD integration and custom extensions

Core Concepts ​

Why Evaluate AI Agents? ​

Key Terminology ​

Suite ​

Task ​

Trial ​

Grader ​

Evaluation Flow ​

Scoring Logic ​

pass@k and pass^k ​

pass@k (Capability Ceiling) ​

pass^k (Reliability) ​

Example ​

Next Steps ​