Skip to content

AgentEval

Define evaluations in YAML, make decisions with data. Featuring pass@k reliability metrics, 8 graders, and 4 agent adapters for simple, reliable, and reproducible AI agent evaluation.

$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bash
Terminal
$agent-eval run -c eval.yaml

Core Features

A complete toolkit designed for AI agent evaluation

pass@k / pass^k Metrics

Measure capability ceiling and reliability with statistically rigorous metrics. Log-space arithmetic prevents overflow for large sample sizes.

8 Built-in Graders

exact_match, contains, regex, json_match, command, llm, pairwise, constraint — from simple string checks to LLM-as-judge.

4 Agent Adapters

Native OpenAI, Anthropic, HTTP, and Command adapters. Registry pattern makes adding custom adapters a single-file change.

Token / Cost Tracking

Automatic token extraction with cost estimation. P50/P90/P99 latency percentiles for SLA assessment.

Cache & Checkpoints

File-based response caching avoids redundant API calls. Checkpoint resume picks up interrupted evaluations seamlessly.

CI/CD Integration

Use --fail-under to gate merges on pass rate. JSON output and summary files for automated pipeline processing.

Evaluate in Three Steps

Define config → Write tasks → Run evaluation — it's that simple

01
Configure
Define your agent, graders, and execution settings in YAML
02
Define Tasks
Write evaluation tasks with expected outputs and custom graders
03
Run & Analyze
Execute evaluations and get detailed reports with reliability metrics
name: "coding-agent-eval"

agent:
  type: openai
  config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}
    temperature: 0.0

defaults:
  trials_per_task: 3
  graders:
    - type: contains
      config:
        ignore_case: true

execution:
  concurrency: 4
  rate_limit_rps: 5
  timeout: 120s

output:
  format: all
  dir: ./results
0
Built-in Graders
0
Agent Adapters
0
Report Formats
0
CGO Dependencies

How It Works

From YAML config to visual reports — a fully automated evaluation pipeline

YAML
Load Config
Parse YAML, expand env vars, apply defaults
Agent
Create Agent
Initialize the agent adapter from config
Run
Run Trials
Concurrent execution with rate limiting
Grade
Grade Results
Apply graders, compute weighted scores
Report
Generate Reports
Table, JSON, HTML with pass@k metrics

Start Evaluating Your AI Agents

Get comprehensive evaluation reports, reliability metrics, and cost tracking with just a single YAML file.

$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bash