AgentEval

Define evaluations in YAML, make decisions with data. Featuring pass@k reliability metrics, 8 graders, and 4 agent adapters for simple, reliable, and reproducible AI agent evaluation.

Get Started GitHub

$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bash

Terminal

$agent-eval run -c eval.yaml

Core Features

A complete toolkit designed for AI agent evaluation

pass@k / pass^k Metrics

Measure capability ceiling and reliability with statistically rigorous metrics. Log-space arithmetic prevents overflow for large sample sizes.

8 Built-in Graders

exact_match, contains, regex, json_match, command, llm, pairwise, constraint — from simple string checks to LLM-as-judge.

4 Agent Adapters

Native OpenAI, Anthropic, HTTP, and Command adapters. Registry pattern makes adding custom adapters a single-file change.

Token / Cost Tracking

Automatic token extraction with cost estimation. P50/P90/P99 latency percentiles for SLA assessment.

Cache & Checkpoints

File-based response caching avoids redundant API calls. Checkpoint resume picks up interrupted evaluations seamlessly.

CI/CD Integration

Use --fail-under to gate merges on pass rate. JSON output and summary files for automated pipeline processing.

Evaluate in Three Steps

Define config → Write tasks → Run evaluation — it's that simple

Configure

Define your agent, graders, and execution settings in YAML

Define Tasks

Write evaluation tasks with expected outputs and custom graders

Run & Analyze

Execute evaluations and get detailed reports with reliability metrics

name: "coding-agent-eval"

agent:
  type: openai
  config:
    model: gpt-4
    api_key: ${OPENAI_API_KEY}
    temperature: 0.0

defaults:
  trials_per_task: 3
  graders:
    - type: contains
      config:
        ignore_case: true

execution:
  concurrency: 4
  rate_limit_rps: 5
  timeout: 120s

output:
  format: all
  dir: ./results

Built-in Graders

Agent Adapters

Report Formats

CGO Dependencies

How It Works

From YAML config to visual reports — a fully automated evaluation pipeline

YAML

Load Config

Parse YAML, expand env vars, apply defaults

Agent

Create Agent

Initialize the agent adapter from config

Run

Run Trials

Concurrent execution with rate limiting

Grade

Grade Results

Apply graders, compute weighted scores

Report

Generate Reports

Table, JSON, HTML with pass@k metrics

Start Evaluating Your AI Agents

Get comprehensive evaluation reports, reliability metrics, and cost tracking with just a single YAML file.

$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bash

Read the Docs GitHub