Define evaluations in YAML, make decisions with data. Featuring pass@k reliability metrics, 8 graders, and 4 agent adapters for simple, reliable, and reproducible AI agent evaluation.
$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bashA complete toolkit designed for AI agent evaluation
Measure capability ceiling and reliability with statistically rigorous metrics. Log-space arithmetic prevents overflow for large sample sizes.
exact_match, contains, regex, json_match, command, llm, pairwise, constraint — from simple string checks to LLM-as-judge.
Native OpenAI, Anthropic, HTTP, and Command adapters. Registry pattern makes adding custom adapters a single-file change.
Automatic token extraction with cost estimation. P50/P90/P99 latency percentiles for SLA assessment.
File-based response caching avoids redundant API calls. Checkpoint resume picks up interrupted evaluations seamlessly.
Use --fail-under to gate merges on pass rate. JSON output and summary files for automated pipeline processing.
Define config → Write tasks → Run evaluation — it's that simple
name: "coding-agent-eval"
agent:
type: openai
config:
model: gpt-4
api_key: ${OPENAI_API_KEY}
temperature: 0.0
defaults:
trials_per_task: 3
graders:
- type: contains
config:
ignore_case: true
execution:
concurrency: 4
rate_limit_rps: 5
timeout: 120s
output:
format: all
dir: ./resultsFrom YAML config to visual reports — a fully automated evaluation pipeline
Get comprehensive evaluation reports, reliability metrics, and cost tracking with just a single YAML file.
$curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bash