Quick Start
This guide walks you through installing AgentEval, creating your first evaluation, and viewing results.
Installation
One-line Install (Recommended)
curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bashThis automatically detects your OS and architecture, downloads the latest binary, verifies the checksum, and installs to /usr/local/bin. You can customize the install directory with INSTALL_DIR:
INSTALL_DIR=~/.local/bin curl -fsSL https://raw.githubusercontent.com/wallezhang/agent-eval/main/install.sh | bashUsing go install
go install github.com/wallezhang/agent-eval@latestDownload Binary
Download pre-built binaries from the GitHub Releases page. Binaries are available for Linux, macOS, and Windows across multiple architectures.
Initialize a Project
agent-eval init my-eval
cd my-evalThis creates:
eval.yaml-- evaluation configurationtasks/-- directory for task definitionstasks/sample.yaml-- a sample task fileresults/-- output directory
Write Your First Evaluation
Edit eval.yaml:
name: "my-first-eval"
description: "A simple evaluation"
agent:
type: command
config:
command: echo
args: ["Hello, World!"]
defaults:
trials_per_task: 3
graders:
- type: exact_match
config:
ignore_case: true
execution:
concurrency: 2
timeout: 30s
task_files:
- tasks/*.yaml
output:
format: all
dir: ./resultsEdit tasks/sample.yaml:
- id: greeting
name: "Greeting Test"
input:
prompt: "Say hello"
expected:
text: "Hello, World!"Run the Evaluation
agent-eval run -c eval.yamlThe runner will execute 3 trials for the greeting task, grade each one with the exact_match grader, and print a summary table to the terminal.
View Results
Results are displayed as a table in the terminal. You can also find detailed reports in the results/ directory:
results/summary.json-- machine-readable summaryresults/report.html-- interactive HTML report
List Past Runs
agent-eval listThis shows all previous evaluation runs stored in the local SQLite database, including timestamps, pass rates, and run IDs.
Compare Two Runs
agent-eval compare <runA> <runB>This generates a diff report highlighting changes in pass rates, scores, and latency between two runs. Useful for tracking regressions or improvements after configuration changes.
Next Steps
- Read Core Concepts to understand the evaluation model
- Explore Examples for real-world configurations
- See the Advanced Usage guide for CI/CD integration, caching, and custom extensions