Skip to content

CLI Reference

agent-eval run

Run an evaluation suite.

bash
agent-eval run [flags]

Flags

FlagTypeDefaultDescription
-c, --configstringeval.yamlPath to evaluation config file
--dbstringSQLite database path
--verboseboolfalseEnable verbose logging
--fail-underfloat0.0Minimum pass rate (0.0-1.0). Exit code 1 if below threshold
--tagsstringOnly run tasks matching these tags (comma-separated)
--exclude-tagsstringExclude tasks matching these tags (comma-separated)
--no-cacheboolfalseBypass response cache for this run
--resumestringResume a previous run by run ID

Examples

bash
# Run with default config
agent-eval run

# Run with specific config
agent-eval run -c my-eval.yaml

# CI mode: fail if pass rate below 80%
agent-eval run -c eval.yaml --fail-under 0.8

# Run only math tasks
agent-eval run --tags math

# Resume interrupted run
agent-eval run --resume abc123

agent-eval list

List historical evaluation runs.

bash
agent-eval list [flags]

Flags

FlagTypeDefaultDescription
--dbstring./results/agent-eval.dbPath to SQLite database

Output

Displays a table with columns:

  • ID -- Run identifier
  • SUITE -- Suite name
  • AGENT -- Agent type
  • TASKS -- Number of tasks
  • PASS RATE -- Overall pass rate
  • DURATION -- Total run duration
  • DATE -- Run timestamp

Example

bash
agent-eval list
agent-eval list --db ./my-results/agent-eval.db

agent-eval compare

Compare two evaluation runs side by side.

bash
agent-eval compare <runA> <runB> [flags]

Arguments

  • runA -- First run ID (supports prefix matching)
  • runB -- Second run ID (supports prefix matching)

Flags

FlagTypeDefaultDescription
--dbstring./results/agent-eval.dbPath to SQLite database

Example

bash
agent-eval compare abc123 def456
# Prefix matching
agent-eval compare abc def

agent-eval init

Initialize a new evaluation project.

bash
agent-eval init [directory]

Arguments

  • directory -- Target directory name (optional, defaults to current directory)

Created Files

<directory>/
  eval.yaml          # Evaluation configuration template
  tasks/
    sample.yaml      # Sample task file
  results/           # Output directory

Example

bash
agent-eval init my-eval
cd my-eval
# Edit eval.yaml and tasks/sample.yaml, then:
agent-eval run