Configuration Reference
Full YAML configuration reference for AgentEval.
Complete Schema
# Suite name (required)
name: "my-eval"
# Description (optional)
description: "Evaluation suite description"
# Agent configuration (required)
agent:
type: openai | anthropic | http | command
config:
# Type-specific config (see Agent Types below)
# Default settings applied to all tasks (optional)
defaults:
trials_per_task: 1 # Number of trials per task
pass_threshold: 0.5 # Minimum score to pass
graders: # Default graders for tasks
- type: exact_match
weight: 1.0
config: {}
# Execution settings (optional)
execution:
concurrency: 1 # Max parallel trials
rate_limit_rps: 0 # Requests per second (0 = unlimited)
timeout: "60s" # Per-trial timeout
retry:
max_retries: 0 # Max retry attempts for agent errors
initial_delay: "1s" # Initial backoff delay
max_delay: "30s" # Maximum backoff delay
# Task file globs (at least one required if no inline tasks)
task_files:
- tasks/*.yaml
# Inline tasks (alternative to task_files)
tasks:
- id: task-1
name: "Task Name"
input:
prompt: "Your prompt"
# Output configuration (optional)
output:
format: table | json | html | all # Default: table
dir: ./results # Output directory
# Response cache (optional)
cache:
enabled: false
dir: .cache/agent-eval
# Lifecycle hooks (optional)
hooks:
before_run: "" # Shell command before evaluation
after_run: "" # Shell command after evaluationAgent Types
openai
agent:
type: openai
config:
api_key: ${OPENAI_API_KEY} # Required
model: gpt-4 # Default: gpt-4
base_url: https://api.openai.com/v1 # Optional, for compatible APIs
temperature: 0.0 # Default: 0.0Compatible with any OpenAI-compatible API (e.g., vLLM, Ollama, Azure OpenAI) by setting base_url.
anthropic
agent:
type: anthropic
config:
api_key: ${ANTHROPIC_API_KEY} # Required
model: claude-sonnet-4-20250514 # Default
base_url: https://api.anthropic.com # Optional
temperature: 0.0 # Optional
max_tokens: 4096 # Default: 4096http
agent:
type: http
config:
url: https://my-api.example.com/evaluate # Required
method: POST # Default: POST
headers: # Optional
Authorization: "Bearer ${API_TOKEN}"
body_template: "" # Optional custom body template
response_path: "text" # JSON path in response, default: "text"The HTTP agent sends a JSON body with prompt, system, messages, and params fields. Response text is extracted from the JSON path specified by response_path, falling back to raw response text.
command
agent:
type: command
config:
command: python # Required
args: ["eval_script.py"] # Optional
working_dir: ./scripts # Optional
timeout: "60s" # Default: 60s
env: # Optional environment variables
MY_VAR: "value"The prompt is passed via stdin by default. If any argument contains {{.Prompt}}, the prompt is substituted into args instead.
Grader Types
exact_match
- type: exact_match
weight: 1.0
config:
ignore_case: false # Case-insensitive comparison
ignore_whitespace: false # Ignore leading/trailing whitespaceCompares agent output text against expected.text. Score: 1.0 if match, 0.0 otherwise.
contains
- type: contains
weight: 1.0
config:
ignore_case: false
keywords: ["keyword1", "keyword2"] # Additional required keywordsChecks that expected.text and all keywords are present in agent output. Score: matched / total.
regex
- type: regex
weight: 1.0
config:
pattern: "\\d{3}-\\d{4}" # Required, Go regex syntaxTests pattern against agent output text. Score: 1.0 if match, 0.0 otherwise.
json_match
- type: json_match
weight: 1.0
config:
ignore_case: falseParses agent output as JSON and compares against expected.fields. Score: matched / total fields.
Task definition with json_match:
- id: json-test
input:
prompt: "Return user info as JSON"
expected:
fields:
name: "Alice"
age: 30
graders:
- type: json_matchcommand
- type: command
weight: 1.0
config:
command: python # Required
args: ["grade_script.py"]
timeout: "60s" # Default: 60sReceives JSON via stdin: {"task_id": "...", "agent_output": "...", "expected": {...}}. Exit code 0 = pass, non-zero = fail. Optionally returns JSON via stdout: {"score": 0.8, "pass": true, "reason": "..."}.
llm
- type: llm
weight: 1.0
config:
provider: openai # Default: openai
api_key: ${OPENAI_API_KEY}
base_url: "" # Optional, for compatible APIs
model: gpt-4
rubric: | # Required
Evaluate the response for:
1. Correctness
2. CompletenessUses an LLM to grade the agent's output against a rubric. The LLM response is parsed for: SCORE: <0.0-1.0>, PASS: <true/false>, REASON: <text>.
pairwise
- type: pairwise
weight: 1.0
config:
provider: openai
api_key: ${OPENAI_API_KEY}
base_url: "" # Optional
model: gpt-4
criteria: "" # Optional, default compares accuracy/completeness/helpfulness
reference: "" # Optional, uses expected.text if not setCompares agent output against a reference using LLM. Parses: VERDICT: <A_BETTER|B_BETTER|TIE>, REASON: <text>. Scores: B_BETTER=1.0, TIE=0.5, A_BETTER=0.0.
constraint
- type: constraint
weight: 1.0
config:
checks:
- pattern: "\\d+"
must_match: true
- pattern: "error"
must_not_match: true
- max_words: 100
- min_words: 10Applies structural constraints to agent output. ALL checks must pass. Score: passed / total checks.
Task File Format
Task files are YAML arrays:
- id: task-1 # Required, unique
name: "Task Name" # Optional
tags: [math, easy] # Optional, for filtering
input:
prompt: "Your prompt here" # Required (or messages)
system: "System prompt" # Optional
messages: # Alternative to prompt
- role: user
content: "Hello"
- role: assistant
content: "Hi there"
params: # Optional key-value params
language: "python"
expected: # Optional
text: "Expected output"
fields: # For json_match
key: "value"
graders: # Optional, overrides defaults
- type: exact_match
trials_per_task: 5 # Optional, overrides default
step_limit: 10 # Optional, expected max stepsEnvironment Variable Expansion
Use ${ENV_VAR} syntax in any string value:
agent:
type: openai
config:
api_key: ${OPENAI_API_KEY}Defaults Cascade
Suite-level defaults are applied to tasks that don't override them:
defaults:
trials_per_task: 3
pass_threshold: 0.7
graders:
- type: exact_match
tasks:
- id: task-1
input:
prompt: "..."
# Inherits: trials_per_task=3, pass_threshold=0.7, graders=[exact_match]
- id: task-2
input:
prompt: "..."
trials_per_task: 5 # Overrides default
graders: # Overrides default graders
- type: contains