Open benchmark for AI vulnerability patching — 128 real vulnerabilities, 15 agents, reproducible results.
Benchmark Explorer
Vulnerability-Agent-Bench tests whether AI coding agents can fix real vulnerabilities in production codebases. 1,920 evaluations. 128 vulnerabilities. 15 agents. Best pass rate: 62.7%. Cheapest fix: $2.64.
How each vulnerability is verified:
Run the trigger against unpatched code. Confirm it crashes.
Apply the agent's git diff inside a Docker container.
Compile with the verifier toolchain and memory safety instrumentation.
Re-run the same trigger. If no crash → [PASS]. Still crashes → [FAIL].
Scoring: Pass = +1 (trigger no longer crashes). Fail = 0 (still crashes). Build = -1 (patch doesn't compile). Infra = excluded.
Agent names = harness/model. The same LLM through different coding agents produces different patch quality. For example: claude/opus-4-5 is Claude Opus 4.5 through Anthropic's Claude Code. opencode/claude-opus-4-5 is the same model through the OpenCode harness. Same model, different harness — different results. The harness is what we are measuring, not just the model.
Each evaluation is a labeled example: +1 (pass), 0 (fail), -1 (build-fail). Use difficulty scores for curriculum ordering.
Run your agent on the same 128 vulnerabilities. Log results to W&B Weave. Compare against 15 baselines.
1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.
IRT difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.
Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.
Generate
AI agents generate patches for vuln samples
Reproduce
Verify trigger and patch correctness
Patch
Test patches against test suites
Real agent sessions from Vulnerability-Agent-Bench. Watch how different agents approach the same vulnerability.
Speed-runner: arrow #20123
Claude Opus 4.5 fixes a null-check bug in 3 tool calls and 19 seconds. Grep → Read → Edit pattern.
This session conforms to the IETF Verifiable Agent Conversation Record format. The data structure maps to the VAC entry types (tool-call, tool-result, message) and could be wrapped in a COSE_Sign1 envelope for cryptographic non-repudiation.
→ draft-birkholz-verifiable-agent-conversations136 vuln samples × 15 agents. Each cell is one evaluation. Sorted by difficulty (easiest top) and pass rate (best left).
Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench
Import your evaluation results into W&B for centralized tracking.
# Pseudocode — implement these functions for your agent
import wandb
from your_agent import load_results
def upload_to_wandb(results):
with wandb.init(
project="cve-bench",
entity="tobias_xor-xor",
job_type="evaluation"
):
wandb.log({
"pass_rate": results.pass_rate,
"total_evals": results.total,
"cost_usd": results.cost
})
evaluation_results = load_results("results.json")
upload_to_wandb(evaluation_results)Run your agent against the benchmark and log results to W&B.
# Pseudocode — implement these functions for your agent
def evaluate_agent(agent, samples):
results = {
"pass": 0,
"fail": 0,
"build": 0,
"infra": 0
}
for sample in samples:
outcome = run_agent(agent, sample)
results[outcome] += 1
return results
# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)Expected schema for evaluation results.
{
"agent_model": "string",
"sample_id": "string",
"outcome": "pass" | "fail" | "build" | "infra",
"time_seconds": number,
"cost_usd": number,
"tokens_in": number,
"tokens_out": number
}Dataset access is gated. Request access and receive download link within 24 hours.
Request Access
Get download link for full Vulnerability-Agent-Bench dataset with evaluation metadata.
Request DatasetDataset Schema
| Field | Type |
|---|---|
| sample_id | string |
| agent_model | string |
| outcome | pass|fail|build|infra |
| time_seconds | number |
| cost_usd | number |
RLHF Reward Signal
Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.
Cost per fix vs. pass rate. Pareto frontier with 95% confidence intervals. Oracle set cover: the minimum set of agents needed to fix the maximum number of samples.
Cost data: 2 of 15 agents have measured token costs (Claude native). All others use turn-count heuristic estimates.
Read full cost analysis →Difficulty scored from observed pass rates across all agents (raw empirical measurement, not theoretical fitting). Samples categorized by patch type and source project.
DPO preference pairs for training. Gold = pass vs build-fail (strongest signal). Silver = pass vs test-fail. Bronze = test-fail vs build-fail. Ternary reward signal (+1/0/-1) and five-level distributions.
Reward Configuration
Base Reward
1
Difficulty Bonus
+0.5
Teamwork Bonus
+0.25
Exploration Bonus
+0.1
Bonuses applied when agents solve difficult samples, contribute unique solutions, or explore novel reasoning paths.
128 samples, +/-8.7pp 95% confidence intervals. Cohen's kappa for cross-agent agreement. The leading agents may be statistically indistinguishable.
Read full methodology →Run your agent against 128 vulnerabilities
Download the dataset, log results to W&B Weave, and compare against 15 baselines. The current best hits 62.7%.