CVE Vulnerability Patching Benchmark
Real-world evaluation of AI agent patch generation. CVE-Agent-Bench measures whether coding agents can fix authenticated CVE vulnerabilities in open-source C/C++ projects.
What is CVE-Agent-Bench?
CVE-Agent-Bench tests whether AI coding agents can fix real CVE vulnerabilities in open-source C/C++ projects. Each sample includes a vulnerable code snippet, proof of concept, and test suite. Agents generate patches, and we verify correctness against the test suite.
[EVALUATION FACTORY]
Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.
Generate
AI agents generate patches for CVE samples
Reproduce
Verify POC and patch correctness
Patch
Test patches against test suites
[SAMPLE EXPLORER]
Interactive view of all 0 CVE samples and agent performance.
Click a cell to view detailed evaluation results
Agent abbrev legend:
- C4.5 = claude
- C4.6 = claude
- GPT5.2 = codex
- GPT5.2C = codex
- Csr1.5 = cursor
- CsrGPT = cursor
Economics: Cost vs Accuracy
Coming soon: Cost-per-pass rankings and Pareto frontier analysis.
Pre-Training Curriculum
Coming soon: Training data curriculum and learning dynamics.
Post-Training Signals
Coming soon: RLHF results and safety evaluations.
Run Your Own Agent
Submit your agent for evaluation on the full benchmark. We handle infrastructure, scoring, and leaderboard placement.
[W&B INTEGRATION]
Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench
Import your evaluation results into W&B for centralized tracking.
# Pseudocode — implement these functions for your agent
import wandb
def upload_to_wandb(results):
with wandb.init(
project="cve-bench",
entity="tobias_xor-xor",
job_type="evaluation"
):
wandb.log({
"pass_rate": results.pass_rate,
"total_evals": results.total,
"cost_usd": results.cost
})
upload_to_wandb(evaluation_results)Run your agent against the benchmark and log results to W&B.
# Pseudocode — implement these functions for your agent
def evaluate_agent(agent, samples):
results = {
"pass": 0,
"fail": 0,
"build": 0,
"infra": 0
}
for sample in samples:
outcome = run_agent(agent, sample)
results[outcome] += 1
return results
# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)Expected schema for evaluation results.
{
"agent_model": "string",
"sample_id": "string",
"outcome": "pass" | "fail" | "build" | "infra",
"time_seconds": number,
"cost_usd": number,
"tokens_in": number,
"tokens_out": number
}[DATA ACCESS]
Dataset access is gated. Request access and receive download link within 24 hours.
Request Access
Get download link for full CVE-Agent-Bench dataset with evaluation metadata.
Request DatasetDataset Schema
| Field | Type |
|---|---|
| sample_id | string |
| agent_model | string |
| outcome | pass|fail|build|infra |
| time_seconds | number |
| cost_usd | number |
RLHF Reward Signal
Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.
Ready to evaluate your agent?
Submit your agent for automated evaluation on CVE-Agent-Bench. Results posted within 48 hours.