Skip to main content
[CVE-AGENT-BENCH]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

[BENCHMARKS]

15 agents benchmarked on 128 real vulnerabilities

OutcomePick the right agent before you deploy. See which ones produce fixes that pass.

Mechanism1,920 evaluations. Pass rates, cost per fix, and difficulty scores for every agent.

ProofBest pass rate: 62.7%. Cheapest verified fix: $2.64.

Agent Rankings

  1. Pass rate (primary metric)
  2. Cost per verified fix
  3. Difficulty-weighted performance
  4. Trend over time

Cost Economics

Fix a CVE with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.

Difficulty Scoring

Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.

128
CVE samples
1,920
Verified evaluations
15
Agent configurations
62.7%
Best agent pass rate

Current test dataset

128 real bugs tested, 1,920 test runs across 15 agent configurations. Growing to 6,138+ vulnerabilities across 250+ open source projects.

CVE-Agent-Bench evaluates how well AI agents can generate security patches for real vulnerability samples. This is not a toy benchmark -- the bugs come from a curated dataset of real vulnerabilities collected from open source projects that teams maintain. The agents run in isolated containers and patches are verified with automated tests.

How it works

  1. Reproduce each bug with a known way to trigger it and a known-good fix.
  2. Run each agent in an isolated environment.
  3. Apply the agent's fix and check if the bug is gone.
  4. Record pass/fail results and categorize failures.
  5. Adjust scores for bug difficulty so results are fair.

The process is deterministic. Every agent receives the same environment, the same inputs, and the same constraints. We measure what agents actually do, not what they claim to do. If an agent fails to generate a fix, or generates a broken patch, or times out, all of those count as failures.

What's in the report

  • Agent leaderboard by pass rate and cost
  • Failure categories and why fixes fail
  • Guide for choosing the right agent and model
  • Fix examples and test results

The report goes beyond rankings. We document every failure mode - agents that produce patches that compile but do not fix the bug, agents that refuse to patch, infrastructure failures. We analyze patch semantics to understand different fixing approaches. We map which agents agree and which ones disagree on the same bugs.

Why this matters

Engineering leaders need proof before scaling AI fixes to hundreds of developers. Security leaders need audit-ready evidence. XOR delivers independent, tested results that both teams can trust.

Most agent benchmarks measure generic coding tasks. This one measures security patching specifically. The skills are different. A high-pass-rate general coding agent might fail on security context. We test what matters for your pipeline.

How to use this data

For RLHF / DPO training

Each evaluation is a labeled example (+1 pass, 0 fail, -1 build-fail). Use difficulty scores for curriculum ordering.

For benchmarking your agent

Run your agent on the same 128 CVEs. Log results to W&B Weave. Compare against 15 baselines.

For pre-training data

1,920 labeled vulnerability-patching examples across 40 C/C++ projects. Patches are surgical — 74% are 10 lines or fewer.

For research

Empirical difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.

Browse the data

Full benchmark report

Enter your email below to access agent configurations, patch examples, failure analysis, and full methodology.

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

FAQ

Which agent has the highest pass rate?

Codex GPT-5.2 at 62.7% on the CVE benchmark dataset. See the full rankings with cost breakdowns.

How much does it cost to fix a vulnerability?

Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.

Are these costs real or estimates?

Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).

Do pass rates change?

Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.