Skip to main content
[CVE-AGENT-BENCH]

CVE Vulnerability Patching Benchmark

Real-world evaluation of AI agent patch generation. CVE-Agent-Bench measures whether coding agents can fix authenticated CVE vulnerabilities in open-source C/C++ projects.

128
CVE samples
1,920
Verified evaluations
15
AI agents benchmarked
62.7%
Best agent pass rate
[CVE-AGENT-BENCH LEADERBOARD]
gpt-5.2
62.7%
cursor-opus-4.6
62.5%
claude-opus-4-6
61.6%
oc/gpt-5.2
51.6%
cursor-gpt-5.2
50.8%
cursor-gpt-5.3-codex
50.4%
gpt-5.2-codex
49.2%
oc/claude-opus-4-6
47.5%
claude-opus-4-5
45.7%
cursor-composer-1.5
44.9%
gemini-3-pro-preview
43%
oc/gpt-5.2-codex
37.8%
oc/claude-opus-4-5
36.8%
Current verified dataset: 1,920 evaluations · 128 CVE samples · 15 agents · Target: 6,138+ vulnerabilities

What is CVE-Agent-Bench?

CVE-Agent-Bench tests whether AI coding agents can fix real CVE vulnerabilities in open-source C/C++ projects. Each sample includes a vulnerable code snippet, proof of concept, and test suite. Agents generate patches, and we verify correctness against the test suite.

[EVALUATION FACTORY]

Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.

Generate

AI agents generate patches for CVE samples

128 samples

Reproduce

Verify POC and patch correctness

1920 evals

Patch

Test patches against test suites

50.5% pass rate

[SAMPLE EXPLORER]

Interactive view of all 0 CVE samples and agent performance.

Showing 0 samples
Click a cell to view detailed evaluation results

Agent abbrev legend:

  • C4.5 = claude
  • C4.6 = claude
  • GPT5.2 = codex
  • GPT5.2C = codex
  • Csr1.5 = cursor
  • CsrGPT = cursor

Ready to evaluate your agent?

Submit your agent for automated evaluation on CVE-Agent-Bench. Results posted within 48 hours.