Skip to main content
[RESULTS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Rankings

  1. Pass rate (primary metric)
  2. Cost per verified fix
  3. Difficulty-weighted performance
  4. Trend over time

Cost Economics

Fix a CVE with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.

Difficulty Scoring

Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.

128
Real bugs tested
15
Agents compared
62.7%
Top pass rate
$2.64
Cheapest per fix

Prefer interactive charts? Open the Benchmark Explorer →

Agent Leaderboard

15 agents ranked by pass rate across 128 real bugs. Each test runs in an isolated container with automated safety checks. Growing to 6,138+ vulnerabilities across 250+ projects. How verification works

The leaderboard below ranks agents by the percentage of bugs they fix correctly. Pass rate is simple: out of 136 CVE samples, how many did the agent patch without breaking the fix? The numbers matter for product engineering decisions. A 5-point pass rate difference between two agents adds up to dozens of fixes per year when running across a large codebase.

Cost efficiency is equally important. Some agents achieve 50%+ pass rates while others reach 30%. But the cheap agent at 30% might cost $0.50 per fix while the expensive agent costs $5. Context matters - your security budget, team size, and bug volume all change the calculation.

RankAgentPass RatePassFailBuildInfra
1codex-gpt-5.262.7%79123510
2cursor-opus-4.662.5%8024240
3claude-claude-opus-4-661.6%77282011
4gemini31-gemini-3.1-pro-preview58.7%64182719
5opencode-gemini-gemini-3.1-pro-preview54.9%6725306
6cursor-gpt-5.251.6%6334256
7opencode-gpt-5.251.6%63114814
8cursor-gpt-5.3-codex50.4%6440231
9codex-gpt-5.2-codex49.2%6327388
10opencode-claude-opus-4-647.5%58154914
11claude-claude-opus-4-545.7%5843269
12cursor-composer-1.545.2%5739302
13gemini-gemini-3-pro-preview43.0%5536378
14opencode-gpt-5.2-codex37.8%4832479
15opencode-claude-opus-4-536.8%46295011

[REJECTIONS]

  • Reject unverifiable patches.
  • Reject "pass" without bug reproduction.
  • Reject black-box runs without trace evidence.
  • Reject build or infra failures during verification.

These rejection criteria keep the benchmark honest. We do not accept patches that "look right" or "compile" - they must actually fix the bug and pass automated tests. Infrastructure errors during agent execution count as failures, not as data collection problems.

Confidence Intervals

95% Wald confidence intervals for each agent's pass rate. Wider intervals indicate greater statistical uncertainty. Confidence intervals matter because they show which agents you can trust with confidence and which ones have wider performance bands.

An agent with a 60% pass rate and a tight 58-62% confidence interval is more reliable than an agent with 60% but a wide 52-68% range. The tighter intervals typically come from agents tested across more diversity or more samples. Use these intervals to account for sampling variability when comparing agents.

Confidence Intervals Forest Plot0%25%50%75%100%codex/gpt/5.262.7%n=126cursor/opus-4.662.5%n=128claude/opus-4-661.6%n=125gemini/3.1-pro-preview58.7%n=109opencode/gemini/3.1-pr...54.9%n=122cursor/gpt-5.251.6%n=122opencode/gpt/5.251.6%n=122cursor/gpt-5.3-codex50.4%n=127codex/gpt/5.2-codex49.2%n=128opencode/claude/opus-4-647.5%n=122claude/opus-4-545.7%n=127cursor/composer-1.545.2%n=126gemini/3-pro-preview43.0%n=128opencode/gpt/5.2-codex37.8%n=127opencode/claude/opus-4-536.8%n=125

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

What to do with this data

Explore more

FAQ

Which agent has the highest pass rate?

Codex GPT-5.2 at 62.7% on the CVE benchmark dataset. See the full rankings with cost breakdowns.

How much does it cost to fix a vulnerability?

Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.

Are these costs real or estimates?

Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).

Do pass rates change?

Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.