Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Rankings
- Pass rate (primary metric)
- Cost per verified fix
- Difficulty-weighted performance
- Trend over time
Cost Economics
Fix a CVE with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.
Difficulty Scoring
Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.
Prefer interactive charts? Open the Benchmark Explorer →
Agent Leaderboard
15 agents ranked by pass rate across 128 real bugs. Each test runs in an isolated container with automated safety checks. Growing to 6,138+ vulnerabilities across 250+ projects. How verification works
The leaderboard below ranks agents by the percentage of bugs they fix correctly. Pass rate is simple: out of 136 CVE samples, how many did the agent patch without breaking the fix? The numbers matter for product engineering decisions. A 5-point pass rate difference between two agents adds up to dozens of fixes per year when running across a large codebase.
Cost efficiency is equally important. Some agents achieve 50%+ pass rates while others reach 30%. But the cheap agent at 30% might cost $0.50 per fix while the expensive agent costs $5. Context matters - your security budget, team size, and bug volume all change the calculation.
| Rank | Agent | Pass Rate | Pass | Fail | Build | Infra |
|---|---|---|---|---|---|---|
| 1 | codex-gpt-5.2 | 62.7% | 79 | 12 | 35 | 10 |
| 2 | cursor-opus-4.6 | 62.5% | 80 | 24 | 24 | 0 |
| 3 | claude-claude-opus-4-6 | 61.6% | 77 | 28 | 20 | 11 |
| 4 | gemini31-gemini-3.1-pro-preview | 58.7% | 64 | 18 | 27 | 19 |
| 5 | opencode-gemini-gemini-3.1-pro-preview | 54.9% | 67 | 25 | 30 | 6 |
| 6 | cursor-gpt-5.2 | 51.6% | 63 | 34 | 25 | 6 |
| 7 | opencode-gpt-5.2 | 51.6% | 63 | 11 | 48 | 14 |
| 8 | cursor-gpt-5.3-codex | 50.4% | 64 | 40 | 23 | 1 |
| 9 | codex-gpt-5.2-codex | 49.2% | 63 | 27 | 38 | 8 |
| 10 | opencode-claude-opus-4-6 | 47.5% | 58 | 15 | 49 | 14 |
| 11 | claude-claude-opus-4-5 | 45.7% | 58 | 43 | 26 | 9 |
| 12 | cursor-composer-1.5 | 45.2% | 57 | 39 | 30 | 2 |
| 13 | gemini-gemini-3-pro-preview | 43.0% | 55 | 36 | 37 | 8 |
| 14 | opencode-gpt-5.2-codex | 37.8% | 48 | 32 | 47 | 9 |
| 15 | opencode-claude-opus-4-5 | 36.8% | 46 | 29 | 50 | 11 |
[REJECTIONS]
- Reject unverifiable patches.
- Reject "pass" without bug reproduction.
- Reject black-box runs without trace evidence.
- Reject build or infra failures during verification.
These rejection criteria keep the benchmark honest. We do not accept patches that "look right" or "compile" - they must actually fix the bug and pass automated tests. Infrastructure errors during agent execution count as failures, not as data collection problems.
Confidence Intervals
95% Wald confidence intervals for each agent's pass rate. Wider intervals indicate greater statistical uncertainty. Confidence intervals matter because they show which agents you can trust with confidence and which ones have wider performance bands.
An agent with a 60% pass rate and a tight 58-62% confidence interval is more reliable than an agent with 60% but a wide 52-68% range. The tighter intervals typically come from agents tested across more diversity or more samples. Use these intervals to account for sampling variability when comparing agents.
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
[NEXT STEPS]
What to do with this data
Explore more
- Economics
- cost per fix, best trade-offs, running multiple agents
- Agent profiles
- how agents differ, where they agree and disagree
- Methodology
- how we test, validity checks, difficulty scoring
- Bug complexity
- which bugs are easy, which are impossible
- Agent strategies
- behavioral clusters and session patterns
FAQ
Which agent has the highest pass rate?
Codex GPT-5.2 at 62.7% on the CVE benchmark dataset. See the full rankings with cost breakdowns.
How much does it cost to fix a vulnerability?
Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.
Are these costs real or estimates?
Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).
Do pass rates change?
Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.