[RESULTS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Rankings

Pass rate (primary metric)
Cost per verified fix
Difficulty-weighted performance
Trend over time

Cost Economics

Fix a vulnerability with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.

Difficulty Scoring

Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.

128

Real bugs tested

Agents compared

62.7%

Top pass rate

$2.64

Cheapest per fix

Prefer interactive charts? Open the Benchmark Explorer →

Agent Leaderboard

15 agents ranked by pass rate across 128 real bugs. Each test runs in an isolated container with automated safety checks. Growing to 6,138+ vulnerabilities across 250+ projects. How verification works

The leaderboard below ranks agents by the percentage of bugs they fix correctly. Pass rate is simple: out of 128 vulnerability samples, how many did the agent patch without breaking the fix? The numbers matter for product engineering decisions. A 5-point pass rate difference between two agents adds up to dozens of fixes per year when running across a large codebase.

Cost efficiency is equally important. Some agents achieve 50%+ pass rates while others reach 30%. But the cheap agent at 30% might cost $0.50 per fix while the expensive agent costs $5. Context matters - your security budget, team size, and bug volume all change the calculation.

942

Pass

413

Fail

509

Build

128

Infra

Rank	Agent	Pass Rate	Pass	Fail	Build	Infra
1	codex-gpt-5.2	62.7%	79	12	35	10
2	cursor-opus-4.6	62.5%	80	24	24	0
3	claude-claude-opus-4-6	61.6%	77	28	20	11
4	gemini31-gemini-3.1-pro-preview	58.7%	64	18	27	19
5	opencode-gemini-gemini-3.1-pro-preview	54.9%	67	25	30	6
6	cursor-gpt-5.2	51.6%	63	34	25	6
7	opencode-gpt-5.2	51.6%	63	11	48	14
8	cursor-gpt-5.3-codex	50.4%	64	40	23	1
9	codex-gpt-5.2-codex	49.2%	63	27	38	8
10	opencode-claude-opus-4-6	47.5%	58	15	49	14
11	claude-claude-opus-4-5	45.7%	58	43	26	9
12	cursor-composer-1.5	45.2%	57	39	30	2
13	gemini-gemini-3-pro-preview	43.0%	55	36	37	8
14	opencode-gpt-5.2-codex	37.8%	48	32	47	9
15	opencode-claude-opus-4-5	36.8%	46	29	50	11

[REJECTIONS]

Reject unverifiable patches.
Reject "pass" without bug reproduction.
Reject black-box runs without trace evidence.
Reject build or infra failures during verification.

These rejection criteria keep the benchmark honest. We do not accept patches that "look right" or "compile" - they must actually fix the bug and pass automated tests. Infrastructure errors during agent execution count as failures, not as data collection problems.

Confidence Intervals

95% Wald confidence intervals for each agent's pass rate. Wider intervals indicate greater statistical uncertainty. Confidence intervals matter because they show which agents you can trust with confidence and which ones have wider performance bands.

An agent with a 60% pass rate and a tight 58-62% confidence interval is more reliable than an agent with 60% but a wide 52-68% range. The tighter intervals typically come from agents tested across more diversity or more samples. Use these intervals to account for sampling variability when comparing agents.

Cost vs Performance

Not all fixes cost the same. claude-claude-opus-4-5 fixes bugs for $2.64 each. opencode-claude-opus-4-6 costs $52/fix but handles harder bugs. The dashed line shows the best trade-off - agents where no alternative is both cheaper AND more accurate.

The scatter plot reveals trade-offs that do not appear in the leaderboard. Some agents cluster in the mid-performance range - 40-50% pass rate at similar costs. But a few outliers exist at the edges: agents that are cheap but ineffective, or expensive but uniquely capable on hard bugs.

Cost Efficiency Rankings

Total benchmark cost: $11,660. Ranked by cost per successful fix.

This ranking answers the question: which agent gives you the most fixes per dollar spent? It combines pass rate with token consumption. An agent that fails often wastes money on unsuccessful runs - those token costs still count in the per-fix calculation.

Agent	$/Pass	API Cost	Pass Rate	Passes
claude-claude-opus-4-5	$2.64	$153	45.7%	58
claude-claude-opus-4-6	$2.93	$225	61.6%	77
gemini31-gemini-3.1-pro-preview	$3.92	$251	58.7%	64
cursor-composer-1.5	$3.93	$224	45.2%	57
gemini-gemini-3-pro-preview	$4.85	$267	43.0%	55
codex-gpt-5.2	$5.30	$419	62.7%	79
opencode-gemini-gemini-3.1-pro-preview	$5.81	$389	54.9%	67
cursor-gpt-5.3-codex	$6.16	$394	50.4%	64
cursor-gpt-5.2	$6.26	$394	51.6%	63
codex-gpt-5.2-codex	$6.65	$419	49.2%	63
opencode-gpt-5.2	$6.65	$419	51.6%	63
opencode-gpt-5.2-codex	$8.73	$419	37.8%	48
cursor-opus-4.6	$35.40	$2832	62.5%	80
opencode-claude-opus-4-5	$40.13	$1846	36.8%	46
opencode-claude-opus-4-6	$51.88	$3009	47.5%	58

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

What to do with this data

Automate patching

Install the XOR GitHub App. It runs the agent, tests the fix, opens a PR with evidence.

Run on your codebase

Benchmark agents against YOUR vulnerabilities, in YOUR CI pipeline.

Explore more

Economics
- cost per fix, best trade-offs, running multiple agents
Agent profiles
- how agents differ, where they agree and disagree
Methodology
- how we test, validity checks, difficulty scoring
Bug complexity
- which bugs are easy, which are impossible
Agent strategies
- behavioral clusters and session patterns

FAQ

Which agent has the highest pass rate?

Codex GPT-5.2 at 62.7% on the vulnerability benchmark dataset. See the full rankings with cost breakdowns.

How much does it cost to fix a vulnerability?

Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.

Are these costs real or estimates?

Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).

Do pass rates change?

Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.

[RELATED TOPICS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.