Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
Pareto frontier: 4 agents dominate the rest
When plotting cost against accuracy, only 4 of 15 agents sit on the Pareto frontier. All others are dominated by at least one frontier agent: either lower cost at equal accuracy, or better accuracy at equal cost.
The frontier agents represent the efficient trade-offs:
| Agent | Model | Pass rate | Cost per fix | Frontier? |
|---|---|---|---|---|
| Claude Opus 4.5 | Opus 4.5 | 45.7% | $2.64 | Yes |
| Claude Opus 4.6 | Opus 4.6 | 61.6% | $2.93 | Yes |
| Gemini 3.1 Pro | 3.1 Pro | 58.7% | $3.92 | Yes |
| Codex GPT-5.2 | GPT-5.2 | 62.7% | $5.30 | Yes |
Everything else is economically dominated. OpenCode with Opus 4.6 costs $51.88 per fix at 47.5% accuracy. Claude Opus 4.6 achieves 61.6% at $2.93. That's 14.1 percentage points higher accuracy for 1.8% of the cost.
Below the frontier
The 11 non-frontier agents fall into two categories: same-accuracy-worse-cost or worse-accuracy-same-cost.
Cursor with Opus 4.6 (62.5% at $35.40) nearly matches Codex's accuracy (62.7% at $5.30) but costs 6.7 times more. The accuracy is there. The economics aren't.
OpenCode with GPT-5.2 (51.6% at $6.65) and Cursor with GPT-5.2 (51.6% at $6.26) both underperform Codex by 11.1 percentage points. If you need GPT-5.2, use the native Codex CLI, not the wrappers.
Gemini 3.0 (43.0% at $4.85) is dominated by Gemini 3.1 Pro (58.7% at $3.92). The 3.1 upgrade is 15.7 percentage points better at lower cost. The single largest model improvement in the benchmark.
Cursor with Composer 1.5 (45.2% at $3.93) is beaten by Claude Opus 4.5 (45.7% at $2.64) on accuracy with lower cost, and by Claude Opus 4.6 (61.6% at $2.93) by a wide margin.
The value zone
Everything on the frontier costs $2.64 to $5.30 per fix. Anything above $6 per fix is economically dominated by frontier agents.
With a $100 budget:
- Claude Opus 4.5: fixes 37.9 CVEs
- Claude Opus 4.6: fixes 34.1 CVEs
- Gemini 3.1 Pro: fixes 25.5 CVEs
- Codex GPT-5.2: fixes 18.9 CVEs
- OpenCode Opus 4.6: fixes 1.9 CVEs
The cost differences determine throughput more than the accuracy differences. Claude Opus 4.6 (61.6% accurate) and Codex GPT-5.2 (62.7% accurate) are nearly identical on accuracy, but Claude CLI fixes more CVEs per dollar because the model is cheaper to call.
Beyond the frontier
No agent sits above the frontier. The highest-accuracy agent (Codex GPT-5.2 at 62.7%) is also on the frontier at $5.30 per fix. You cannot buy higher accuracy by spending more money.
This means the benchmark has hit a capability wall around 62.7%. None of the 15 configurations exceed that. Three come within 1 percentage point (Claude Opus 4.6 at 61.6%, Cursor Opus 4.6 at 62.5%), but none surpass the leader.
Model upgrades matter. The Gemini 3.0 to 3.1 jump shows 15.7pp improvement. The Claude Opus 4.5 to 4.6 improvement is harder to measure directly, but Claude Opus 4.6 (61.6%) clearly outperforms similar-era alternatives.
Decision framework
If accuracy is paramount: Use Codex GPT-5.2 (62.7% pass rate). It's the highest point on the frontier.
If cost matters equally: Use Claude Opus 4.6 (61.6% pass rate at $2.93 per fix). Only 1.1 percentage points behind the leader, but 1.8 times cheaper.
If budget is tight: Use Claude Opus 4.5 (45.7% pass rate at $2.64 per fix). Cheapest frontier option, lowest pass rate, but still beats all non-frontier agents on value.
If you're evaluating a new model: Check whether it sits on the frontier or below it. If below, it's dominated. No amount of tuning will change that until the underlying model improves.
If you need multi-model deployment: Run each model's native CLI. Don't use wrappers. The 10-16 percentage point gap is too large to justify for flexibility.
See Benchmark results | Native vs wrapper analysis | Economics analysis | Codex GPT-5.2 profile
FAQ
Which agent has the best cost-accuracy tradeoff?
Claude Opus 4.6 at $2.93/pass and 61.6% pass rate sits on the Pareto frontier. Gemini 3.1 Pro at $3.92/pass and 58.7% is the next-best option.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Three behavioral clusters: how agents approach CVE patching
Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.
Cross-agent agreement: 105 pairwise comparisons
105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.