Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

[PARETO ANALYSIS]

Pareto frontier: 4 agents dominate the rest

When plotting cost against accuracy, only 4 of 15 agents sit on the Pareto frontier. All others are dominated by at least one frontier agent: either lower cost at equal accuracy, or better accuracy at equal cost.

The frontier agents represent the efficient trade-offs:

Agent	Model	Pass rate	Cost per fix	Frontier?
Claude Opus 4.5	Opus 4.5	45.7%	$2.64	Yes
Claude Opus 4.6	Opus 4.6	61.6%	$2.93	Yes
Gemini 3.1 Pro	3.1 Pro	58.7%	$3.92	Yes
Codex GPT-5.2	GPT-5.2	62.7%	$5.30	Yes

Everything else is economically dominated. OpenCode with Opus 4.6 costs $51.88 per fix at 47.5% accuracy. Claude Opus 4.6 achieves 61.6% at $2.93. That's 14.1 percentage points higher accuracy for 1.8% of the cost.

Below the frontier

The 11 non-frontier agents fall into two categories: same-accuracy-worse-cost or worse-accuracy-same-cost.

Cursor with Opus 4.6 (62.5% at $35.40) nearly matches Codex's accuracy (62.7% at $5.30) but costs 6.7 times more. The accuracy is there. The economics aren't.

OpenCode with GPT-5.2 (51.6% at $6.65) and Cursor with GPT-5.2 (51.6% at $6.26) both underperform Codex by 11.1 percentage points. If you need GPT-5.2, use the native Codex CLI, not the wrappers.

Gemini 3.0 (43.0% at $4.85) is dominated by Gemini 3.1 Pro (58.7% at $3.92). The 3.1 upgrade is 15.7 percentage points better at lower cost. The single largest model improvement in the benchmark.

Cursor with Composer 1.5 (45.2% at $3.93) is beaten by Claude Opus 4.5 (45.7% at $2.64) on accuracy with lower cost, and by Claude Opus 4.6 (61.6% at $2.93) by a wide margin.

The value zone

Everything on the frontier costs $2.64 to $5.30 per fix. Anything above $6 per fix is economically dominated by frontier agents.

With a $100 budget:

Claude Opus 4.5: fixes 37.9 vulns
Claude Opus 4.6: fixes 34.1 vulns
Gemini 3.1 Pro: fixes 25.5 vulns
Codex GPT-5.2: fixes 18.9 vulns
OpenCode Opus 4.6: fixes 1.9 vulns

The cost differences determine throughput more than the accuracy differences. Claude Opus 4.6 (61.6% accurate) and Codex GPT-5.2 (62.7% accurate) are nearly identical on accuracy, but Claude CLI fixes more vulnerabilities per dollar because the model is cheaper to call.

Beyond the frontier

No agent sits above the frontier. The highest-accuracy agent (Codex GPT-5.2 at 62.7%) is also on the frontier at $5.30 per fix. You cannot buy higher accuracy by spending more money.

This means the benchmark has hit a capability wall around 62.7%. None of the 15 configurations exceed that. Three come within 1 percentage point (Claude Opus 4.6 at 61.6%, Cursor Opus 4.6 at 62.5%), but none surpass the leader.

Model upgrades matter. The Gemini 3.0 to 3.1 jump shows 15.7pp improvement. The Claude Opus 4.5 to 4.6 improvement is harder to measure directly, but Claude Opus 4.6 (61.6%) clearly outperforms similar-era alternatives.

Decision framework

If accuracy is paramount: Use Codex GPT-5.2 (62.7% pass rate). It's the highest point on the frontier.

If cost matters equally: Use Claude Opus 4.6 (61.6% pass rate at $2.93 per fix). Only 1.1 percentage points behind the leader, but 1.8 times cheaper.

If budget is tight: Use Claude Opus 4.5 (45.7% pass rate at $2.64 per fix). Cheapest frontier option, lowest pass rate, but still beats all non-frontier agents on value.

If you're evaluating a new model: Check whether it sits on the frontier or below it. If below, it's dominated. No amount of tuning will change that until the underlying model improves.

If you need multi-model deployment: Run each model's native CLI. Don't use wrappers. The 10-16 percentage point gap is too large to justify for flexibility.

See Benchmark results | Native vs wrapper analysis | Economics analysis | Codex GPT-5.2 profile

FAQ

Which agent has the best cost-accuracy tradeoff?

Claude Opus 4.6 at $2.93/pass and 61.6% pass rate sits on the Pareto frontier. Gemini 3.1 Pro at $3.92/pass and 58.7% is the next-best option.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Three behavioral clusters: how agents approach vulnerability patching

Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.

Cross-agent agreement: 105 pairwise comparisons

105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.