Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Native CLIs consistently outperform wrappers
The benchmark reveals a systematic accuracy gap between native CLIs and wrapper CLIs running the same underlying model. Native configurations achieve 10-16 percentage point higher pass rates while costing 1.5 to 18 times less per fix.
Claude CLI with Opus 4.6 reaches 61.6% accuracy at $2.93 per fix. OpenCode with the same model hits 47.5% at $51.88. 14.1 percentage points lower while costing 17.7 times more. Cursor with Opus 4.6 achieves 62.5% but at $35.40 per fix, making it 12 times more expensive than Claude CLI despite nearly identical accuracy.
This pattern holds across all three major model families:
| Configuration | Model | Pass rate | Cost per fix | Gap from native |
|---|---|---|---|---|
| Claude CLI | Opus 4.6 | 61.6% | $2.93 | . |
| OpenCode | Opus 4.6 | 47.5% | $51.88 | -14.1pp |
| Cursor | Opus 4.6 | 62.5% | $35.40 | +0.9pp |
| Codex | GPT-5.2 | 62.7% | $5.30 | . |
| OpenCode | GPT-5.2 | 51.6% | $6.65 | -11.1pp |
| Cursor | GPT-5.2 | 51.6% | $6.26 | -11.1pp |
| Gemini CLI | 3.1 Pro | 58.7% | $3.92 | . |
| OpenCode | 3.1 Pro | 54.9% | $5.81 | -3.8pp |
Why native CLIs win
Three factors explain the performance gap. First, native CLIs handle environment setup more reliably. Fewer build failures means more test runs actually execute the agent's patch. OpenCode and Cursor delegate build responsibility to the agent, which has to install dependencies, configure paths, and handle compiler flags. All steps that can fail.
Second, native CLIs integrate tools more efficiently. Claude CLI gives direct file manipulation, testing, and execution without abstraction layers. Wrappers like OpenCode and Cursor route agent actions through intermediate interfaces, introducing latency and communication overhead that slows decision cycles.
Third, native CLIs use model-specific prompt engineering. The Claude team optimizes its CLI prompts for how Opus behaves. The Codex CLI is tuned for GPT-5.2's completion patterns. Wrappers use generic prompts designed to work with multiple models, which means they optimize for none.
The cost multiplier
The cost difference amplifies the accuracy gap. Claude CLI costs 18 times less than OpenCode for Opus 4.6. This isn't just pricing. It reflects infrastructure efficiency. Native CLIs bundle the model call with the CLI runtime, minimizing overhead. Wrappers add service fees, agent abstraction layers, and API mark-ups.
For an organization fixing 100 CVEs:
- Claude CLI: 61.6 fixes at $180 total cost ($2.93 × 61.6)
- OpenCode: 47.5 fixes at $2,464 total cost ($51.88 × 47.6)
Same model. The wrapper version costs 13.7 times more to achieve 14.1 percentage points lower accuracy.
Cursor's exception
Cursor with Opus 4.6 is the one wrapper that approaches native performance on accuracy (62.5% vs 61.6%). However, it costs $35.40 per fix compared to Claude CLI's $2.93. The accuracy is comparable; the efficiency is not.
For 100 CVEs:
- Claude CLI: 61.6 fixes at $180
- Cursor: 62.5 fixes at $2,212
Even when Cursor wins on accuracy, it loses on cost. If accuracy is your only metric, Cursor matches the native CLI. If cost matters at all, it doesn't.
Recommendations
Use the native CLI for your chosen model. If you deploy Claude exclusively, use Claude CLI. If you standardize on GPT-5.2, use Codex. If Gemini is your platform, use Gemini CLI.
If you need multi-model support and can't justify three separate native tools, OpenCode with Gemini 3.1 offers the best wrapper compromise: 54.9% pass rate at $5.81 per fix. It's 3.8 percentage points below the native Gemini CLI but cheaper than Cursor and ahead of OpenCode's other configurations.
The native-vs-wrapper gap exists because CLIs are built for one model. Wrappers are built for all models. One beats the other.
See Benchmark results | Economics analysis | Claude Opus 4.6 profile | Codex GPT-5.2 profile | Gemini 3.1 Pro profile
FAQ
How much does the CLI wrapper affect agent performance?
Native CLIs outperform wrapper CLIs by 10-16 percentage points on the same model. Claude Code achieves 61.6% vs OpenCode's 47.5% with identical Opus 4.6. The cost difference is even larger: $2.93 vs $51.88 per fix.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
Three behavioral clusters: how agents approach CVE patching
Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.
Cross-agent agreement: 105 pairwise comparisons
105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.