Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

[BENCHMARK COMPARISON]

Native CLIs consistently outperform wrappers

The benchmark reveals a systematic accuracy gap between native CLIs and wrapper CLIs running the same underlying model. Native configurations achieve 10-16 percentage point higher pass rates while costing 1.5 to 18 times less per fix.

Claude CLI with Opus 4.6 reaches 61.6% accuracy at $2.93 per fix. OpenCode with the same model hits 47.5% at $51.88. 14.1 percentage points lower while costing 17.7 times more. Cursor with Opus 4.6 achieves 62.5% but at $35.40 per fix, making it 12 times more expensive than Claude CLI despite nearly identical accuracy.

This pattern holds across all three major model families:

Configuration	Model	Pass rate	Cost per fix	Gap from native
Claude CLI	Opus 4.6	61.6%	$2.93	.
OpenCode	Opus 4.6	47.5%	$51.88	-14.1pp
Cursor	Opus 4.6	62.5%	$35.40	+0.9pp
Codex	GPT-5.2	62.7%	$5.30	.
OpenCode	GPT-5.2	51.6%	$6.65	-11.1pp
Cursor	GPT-5.2	51.6%	$6.26	-11.1pp
Gemini CLI	3.1 Pro	58.7%	$3.92	.
OpenCode	3.1 Pro	54.9%	$5.81	-3.8pp

Why native CLIs win

Three factors explain the performance gap. First, native CLIs handle environment setup more reliably. Fewer build failures means more test runs actually execute the agent's patch. OpenCode and Cursor delegate build responsibility to the agent, which has to install dependencies, configure paths, and handle compiler flags. All steps that can fail.

Second, native CLIs integrate tools more efficiently. Claude CLI gives direct file manipulation, testing, and execution without abstraction layers. Wrappers like OpenCode and Cursor route agent actions through intermediate interfaces, introducing latency and communication overhead that slows decision cycles.

Third, native CLIs use model-specific prompt engineering. The Claude team optimizes its CLI prompts for how Opus behaves. The Codex CLI is tuned for GPT-5.2's completion patterns. Wrappers use generic prompts designed to work with multiple models, which means they optimize for none.

The cost multiplier

The cost difference amplifies the accuracy gap. Claude CLI costs 18 times less than OpenCode for Opus 4.6. This isn't just pricing. It reflects infrastructure efficiency. Native CLIs bundle the model call with the CLI runtime, minimizing overhead. Wrappers add service fees, agent abstraction layers, and API mark-ups.

For an organization fixing 100 vulnerabilities:

Claude CLI: 61.6 fixes at $180 total cost ($2.93 × 61.6)
OpenCode: 47.5 fixes at $2,464 total cost ($51.88 × 47.6)

Same model. The wrapper version costs 13.7 times more to achieve 14.1 percentage points lower accuracy.

Cursor's exception

Cursor with Opus 4.6 is the one wrapper that approaches native performance on accuracy (62.5% vs 61.6%). However, it costs $35.40 per fix compared to Claude CLI's $2.93. The accuracy is comparable; the efficiency is not.

For 100 vulnerabilities:

Claude CLI: 61.6 fixes at $180
Cursor: 62.5 fixes at $2,212

Even when Cursor wins on accuracy, it loses on cost. If accuracy is your only metric, Cursor matches the native CLI. If cost matters at all, it doesn't.

Recommendations

Use the native CLI for your chosen model. If you deploy Claude exclusively, use Claude CLI. If you standardize on GPT-5.2, use Codex. If Gemini is your platform, use Gemini CLI.

If you need multi-model support and can't justify three separate native tools, OpenCode with Gemini 3.1 offers the best wrapper compromise: 54.9% pass rate at $5.81 per fix. It's 3.8 percentage points below the native Gemini CLI but cheaper than Cursor and ahead of OpenCode's other configurations.

The native-vs-wrapper gap exists because CLIs are built for one model. Wrappers are built for all models. One beats the other.

See Benchmark results | Economics analysis | Claude Opus 4.6 profile | Codex GPT-5.2 profile | Gemini 3.1 Pro profile

FAQ

How much does the CLI wrapper affect agent performance?

Native CLIs outperform wrapper CLIs by 10-16 percentage points on the same model. Claude Code achieves 61.6% vs OpenCode's 47.5% with identical Opus 4.6. The cost difference is even larger: $2.93 vs $51.88 per fix.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Three behavioral clusters: how agents approach vulnerability patching

Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.

Cross-agent agreement: 105 pairwise comparisons

105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.