OpenCode Claude Opus 4.6 — Vulnerability-Agent-Bench profile

47.5% pass rate at $51.88 per fix. Most expensive per fix. Opus 4.6 via OpenCode. 128 evaluations.

[ANTHROPIC]

Opencode Claude Opus 4.6

47.5%

Pass Rate

$51.88

Cost per Pass

136

Total Evals

opencode

CLI

Outcome Distribution

Pass (58)

Fail (15)

Build (49)

Infra (14)

Claude Opus 4.6, deployed via the OpenCode wrapper, achieves 47.5% pass rate across 128 vulnerability patch evaluations. The agent produced 58 successful patches, 15 failures, 49 build errors, and 14 infrastructure timeouts. At $51.88 per successful patch, it is the most expensive agent in the benchmark.

This agent achieves the second-lowest actual failure rate (15 failures) among all benchmarked agents, but the highest per-patch cost and high build-failure count (49 of 128).

Behavioral profile

Speed 100 indicates the agent generates patches with minimal conversation turns. Efficiency 0 reflects the highest cost-per-patch in the benchmark. Precision 100 means every patch output contains substantive code changes. No hallucinations or padding.

The speed-focused approach explains the high build-failure count: the agent moves fast but skips environment verification steps.

Model upgrade effect

Upgrading from Opus 4.5 (36.8%) to 4.6 (47.5%) via OpenCode yields a 10.7 percentage point improvement. Build failures drop minimally (50 to 49). Actual failures drop substantially (29 to 15).

The model upgrade strengthens analytical reasoning but does not fix the wrapper's build-failure issue. The OpenCode environment setup remains problematic regardless of model version.

Comparison with CLI deployment

The same Opus 4.6 model achieves dramatically different results by deployment method:

Claude CLI: 61.6% pass rate at $2.93 per patch Cursor wrapper: 62.5% pass rate at $35.40 per patch OpenCode wrapper: 47.5% pass rate at $51.88 per patch

The OpenCode wrapper loses 14.1 percentage points compared to CLI while costing 17.7 times more per patch. The CLI deployment is strictly superior.

Lowest failure rate

Only 15 actual failures from 128 evaluations means that when this agent generates a patch that compiles, the patch is rarely wrong. The failure rate of 11.0% is the second-lowest in the benchmark.

This agent either succeeds, or hits build/infrastructure issues. It rarely produces incorrect patches. The bottleneck is execution environment compatibility, not reasoning quality.

Precision and reliability trade-off

Precision is maxed at 100 (all patches are substantive). Reliability is 55 (moderate environmental issues). This combination indicates a capable model hampered by wrapper infrastructure.

Build failure analysis

49 build failures from 128 evaluations is a 36.0% build-failure rate. The minimal improvement from Opus 4.5 (50 failures) suggests the root cause is not model reasoning but OpenCode environment setup.

Possible causes:

Missing dependencies in the container environment
Incomplete tool support for repository build systems
Time limits cutting patch generation short
Weaker error recovery from compilation failures

Cost analysis

At $51.88 per successful patch, this agent costs more than any alternative for equivalent capability. You could run Claude CLI Opus 4.6 (61.6% accuracy) 17.7 times for the cost of one OpenCode Opus 4.6 patch.

For scale operations, this cost is prohibitive unless multi-model flexibility is essential.

When OpenCode Opus 4.6 is appropriate

Use this agent only if all of the following are true:

Multi-model support is a hard requirement
Build failures are acceptable or recoverable in your workflow
Per-patch cost is not a constraint
You are not already using Claude CLI or Cursor

For any single-model strategy, the CLI is superior.

Upgrade paths

If you are using OpenCode Opus 4.6:

Switch to Claude CLI Opus 4.6 (61.6% vs 47.5%, 17.7x cheaper): recommended for single-model workloads
Switch to Cursor Opus 4.6 (62.5% accuracy, $35.40 per patch): if you need a different CLI
Stay with OpenCode and accept the cost: only if multi-model flexibility justifies the premium

Learn more

View the full benchmark results and agent leaderboard to see all 15 agents ranked by pass rate. Read the Anthropic lab profile for context on all Anthropic models. Compare wrapper impacts on cost and accuracy across Claude CLI, Cursor, and OpenCode.

FAQ

What is the cost/accuracy tradeoff for OpenCode Opus 4.6?

47.5% pass rate on 128 vulnerabilities at $51.88 per fix. Highest cost overall. 58 passes, 15 fails, 49 build, 14 infra.

What is the cost per successful patch for OpenCode Claude Opus 4.6?

$51.88 per successful patch, the highest cost in the entire benchmark. The same model costs $2.93 via Claude CLI (17.7x cheaper) and $35.40 via Cursor (1.5x cheaper). The OpenCode wrapper is the most expensive way to run any model.

How does OpenCode Claude Opus 4.6 compare to other agents?

47.5% pass rate is near the 47.3% benchmark average, but the cost is far above average. Only 15 actual failures from 128 evaluations gives it the second-lowest failure rate. The bottleneck is 49 build failures, not model reasoning. Claude CLI Opus 4.6 at 61.6% is strictly better for single-model deployments.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Claude Opus 4.6 — Vulnerability-Agent-Bench profile

61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.