[AGENT]

Claude Opus 4.6 — CVE-Agent-Bench profile

61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.

[ANTHROPIC]

Claude Opus 4.6

61.6%

Pass Rate

$2.93

Cost per Pass

136

Total Evals

claude

CLI

Outcome Distribution

Pass (77)

Fail (28)

Build (20)

Infra (11)

Claude Opus 4.6, deployed via the Claude CLI, achieves 61.6% pass rate across 128 CVE patch evaluations. The second-highest pass rate in the full benchmark. The agent produced 77 successful patches, 28 failures, 20 build errors, and 11 infrastructure timeouts. At $2.93 per successful patch, it is the second-cheapest effective agent for generating working CVE patches.

Only one agent in the benchmark (GPT-5.2) exceeds this pass rate. The $2.93 per-patch cost makes Opus 4.6 one of two agents occupying the Pareto frontier. Highest accuracy without proportional cost sacrifice.

Behavioral profile

This profile reveals an agent that explores thoroughly before committing. Speed is zero because it uses the most conversation turns, but breadth is 100 because it invokes the full range of available tools. The agent does not rush toward a patch. It gathers context, considers multiple approaches, and delivers a well-reasoned solution.

Model upgrade impact

Opus 4.6 improves on Opus 4.5 (same CLI, same wrapper) by 15.9 percentage points (45.7% to 61.6%). Build failures drop from 26 to 20. Actual failures decrease from 43 to 28. This single model upgrade is the largest performance jump observed between sequential Anthropic releases in the benchmark.

The accuracy dimension jumps from 34 to 96, indicating the model upgrade strengthens analytical reasoning. Efficiency stays near-maximum, and precision remains at 100. The upgrade preserves cost-efficiency while transforming pass rate.

Wrapper comparison

The same Opus 4.6 model achieves dramatically different results depending on deployment wrapper:

Claude CLI: 61.6% pass rate at $2.93 per patch Cursor wrapper: 62.5% pass rate at $35.40 per patch OpenCode wrapper: 47.5% pass rate at $51.88 per patch

The Claude CLI has the best accuracy and lowest cost. The Cursor wrapper adds only 0.9 percentage point accuracy improvement but costs 12x more per patch. The OpenCode wrapper loses 14.1 percentage points compared to CLI while costing 17.7 times more.

For Opus 4.6, the CLI deployment is the clear choice.

Pareto frontier positioning

An agent sits on the Pareto frontier when you cannot improve accuracy without increasing cost, or reduce cost without sacrificing accuracy. Opus 4.6 sits on this frontier alongside GPT-5.2.

This positioning makes it the standard recommendation. If you choose any other agent, you are accepting either higher cost for lower accuracy, or lower accuracy for lower cost. Neither trade-off is ideal for most workloads.

Breadth and tool usage

The breadth 100 score means this agent invokes all available tools across the benchmark. It does not specialize in a narrow approach. It reads files, runs commands, edits code, checks dependencies, and explores the repository structure methodically.

This breadth correlates with the high accuracy score. The agent gathers sufficient information to make informed patch decisions rather than guessing from limited context.

Reliability dimension

Reliability 63 (high but not maximum) indicates occasional infrastructure issues or environmental timeouts. With 11 infrastructure timeouts from 128 evaluations, the agent experiences timeout constraints that prevent it from reaching 100 reliability. This is a system limitation, not a model limitation.

When Opus 4.6 is appropriate

Choose Claude Opus 4.6 via CLI as your default agent. The 61.6% pass rate handles the majority of CVE types. The $2.93 per-patch cost is reasonable at scale. The Pareto frontier positioning means you are not overpaying for accuracy.

Upgrade to this model if you are currently using Opus 4.5. The 15.9 percentage point improvement justifies the 11% cost increase.

Learn more

View the complete benchmark results and agent rankings to see how Opus 4.6 compares with 15 agents. Read the Anthropic lab profile for context on Anthropic's full model portfolio. Compare economic impact across all agents to understand cost-per-patch across the suite.

FAQ

How does Claude Opus 4.6 compare to Opus 4.5?

61.6% pass rate, up 15.9 percentage points from Opus 4.5. 77 passes, 28 fails, 20 build failures, 11 infrastructure failures on 128 CVEs.

What is the cost per successful patch for Claude Opus 4.6?

$2.93 per successful patch across 128 evaluations. Second-cheapest effective agent in the benchmark. At this cost, fixing 100 CVEs requires roughly 162 evaluations at about $475 total.

How does Claude Opus 4.6 compare to other agents?

61.6% pass rate is the second-highest in the benchmark, behind only Codex GPT-5.2 at 62.7%. Opus 4.6 sits on the Pareto frontier: no other agent offers better accuracy at lower cost. The benchmark average is 47.3%, so Opus 4.6 exceeds it by 14.3 percentage points.

[RELATED TOPICS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Cursor Opus 4.6 — CVE-Agent-Bench profile

62.5% pass rate at $35.40 per fix. Anthropic Opus 4.6 via Cursor. High accuracy, highest cost.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.