Claude Opus 4.6 — CVE-Agent-Bench profile
61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.
Claude Opus 4.6
Claude Opus 4.6, deployed via the Claude CLI, achieves 61.6% pass rate across 136 CVE patch evaluations. The second-highest pass rate in the full benchmark. The agent produced 77 successful patches, 28 failures, 20 build errors, and 11 infrastructure timeouts. At $2.93 per successful patch, it is the second-cheapest effective agent for generating working CVE patches.
Only one agent in the benchmark (GPT-5.2) exceeds this pass rate. The $2.93 per-patch cost makes Opus 4.6 one of two agents occupying the Pareto frontier. Highest accuracy without proportional cost sacrifice.
Behavioral profile
This profile reveals an agent that explores thoroughly before committing. Speed is zero because it uses the most conversation turns, but breadth is 100 because it invokes the full range of available tools. The agent does not rush toward a patch. It gathers context, considers multiple approaches, and delivers a well-reasoned solution.
Model upgrade impact
Opus 4.6 improves on Opus 4.5 (same CLI, same wrapper) by 15.9 percentage points (45.7% to 61.6%). Build failures drop from 26 to 20. Actual failures decrease from 43 to 28. This single model upgrade is the largest performance jump observed between sequential Anthropic releases in the benchmark.
The accuracy dimension jumps from 34 to 96, indicating the model upgrade strengthens analytical reasoning. Efficiency stays near-maximum, and precision remains at 100. The upgrade preserves cost-efficiency while transforming pass rate.
Wrapper comparison
The same Opus 4.6 model achieves dramatically different results depending on deployment wrapper:
Claude CLI: 61.6% pass rate at $2.93 per patch Cursor wrapper: 62.5% pass rate at $35.40 per patch OpenCode wrapper: 47.5% pass rate at $51.88 per patch
The Claude CLI has the best accuracy and lowest cost. The Cursor wrapper adds only 0.9 percentage point accuracy improvement but costs 12x more per patch. The OpenCode wrapper loses 14.1 percentage points compared to CLI while costing 17.7 times more.
For Opus 4.6, the CLI deployment is the clear choice.
Pareto frontier positioning
An agent sits on the Pareto frontier when you cannot improve accuracy without increasing cost, or reduce cost without sacrificing accuracy. Opus 4.6 sits on this frontier alongside GPT-5.2.
This positioning makes it the standard recommendation. If you choose any other agent, you are accepting either higher cost for lower accuracy, or lower accuracy for lower cost. Neither trade-off is ideal for most workloads.
Breadth and tool usage
The breadth 100 score means this agent invokes all available tools across the benchmark. It does not specialize in a narrow approach. It reads files, runs commands, edits code, checks dependencies, and explores the repository structure methodically.
This breadth correlates with the high accuracy score. The agent gathers sufficient information to make informed patch decisions rather than guessing from limited context.
Reliability dimension
Reliability 63 (high but not maximum) indicates occasional infrastructure issues or environmental timeouts. With 11 infrastructure timeouts from 136 evaluations, the agent experiences timeout constraints that prevent it from reaching 100 reliability. This is a system limitation, not a model limitation.
When Opus 4.6 is appropriate
Choose Claude Opus 4.6 via CLI as your default agent. The 61.6% pass rate handles the majority of CVE types. The $2.93 per-patch cost is reasonable at scale. The Pareto frontier positioning means you are not overpaying for accuracy.
Upgrade to this model if you are currently using Opus 4.5. The 15.9 percentage point improvement justifies the 11% cost increase.
Learn more
View the complete benchmark results and agent rankings to see how Opus 4.6 compares with 15 agents. Read the Anthropic lab profile for context on Anthropic's full model portfolio. Compare economic impact across all agents to understand cost-per-patch across the suite.
FAQ
How does Claude Opus 4.6 compare to Opus 4.5?
61.6% pass rate, up 15.9 percentage points from Opus 4.5. 77 passes, 28 fails, 20 build failures, 11 infrastructure failures on 136 CVEs.
What is the cost per successful patch for Claude Opus 4.6?
$2.93 per successful patch across 136 evaluations. Second-cheapest effective agent in the benchmark. At this cost, fixing 100 CVEs requires roughly 162 evaluations at about $475 total.
How does Claude Opus 4.6 compare to other agents?
61.6% pass rate is the second-highest in the benchmark, behind only Codex GPT-5.2 at 62.7%. Opus 4.6 sits on the Pareto frontier: no other agent offers better accuracy at lower cost. The benchmark average is 47.3%, so Opus 4.6 exceeds it by 14.3 percentage points.
Anthropic security research and patch equivalence validation
Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.