Claude Opus 4.5 — CVE-Agent-Bench profile
45.7% pass rate at $2.64 per fix. Anthropic model via Claude Code CLI. 136 real CVEs evaluated.
Claude Opus 4.5
Claude Opus 4.5, deployed via the Claude CLI, achieves a 45.7% pass rate across 136 CVE patch evaluations. The agent produced 58 successful patches, 43 failures, 26 build errors, and 9 infrastructure timeouts. At $2.64 per successful patch, it is one of the most cost-effective agents in the benchmark.
This cost-per-patch metric places Claude Opus 4.5 in the lower quartile of expense. The agent generates functional patches at scale without proportional cost growth.
Behavioral profile
The pattern shows an agent that prioritizes efficiency and precision over raw speed. It takes more conversation turns to reach decisions but produces patches that rarely contain false content. When Opus 4.5 generates a patch, the patch is substantive and intentional. Never empty or padding the output.
Comparison with Opus 4.6
The newer Opus 4.6 model, running through the same Claude CLI, achieves 61.6% pass rate. A 15.9 percentage point improvement. Cost rises modestly from $2.64 to $2.93 per patch. Build failures drop from 26 to 20.
The accuracy dimension jumps from 34 to 96, indicating the model upgrade primarily strengthens analytical ability. Speed remains low, and precision stays maxed out. The 4.6 upgrade preserves cost-efficiency while dramatically improving pass rate.
CLI wrapper effect
The Claude CLI wrapper produces vastly different results than alternative deployment methods for the same Opus 4.5 model. When the same model runs through the OpenCode wrapper, it achieves only 36.8% pass rate at $40.13 per patch. That is 8.9 percentage points lower accuracy at 15.2 times higher cost.
This gap shows how much the wrapper matters. The Claude CLI has optimized tool calling, environment setup, and error handling that the OpenCode wrapper does not replicate. Same model, different infrastructure, dramatically different outcomes.
Build failure analysis
26 build failures from 136 evaluations represents a 19% build-failure rate. These failures occur when the agent generates syntactically correct patches that the repository's build system rejects. Root causes include environment setup issues, missing dependencies, or repository-specific build constraints.
The build failures are not agent hallucinations. They are legitimate environmental mismatches. In an integration scenario where you control the build environment, these failures might be preventable.
Precision dimension at maximum
The 100 precision score means every successful patch output contains actual code changes. The agent does not pad with explanations or generic fixes. This precision is valuable in automated patch selection. You can trust the patch output to contain meaningful changes.
When Opus 4.5 is appropriate
Choose Claude Opus 4.5 via CLI when cost per successful patch is a primary constraint. At $2.64 per patch, it enables high-volume evaluations. The 45.7% pass rate is competitive for many CVE types, particularly memory safety and logic issues where accuracy 34 gives sufficient problem-solving.
If your workload requires higher accuracy or faster turn times, Opus 4.6 offers better performance at modest cost increase.
Learn more
Explore the full benchmark results and leaderboard to see how Opus 4.5 ranks among 15 evaluated agents. Read the Anthropic lab profile for context on all Anthropic models in the benchmark. Compare economics across all agents and cost models.
FAQ
How does Claude Opus 4.5 perform on real CVEs?
45.7% pass rate across 136 CVEs at $2.64 per fixed vulnerability. 58 passes, 43 fails, 26 build failures, 9 infrastructure failures.
What is the cost per successful patch for Claude Opus 4.5?
$2.64 per successful patch, the lowest cost in the benchmark. Across 136 evaluations, Opus 4.5 generated 58 working fixes. The low cost makes it suitable for high-volume CVE evaluation where budget matters more than peak accuracy.
How does Claude Opus 4.5 compare to other agents?
At 45.7% pass rate, Opus 4.5 sits slightly below the benchmark average of 47.3%. It ranks mid-tier on accuracy but first on cost efficiency. Upgrading to Opus 4.6 adds 15.9 percentage points at only $0.29 more per fix.
Anthropic security research and patch equivalence validation
Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.