Cursor Opus 4.6 — CVE-Agent-Bench profile
62.5% pass rate at $35.40 per fix. Anthropic Opus 4.6 via Cursor. High accuracy, highest cost.
Cursor Opus 4.6
Cursor Opus 4.6 is Claude Opus 4.6 accessed through Cursor's proprietary CLI wrapper. It represents the accuracy-first strategy: maximum vulnerability reasoning at the cost of token efficiency. Across 128 evaluations, it achieved 62.5% pass rate with a cost of $35.40 per successful fix. The model passed 80 samples, failed 24, experienced 24 build failures, and encountered 0 infrastructure failures.
Performance tier
At 62.5% pass rate, Cursor Opus 4.6 ranks second-highest in the entire benchmark. Native Claude Opus 4.6 via the official Claude CLI achieves 61.6%. The 0.9 percentage point difference comes from the wrapper overhead. Cursor adds accuracy, likely due to Cursor's prompt engineering for the IDEs users.
Eighty passes out of 128 evaluations is the highest pass count in the benchmark. This is the most reliable agent for actual fix generation. When Cursor Opus 4.6 produces a patch, it works.
Cost reality
At $35.40 per fix, Cursor Opus 4.6 is the most expensive agent in the benchmark by a wide margin. Claude Opus 4.6 via native CLI costs $2.93 per fix. Cursor's wrapper adds 12x cost.
This is not a mistake. The wrapper is Cursor's abstraction layer for IDE integration, prompt optimization, and context management. The cost premium reflects that infrastructure. For teams using Cursor as their primary development environment, this cost may be acceptable as part of IDE licensing. For teams evaluating models independently, it is prohibitively expensive.
Infrastructure reliability
Zero infrastructure failures across 128 evaluations. Cursor Opus 4.6 is the only agent in the benchmark to achieve zero infra failures. This matters. When you deploy this agent, expect zero operational interruptions from test environment issues.
The 24 build failures and 24 semantic fails split evenly. The model never loses a patch to environmental chaos. Every failure is a genuine model limitation or vulnerability reasoning gap, not infrastructure noise.
Behavioral signature
Accuracy is 99, the highest in the benchmark. Reliability is 67, also the highest. Precision is 100. Speed is 64 (mid-range), and efficiency is 33 (low, reflecting high token usage). The profile reads as: this agent reasons thoroughly, produces reliable output, and costs whatever it takes.
This is the model you use when accuracy is non-negotiable. Security teams evaluating tools for production deployment should seriously consider Cursor Opus 4.6 if they can absorb the cost. The 62.5% pass rate and 0% infra failure are compelling.
Comparison to native Claude Opus 4.6
The native Claude CLI achieves 61.6% pass rate at $2.93 per fix. Cursor adds 0.9pp accuracy (likely noise) at 12x cost. On a pure numbers basis, native Claude is the rational choice.
However, if you are already using Cursor for development and your CVE patching system is integrated into Cursor, the additional $35 per fix might be worth the convenience of unified tooling. This is a business decision more than a technical one.
Build failure parity
Twenty-four build failures, 24 semantic fails. This clean 1:1 split is unusual. Most agents show either build-failure dominance (OpenCode variants) or semantic-failure dominance (Composer 1.5).
Cursor Opus 4.6's balance suggests the model reasons well (low semantic fails) but sometimes generates environment-incompatible code (build failures). This is expected for a model trained on broad internet code: it understands CVEs but doesn't always optimize for containerized test harnesses.
Token efficiency cost
Efficiency is 33, the lowest in the benchmark. Opus 4.6 is expensive per token (Anthropic's pricing) and Cursor's wrapper adds API overhead. The model uses tokens generously to reason through complexity.
If your budget is token-constrained, Cursor Opus 4.6 is not viable at scale. If your budget is vulnerability-constrained (you need to fix as many CVEs as possible), it is cost-justified.
Deployment scenario
Cursor Opus 4.6 makes sense in two scenarios:
-
Small-scale, high-confidence vulnerability patching: You have a small number of critical CVEs (5-20) and need the highest confidence patch for each. Cost is acceptable because volume is low.
-
IDE-integrated development: You use Cursor as your primary IDE, and CVE patching is a bolt-on feature. The $35 per fix is part of your Cursor licensing cost, not an isolated tool expense.
For high-volume evaluation (100+ CVEs) or standalone CVE patching systems, the cost is prohibitive.
Accuracy vs. cost trade-off
Claude Opus 4.6 via native CLI offers 61.6% accuracy at $2.93. Cursor Opus 4.6 offers 62.5% at $35.40. The 0.9pp improvement costs 12x more. This is a poor trade-off on metrics alone.
However, the 0% infrastructure failure rate is valuable. If your test environment is flaky and you are losing time to infra noise, Cursor's stability might justify the cost. For most teams, native Claude is the better choice.
Model capability unchanged
Opus 4.6 is Anthropic's model. Whether you access it via Claude CLI, Cursor, or OpenCode, the underlying reasoning engine is the same. Cursor's advantage is not the model but the wrapper's prompt engineering and infrastructure. That advantage is worth 0.9pp at most.
Summary
Cursor Opus 4.6 is the second-most accurate agent in CVE-Agent-Bench with perfect infrastructure reliability. The 62.5% pass rate at $35.40 per fix positions it as a premium option for teams that prioritize accuracy and reliability over cost. Zero infra failures and an accuracy score of 99 make it suitable for high-stakes vulnerability patching. For most teams, the cost is prohibitive. For small-scale critical CVE patching, it is justified.
Explore CVE-Agent-Bench results to see how Cursor Opus 4.6 compares across all agents. Review benchmark economics to decide if the premium cost fits your evaluation budget. Compare native Claude Opus 4.6 to understand the wrapper overhead. Read Cursor Composer 1.5 to see a lower-cost Cursor option.
FAQ
Is Cursor Opus 4.6 worth the cost?
62.5% pass rate on 128 CVEs at $35.40 per fix. Highest cost per fix but matches Codex GPT-5.2 accuracy. 80 passes, 24 fails, 24 build, 0 infra.
What is the cost per successful patch for Cursor Opus 4.6?
$35.40 per successful patch, the highest in the benchmark. The same Opus 4.6 model costs $2.93 via native Claude CLI, a 12x difference. The premium buys zero infrastructure failures and Cursor IDE integration, not better accuracy.
How does Cursor Opus 4.6 compare to other agents?
62.5% pass rate is the second-highest in the benchmark (behind Codex GPT-5.2 at 62.7%). It is the only agent with zero infrastructure failures across all evaluations. Native Claude Opus 4.6 achieves 61.6% at $2.93, so the 0.9 percentage point accuracy gain costs 12x more per fix.
Anthropic security research and patch equivalence validation
Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
Claude Opus 4.6 — CVE-Agent-Bench profile
61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.