Cursor Opus 4.6 — Vulnerability-Agent-Bench profile

62.5% pass rate at $35.40 per fix. Anthropic Opus 4.6 via Cursor. High accuracy, highest cost.

[ANTHROPIC]

Cursor Opus 4.6

62.5%

Pass Rate

$35.40

Cost per Pass

128

Total Evals

cursor

CLI

Outcome Distribution

Pass (80)

Fail (24)

Build (24)

Cursor Opus 4.6 is Claude Opus 4.6 accessed through Cursor's proprietary CLI wrapper. It represents the accuracy-first strategy: maximum vulnerability reasoning at the cost of token efficiency. Across 128 evaluations, it achieved 62.5% pass rate with a cost of $35.40 per successful fix. The model passed 80 samples, failed 24, experienced 24 build failures, and encountered 0 infrastructure failures.

Performance tier

At 62.5% pass rate, Cursor Opus 4.6 ranks second-highest in the entire benchmark. Native Claude Opus 4.6 via the official Claude CLI achieves 61.6%. The 0.9 percentage point difference comes from the wrapper overhead. Cursor adds accuracy, likely due to Cursor's prompt engineering for the IDEs users.

Eighty passes out of 128 evaluations is the highest pass count in the benchmark. This is the most reliable agent for actual fix generation. When Cursor Opus 4.6 produces a patch, it works.

Cost reality

At $35.40 per fix, Cursor Opus 4.6 is the most expensive agent in the benchmark by a wide margin. Claude Opus 4.6 via native CLI costs $2.93 per fix. Cursor's wrapper adds 12x cost.

This is not a mistake. The wrapper is Cursor's abstraction layer for IDE integration, prompt optimization, and context management. The cost premium reflects that infrastructure. For teams using Cursor as their primary development environment, this cost may be acceptable as part of IDE licensing. For teams evaluating models independently, it is prohibitively expensive.

Infrastructure reliability

Zero infrastructure failures across 128 evaluations. Cursor Opus 4.6 is the only agent in the benchmark to achieve zero infra failures. This matters. When you deploy this agent, expect zero operational interruptions from test environment issues.

The 24 build failures and 24 semantic fails split evenly. The model never loses a patch to environmental chaos. Every failure is a genuine model limitation or vulnerability reasoning gap, not infrastructure noise.

Behavioral signature

Accuracy is 99, the highest in the benchmark. Reliability is 67, also the highest. Precision is 100. Speed is 64 (mid-range), and efficiency is 33 (low, reflecting high token usage). The profile reads as: this agent reasons thoroughly, produces reliable output, and costs whatever it takes.

This is the model you use when accuracy is non-negotiable. Security teams evaluating tools for production deployment should seriously consider Cursor Opus 4.6 if they can absorb the cost. The 62.5% pass rate and 0% infra failure are compelling.

Comparison to native Claude Opus 4.6

The native Claude CLI achieves 61.6% pass rate at $2.93 per fix. Cursor adds 0.9pp accuracy (likely noise) at 12x cost. On a pure numbers basis, native Claude is the rational choice.

However, if you are already using Cursor for development and your vulnerability patching system is integrated into Cursor, the additional $35 per fix might be worth the convenience of unified tooling. This is a business decision more than a technical one.

Build failure parity

Twenty-four build failures, 24 semantic fails. This clean 1:1 split is unusual. Most agents show either build-failure dominance (OpenCode variants) or semantic-failure dominance (Composer 1.5).

Cursor Opus 4.6's balance suggests the model reasons well (low semantic fails) but sometimes generates environment-incompatible code (build failures). This is expected for a model trained on broad internet code: it understands vulnerabilities but doesn't always optimize for containerized test harnesses.

Token efficiency cost

Efficiency is 33, the lowest in the benchmark. Opus 4.6 is expensive per token (Anthropic's pricing) and Cursor's wrapper adds API overhead. The model uses tokens generously to reason through complexity.

If your budget is token-constrained, Cursor Opus 4.6 is not viable at scale. If your budget is vulnerability-constrained (you need to fix as many vulnerabilities as possible), it is cost-justified.

Deployment scenario

Cursor Opus 4.6 makes sense in two scenarios:

Small-scale, high-confidence vulnerability patching: You have a small number of critical vulnerabilities (5-20) and need the highest confidence patch for each. Cost is acceptable because volume is low.
IDE-integrated development: You use Cursor as your primary IDE, and vulnerability patching is a bolt-on feature. The $35 per fix is part of your Cursor licensing cost, not an isolated tool expense.

For high-volume evaluation (100+ vulnerabilities) or standalone vulnerability patching systems, the cost is prohibitive.

Accuracy vs. cost trade-off

Claude Opus 4.6 via native CLI offers 61.6% accuracy at $2.93. Cursor Opus 4.6 offers 62.5% at $35.40. The 0.9pp improvement costs 12x more. This is a poor trade-off on metrics alone.

However, the 0% infrastructure failure rate is valuable. If your test environment is flaky and you are losing time to infra noise, Cursor's stability might justify the cost. For most teams, native Claude is the better choice.

Model capability unchanged

Opus 4.6 is Anthropic's model. Whether you access it via Claude CLI, Cursor, or OpenCode, the underlying reasoning engine is the same. Cursor's advantage is not the model but the wrapper's prompt engineering and infrastructure. That advantage is worth 0.9pp at most.

Summary

Cursor Opus 4.6 is the second-most accurate agent in Vulnerability-Agent-Bench with perfect infrastructure reliability. The 62.5% pass rate at $35.40 per fix positions it as a premium option for teams that prioritize accuracy and reliability over cost. Zero infra failures and an accuracy score of 99 make it suitable for high-stakes vulnerability patching. For most teams, the cost is prohibitive. For small-scale critical vulnerability patching, it is justified.

Explore Vulnerability-Agent-Bench results to see how Cursor Opus 4.6 compares across all agents. Review benchmark economics to decide if the premium cost fits your evaluation budget. Compare native Claude Opus 4.6 to understand the wrapper overhead. Read Cursor Composer 1.5 to see a lower-cost Cursor option.

FAQ

Is Cursor Opus 4.6 worth the cost?

62.5% pass rate on 128 vulnerabilities at $35.40 per fix. Highest cost per fix but matches Codex GPT-5.2 accuracy. 80 passes, 24 fails, 24 build, 0 infra.

What is the cost per successful patch for Cursor Opus 4.6?

$35.40 per successful patch, the highest in the benchmark. The same Opus 4.6 model costs $2.93 via native Claude CLI, a 12x difference. The premium buys zero infrastructure failures and Cursor IDE integration, not better accuracy.

How does Cursor Opus 4.6 compare to other agents?

62.5% pass rate is the second-highest in the benchmark (behind Codex GPT-5.2 at 62.7%). It is the only agent with zero infrastructure failures across all evaluations. Native Claude Opus 4.6 achieves 61.6% at $2.93, so the 0.9 percentage point accuracy gain costs 12x more per fix.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Claude Opus 4.6 — Vulnerability-Agent-Bench profile

61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.

OpenCode Claude Opus 4.6 — Vulnerability-Agent-Bench profile

47.5% pass rate at $51.88 per fix. Most expensive per fix. Opus 4.6 via OpenCode. 128 evaluations.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.