Cursor GPT-5.2 — CVE-Agent-Bench profile
51.6% pass rate at $6.26 per fix. OpenAI GPT-5.2 via Cursor IDE. 128 evaluations.
Cursor Gpt 5.2
Codex GPT-5.2 running through the Cursor IDE achieves a 51.6% pass rate at $6.26 per successful patch. Across 128 evaluations, it recorded 63 passes, 34 actual failures, 25 build failures, and 6 infrastructure failures. The Cursor environment adds codebase-aware context that differs from the Codex CLI wrapper and OpenCode, resulting in better build outcomes but lower overall pass rates compared to the native Codex CLI.
Performance overview
51.6% pass rate places Cursor GPT-5.2 in the middle tier of the benchmark. The model produces slightly fewer actual failures (34 vs 40-50 on weaker configurations) but far fewer than the native Codex CLI's 12. The 128 evaluations (8 fewer than some agents) suggest slightly faster or more reliable test execution in the Cursor environment, though the smaller sample size makes direct comparison tricky.
At $6.26 per pass, the cost is higher than the native Codex CLI ($5.30) but roughly equal to the OpenCode wrapper ($6.65). For 100 successful CVE fixes, expect to spend roughly $1,213 (193 evaluations ÷ 0.516 × average per-eval cost). This is a mid-range investment for mid-range results.
Behavioral profile
Cursor GPT-5.2 exhibits a balanced personality across dimensions. Accuracy 57 is moderate. Not as precise as the Codex CLI's 100, but better than lower-scoring variants. Speed 83 indicates steady, deliberate work rather than rushing or extreme caution. Efficiency 93 is high, suggesting the agent uses resources well despite taking slightly longer.
Precision 100 means every patch that successfully builds is correct. When something compiles in Cursor, it fixes the bug. The gap between 63 passes and precision 100 comes from failures to build or incorrect patches that don't compile. Breadth 0 confirms Cursor relies on a single tool or approach, not experimental multi-method attempts.
Cross-wrapper comparison
The same GPT-5.2 model performs differently across three wrapper environments:
- Codex CLI (native): 62.7% pass rate
- Cursor IDE: 51.6% pass rate
- OpenCode: 51.6% pass rate
Cursor and OpenCode extract identical pass rates from GPT-5.2, both 11.1 percentage points below the native CLI. However, their failure modes differ. Cursor has 25 build failures and 34 actual fails. OpenCode has 48 build failures and only 11 actual fails. The same pass rate masks opposite weaknesses.
Build failure analysis
Only 25 build failures out of 128 evals (19.5%) is the second-lowest build failure rate among GPT-5.2 configurations. The Codex CLI has 35 (25.7%), and OpenCode jumps to 48 (35.3%). Why does Cursor perform better at builds? The IDE provides codebase-aware context. File trees, import structures, existing dependencies. That helps the model generate patches compatible with the actual project structure.
The trade-off is visible: Cursor builds patches better but produces more incorrect ones (34 fails vs 12 for Codex CLI). The codebase context helps with syntax and structure but may limit the model's ability to reason about the actual vulnerability fix.
IDE integration advantage
Cursor's codebase-aware context is its defining feature. Rather than starting from patch templates or generic vulnerability patterns, the model sees the actual code structure, existing imports, and build configuration. This context should theoretically improve accuracy, but the benchmark shows it improves build success more than patch correctness.
This gap suggests the IDE context helps with implementation details (where to place the patch, what imports to use) but gives less insight into whether the patch actually fixes the underlying vulnerability. The model can generate syntactically correct patches that compile, but fewer of them achieve the functional fix.
Infrastructure reliability
6 infrastructure failures out of 128 (4.7%) is good reliability. Cursor's test environment is stable and rarely fails due to system issues. Most non-passing outcomes are true model failures or build incompatibilities, not environmental noise.
Recommendation
Cursor GPT-5.2 is a sensible middle option if you're already using Cursor for development. The IDE integration delivers real build benefits, reducing environment-related issues. But the 11.1pp accuracy loss compared to the native Codex CLI is large and consistent. For pure CVE-fixing performance, the Codex CLI is superior. For developer experience and codebase integration, Cursor has advantages worth the accuracy trade-off.
Explore the full benchmark results on the leaderboard. Compare all GPT-5.2 configurations and understand methodology. Review the OpenAI lab profile for context on all evaluated agents. See how agent behavior patterns vary across configurations.
FAQ
What is the performance of GPT-5.2 via Cursor?
51.6% pass rate on 128 CVEs at $6.26 per fix. 63 passes, 34 fails, 25 build, 6 infra failures.
What is the cost per successful patch for Cursor GPT-5.2?
$6.26 per successful patch. Higher than native Codex CLI ($5.30) but includes Cursor's IDE codebase context. For 100 CVE fixes, roughly 193 evaluations are needed at about $1,213 total.
How does Cursor GPT-5.2 compare to other agents?
51.6% pass rate is above the 47.3% benchmark average. The same GPT-5.2 model achieves 62.7% via native Codex CLI, an 11.1 percentage point gap. Cursor's advantage is lower build failures (25 vs 35) from IDE-aware code context, but more semantic failures (34 vs 12).
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
Codex GPT-5.2 — CVE-Agent-Bench profile
62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.