Cursor GPT-5.2 — Vulnerability-Agent-Bench profile

51.6% pass rate at $6.26 per fix. OpenAI GPT-5.2 via Cursor IDE. 128 evaluations.

[OPENAI]

Cursor Gpt 5.2

51.6%

Pass Rate

$6.26

Cost per Pass

128

Total Evals

cursor

CLI

Outcome Distribution

Pass (63)

Fail (34)

Build (25)

Infra (6)

Codex GPT-5.2 running through the Cursor IDE achieves a 51.6% pass rate at $6.26 per successful patch. Across 128 evaluations, it recorded 63 passes, 34 actual failures, 25 build failures, and 6 infrastructure failures. The Cursor environment adds codebase-aware context that differs from the Codex CLI wrapper and OpenCode, resulting in better build outcomes but lower overall pass rates compared to the native Codex CLI.

Performance overview

51.6% pass rate places Cursor GPT-5.2 in the middle tier of the benchmark. The model produces slightly fewer actual failures (34 vs 40-50 on weaker configurations) but far fewer than the native Codex CLI's 12. The 128 evaluations (8 fewer than some agents) suggest slightly faster or more reliable test execution in the Cursor environment, though the smaller sample size makes direct comparison tricky.

At $6.26 per pass, the cost is higher than the native Codex CLI ($5.30) but roughly equal to the OpenCode wrapper ($6.65). For 100 successful vulnerability fixes, expect to spend roughly $1,213 (193 evaluations ÷ 0.516 × average per-eval cost). This is a mid-range investment for mid-range results.

Behavioral profile

Cursor GPT-5.2 exhibits a balanced personality across dimensions. Accuracy 57 is moderate. Not as precise as the Codex CLI's 100, but better than lower-scoring variants. Speed 83 indicates steady, deliberate work rather than rushing or extreme caution. Efficiency 93 is high, suggesting the agent uses resources well despite taking slightly longer.

Precision 100 means every patch that successfully builds is correct. When something compiles in Cursor, it fixes the bug. The gap between 63 passes and precision 100 comes from failures to build or incorrect patches that don't compile. Breadth 0 confirms Cursor relies on a single tool or approach, not experimental multi-method attempts.

Cross-wrapper comparison

The same GPT-5.2 model performs differently across three wrapper environments:

Codex CLI (native): 62.7% pass rate
Cursor IDE: 51.6% pass rate
OpenCode: 51.6% pass rate

Cursor and OpenCode extract identical pass rates from GPT-5.2, both 11.1 percentage points below the native CLI. However, their failure modes differ. Cursor has 25 build failures and 34 actual fails. OpenCode has 48 build failures and only 11 actual fails. The same pass rate masks opposite weaknesses.

Build failure analysis

Only 25 build failures out of 128 evals (19.5%) is the second-lowest build failure rate among GPT-5.2 configurations. The Codex CLI has 35 (25.7%), and OpenCode jumps to 48 (35.3%). Why does Cursor perform better at builds? The IDE provides codebase-aware context. File trees, import structures, existing dependencies. That helps the model generate patches compatible with the actual project structure.

The trade-off is visible: Cursor builds patches better but produces more incorrect ones (34 fails vs 12 for Codex CLI). The codebase context helps with syntax and structure but may limit the model's ability to reason about the actual vulnerability fix.

IDE integration advantage

Cursor's codebase-aware context is its defining feature. Rather than starting from patch templates or generic vulnerability patterns, the model sees the actual code structure, existing imports, and build configuration. This context should theoretically improve accuracy, but the benchmark shows it improves build success more than patch correctness.

This gap suggests the IDE context helps with implementation details (where to place the patch, what imports to use) but gives less insight into whether the patch actually fixes the underlying vulnerability. The model can generate syntactically correct patches that compile, but fewer of them achieve the functional fix.

Infrastructure reliability

6 infrastructure failures out of 128 (4.7%) is good reliability. Cursor's test environment is stable and rarely fails due to system issues. Most non-passing outcomes are true model failures or build incompatibilities, not environmental noise.

Recommendation

Cursor GPT-5.2 is a sensible middle option if you're already using Cursor for development. The IDE integration delivers real build benefits, reducing environment-related issues. But the 11.1pp accuracy loss compared to the native Codex CLI is large and consistent. For pure vulnerability-fixing performance, the Codex CLI is superior. For developer experience and codebase integration, Cursor has advantages worth the accuracy trade-off.

Explore the full benchmark results on the leaderboard. Compare all GPT-5.2 configurations and understand methodology. Review the OpenAI lab profile for context on all evaluated agents. See how agent behavior patterns vary across configurations.

FAQ

What is the performance of GPT-5.2 via Cursor?

51.6% pass rate on 128 vulnerabilities at $6.26 per fix. 63 passes, 34 fails, 25 build, 6 infra failures.

What is the cost per successful patch for Cursor GPT-5.2?

$6.26 per successful patch. Higher than native Codex CLI ($5.30) but includes Cursor's IDE codebase context. For 100 vulnerability fixes, roughly 193 evaluations are needed at about $1,213 total.

How does Cursor GPT-5.2 compare to other agents?

51.6% pass rate is above the 47.3% benchmark average. The same GPT-5.2 model achieves 62.7% via native Codex CLI, an 11.1 percentage point gap. Cursor's advantage is lower build failures (25 vs 35) from IDE-aware code context, but more semantic failures (34 vs 12).

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Codex GPT-5.2 — Vulnerability-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

Codex GPT-5.2 (Codex runtime) — Vulnerability-Agent-Bench profile

49.2% pass rate at $6.65 per fix. Same GPT-5.2 model, different runtime environment. 128 evaluations.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.