Skip to main content
[AGENT]

Cursor GPT-5.3 Codex — CVE-Agent-Bench profile

50.4% pass rate at $6.16 per fix. GPT-5.3 via Cursor with Codex runtime. 128 evaluations.

[OPENAI]

Cursor Gpt 5.3 Codex

50.4%
Pass Rate
$6.16
Cost per Pass
128
Total Evals
cursor
CLI
Outcome Distribution
Pass (64)
Fail (40)
Build (23)
Infra (1)

Codex GPT-5.3 running through the Cursor IDE with Codex runtime sandboxing achieves a 50.4% pass rate at $6.16 per successful patch. Across 128 evaluations, it recorded 64 passes, 40 actual failures, 23 build failures, and only 1 infrastructure failure. This configuration has the best infrastructure reliability. No other agent comes close to a 0.8% infra failure rate. But shows higher model failure rates than its GPT-5.2 variants.

Performance overview

50.4% pass rate is nearly identical to Cursor GPT-5.2 (51.6%), suggesting the model upgrade from GPT-5.3 doesn't improve pass rate in this specific task. However, the failure distribution tells a different story. Infrastructure failures drop dramatically: 1 versus 6 for Cursor GPT-5.2. Build failures also improve: 23 versus 25. The trade-off is actual fails: 40 versus 34.

At $6.16 per pass, it's cheaper than Cursor GPT-5.2 ($6.26) despite the lower pass rate. For 100 successful fixes, you'd need roughly 198 evaluations (100 ÷ 0.504), costing approximately $1,220 total. This is comparable to GPT-5.2 Cursor but with far better infrastructure outcomes.

Behavioral profile

Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

Reliability 64 is the highest among all Cursor configurations. This dimension directly correlates with infrastructure stability. Accuracy 53 is moderate, down from GPT-5.2's 57, explaining the higher fail count. Speed 78 indicates slightly slower, more deliberate work. Efficiency 93 matches GPT-5.2 Cursor.

Precision 100 holds across all variants: when a patch builds, it works. The issue is generating more patches that don't build (23 build failures) and more patches that build but don't fix (40 fails). The model works reliably without crashing the system but produces less accurate patches.

Model comparison

GPT-5.3 versus GPT-5.2 in the Cursor environment:

  • GPT-5.2: 51.6% pass, 6 infra fails, 25 build fails, 34 actual fails
  • GPT-5.3: 50.4% pass, 1 infra fail, 23 build fails, 40 actual fails

The newer model improves environmental robustness (5 fewer infra failures, 2 fewer build failures) but degrades model accuracy (6 additional actual fails). For reliable, stable execution without crashes, GPT-5.3 is superior. For correct patches, GPT-5.2 is better.

Infrastructure reliability standout

1 infrastructure failure out of 128 evals (0.8%) is exceptional. No other agent in the entire benchmark achieves this rate. The Cursor GPT-5.3 Codex combination produces the most stable test environment across all configurations. For deployments where uptime and reliability matter more than peak accuracy, this is the configuration to choose.

This reliability advantage likely comes from GPT-5.3's improved system stability and better resource management. The model avoids edge cases that cause the test environment to crash or timeout. For production deployments where human attention is expensive, fewer infrastructure-related reruns reduce operational burden.

Build failure insight

23 build failures (18%) is low. The Cursor IDE context continues to produce good results for syntactically correct patches. But the 40 actual fails suggest GPT-5.3 generates more patches that compile but don't fix the underlying vulnerability. The model is more likely to hit a local maximum. Finding code that compiles without finding code that works.

Cost efficiency

At $6.16 per pass with 50.4% success rate, cost per evaluation is roughly $3.10. To fix 100 CVEs, you'd pay roughly $1,220. Compare this to:

  • Codex GPT-5.2 (native): $844 for 100 fixes
  • Cursor GPT-5.2: $1,213 for 100 fixes
  • Cursor GPT-5.3 Codex: $1,220 for 100 fixes

Cursor GPT-5.3 Codex and Cursor GPT-5.2 are economically equivalent. The choice between them is driven by reliability requirements, not cost.

When to choose this configuration

Select Cursor GPT-5.3 Codex if your deployment is ops-constrained. 0.8% infrastructure failure rate means your monitoring, alerting, and incident response overhead drops dramatically. You can run unattended evaluation batches without babysitting the system. The 1pp lower pass rate (51.6% → 50.4%) is a minor cost for that operational simplicity.

For development teams using Cursor who prioritize system stability over peak accuracy, this is the best option. You get IDE codebase context, sandboxed runtime security, and exceptional infrastructure reliability in one configuration.

Recommendation

Use Cursor GPT-5.3 Codex for stable, production-like evaluation pipelines. Use Cursor GPT-5.2 if you want slightly better patch accuracy and can tolerate occasional infrastructure issues. Use the native Codex CLI if you want the best overall accuracy and cost, accepting that you lose IDE context and sandboxed runtime benefits.

Explore the benchmark results to see all 15 agents. Read about methodology and how we measure reliability. Compare economics across configurations. Review session-level analysis to understand failure patterns deeper.

FAQ

How does GPT-5.3 Codex perform in Cursor?

50.4% pass rate on 128 CVEs at $6.16 per fix. 64 passes, 40 fails, 23 build, 1 infra failure.

What is the cost per successful patch for Cursor GPT-5.3 Codex?

$6.16 per successful patch. Nearly identical to Cursor GPT-5.2 ($6.26) in cost and accuracy (50.4% vs 51.6%). The GPT-5.3 model upgrade improves infrastructure reliability (1 infra failure vs 6) but adds more semantic failures (40 vs 34).

How does Cursor GPT-5.3 Codex compare to other agents?

50.4% pass rate is above the 47.3% benchmark average. Its standout feature is 0.8% infrastructure failure rate, the best of any agent. For production pipelines that need stable, unattended batch evaluation, this is the most reliable configuration available.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.