Skip to main content
[AGENT]

Codex GPT-5.2 — CVE-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

[OPENAI]

Codex Gpt 5.2

62.7%
Pass Rate
$5.30
Cost per Pass
136
Total Evals
codex
CLI
Outcome Distribution
Pass (79)
Fail (12)
Build (35)
Infra (10)

Codex GPT-5.2 running through the native Codex CLI represents the highest-performing agent configuration in CVE-Agent-Bench. Across 136 evaluations, it achieved a 62.7% pass rate with only 12 actual failures. The second-lowest failure count in the benchmark. The cost per successful patch is $5.30, placing it in the mid-range for efficiency while maintaining top-tier accuracy.

Performance overview

The raw numbers tell a clear story: 79 passes out of 136 evaluations, with just 12 fails, 35 build failures, and 10 infrastructure failures. This distribution matters. The low failure count suggests the model produces substantive patch attempts, but environmental factors (build setup, infrastructure) prevent some evaluations from completing. When Codex GPT-5.2 generates a patch, it usually works.

At $5.30 per pass, the agent sits in the middle of the cost spectrum. It's cheaper than sandboxed runtimes but more expensive than some lower-accuracy alternatives. With 79 passes from 126 scored evaluations (excluding infra failures), the 62.7% pass rate is the highest in the benchmark.

Behavioral profile

The agent scores consistently high across speed and efficiency dimensions. Accuracy 100 indicates every attempted patch is substantive. No hallucinated code or nonsensical attempts. Speed 90 means it works quickly, rarely exceeding time limits. Efficiency 95 shows it uses computation resources well, avoiding redundant retries or excessive token consumption.

Breadth 0 is notable. This agent doesn't rotate between multiple tools or approaches. It settles on a narrow strategy and executes it. For vulnerability patching, that's exactly the behavior you want: a focused method that works rather than an exploratory one that tries everything. Reliability 64 is the weakest dimension, suggesting some evaluations hit infrastructure or environmental walls despite high-quality patches.

Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

Runtime comparison

The Codex CLI runtime is the reference implementation. Compare it to the Codex runtime variant (which runs GPT-5.2 inside a sandboxed container):

  • Codex GPT-5.2 (standard): 62.7% pass rate at $5.30/pass
  • Codex GPT-5.2-codex (sandbox): 49.2% pass rate at $6.65/pass

The sandbox adds a 13.5 percentage point penalty. Build failures increase from 35 to 38, actual fails double from 12 to 27, and cost per pass rises by $1.35. The isolation provides security but extracts a steep accuracy price.

Cross-wrapper comparison

The same GPT-5.2 model achieves different results across three different wrapper environments:

  • Native Codex CLI: 62.7%
  • Cursor IDE: 51.6%
  • OpenCode: 51.6%

The native Codex CLI extracts 11+ percentage points more than either Cursor or OpenCode. This gap persists even though all three use the same underlying model. The difference points to environment setup: how each wrapper handles dependency installation, build configuration, and patch application. The Codex CLI's lean setup appears optimal for this task.

Pareto frontier

Codex GPT-5.2 occupies a position on the Pareto frontier of the benchmark. No agent has both higher pass rate and lower cost. For pure accuracy, it's the best available option. For cost-conscious teams willing to accept slightly lower pass rates, other agents become relevant. But if your priority is fixing CVEs correctly, this is the configuration to choose.

The 79 successful patches, the 12 failures (not 50+), and the $5.30 cost create a profile that doesn't trade much away. The 35 build failures are infrastructure noise, not model weakness. When Codex GPT-5.2 attempts a patch, the model itself almost always gets it right.

Reliability and infrastructure

10 infrastructure failures out of 136 evals (7.4%) shows the Codex CLI environment is stable. Failures due to system issues are rare. Most non-passing outcomes come from build failures (35), which are often recoverable with better dependency specs. The 12 actual fails. Where the patch doesn't fix the bug despite building. Are the real limit of the model's capability.

For deployment, expect Codex GPT-5.2 to succeed roughly 63% of the time and require human inspection or retry logic for the remaining 37%. Of those non-successes, only 9% will be actual failure (wrong patch), while 26% will be environment issues you may be able to resolve.

Learn more about benchmark methodology in the CVE-Agent-Bench documentation. Compare all 15 agents on the leaderboard. Explore how we measure economics and cost efficiency. Visit the OpenAI lab profile for context on all agents evaluated from this organization.

FAQ

What makes Codex GPT-5.2 the highest-accuracy agent?

62.7% pass rate on 136 CVEs, the highest across all tested agents. 79 passes, 12 fails, 35 build failures, 10 infrastructure failures.

What is the cost per successful patch for Codex GPT-5.2?

$5.30 per successful patch. Mid-range cost but highest accuracy. Only 12 actual failures from 136 evaluations means nearly every patch that compiles is correct. For 100 CVE fixes, expect roughly 159 evaluations at $844 total.

What types of vulnerabilities does Codex GPT-5.2 handle?

The benchmark tests 136 real CVE samples covering memory safety bugs, bounds checking errors, use-after-free issues, integer overflows, and logic vulnerabilities across C/C++ open source projects. Codex GPT-5.2 handles the broadest range with only 12 outright failures.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.