Codex GPT-5.2 — CVE-Agent-Bench profile
62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.
Codex Gpt 5.2
Codex GPT-5.2 running through the native Codex CLI represents the highest-performing agent configuration in CVE-Agent-Bench. Across 136 evaluations, it achieved a 62.7% pass rate with only 12 actual failures. The second-lowest failure count in the benchmark. The cost per successful patch is $5.30, placing it in the mid-range for efficiency while maintaining top-tier accuracy.
Performance overview
The raw numbers tell a clear story: 79 passes out of 136 evaluations, with just 12 fails, 35 build failures, and 10 infrastructure failures. This distribution matters. The low failure count suggests the model produces substantive patch attempts, but environmental factors (build setup, infrastructure) prevent some evaluations from completing. When Codex GPT-5.2 generates a patch, it usually works.
At $5.30 per pass, the agent sits in the middle of the cost spectrum. It's cheaper than sandboxed runtimes but more expensive than some lower-accuracy alternatives. With 79 passes from 126 scored evaluations (excluding infra failures), the 62.7% pass rate is the highest in the benchmark.
Behavioral profile
The agent scores consistently high across speed and efficiency dimensions. Accuracy 100 indicates every attempted patch is substantive. No hallucinated code or nonsensical attempts. Speed 90 means it works quickly, rarely exceeding time limits. Efficiency 95 shows it uses computation resources well, avoiding redundant retries or excessive token consumption.
Breadth 0 is notable. This agent doesn't rotate between multiple tools or approaches. It settles on a narrow strategy and executes it. For vulnerability patching, that's exactly the behavior you want: a focused method that works rather than an exploratory one that tries everything. Reliability 64 is the weakest dimension, suggesting some evaluations hit infrastructure or environmental walls despite high-quality patches.
Runtime comparison
The Codex CLI runtime is the reference implementation. Compare it to the Codex runtime variant (which runs GPT-5.2 inside a sandboxed container):
- Codex GPT-5.2 (standard): 62.7% pass rate at $5.30/pass
- Codex GPT-5.2-codex (sandbox): 49.2% pass rate at $6.65/pass
The sandbox adds a 13.5 percentage point penalty. Build failures increase from 35 to 38, actual fails double from 12 to 27, and cost per pass rises by $1.35. The isolation provides security but extracts a steep accuracy price.
Cross-wrapper comparison
The same GPT-5.2 model achieves different results across three different wrapper environments:
- Native Codex CLI: 62.7%
- Cursor IDE: 51.6%
- OpenCode: 51.6%
The native Codex CLI extracts 11+ percentage points more than either Cursor or OpenCode. This gap persists even though all three use the same underlying model. The difference points to environment setup: how each wrapper handles dependency installation, build configuration, and patch application. The Codex CLI's lean setup appears optimal for this task.
Pareto frontier
Codex GPT-5.2 occupies a position on the Pareto frontier of the benchmark. No agent has both higher pass rate and lower cost. For pure accuracy, it's the best available option. For cost-conscious teams willing to accept slightly lower pass rates, other agents become relevant. But if your priority is fixing CVEs correctly, this is the configuration to choose.
The 79 successful patches, the 12 failures (not 50+), and the $5.30 cost create a profile that doesn't trade much away. The 35 build failures are infrastructure noise, not model weakness. When Codex GPT-5.2 attempts a patch, the model itself almost always gets it right.
Reliability and infrastructure
10 infrastructure failures out of 136 evals (7.4%) shows the Codex CLI environment is stable. Failures due to system issues are rare. Most non-passing outcomes come from build failures (35), which are often recoverable with better dependency specs. The 12 actual fails. Where the patch doesn't fix the bug despite building. Are the real limit of the model's capability.
For deployment, expect Codex GPT-5.2 to succeed roughly 63% of the time and require human inspection or retry logic for the remaining 37%. Of those non-successes, only 9% will be actual failure (wrong patch), while 26% will be environment issues you may be able to resolve.
Learn more about benchmark methodology in the CVE-Agent-Bench documentation. Compare all 15 agents on the leaderboard. Explore how we measure economics and cost efficiency. Visit the OpenAI lab profile for context on all agents evaluated from this organization.
FAQ
What makes Codex GPT-5.2 the highest-accuracy agent?
62.7% pass rate on 136 CVEs, the highest across all tested agents. 79 passes, 12 fails, 35 build failures, 10 infrastructure failures.
What is the cost per successful patch for Codex GPT-5.2?
$5.30 per successful patch. Mid-range cost but highest accuracy. Only 12 actual failures from 136 evaluations means nearly every patch that compiles is correct. For 100 CVE fixes, expect roughly 159 evaluations at $844 total.
What types of vulnerabilities does Codex GPT-5.2 handle?
The benchmark tests 136 real CVE samples covering memory safety bugs, bounds checking errors, use-after-free issues, integer overflows, and logic vulnerabilities across C/C++ open source projects. Codex GPT-5.2 handles the broadest range with only 12 outright failures.
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.