Codex GPT-5.2 — Vulnerability-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

[OPENAI]

Codex Gpt 5.2

62.7%

Pass Rate

$5.30

Cost per Pass

136

Total Evals

codex

CLI

Outcome Distribution

Pass (79)

Fail (12)

Build (35)

Infra (10)

Codex GPT-5.2 running through the native Codex CLI represents the highest-performing agent configuration in Vulnerability-Agent-Bench. Across 128 evaluations, it achieved a 62.7% pass rate with only 12 actual failures. The second-lowest failure count in the benchmark. The cost per successful patch is $5.30, placing it in the mid-range for efficiency while maintaining top-tier accuracy.

Performance overview

The raw numbers tell a clear story: 79 passes out of 128 evaluations, with just 12 fails, 35 build failures, and 10 infrastructure failures. This distribution matters. The low failure count suggests the model produces substantive patch attempts, but environmental factors (build setup, infrastructure) prevent some evaluations from completing. When Codex GPT-5.2 generates a patch, it usually works.

At $5.30 per pass, the agent sits in the middle of the cost spectrum. It's cheaper than sandboxed runtimes but more expensive than some lower-accuracy alternatives. With 79 passes from 126 scored evaluations (excluding infra failures), the 62.7% pass rate is the highest in the benchmark.

Behavioral profile

The agent scores consistently high across speed and efficiency dimensions. Accuracy 100 indicates every attempted patch is substantive. No hallucinated code or nonsensical attempts. Speed 90 means it works quickly, rarely exceeding time limits. Efficiency 95 shows it uses computation resources well, avoiding redundant retries or excessive token consumption.

Breadth 0 is notable. This agent doesn't rotate between multiple tools or approaches. It settles on a narrow strategy and executes it. For vulnerability patching, that's exactly the behavior you want: a focused method that works rather than an exploratory one that tries everything. Reliability 64 is the weakest dimension, suggesting some evaluations hit infrastructure or environmental walls despite high-quality patches.

Runtime comparison

The Codex CLI runtime is the reference implementation. Compare it to the Codex runtime variant (which runs GPT-5.2 inside a sandboxed container):

Codex GPT-5.2 (standard): 62.7% pass rate at $5.30/pass
Codex GPT-5.2-codex (sandbox): 49.2% pass rate at $6.65/pass

The sandbox adds a 13.5 percentage point penalty. Build failures increase from 35 to 38, actual fails double from 12 to 27, and cost per pass rises by $1.35. The isolation provides security but extracts a steep accuracy price.

Cross-wrapper comparison

The same GPT-5.2 model achieves different results across three different wrapper environments:

Native Codex CLI: 62.7%
Cursor IDE: 51.6%
OpenCode: 51.6%

The native Codex CLI extracts 11+ percentage points more than either Cursor or OpenCode. This gap persists even though all three use the same underlying model. The difference points to environment setup: how each wrapper handles dependency installation, build configuration, and patch application. The Codex CLI's lean setup appears optimal for this task.

Pareto frontier

Codex GPT-5.2 occupies a position on the Pareto frontier of the benchmark. No agent has both higher pass rate and lower cost. For pure accuracy, it's the best available option. For cost-conscious teams willing to accept slightly lower pass rates, other agents become relevant. But if your priority is fixing vulnerabilities correctly, this is the configuration to choose.

The 79 successful patches, the 12 failures (not 50+), and the $5.30 cost create a profile that doesn't trade much away. The 35 build failures are infrastructure noise, not model weakness. When Codex GPT-5.2 attempts a patch, the model itself almost always gets it right.

Reliability and infrastructure

10 infrastructure failures out of 128 evals (7.4%) shows the Codex CLI environment is stable. Failures due to system issues are rare. Most non-passing outcomes come from build failures (35), which are often recoverable with better dependency specs. The 12 actual fails. Where the patch doesn't fix the bug despite building. Are the real limit of the model's capability.

For deployment, expect Codex GPT-5.2 to succeed roughly 63% of the time and require human inspection or retry logic for the remaining 37%. Of those non-successes, only 9% will be actual failure (wrong patch), while 26% will be environment issues you may be able to resolve.

Learn more about benchmark methodology in the Vulnerability-Agent-Bench documentation. Compare all 15 agents on the leaderboard. Explore how we measure economics and cost efficiency. Visit the OpenAI lab profile for context on all agents evaluated from this organization.

FAQ

What makes Codex GPT-5.2 the highest-accuracy agent?

62.7% pass rate on 128 vulnerabilities, the highest across all tested agents. 79 passes, 12 fails, 35 build failures, 10 infrastructure failures.

What is the cost per successful patch for Codex GPT-5.2?

$5.30 per successful patch. Mid-range cost but highest accuracy. Only 12 actual failures from 128 evaluations means nearly every patch that compiles is correct. For 100 vulnerability fixes, expect roughly 159 evaluations at $844 total.

What types of vulnerabilities does Codex GPT-5.2 handle?

The benchmark tests 128 real vuln samples covering memory safety bugs, bounds checking errors, use-after-free issues, integer overflows, and logic vulnerabilities across production codebases. Codex GPT-5.2 handles the broadest range with only 12 outright failures.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Codex GPT-5.2 (Codex runtime) — Vulnerability-Agent-Bench profile

49.2% pass rate at $6.65 per fix. Same GPT-5.2 model, different runtime environment. 128 evaluations.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.