Codex GPT-5.2 (Codex runtime) — CVE-Agent-Bench profile
49.2% pass rate at $6.65 per fix. Same GPT-5.2 model, different runtime environment. 136 evaluations.
Codex Gpt 5.2 Codex
Codex GPT-5.2 running inside the Codex sandboxed runtime provides network isolation and container-level security but at a measurable cost to accuracy. Across 136 evaluations, this configuration achieved a 49.2% pass rate with 27 actual failures and 38 build failures. The cost per successful patch rises to $6.65, a $1.35 penalty compared to the standard Codex CLI runtime.
Performance overview
The numbers show a consistent accuracy penalty across the board. 63 passes out of 136 evaluations represents a 13.5 percentage point drop from the native Codex CLI (62.7%). The failure count nearly doubles: 27 actual fails versus 12 for the standard runtime. Build failures increase from 35 to 38, and infrastructure failures drop slightly to 8. Overall, the sandboxed environment both reduces success rate and increases costs.
The same GPT-5.2 model is running inside both configurations. The difference in performance comes entirely from the sandbox overhead and isolation constraints. This tells you something important: the sandbox is the bottleneck, not the model. When network isolation and container sandboxing are critical requirements, you're purchasing that feature at a 13.5pp accuracy cost.
Behavioral profile
Within the sandbox, Codex GPT-5.2 exhibits different behavior than its standard variant. Speed remains high at 90, but accuracy drops from 100 to 48. This is the most dramatic change. Precision stays at 100, meaning patches that build are correct. But the model generates more incorrect patches (those that fail to fix the bug despite compiling).
Efficiency holds at 92, and breadth remains 0. Reliability drops to 60. The sandbox introduces latency and constraint friction that interferes with the model's ability to accurately assess whether a patch has fixed the vulnerability. The model is faster but less accurate, suggesting the isolation layer creates blind spots.
Runtime effect
The Codex runtime wrapper (sandbox) adds these concrete costs:
- Pass rate: 62.7% → 49.2% (13.5pp loss)
- Actual fails: 12 → 27 (2.25x increase)
- Build failures: 35 → 38 (3 additional failures)
- Cost per pass: $5.30 → $6.65 ($1.35 increase)
The efficiency loss compounds. You spend more per evaluation and succeed less often. For every 100 evaluations, the sandbox causes you to lose 13 successful patches while adding $1.35 to each pass cost.
Cost analysis
At $6.65 per successful patch, the sandbox configuration ranks third in cost per pass among all 15 agents. Only the OpenCode variants cost more. For a deployment needing to fix 100 CVEs, the sandbox would cost $665 total versus $530 for the standard Codex CLI. The $135 premium purchases network isolation and container sandboxing.
The calculation changes if you account for the lower success rate. To fix 100 CVEs with the sandbox variant, you'd need roughly 203 evaluations (100 ÷ 0.492). That's $1,359 total. With the standard Codex CLI, you'd need 159 evaluations (100 ÷ 0.627) at $844 total. The sandbox costs an extra $515 per 100 CVEs fixed.
When the sandbox matters
The Codex runtime has genuine security value. It isolates patched code from the network, runs in a restricted container, and prevents malicious patches from exfiltrating data. For environments where that isolation is mandatory (air-gapped systems, high-security deployments, regulated industries), the pass rate penalty is not negotiable. It's a requirement.
But for standard cloud deployments, CI/CD pipelines, and development environments where the host system is already trusted, the standard Codex CLI achieves 13.5pp better results without the sandbox overhead. The choice depends on your security model, not on the model's capability.
Build and fail dynamics
The increase in build failures (35 → 38) suggests the sandbox environment has stricter constraints on dependency resolution or environment setup. Patches that compile in the standard runtime may fail to build in the sandbox due to missing libraries or restricted system calls. The actual fails (27 vs 12) indicate the sandbox's isolation interferes with the model's ability to validate patch correctness within the constrained environment.
Recommendation
Use the standard Codex GPT-5.2 (non-sandboxed) unless your deployment requires network isolation. If you need the sandbox, budget for 13.5pp lower accuracy and plan for roughly 2x the failure rate on incorrect patches. The same model performs far better in an unconstrained environment.
Compare all configurations of GPT-5.2 on the benchmark leaderboard. Learn about runtime variations and their impact. Explore the OpenAI lab profile for context on all evaluated agents. See how economics scale across different configurations.
FAQ
How does Codex runtime affect GPT-5.2 performance?
49.2% pass rate vs 62.7% in standard Codex CLI. Shows 13.5pp penalty from sandbox differences. 63 passes, 27 fails, 38 build, 8 infra on 136 CVEs.
What is the cost per successful patch for Codex GPT-5.2 Codex?
$6.65 per successful patch. The sandbox runtime adds $1.35 per fix compared to standard Codex CLI ($5.30). For 100 CVE fixes, expect roughly 203 evaluations at $1,359 total, compared to $844 with standard Codex CLI.
How does Codex GPT-5.2 Codex compare to other agents?
At 49.2%, it sits slightly above the 47.3% benchmark average but 13.5 percentage points below the same model in standard Codex CLI. The sandbox provides network isolation and container security at a measurable accuracy and cost penalty.
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.