OpenCode GPT-5.2 Codex — CVE-Agent-Bench profile
37.8% pass rate at $8.73 per fix. GPT-5.2 with Codex runtime via OpenCode. 136 evaluations.
Opencode Gpt 5.2 Codex
OpenCode GPT-5.2 Codex represents the most constrained configuration in CVE-Agent-Bench. Running GPT-5.2 inside the Codex sandboxed runtime inside the OpenCode wrapper compounds multiple layers of environmental friction. Across 136 evaluations, it achieved just 37.8% pass rate at $8.73 per successful patch. With 48 passes, 32 fails, 47 build failures, and 9 infrastructure failures, this configuration is the least efficient option available.
Performance overview
37.8% pass rate is second-lowest in the entire benchmark (only one configuration scores lower). The cost at $8.73 per pass is third-highest, creating a double penalty: fewer successes and higher expense per success. To fix 100 CVEs using this configuration, you'd need roughly 264 evaluations (100 ÷ 0.378) costing approximately $2,304 total.
Compare this to native Codex CLI at $844 for 100 fixes. The OpenCode Codex combination costs 2.7x more for the same result. For large-scale CVE remediation, this multiplier becomes prohibitive.
Behavioral profile
Accuracy 4 is the lowest in the benchmark, just barely above zero. Speed 100 (fastest) combined with accuracy 4 creates a dangerous profile: the agent moves quickly but in the wrong direction. Reliability 47 is the lowest of any configuration. Efficiency 88 is also relatively low, suggesting the constrained environment causes wasted computation.
Precision 100 persists across all variants: when something builds, it's correct. The gap between 48 passes and accuracy 4 comes from failures to build (47) and infrastructure issues (9). Only 32 actual fails represent true model errors. But 56 environmental failures (47 + 9) obscure the underlying problem: the model isn't learning or improving within the constrained sandbox.
Double penalty analysis
This configuration stacks two penalty layers:
- OpenCode wrapper penalty, 48 build failures, increased infrastructure issues
- Codex sandbox penalty, Isolation restrictions, reduced model visibility
OpenCode GPT-5.2 (non-sandboxed): 51.6% pass rate OpenCode GPT-5.2 Codex: 37.8% pass rate Combined penalty: 13.8 percentage points
Compare to the standalone sandbox:
- Codex GPT-5.2 (native): 62.7% pass
- Codex GPT-5.2 Codex (sandbox only): 49.2% pass
- Sandbox penalty: 13.5 percentage points
The wrapper and sandbox penalties are roughly additive, suggesting they interfere with the model independently rather than combining in multiplicative ways.
Build and infrastructure failures
47 build failures plus 9 infrastructure failures account for 41.2% of all evaluations. These are environment issues, not model failures. The sandbox inside the wrapper creates such restrictive constraints that even syntactically correct patches fail to build.
The 32 actual fails (where code compiles but doesn't fix the bug) suggest the model struggles to reason effectively within the sandboxed environment. It can't see enough of the codebase or system state to understand whether a patch truly resolves the vulnerability.
Cost breakdown
At $8.73 per pass, breaking down what you're paying for:
- Base model cost: ~$0.50 (per evaluation, shared across all configs)
- Environment overhead (OpenCode): ~$1.20
- Sandbox overhead (Codex): ~$0.90
- Retry penalty (lower success rate): ~$6.13 (implicit)
The explicit costs (model + environment) are roughly $2.60 per evaluation. The implicit cost comes from needing 264 evaluations to fix 100 CVEs instead of 159 with Codex CLI.
Failure distribution
- Actual fails: 32 (24%). Model produces wrong patch
- Build failures: 47 (35%). Environment can't execute patch
- Infra failures: 9 (7%). System crashes or times out
- Passes: 48 (35%), all systems work, patch correct
This distribution tells you the real bottleneck: environment friction (42% of evals) exceeds model failure (24% of evals). You're not hitting the limits of the model; you're hitting the limits of the sandbox.
Recommendation
Do not use OpenCode GPT-5.2 Codex for production CVE remediation. The cost is prohibitive and the pass rate is unacceptable. Even if you need both OpenCode isolation and Codex sandboxing, use OpenCode GPT-5.2 (non-sandboxed) instead, which achieves 51.6% at $6.65. If you need Codex sandboxing, use the native Codex CLI at $5.30. Never combine both constraints.
The only reason to evaluate this configuration is to understand how constraint stacking degrades performance. For any practical deployment, alternatives are superior.
Comparative context
To understand why this configuration fails, compare it to its component parts:
- Codex CLI GPT-5.2: 62.7% at $5.30
- OpenCode GPT-5.2: 51.6% at $6.65
- Codex Sandbox GPT-5.2: 49.2% at $6.65
- OpenCode Codex GPT-5.2: 37.8% at $8.73
Each additional constraint reduces pass rate and raises cost. Combining multiple constraints creates disproportionate penalties because the model can't adapt its strategy within increasingly restrictive environments.
What this teaches about agent design
This configuration demonstrates that agent success depends heavily on environment fit. A strong model (GPT-5.2) performs worst when the environment is most constrained. Conversely, the same model performs best (62.7%) in the simplest environment (native Codex CLI).
For CVE patching specifically, the agent needs: visibility into the codebase, access to build tools, reasonable execution timeouts, and network access to package managers. When any of these are restricted, success rate drops sharply. The sandbox restrictions on all four dimensions create the worst-case scenario.
When you might encounter this
You'd use OpenCode Codex if you required both:
- High-security isolation (sandbox)
- Specific container/orchestration capabilities (OpenCode)
- Both uncompromisable in your deployment
In practice, these constraints are rarely both critical. Most high-security environments use native Codex + network policies. Most OpenCode users skip the sandbox for better results.
Learn why methodology matters when evaluating agents. Explore the full leaderboard to see all configurations. Understand behavior patterns and how constraint affects agent cognition. Review economics to make cost-aware decisions.
FAQ
Why is OpenCode GPT-5.2 Codex so much slower?
37.8% pass rate on 136 CVEs at $8.73 per fix. Codex sandbox adds overhead and reduces accuracy. 48 passes, 32 fails, 47 build, 9 infra.
What is the cost per successful patch for OpenCode GPT-5.2 Codex?
$8.73 per successful patch, third-highest in the benchmark. Stacking the OpenCode wrapper and Codex sandbox creates a combined 24.9 percentage point penalty compared to native Codex CLI (62.7%). For 100 CVE fixes, roughly 264 evaluations are needed at $2,304 total vs $844 with Codex CLI.
How does OpenCode GPT-5.2 Codex compare to other agents?
37.8% pass rate is the second-lowest in the benchmark, well below the 47.3% average. 47 build failures and 9 infra failures account for 41.2% of evaluations. The double constraint of OpenCode wrapper plus Codex sandbox creates disproportionate penalties. Not recommended for production use.
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.