OpenCode GPT-5.2 Codex — Vulnerability-Agent-Bench profile

37.8% pass rate at $8.73 per fix. GPT-5.2 with Codex runtime via OpenCode. 128 evaluations.

[OPENAI]

Opencode Gpt 5.2 Codex

37.8%

Pass Rate

$8.73

Cost per Pass

136

Total Evals

opencode

CLI

Outcome Distribution

Pass (48)

Fail (32)

Build (47)

Infra (9)

OpenCode GPT-5.2 Codex represents the most constrained configuration in Vulnerability-Agent-Bench. Running GPT-5.2 inside the Codex sandboxed runtime inside the OpenCode wrapper compounds multiple layers of environmental friction. Across 128 evaluations, it achieved just 37.8% pass rate at $8.73 per successful patch. With 48 passes, 32 fails, 47 build failures, and 9 infrastructure failures, this configuration is the least efficient option available.

Performance overview

37.8% pass rate is second-lowest in the entire benchmark (only one configuration scores lower). The cost at $8.73 per pass is third-highest, creating a double penalty: fewer successes and higher expense per success. To fix 100 vulnerabilities using this configuration, you'd need roughly 264 evaluations (100 ÷ 0.378) costing approximately $2,304 total.

Compare this to native Codex CLI at $844 for 100 fixes. The OpenCode Codex combination costs 2.7x more for the same result. For large-scale vulnerability remediation, this multiplier becomes prohibitive.

Behavioral profile

Accuracy 4 is the lowest in the benchmark, just barely above zero. Speed 100 (fastest) combined with accuracy 4 creates a dangerous profile: the agent moves quickly but in the wrong direction. Reliability 47 is the lowest of any configuration. Efficiency 88 is also relatively low, suggesting the constrained environment causes wasted computation.

Precision 100 persists across all variants: when something builds, it's correct. The gap between 48 passes and accuracy 4 comes from failures to build (47) and infrastructure issues (9). Only 32 actual fails represent true model errors. But 56 environmental failures (47 + 9) obscure the underlying problem: the model isn't learning or improving within the constrained sandbox.

Double penalty analysis

This configuration stacks two penalty layers:

OpenCode wrapper penalty, 48 build failures, increased infrastructure issues
Codex sandbox penalty, Isolation restrictions, reduced model visibility

OpenCode GPT-5.2 (non-sandboxed): 51.6% pass rate OpenCode GPT-5.2 Codex: 37.8% pass rate Combined penalty: 13.8 percentage points

Compare to the standalone sandbox:

Codex GPT-5.2 (native): 62.7% pass
Codex GPT-5.2 Codex (sandbox only): 49.2% pass
Sandbox penalty: 13.5 percentage points

The wrapper and sandbox penalties are roughly additive, suggesting they interfere with the model independently rather than combining in multiplicative ways.

Build and infrastructure failures

47 build failures plus 9 infrastructure failures account for 41.2% of all evaluations. These are environment issues, not model failures. The sandbox inside the wrapper creates such restrictive constraints that even syntactically correct patches fail to build.

The 32 actual fails (where code compiles but doesn't fix the bug) suggest the model struggles to reason effectively within the sandboxed environment. It can't see enough of the codebase or system state to understand whether a patch truly resolves the vulnerability.

Cost breakdown

At $8.73 per pass, breaking down what you're paying for:

Base model cost: ~$0.50 (per evaluation, shared across all configs)
Environment overhead (OpenCode): ~$1.20
Sandbox overhead (Codex): ~$0.90
Retry penalty (lower success rate): ~$6.13 (implicit)

The explicit costs (model + environment) are roughly $2.60 per evaluation. The implicit cost comes from needing 264 evaluations to fix 100 vulnerabilities instead of 159 with Codex CLI.

Failure distribution

Actual fails: 32 (24%). Model produces wrong patch
Build failures: 47 (35%). Environment can't execute patch
Infra failures: 9 (7%). System crashes or times out
Passes: 48 (35%), all systems work, patch correct

This distribution tells you the real bottleneck: environment friction (42% of evals) exceeds model failure (24% of evals). You're not hitting the limits of the model; you're hitting the limits of the sandbox.

Recommendation

Do not use OpenCode GPT-5.2 Codex for production vulnerability remediation. The cost is prohibitive and the pass rate is unacceptable. Even if you need both OpenCode isolation and Codex sandboxing, use OpenCode GPT-5.2 (non-sandboxed) instead, which achieves 51.6% at $6.65. If you need Codex sandboxing, use the native Codex CLI at $5.30. Never combine both constraints.

The only reason to evaluate this configuration is to understand how constraint stacking degrades performance. For any practical deployment, alternatives are superior.

Comparative context

To understand why this configuration fails, compare it to its component parts:

Codex CLI GPT-5.2: 62.7% at $5.30
OpenCode GPT-5.2: 51.6% at $6.65
Codex Sandbox GPT-5.2: 49.2% at $6.65
OpenCode Codex GPT-5.2: 37.8% at $8.73

Each additional constraint reduces pass rate and raises cost. Combining multiple constraints creates disproportionate penalties because the model can't adapt its strategy within increasingly restrictive environments.

What this teaches about agent design

This configuration demonstrates that agent success depends heavily on environment fit. A strong model (GPT-5.2) performs worst when the environment is most constrained. Conversely, the same model performs best (62.7%) in the simplest environment (native Codex CLI).

For vulnerability patching specifically, the agent needs: visibility into the codebase, access to build tools, reasonable execution timeouts, and network access to package managers. When any of these are restricted, success rate drops sharply. The sandbox restrictions on all four dimensions create the worst-case scenario.

When you might encounter this

You'd use OpenCode Codex if you required both:

High-security isolation (sandbox)
Specific container/orchestration capabilities (OpenCode)
Both uncompromisable in your deployment

In practice, these constraints are rarely both critical. Most high-security environments use native Codex + network policies. Most OpenCode users skip the sandbox for better results.

Learn why methodology matters when evaluating agents. Explore the full leaderboard to see all configurations. Understand behavior patterns and how constraint affects agent cognition. Review economics to make cost-aware decisions.

FAQ

Why is OpenCode GPT-5.2 Codex so much slower?

37.8% pass rate on 128 vulnerabilities at $8.73 per fix. Codex sandbox adds overhead and reduces accuracy. 48 passes, 32 fails, 47 build, 9 infra.

What is the cost per successful patch for OpenCode GPT-5.2 Codex?

$8.73 per successful patch, third-highest in the benchmark. Stacking the OpenCode wrapper and Codex sandbox creates a combined 24.9 percentage point penalty compared to native Codex CLI (62.7%). For 100 vulnerability fixes, roughly 264 evaluations are needed at $2,304 total vs $844 with Codex CLI.

How does OpenCode GPT-5.2 Codex compare to other agents?

37.8% pass rate is the second-lowest in the benchmark, well below the 47.3% average. 47 build failures and 9 infra failures account for 41.2% of evaluations. The double constraint of OpenCode wrapper plus Codex sandbox creates disproportionate penalties. Not recommended for production use.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Codex GPT-5.2 — Vulnerability-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.