OpenCode GPT-5.2 — Vulnerability-Agent-Bench profile

51.6% pass rate at $6.65 per fix. OpenAI GPT-5.2 via OpenCode. Only 11 fail outcomes overall.

[OPENAI]

Opencode Gpt 5.2

51.6%

Pass Rate

$6.65

Cost per Pass

136

Total Evals

opencode

CLI

Outcome Distribution

Pass (63)

Fail (11)

Build (48)

Infra (14)

OpenCode GPT-5.2 achieves the same 51.6% pass rate as Cursor GPT-5.2 but through entirely different failure modes. Across 128 evaluations, it recorded 63 passes, only 11 actual failures, 48 build failures, and 14 infrastructure failures. The configuration demonstrates high patch quality masked by environmental constraints. When the wrapper successfully executes, patches almost always work. When they fail, it's usually a build issue, not a model problem.

Performance overview

51.6% pass rate ties Cursor GPT-5.2 exactly. But compare the failure distributions:

Cursor: 34 actual fails, 25 build failures
OpenCode: 11 actual fails, 48 build failures

The OpenCode environment produces fewer incorrect patches but hits build and infrastructure walls frequently. Of 73 non-passing evals, only 11 are model failures. The other 62 are environmental. This ratio suggests exceptional patch quality constrained by wrapper limitations.

At $6.65 per successful patch, cost is higher than native Codex ($5.30) but comparable to Cursor ($6.26). For 100 vulnerability fixes, you'd need roughly 194 evaluations (100 ÷ 0.516), costing approximately $1,304 total. The cost-per-success closely matches Cursor despite different failure modes.

Behavioral profile

Speed 100 is the fastest in the entire benchmark. The agent rushes through problems and either succeeds or hits a wall. Efficiency 92 confirms it minimizes token usage and redundant attempts. Precision 100 means every patch it generates is substantive (no hallucinations or nonsense code).

Accuracy 57 is moderate, but accuracy here includes environmental failures, not just model mistakes. The true model accuracy (patches that build) is likely much higher. Breadth 0 shows a single-strategy approach: commit to one solution method and execute it fast. Reliability 58 is the lowest dimension because the wrapper environment is unstable.

Failure mode analysis

11 actual fails is the lowest in the entire benchmark. When OpenCode successfully builds a patch, that patch almost always fixes the vulnerability. The model quality is genuine and high. But 48 build failures (35.3%) and 14 infra failures (10.3%) create the appearance of poor performance.

The OpenCode wrapper's environment setup is the bottleneck. Dependency resolution fails more often. System calls hit restrictions. Timeout limits trigger. None of these failures reflect on the model's ability to generate correct patches. They reflect on the wrapper's ability to execute patches in its constrained environment.

Cross-wrapper comparison

Same model, three wrappers:

Codex CLI: 62.7% pass, 35 build, 12 fails
Cursor: 51.6% pass, 25 build, 34 fails
OpenCode: 51.6% pass, 48 build, 11 fails

Codex CLI has the fewest build failures and fewest model failures: the optimal environment. Cursor trades off some accuracy for better build handling. OpenCode reverses the trade: fewer model failures but far more build obstacles.

This pattern shows that OpenCode's sandbox and isolation layers work against build execution. The wrapper can run patches but struggles with dependency chains, build configuration, and environment setup more than Codex or Cursor.

Build failure root causes

48 build failures suggests several environmental constraints:

Dependency resolution, OpenCode may have stricter or smaller dependency caches than standard runtimes
System call restrictions, Sandbox limitations prevent certain build tools from functioning
Environment variable constraints, Build config defaults differ from developer expectations
Timeout sensitivity, Slower build execution hits time limits more often

The 11 actual fails represent genuine model limitations (wrong patches). The 48 build failures represent wrapper limitations (can't execute otherwise-correct patches).

Patch quality insight

If we adjust for environment: 63 passes + 11 fails = 74 completed patches, of which 85% are correct. If OpenCode's environment were as permissive as Codex CLI, the pass rate might reach 70-75% by fixing the environmental issues. The model itself is stronger than the 51.6% headline rate suggests.

Cost analysis

$6.65 per successful patch ranks OpenCode third in cost (only other OpenCode variants cost more). But cost-per-evaluation is actually competitive. The issue is the low success rate due to environment friction. You pay more per evaluation and succeed less often.

For 100 vulnerability fixes:

Codex CLI: $844 (159 evals × ~$5.31)
OpenCode GPT-5.2: $1,304 (194 evals × ~$6.73)
Cost difference: $460 more for the same 100 fixes

When OpenCode makes sense

OpenCode is useful if you need extreme isolation or specific build container capabilities that neither Codex CLI nor Cursor provide. The model quality is high. The wrapper just extracts a cost in environment friction.

For most development teams, Cursor or Codex CLI are better choices. But for security-focused deployments where patch isolation and container control are primary requirements, OpenCode delivers correct patches at the cost of more environment failures.

Recommendation

Use OpenCode GPT-5.2 only if you need its specific isolation or container capabilities. Otherwise, choose Codex CLI for better accuracy and cost, or Cursor for better developer experience. If you do use OpenCode, understand that 48/128 failures are environmental, not model failures. Budget extra time and retries for build failures.

Compare all agents on the benchmark leaderboard. Understand failure modes and success patterns. Explore agent behavior profiles. Review cost economics across all configurations.

FAQ

What is distinctive about OpenCode GPT-5.2?

51.6% pass rate on 128 vulnerabilities with only 11 fail outcomes (lowest fail count). 63 passes, 11 fails, 48 build, 14 infra.

What is the cost per successful patch for OpenCode GPT-5.2?

$6.65 per successful patch across 128 evaluations. Higher than native Codex CLI ($5.30) due to wrapper overhead. For 100 vulnerability fixes, expect roughly 194 evaluations at $1,304 total.

How does OpenCode GPT-5.2 compare to other agents?

51.6% pass rate is above the 47.3% benchmark average. Only 11 actual failures, the lowest in the entire benchmark, means the model almost always generates correct patches when the environment cooperates. The bottleneck is 48 build failures from wrapper constraints, not model quality.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Codex GPT-5.2 — Vulnerability-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.