OpenCode GPT-5.2 — CVE-Agent-Bench profile
51.6% pass rate at $6.65 per fix. OpenAI GPT-5.2 via OpenCode. Only 11 fail outcomes overall.
Opencode Gpt 5.2
OpenCode GPT-5.2 achieves the same 51.6% pass rate as Cursor GPT-5.2 but through entirely different failure modes. Across 136 evaluations, it recorded 63 passes, only 11 actual failures, 48 build failures, and 14 infrastructure failures. The configuration demonstrates high patch quality masked by environmental constraints. When the wrapper successfully executes, patches almost always work. When they fail, it's usually a build issue, not a model problem.
Performance overview
51.6% pass rate ties Cursor GPT-5.2 exactly. But compare the failure distributions:
- Cursor: 34 actual fails, 25 build failures
- OpenCode: 11 actual fails, 48 build failures
The OpenCode environment produces fewer incorrect patches but hits build and infrastructure walls frequently. Of 73 non-passing evals, only 11 are model failures. The other 62 are environmental. This ratio suggests exceptional patch quality constrained by wrapper limitations.
At $6.65 per successful patch, cost is higher than native Codex ($5.30) but comparable to Cursor ($6.26). For 100 CVE fixes, you'd need roughly 194 evaluations (100 ÷ 0.516), costing approximately $1,304 total. The cost-per-success closely matches Cursor despite different failure modes.
Behavioral profile
Speed 100 is the fastest in the entire benchmark. The agent rushes through problems and either succeeds or hits a wall. Efficiency 92 confirms it minimizes token usage and redundant attempts. Precision 100 means every patch it generates is substantive (no hallucinations or nonsense code).
Accuracy 57 is moderate, but accuracy here includes environmental failures, not just model mistakes. The true model accuracy (patches that build) is likely much higher. Breadth 0 shows a single-strategy approach: commit to one solution method and execute it fast. Reliability 58 is the lowest dimension because the wrapper environment is unstable.
Failure mode analysis
11 actual fails is the lowest in the entire benchmark. When OpenCode successfully builds a patch, that patch almost always fixes the CVE. The model quality is genuine and high. But 48 build failures (35.3%) and 14 infra failures (10.3%) create the appearance of poor performance.
The OpenCode wrapper's environment setup is the bottleneck. Dependency resolution fails more often. System calls hit restrictions. Timeout limits trigger. None of these failures reflect on the model's ability to generate correct patches. They reflect on the wrapper's ability to execute patches in its constrained environment.
Cross-wrapper comparison
Same model, three wrappers:
- Codex CLI: 62.7% pass, 35 build, 12 fails
- Cursor: 51.6% pass, 25 build, 34 fails
- OpenCode: 51.6% pass, 48 build, 11 fails
Codex CLI has the fewest build failures and fewest model failures: the optimal environment. Cursor trades off some accuracy for better build handling. OpenCode reverses the trade: fewer model failures but far more build obstacles.
This pattern shows that OpenCode's sandbox and isolation layers work against build execution. The wrapper can run patches but struggles with dependency chains, build configuration, and environment setup more than Codex or Cursor.
Build failure root causes
48 build failures suggests several environmental constraints:
- Dependency resolution, OpenCode may have stricter or smaller dependency caches than standard runtimes
- System call restrictions, Sandbox limitations prevent certain build tools from functioning
- Environment variable constraints, Build config defaults differ from developer expectations
- Timeout sensitivity, Slower build execution hits time limits more often
The 11 actual fails represent genuine model limitations (wrong patches). The 48 build failures represent wrapper limitations (can't execute otherwise-correct patches).
Patch quality insight
If we adjust for environment: 63 passes + 11 fails = 74 completed patches, of which 85% are correct. If OpenCode's environment were as permissive as Codex CLI, the pass rate might reach 70-75% by fixing the environmental issues. The model itself is stronger than the 51.6% headline rate suggests.
Cost analysis
$6.65 per successful patch ranks OpenCode third in cost (only other OpenCode variants cost more). But cost-per-evaluation is actually competitive. The issue is the low success rate due to environment friction. You pay more per evaluation and succeed less often.
For 100 CVE fixes:
- Codex CLI: $844 (159 evals × ~$5.31)
- OpenCode GPT-5.2: $1,304 (194 evals × ~$6.73)
- Cost difference: $460 more for the same 100 fixes
When OpenCode makes sense
OpenCode is useful if you need extreme isolation or specific build container capabilities that neither Codex CLI nor Cursor provide. The model quality is high. The wrapper just extracts a cost in environment friction.
For most development teams, Cursor or Codex CLI are better choices. But for security-focused deployments where patch isolation and container control are primary requirements, OpenCode delivers correct patches at the cost of more environment failures.
Recommendation
Use OpenCode GPT-5.2 only if you need its specific isolation or container capabilities. Otherwise, choose Codex CLI for better accuracy and cost, or Cursor for better developer experience. If you do use OpenCode, understand that 48/136 failures are environmental, not model failures. Budget extra time and retries for build failures.
Compare all agents on the benchmark leaderboard. Understand failure modes and success patterns. Explore agent behavior profiles. Review cost economics across all configurations.
FAQ
What is distinctive about OpenCode GPT-5.2?
51.6% pass rate on 136 CVEs with only 11 fail outcomes (lowest fail count). 63 passes, 11 fails, 48 build, 14 infra.
What is the cost per successful patch for OpenCode GPT-5.2?
$6.65 per successful patch across 136 evaluations. Higher than native Codex CLI ($5.30) due to wrapper overhead. For 100 CVE fixes, expect roughly 194 evaluations at $1,304 total.
How does OpenCode GPT-5.2 compare to other agents?
51.6% pass rate is above the 47.3% benchmark average. Only 11 actual failures, the lowest in the entire benchmark, means the model almost always generates correct patches when the environment cooperates. The bottleneck is 48 build failures from wrapper constraints, not model quality.
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.