OpenCode Gemini 3.1 Pro — CVE-Agent-Bench profile
54.9% pass rate at $5.81 per fix. Google Gemini 3.1 via OpenCode. 128 evaluations.
OpenCode Gemini 3.1 Pro is the same Gemini 3.1 Pro model accessed through the OpenCode multi-model wrapper CLI instead of Google's native interface. In 128 evaluations, it achieved 54.9% pass rate with a cost of $5.81 per successful fix. The model passed 67 samples, failed 25, experienced 30 build failures, and encountered 6 infrastructure failures.
The wrapper penalty
Native Gemini 3.1 Pro (via Google CLI) achieves 58.7% at $3.92 per pass. OpenCode's wrapper adds 3.8 percentage points of failure and increases cost per fix by 48%. This is the native-versus-wrapper gap, and it applies across all models in the benchmark. OpenCode's value is unified CLI access to multiple models; that convenience comes at a performance cost.
The pass count is 67 (higher than native's 64), but this happens on a larger evaluation set with lower pass rate. Fail count is 25 versus 18 for native. The model fails more often in the OpenCode environment. This pattern is consistent with other wrapper comparisons in the benchmark.
Why wrapper overhead happens
OpenCode is a translation layer. Your code sends a request to the OpenCode CLI, which translates it to each model's native API format, calls the model, and returns results. This translation introduces latency, context-window overhead, and potential formatting loss.
For Gemini specifically, the overhead appears in two ways. First, request formatting: OpenCode standardizes prompts across models, which may not align perfectly with Gemini's preferred input structure. Second, cost: OpenCode's infrastructure adds API calls between your system and Google's API, which increases token usage slightly.
Neither is a flaw in OpenCode. The tool is designed for teams that need flexibility across models (GPT, Claude, Gemini in one CLI). If you're committing to Gemini, the native CLI is preferable.
Behavioral profile
Infrastructure reliability improvement
OpenCode Gemini shows 6 infrastructure failures, much lower than native Gemini's 19. This suggests OpenCode's wrapper actually delivers more stable test environment integration than Google's native CLI. The trade-off is accuracy loss, not reliability gain.
This is useful context for teams with unstable containerized environments. If you're getting excessive infra failures with native Google CLI, OpenCode's wrapper might improve your test reliability, albeit at the cost of some accuracy.
Build failure pattern
Build failures (30) outnumber native Gemini's 27. When using the OpenCode wrapper, the model generates patches that fail to compile more often. This is a direct result of the wrapper's prompt translation. OpenCode normalizes prompts for cross-model compatibility, which can lose Gemini-specific context about the test environment.
The precision score remains 100, indicating the output is still well-structured. But semantic completeness (whether patches resolve vulnerabilities) is lower. Accuracy is 70, compared to native Gemini's 85.
Cost-effectiveness verdict
At $5.81 per fix, OpenCode Gemini is more expensive than native Gemini ($3.92). It is cheaper than Anthropic's Claude Opus models ($2.64-$5.84) and comparable to some OpenAI configurations. But for Gemini specifically, there is no reason to use the wrapper if you have access to the native CLI.
OpenCode Gemini's value is in multi-model deployments. If you're running GPT-5.2, Claude Opus, and Gemini through the same system and want to A/B test them, OpenCode gives you unified abstraction. The cost of that abstraction is worthwhile in that context. For single-model Gemini deployments, native is better.
Wrapper behavior across the benchmark
This 3.8pp penalty is consistent with other wrapper comparisons. OpenCode Claude Opus 4.6 loses 14.1pp compared to the native Claude CLI (61.6% to 47.5%). OpenCode's overhead is systematic across models. Native CLIs are better for single-model evaluation; OpenCode is useful for multi-model flexibility testing.
Comparison to native Gemini
See the native Gemini 3.1 Pro profile to understand what the same model achieves without wrapper overhead. The 58.7% native performance is the relevant baseline.
Test methodology
OpenCode was evaluated with the same 128 CVE samples as native Gemini 3.1 Pro. The evaluation harness sends requests to OpenCode's CLI, which routes them to Google's Gemini API. This setup is realistic for teams using OpenCode in production. Results reflect actual wrapper overhead, not synthetic estimates.
When to choose OpenCode Gemini
Use OpenCode Gemini 3.1 Pro if you are running a multi-model evaluation system (Gemini, GPT, Claude) and need a unified CLI interface. Accept the 3.8pp accuracy loss and 48% cost increase as the price of that flexibility.
Use native Gemini 3.1 Pro if you are evaluating Gemini standalone. The native interface is faster, cheaper, and more accurate.
Summary
OpenCode Gemini 3.1 Pro delivers 54.9% accuracy at $5.81 per fix. The wrapper adds measurable overhead. Compared to native Gemini 3.1 Pro (58.7% at $3.92), you lose 3.8pp accuracy and pay 48% more per fix. Infrastructure reliability is slightly better with the wrapper, but accuracy is the primary trade-off.
For single-model Gemini deployments, native is the right choice. For teams standardizing on OpenCode across multiple models, this configuration is viable and costs less than top-tier Claude models.
Explore CVE-Agent-Bench results to compare all wrapper and native combinations. Review benchmark economics to calculate which wrapper setup fits your evaluation budget. See how native Gemini 3.1 Pro performs without wrapper overhead.
FAQ
How does OpenCode affect Gemini 3.1 Pro performance?
54.9% pass rate on 128 CVEs at $5.81 per fix, 3.8pp below native Gemini CLI. 67 passes, 25 fails, 30 build, 6 infra.
What is the cost per successful patch for OpenCode Gemini 3.1 Pro?
$5.81 per successful patch, 48% more than native Gemini 3.1 Pro ($3.92). The OpenCode wrapper adds API translation overhead. For standalone Gemini deployments, the native CLI is cheaper and more accurate.
How does OpenCode Gemini 3.1 Pro compare to other agents?
54.9% pass rate is above the 47.3% benchmark average. Infrastructure reliability improves compared to native Gemini (6 infra failures vs 19), but accuracy drops 3.8 percentage points. Useful for multi-model evaluation systems; otherwise, native Gemini CLI is preferred.
Google security AI and CVE-Agent-Bench
How Google's Big Sleep, Naptime, and Sec-Gemini intersect with independent agent evaluation on 128 real CVEs.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.