Skip to main content
[AGENT]

Gemini 3.1 Pro Preview — CVE-Agent-Bench profile

58.7% pass rate at $3.92 per fix. +15.7pp upgrade from Gemini 3 Pro. Best cost/accuracy for Google.

No profile data available for this agent

Gemini 3.1 Pro Preview is Google's second-generation model in CVE-Agent-Bench, representing a substantial improvement over its predecessor. Across 128 evaluations, it achieved a 58.7% pass rate with a cost of $3.92 per successful fix. The model passed 64 samples, failed 18, experienced 27 build failures, and encountered 19 infrastructure failures.

Performance improvement

The gap from Gemini 3 Pro to Gemini 3.1 Pro is 15.7 percentage points. The second-largest single-model upgrade in the benchmark (Claude Opus 4.5 to 4.6 is 15.9pp). The raw pass count rose from 55 to 64, while actual fails dropped from 36 to 18. Build failures decreased from 37 to 27. Every major failure category improved.

Cost per pass dropped to $3.92, despite higher per-token API pricing. This happens because pass rate improves faster than cost increases. Gemini 3.1 Pro is cheaper and more accurate than Gemini 3 Pro. It ranks in the top tier of the benchmark (only Claude Opus 4.6 variants and the base GPT-5.2 exceed it), making it one of the most reliable models available.

Behavioral profile

No profile data available for this agent

Accuracy jumps to 85, a major leap from the 24 baseline. Speed remains maxed at 100, and efficiency is 97. Precision is 100. Reliability is 58, the only dimension that lags. This is because the model experiences 19 infrastructure failures. The highest count in the benchmark.

The profile now reads differently: Gemini 3.1 Pro is accurate and reliable, fast and cheap. The sole weakness is infrastructure fragility. When you run this model, roughly 15% of your evals will fail due to test environment issues, not model error.

Infrastructure fragility

Nineteen infra failures out of 128 evals (14.8%) requires investigation. Infra failures are not model errors. They happen when the test harness itself crashes (network timeout, disk quota, port binding conflict). They are systematic noise rather than model weakness.

To understand Gemini 3.1 Pro's true accuracy, adjust the denominator: 64 passes divided by 109 non-infra evals equals 58.7%. The model's effective pass rate on infrastructure-stable runs is similar to its reported rate. The infra issue doesn't degrade the model's reasoning; it just makes the evaluation environment fragile.

Teams running their own test harness should note this. If you deploy Gemini 3.1 Pro in a production CVE patching system, ensure your environment is stable. Google's hosted API is reliable; containerized test environments sometimes have network or resource constraints.

Build failure analysis

Build failures (27) are lower than Gemini 3 Pro (37) but still material. These are cases where the patch compiles and runs but doesn't resolve the vulnerability. The model generates syntactically valid code that the test environment can build, but the patch is semantically incomplete.

Gemini 3.1 Pro's reduced build-failure rate (21% vs 27%) shows the model learned to generate more environment-compatible patches. But the absolute count is still high. Precision is 100, meaning the model's output is well-structured. The build failures reflect incomplete vulnerability reasoning, not malformed code.

Cost-effectiveness positioning

At $3.92 per fix, Gemini 3.1 Pro is the third-cheapest agent by cost per pass. Only Claude Opus 4.5 ($2.64) and Claude Opus 4.6 ($2.93) are more cost-efficient. But Gemini 3.1 achieves 58.7% accuracy while Claude Opus reaches 61.6-62.5%, so the cost-accuracy trade-off is favorable for Gemini: you pay less for near-equivalent accuracy.

For teams evaluating models at scale, Gemini 3.1 Pro offers a good balance. You get high accuracy without premium pricing. The infra-failure overhead is manageable if you understand it and account for it in your test design.

Comparison to native integration

Gemini 3.1 Pro is available through both the native Google CLI and third-party wrappers like OpenCode. The native integration (shown here) outperforms the wrapped version: native achieves 58.7% at $3.92, while OpenCode's version reaches 52.3% at $5.81. The wrapper adds overhead and reduces accuracy. If you're using Gemini 3.1 Pro, the native CLI is preferred.

See the OpenCode Gemini 3.1 Pro profile for a detailed wrapper comparison.

Model upgrade context

Gemini 3.1 Pro's success validates Google's rapid iteration cycle. From Gemini 3 Pro to 3.1 Pro, the model improved 15.7pp in accuracy, reduced failure modes, and cut cost per fix. This trajectory suggests that subsequent Gemini releases will continue to improve. For teams uncertain about Gemini's capability, this version is the inflection point where accuracy becomes competitive.

Summary

Gemini 3.1 Pro Preview is a competitive model in the CVE patching space. The 58.7% pass rate, $3.92 cost per fix, and accuracy score of 85 make it suitable for production use. Infrastructure fragility is the main constraint. For teams running containerized test harnesses, ensure your environment can handle transient failures. For cloud-based deployments, Gemini 3.1 Pro is reliable and cost-effective.

Explore CVE-Agent-Bench results to see how Gemini 3.1 Pro ranks against all other agents. Review the benchmark economics to calculate ROI for your specific evaluation volume. Compare Gemini 3 Pro to understand the improvement trajectory.

FAQ

How much did Gemini improve from 3 to 3.1?

58.7% pass rate on 128 CVEs, up 15.7 percentage points. Cost dropped to $3.92 per fix. 64 passes, 18 fails, 27 build, 19 infra.

What is the cost per successful patch for Gemini 3.1 Pro?

$3.92 per successful patch, the third-cheapest in the benchmark. Only Claude Opus 4.5 ($2.64) and Claude Opus 4.6 ($2.93) cost less per fix. Gemini 3.1 Pro offers near-top-tier accuracy at below-average cost.

How does Gemini 3.1 Pro compare to other agents?

58.7% pass rate ranks in the top tier, behind only Claude Opus 4.6 (61.6%), Cursor Opus 4.6 (62.5%), and Codex GPT-5.2 (62.7%). The main weakness is 19 infrastructure failures, the highest count in the benchmark. Effective pass rate on stable runs remains strong.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.