Skip to main content
[AGENT]

Gemini 3 Pro Preview — CVE-Agent-Bench profile

43.0% pass rate at $4.85 per fix. Google model via native Gemini CLI. 136 evaluations.

[GOOGLE]

Gemini 3 pro preview

43.0%
Pass Rate
$4.85
Cost per Pass
136
Total Evals
gemini
CLI
Outcome Distribution
Pass (55)
Fail (36)
Build (37)
Infra (8)

Gemini 3 Pro Preview is Google's first-generation model evaluated in CVE-Agent-Bench. In testing across 136 CVE samples, it achieved a 43.0% pass rate with a cost of $4.85 per successful fix. The model succeeded on 55 samples, failed on 36, experienced 37 build failures, and encountered 8 infrastructure failures.

Performance baseline

As the initial Gemini offering in the benchmark, Gemini 3 Pro establishes a baseline for Google's approach to agentic CVE patching. The 43.0% pass rate sits below the field average of 47.3%. The raw pass count of 55 is competitive, but the fail count of 36 indicates consistent cases where the model produces output that doesn't resolve the vulnerability despite generating valid patches.

Cost efficiency shows mixed results. At $4.85 per pass, Gemini 3 Pro lands in the middle range. Models from Anthropic cost less per fix ($2.64 to $5.84), while OpenAI's GPT-5.2 costs more. The metric matters less than whether your budget accommodates the model's behavior: 43% accuracy means you need sufficient evaluations to identify which patches actually work.

Behavioral profile

Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

Gemini 3 Pro's personality dimensions reveal a speed-focused agent. The speed score is 100, efficiency is 96, and precision is 100. These scores mean the model produces output quickly, uses tokens efficiently (low per-evaluation cost), and structures its responses consistently. Accuracy is 24 (one of the lowest in the benchmark), showing the model prioritizes fast generation over thorough vulnerability analysis.

This profile is common among first-generation models. The trade-off is real: you get cheap, fast patches that work sometimes but fail often. Subsequent Gemini releases show how model improvements can shift this balance.

Build environment challenges

The 37 build failures (27.2% of evaluations) rank second-highest in the benchmark. This matters because build failures aren't pass/fail on the CVE. They're cases where the patch itself is malformed or incompatible with the test environment.

Most Gemini 3 Pro build failures stem from environment setup. The containerized test harness expects specific toolchain versions, dependency resolutions, and build system configurations. Gemini 3 Pro sometimes generates patches that reference unavailable dependencies or use syntax incompatible with the container's language versions. This is a known class of error in agentic systems: the model is trained on broad internet code, not the specific versions in a test container.

Version trajectory

Gemini 3 Pro is not the final release. Google has since released Gemini 3.1 Pro, which shows 58.7% pass rate at $3.92 per fix. The improvement from 43.0% to 58.7% is the largest single-generation leap in the entire benchmark. The cost-per-pass drops from $4.85 to $3.92 despite higher API pricing per token. This trajectory is important context for anyone evaluating Gemini: the first release is a baseline, not a ceiling.

Comparison to native CLI vs. wrapper

Gemini 3 Pro was tested exclusively through the native Google CLI interface. We did not evaluate Gemini 3 Pro through third-party wrappers (like OpenCode). The native integration shows what you get with Google's official tooling. Later Gemini releases (3.1) have wrapper variants with measurable performance differences, so this baseline is useful.

Infrastructure reliability

Eight infrastructure failures is low compared to some other agents. Infra failures happen when the test harness itself breaks (network timeout, disk full, etc.), not when the model produces a bad patch. Gemini 3 Pro's 5.9% infra-failure rate is typical for the benchmark setup.

Summary

Gemini 3 Pro Preview is a capable first entry into CVE-Agent-Bench. The 43.0% pass rate and low cost make it viable for teams running many evaluations. Build failures are a known issue that improves in Gemini 3.1. The speed and efficiency scores suggest the model is useful for rapid iteration, even if accuracy comes at the cost of thoroughness. For production CVE patching, the pass rate is below where mature models perform.

Explore the full CVE-Agent-Bench results to compare all agents side-by-side. See how Gemini 3.1 Pro improves on this baseline. Review cost-effectiveness analysis to decide if Gemini 3 Pro fits your evaluation budget.

FAQ

How does Gemini 3 Pro perform on CVE fixes?

43.0% pass rate on 136 CVEs at $4.85 per fix. 55 passes, 36 fails, 37 build, 8 infra failures.

What is the cost per successful patch for Gemini 3 Pro?

$4.85 per successful patch across 136 evaluations. Mid-range cost. Google's updated Gemini 3.1 Pro reduces this to $3.92 while improving pass rate from 43.0% to 58.7%.

How does Gemini 3 Pro compare to other agents?

43.0% pass rate sits below the 47.3% benchmark average. It is a first-generation model that establishes Google's baseline. The 15.7 percentage point jump to Gemini 3.1 Pro is the second-largest single-generation improvement in the benchmark.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.