Gemini 3 Pro Preview — CVE-Agent-Bench profile
43.0% pass rate at $4.85 per fix. Google model via native Gemini CLI. 136 evaluations.
Gemini 3 pro preview
Gemini 3 Pro Preview is Google's first-generation model evaluated in CVE-Agent-Bench. In testing across 136 CVE samples, it achieved a 43.0% pass rate with a cost of $4.85 per successful fix. The model succeeded on 55 samples, failed on 36, experienced 37 build failures, and encountered 8 infrastructure failures.
Performance baseline
As the initial Gemini offering in the benchmark, Gemini 3 Pro establishes a baseline for Google's approach to agentic CVE patching. The 43.0% pass rate sits below the field average of 47.3%. The raw pass count of 55 is competitive, but the fail count of 36 indicates consistent cases where the model produces output that doesn't resolve the vulnerability despite generating valid patches.
Cost efficiency shows mixed results. At $4.85 per pass, Gemini 3 Pro lands in the middle range. Models from Anthropic cost less per fix ($2.64 to $5.84), while OpenAI's GPT-5.2 costs more. The metric matters less than whether your budget accommodates the model's behavior: 43% accuracy means you need sufficient evaluations to identify which patches actually work.
Behavioral profile
Gemini 3 Pro's personality dimensions reveal a speed-focused agent. The speed score is 100, efficiency is 96, and precision is 100. These scores mean the model produces output quickly, uses tokens efficiently (low per-evaluation cost), and structures its responses consistently. Accuracy is 24 (one of the lowest in the benchmark), showing the model prioritizes fast generation over thorough vulnerability analysis.
This profile is common among first-generation models. The trade-off is real: you get cheap, fast patches that work sometimes but fail often. Subsequent Gemini releases show how model improvements can shift this balance.
Build environment challenges
The 37 build failures (27.2% of evaluations) rank second-highest in the benchmark. This matters because build failures aren't pass/fail on the CVE. They're cases where the patch itself is malformed or incompatible with the test environment.
Most Gemini 3 Pro build failures stem from environment setup. The containerized test harness expects specific toolchain versions, dependency resolutions, and build system configurations. Gemini 3 Pro sometimes generates patches that reference unavailable dependencies or use syntax incompatible with the container's language versions. This is a known class of error in agentic systems: the model is trained on broad internet code, not the specific versions in a test container.
Version trajectory
Gemini 3 Pro is not the final release. Google has since released Gemini 3.1 Pro, which shows 58.7% pass rate at $3.92 per fix. The improvement from 43.0% to 58.7% is the largest single-generation leap in the entire benchmark. The cost-per-pass drops from $4.85 to $3.92 despite higher API pricing per token. This trajectory is important context for anyone evaluating Gemini: the first release is a baseline, not a ceiling.
Comparison to native CLI vs. wrapper
Gemini 3 Pro was tested exclusively through the native Google CLI interface. We did not evaluate Gemini 3 Pro through third-party wrappers (like OpenCode). The native integration shows what you get with Google's official tooling. Later Gemini releases (3.1) have wrapper variants with measurable performance differences, so this baseline is useful.
Infrastructure reliability
Eight infrastructure failures is low compared to some other agents. Infra failures happen when the test harness itself breaks (network timeout, disk full, etc.), not when the model produces a bad patch. Gemini 3 Pro's 5.9% infra-failure rate is typical for the benchmark setup.
Summary
Gemini 3 Pro Preview is a capable first entry into CVE-Agent-Bench. The 43.0% pass rate and low cost make it viable for teams running many evaluations. Build failures are a known issue that improves in Gemini 3.1. The speed and efficiency scores suggest the model is useful for rapid iteration, even if accuracy comes at the cost of thoroughness. For production CVE patching, the pass rate is below where mature models perform.
Explore the full CVE-Agent-Bench results to compare all agents side-by-side. See how Gemini 3.1 Pro improves on this baseline. Review cost-effectiveness analysis to decide if Gemini 3 Pro fits your evaluation budget.
FAQ
How does Gemini 3 Pro perform on CVE fixes?
43.0% pass rate on 136 CVEs at $4.85 per fix. 55 passes, 36 fails, 37 build, 8 infra failures.
What is the cost per successful patch for Gemini 3 Pro?
$4.85 per successful patch across 136 evaluations. Mid-range cost. Google's updated Gemini 3.1 Pro reduces this to $3.92 while improving pass rate from 43.0% to 58.7%.
How does Gemini 3 Pro compare to other agents?
43.0% pass rate sits below the 47.3% benchmark average. It is a first-generation model that establishes Google's baseline. The 15.7 percentage point jump to Gemini 3.1 Pro is the second-largest single-generation improvement in the benchmark.
Google security AI and CVE-Agent-Bench
How Google's Big Sleep, Naptime, and Sec-Gemini intersect with independent agent evaluation on 128 real CVEs.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.