Gemini 3 Pro Preview — Vulnerability-Agent-Bench profile

43.0% pass rate at $4.85 per fix. Google model via native Gemini CLI. 128 evaluations.

[GOOGLE]

Gemini 3 pro preview

43.0%

Pass Rate

$4.85

Cost per Pass

136

Total Evals

gemini

CLI

Outcome Distribution

Pass (55)

Fail (36)

Build (37)

Infra (8)

Gemini 3 Pro Preview is Google's first-generation model evaluated in Vulnerability-Agent-Bench. In testing across 128 vulnerability samples, it achieved a 43.0% pass rate with a cost of $4.85 per successful fix. The model succeeded on 55 samples, failed on 36, experienced 37 build failures, and encountered 8 infrastructure failures.

Performance baseline

As the initial Gemini offering in the benchmark, Gemini 3 Pro establishes a baseline for Google's approach to agentic vulnerability patching. The 43.0% pass rate sits below the field average of 47.3%. The raw pass count of 55 is competitive, but the fail count of 36 indicates consistent cases where the model produces output that doesn't resolve the vulnerability despite generating valid patches.

Cost efficiency shows mixed results. At $4.85 per pass, Gemini 3 Pro lands in the middle range. Models from Anthropic cost less per fix ($2.64 to $5.84), while OpenAI's GPT-5.2 costs more. The metric matters less than whether your budget accommodates the model's behavior: 43% accuracy means you need sufficient evaluations to identify which patches actually work.

Behavioral profile

Gemini 3 Pro's personality dimensions reveal a speed-focused agent. The speed score is 100, efficiency is 96, and precision is 100. These scores mean the model produces output quickly, uses tokens efficiently (low per-evaluation cost), and structures its responses consistently. Accuracy is 24 (one of the lowest in the benchmark), showing the model prioritizes fast generation over thorough vulnerability analysis.

This profile is common among first-generation models. The trade-off is real: you get cheap, fast patches that work sometimes but fail often. Subsequent Gemini releases show how model improvements can shift this balance.

Build environment challenges

The 37 build failures (27.2% of evaluations) rank second-highest in the benchmark. This matters because build failures aren't pass/fail on the vulnerability. They're cases where the patch itself is malformed or incompatible with the test environment.

Most Gemini 3 Pro build failures stem from environment setup. The containerized test harness expects specific toolchain versions, dependency resolutions, and build system configurations. Gemini 3 Pro sometimes generates patches that reference unavailable dependencies or use syntax incompatible with the container's language versions. This is a known class of error in agentic systems: the model is trained on broad internet code, not the specific versions in a test container.

Version trajectory

Gemini 3 Pro is not the final release. Google has since released Gemini 3.1 Pro, which shows 58.7% pass rate at $3.92 per fix. The improvement from 43.0% to 58.7% is the largest single-generation leap in the entire benchmark. The cost-per-pass drops from $4.85 to $3.92 despite higher API pricing per token. This trajectory is important context for anyone evaluating Gemini: the first release is a baseline, not a ceiling.

Comparison to native CLI vs. wrapper

Gemini 3 Pro was tested exclusively through the native Google CLI interface. We did not evaluate Gemini 3 Pro through third-party wrappers (like OpenCode). The native integration shows what you get with Google's official tooling. Later Gemini releases (3.1) have wrapper variants with measurable performance differences, so this baseline is useful.

Infrastructure reliability

Eight infrastructure failures is low compared to some other agents. Infra failures happen when the test harness itself breaks (network timeout, disk full, etc.), not when the model produces a bad patch. Gemini 3 Pro's 5.9% infra-failure rate is typical for the benchmark setup.

Summary

Gemini 3 Pro Preview is a capable first entry into Vulnerability-Agent-Bench. The 43.0% pass rate and low cost make it viable for teams running many evaluations. Build failures are a known issue that improves in Gemini 3.1. The speed and efficiency scores suggest the model is useful for rapid iteration, even if accuracy comes at the cost of thoroughness. For production vulnerability patching, the pass rate is below where mature models perform.

Explore the full Vulnerability-Agent-Bench results to compare all agents side-by-side. See how Gemini 3.1 Pro improves on this baseline. Review cost-effectiveness analysis to decide if Gemini 3 Pro fits your evaluation budget.

FAQ

How does Gemini 3 Pro perform on vulnerability fixes?

43.0% pass rate on 128 vulnerabilities at $4.85 per fix. 55 passes, 36 fails, 37 build, 8 infra failures.

What is the cost per successful patch for Gemini 3 Pro?

$4.85 per successful patch across 128 evaluations. Mid-range cost. Google's updated Gemini 3.1 Pro reduces this to $3.92 while improving pass rate from 43.0% to 58.7%.

How does Gemini 3 Pro compare to other agents?

43.0% pass rate sits below the 47.3% benchmark average. It is a first-generation model that establishes Google's baseline. The 15.7 percentage point jump to Gemini 3.1 Pro is the second-largest single-generation improvement in the benchmark.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.