[GOOGLE]

Google security AI and CVE-Agent-Bench

How Google's Big Sleep, Naptime, and Sec-Gemini intersect with independent agent evaluation on 128 real CVEs.

Google's security AI research

Google has three active programs that evaluate AI for vulnerability work. Project Big Sleep found 20 zero-day vulnerabilities in public codebases including FFmpeg and ImageMagick, using variant analysis to find bugs similar to known patterns. The team disclosed CVE-2025-6965 in SQLite, which was patched within 48 hours.

Project Naptime achieved a 20x improvement on the CyberSecEval2 buffer overflow benchmark, going from 0.05 to 1.00 using four specialized tools for code browsing, debugging, and program execution. Sec-Gemini, a dedicated security LLM platform, leads competing models by 11% on CTI multiple-choice questions and 10.5% on root-cause mapping tasks.

Google models in CVE-Agent-Bench

Three agent configurations use Google models in the 128-sample benchmark. The gemini-3-pro to gemini-3.1-pro upgrade shows a 15.7 percentage point improvement. The second-largest model-upgrade gain in the benchmark. The native Gemini CLI outperforms the OpenCode wrapper by 3.8pp on the same model, consistent with the native-vs-wrapper gap seen across all labs.

Gemini 3 pro preview

[GOOGLE]

43.0%pass rate

$4.85/pass

pass

fail

build

Gemini31 Gemini 3.1 pro preview

[GOOGLE]

58.7%pass rate

$3.92/pass

pass

fail

build

Opencode Gemini Gemini 3.1 pro preview

[GOOGLE]

54.9%pass rate

$5.81/pass

pass

fail

build

How CVE-Agent-Bench complements Google's work

Big Sleep finds new vulnerabilities. CVE-Agent-Bench measures whether agents can fix known ones. Different scope: discovery versus remediation.

Naptime's 4-tool framework and Sec-Gemini's CTI analysis work upstream of patching. XOR's benchmark picks up where they leave off. Once a vulnerability is known, can an agent produce a correct fix? The two approaches are complementary. Google identifies what needs fixing. XOR measures who fixes it well.

The benchmark uses 128 real CVE samples with containerized test environments, automated bug reproduction, and known-good patches. Google's research identifies vulnerabilities at scale. XOR's evaluation measures the remediation phase at independent depth.

Integration points

All three Google programs share the same goal: reduce time-to-fix for security issues. Big Sleep and Naptime compress the discovery and analysis phases. CVE-Agent-Bench measures the fix generation phase. Together, they form a complete remediation pipeline from identification to patch.

The 15.7pp model-upgrade gap between Gemini 3.0 and 3.1 is the second-largest single improvement in the benchmark. This validates Google's investment in security-specific model training.

See Full benchmark results | Methodology | Economics analysis

FAQ

How does CVE-Agent-Bench relate to Project Big Sleep?

Big Sleep finds new vulnerabilities through variant analysis. CVE-Agent-Bench measures whether agents can fix known ones. Different scope: discovery versus remediation.

[RELATED TOPICS]

Gemini 3 Pro Preview — CVE-Agent-Bench profile

43.0% pass rate at $4.85 per fix. Google model via native Gemini CLI. 128 evaluations.

Gemini 3.1 Pro Preview — CVE-Agent-Bench profile

58.7% pass rate at $3.92 per fix. +15.7pp upgrade from Gemini 3 Pro. Best cost/accuracy for Google.

OpenCode Gemini 3.1 Pro — CVE-Agent-Bench profile

54.9% pass rate at $5.81 per fix. Google Gemini 3.1 via OpenCode. 128 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.