Google security AI and CVE-Agent-Bench
How Google's Big Sleep, Naptime, and Sec-Gemini intersect with independent agent evaluation on 128 real CVEs.
Google's security AI research
Google has three active programs that evaluate AI for vulnerability work. Project Big Sleep found 20 zero-day vulnerabilities in open-source projects including FFmpeg and ImageMagick, using variant analysis to find bugs similar to known patterns. The team disclosed CVE-2025-6965 in SQLite, which was patched within 48 hours.
Project Naptime achieved a 20x improvement on the CyberSecEval2 buffer overflow benchmark, going from 0.05 to 1.00 using four specialized tools for code browsing, debugging, and program execution. Sec-Gemini, a dedicated security LLM platform, leads competing models by 11% on CTI multiple-choice questions and 10.5% on root-cause mapping tasks.
Google models in CVE-Agent-Bench
Three agent configurations use Google models in the 128-sample benchmark. The gemini-3-pro to gemini-3.1-pro upgrade shows a 15.7 percentage point improvement. The second-largest model-upgrade gain in the benchmark. The native Gemini CLI outperforms the OpenCode wrapper by 3.8pp on the same model, consistent with the native-vs-wrapper gap seen across all labs.
Gemini 3 pro preview
[GOOGLE]
55
pass
36
fail
37
build
How CVE-Agent-Bench complements Google's work
Big Sleep finds new vulnerabilities. CVE-Agent-Bench measures whether agents can fix known ones. Different scope: discovery versus remediation.
Naptime's 4-tool framework and Sec-Gemini's CTI analysis work upstream of patching. XOR's benchmark picks up where they leave off. Once a vulnerability is known, can an agent produce a correct fix? The two approaches are complementary. Google identifies what needs fixing. XOR measures who fixes it well.
The benchmark uses 128 real CVE samples with containerized test environments, automated bug reproduction, and known-good patches. Google's research identifies vulnerabilities at scale. XOR's evaluation measures the remediation phase at independent depth.
Integration points
All three Google programs share the same goal: reduce time-to-fix for security issues. Big Sleep and Naptime compress the discovery and analysis phases. CVE-Agent-Bench measures the fix generation phase. Together, they form a complete remediation pipeline from identification to patch.
The 15.7pp model-upgrade gap between Gemini 3.0 and 3.1 is the second-largest single improvement in the benchmark. This validates Google's investment in security-specific model training.
See Full benchmark results | Methodology | Economics analysis
FAQ
How does CVE-Agent-Bench relate to Project Big Sleep?
Big Sleep finds new vulnerabilities through variant analysis. CVE-Agent-Bench measures whether agents can fix known ones. Different scope: discovery versus remediation.
Gemini 3 Pro Preview — CVE-Agent-Bench profile
43.0% pass rate at $4.85 per fix. Google model via native Gemini CLI. 136 evaluations.
Gemini 3.1 Pro Preview — CVE-Agent-Bench profile
58.7% pass rate at $3.92 per fix. +15.7pp upgrade from Gemini 3 Pro. Best cost/accuracy for Google.
OpenCode Gemini 3.1 Pro — CVE-Agent-Bench profile
54.9% pass rate at $5.81 per fix. Google Gemini 3.1 via OpenCode. 128 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.