Anthropic security research and patch equivalence validation
Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.
Anthropic's security research programs
Anthropic maintains three active security research initiatives. Claude Code created over 500 zero-day disclosures through systematic red-teaming of open-source projects. CyberGym, Anthropic's synthetic benchmark, achieves 28.9% state-of-the-art pass rate with cost at $2 per vulnerability fixed. BaxBench found 62% of patched samples still contain insecure patterns, indicating semantic gaps between syntactic fixes and security remediation.
These programs measure different aspects of the patch lifecycle. Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench validates patch semantic correctness. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples.
Claude models in CVE-Agent-Bench
Anthropic has the largest agent representation in the benchmark: five Claude-model configurations. Two test Claude directly via the native API. One tests Claude via Cursor, a commercial IDE integration. Two test Claude via OpenCode, an independent wrapper.
The five agents benchmark Claude Opus models (the larger, more capable variant):
- claude-claude-opus-4-6: Native Claude API, latest Opus model
- claude-claude-opus-4-5: Native Claude API, previous Opus release
- cursor-opus-4.6: Cursor IDE integration, Opus 4.6
- opencode-claude-opus-4-6: OpenCode CLI wrapper, Opus 4.6
- opencode-claude-opus-4-5: OpenCode CLI wrapper, Opus 4.5
Pass rate distribution shows that model version (Opus 4.6 vs 4.5) matters more than environment. The Opus 4.6 upgrade delivers consistent gains across all three execution contexts.
Claude Opus 4.5
[ANTHROPIC]
58
pass
43
fail
26
build
Claude Opus 4.6
[ANTHROPIC]
77
pass
28
fail
20
build
Cursor Opus 4.6
[ANTHROPIC]
80
pass
24
fail
24
build
Opencode Claude Opus 4.5
[ANTHROPIC]
46
pass
29
fail
50
build
Opencode Claude Opus 4.6
[ANTHROPIC]
58
pass
15
fail
49
build
Environment impact
Cursor wrapper slightly outperforms the native API for Opus 4.6 (62.5% vs 61.6%, a 0.9pp gain). Cursor's orchestration adds marginal benefits, but model capability dominates the result.
OpenCode wrapper underperforms both Cursor and native API across both versions. The gap is consistent: approximately 14-15pp below native API. This reflects orchestration differences (timeout handling, retry logic, prompt format) rather than model capability.
Patch semantic equivalence
BaxBench showed that 38% of patched code remains insecure. This gap. Between syntactic correctness (code compiles, test passes) and semantic correctness (vulnerability closed, no new vulns). Is critical for security.
CVE-Agent-Bench extends BaxBench's preliminary analysis with 1,992 independent evaluations across 15 agent-model pairs. The results validate Anthropic's preliminary finding: most agents fix the immediate bug but leave adjacent vulnerabilities open.
Semantic equivalence is measured in two ways:
- Patch syntax: Does the fix compile and pass automated tests? (Primary measure)
- Semantic equivalence: Does the fix match Anthropic's reference patch or produce equivalent security properties? (Secondary measure, preliminary)
Anthropic's work on semantic equivalence is ongoing. CVE-Agent-Bench offers a longitudinal dataset to measure improvement as models and tools advance.
Integration with CVE-Agent-Bench
The five Anthropic agents represent the largest single lab in the benchmark. This reflects Anthropic's leadership in security AI research and the breadth of their Claude model lineup.
With 672 evaluations across Anthropic's five agents (1,992 total across all 15 agents), the benchmark yields precise pass-rate estimates. This enables detection of small improvements in future Claude releases.
Claude Code's 500+ zero-day contributions fed vulnerability discovery. CyberGym's synthetic benchmark established baseline quality (28.9%). BaxBench identified semantic gaps in patching. CVE-Agent-Bench measures patching performance on real vulnerabilities with full end-to-end validation.
See Full benchmark results | Agent Profiles | Economics analysis
FAQ
How does CVE-Agent-Bench extend Anthropic's security research?
Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench identifies semantic gaps. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples with full end-to-end validation.
Claude Opus 4.5 — CVE-Agent-Bench profile
45.7% pass rate at $2.64 per fix. Anthropic model via Claude Code CLI. 136 real CVEs evaluated.
Claude Opus 4.6 — CVE-Agent-Bench profile
61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.
Cursor Opus 4.6 — CVE-Agent-Bench profile
62.5% pass rate at $35.40 per fix. Anthropic Opus 4.6 via Cursor. High accuracy, highest cost.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.