[ANTHROPIC]

Anthropic security research and patch equivalence validation

Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,920 independent evaluations.

Anthropic's security research programs

Anthropic maintains three active security research initiatives. Claude Code created over 500 zero-day disclosures through systematic red-teaming of public codebases. CyberGym, Anthropic's synthetic benchmark, achieves 28.9% state-of-the-art pass rate with cost at $2 per vulnerability fixed. BaxBench found 62% of patched samples still contain insecure patterns, indicating semantic gaps between syntactic fixes and security remediation.

These programs measure different aspects of the patch lifecycle. Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench validates patch semantic correctness. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples.

Claude models in CVE-Agent-Bench

Anthropic has the largest agent representation in the benchmark: five Claude-model configurations. Two test Claude directly via the native API. One tests Claude via Cursor, a commercial IDE integration. Two test Claude via OpenCode, an independent wrapper.

The five agents benchmark Claude Opus models (the larger, more capable variant):

claude-claude-opus-4-6: Native Claude API, latest Opus model
claude-claude-opus-4-5: Native Claude API, previous Opus release
cursor-opus-4.6: Cursor IDE integration, Opus 4.6
opencode-claude-opus-4-6: OpenCode CLI wrapper, Opus 4.6
opencode-claude-opus-4-5: OpenCode CLI wrapper, Opus 4.5

Pass rate distribution shows that model version (Opus 4.6 vs 4.5) matters more than environment. The Opus 4.6 upgrade delivers consistent gains across all three execution contexts.

Claude Opus 4.5

[ANTHROPIC]

45.7%pass rate

$2.64/pass

pass

fail

build

Claude Opus 4.6

[ANTHROPIC]

61.6%pass rate

$2.93/pass

pass

fail

build

Cursor Opus 4.6

[ANTHROPIC]

62.5%pass rate

$35.40/pass

pass

fail

build

Opencode Claude Opus 4.5

[ANTHROPIC]

36.8%pass rate

$40.13/pass

pass

fail

build

Opencode Claude Opus 4.6

[ANTHROPIC]

47.5%pass rate

$51.88/pass

pass

fail

build

Environment impact

Cursor wrapper slightly outperforms the native API for Opus 4.6 (62.5% vs 61.6%, a 0.9pp gain). Cursor's orchestration adds marginal benefits, but model capability dominates the result.

OpenCode wrapper underperforms both Cursor and native API across both versions. The gap is consistent: approximately 14-15pp below native API. This reflects orchestration differences (timeout handling, retry logic, prompt format) rather than model capability.

Patch semantic equivalence

BaxBench showed that 38% of patched code remains insecure. This gap. Between syntactic correctness (code compiles, test passes) and semantic correctness (vulnerability closed, no new vulns). Is critical for security.

CVE-Agent-Bench extends BaxBench's preliminary analysis with 1,920 independent evaluations across 15 agent-model pairs. The results validate Anthropic's preliminary finding: most agents fix the immediate bug but leave adjacent vulnerabilities open.

Semantic equivalence is measured in two ways:

Patch syntax: Does the fix compile and pass automated tests? (Primary measure)
Semantic equivalence: Does the fix match Anthropic's reference patch or produce equivalent security properties? (Secondary measure, preliminary)

Anthropic's work on semantic equivalence is ongoing. CVE-Agent-Bench offers a longitudinal dataset to measure improvement as models and tools advance.

Integration with CVE-Agent-Bench

The five Anthropic agents represent the largest single lab in the benchmark. This reflects Anthropic's leadership in security AI research and the breadth of their Claude model lineup.

With 672 evaluations across Anthropic's five agents (1,920 total across all 15 agents), the benchmark yields precise pass-rate estimates. This enables detection of small improvements in future Claude releases.

Claude Code's 500+ zero-day contributions fed vulnerability discovery. CyberGym's synthetic benchmark established baseline quality (28.9%). BaxBench identified semantic gaps in patching. CVE-Agent-Bench measures patching performance on real vulnerabilities with full end-to-end validation.

See Full benchmark results | Agent Profiles | Economics analysis

FAQ

How does CVE-Agent-Bench extend Anthropic's security research?

Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench identifies semantic gaps. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples with full end-to-end validation.

[RELATED TOPICS]

Claude Opus 4.5 — CVE-Agent-Bench profile

45.7% pass rate at $2.64 per fix. Anthropic model via Claude Code CLI. 128 real CVEs evaluated.

Claude Opus 4.6 — CVE-Agent-Bench profile

61.6% pass rate at $2.93 per fix. Anthropic model via Claude Code CLI. Second-highest accuracy overall.

Cursor Opus 4.6 — CVE-Agent-Bench profile

62.5% pass rate at $35.40 per fix. Anthropic Opus 4.6 via Cursor. High accuracy, highest cost.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.