Skip to main content
[ANTHROPIC]

Anthropic security research and patch equivalence validation

Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.

Anthropic's security research programs

Anthropic maintains three active security research initiatives. Claude Code created over 500 zero-day disclosures through systematic red-teaming of open-source projects. CyberGym, Anthropic's synthetic benchmark, achieves 28.9% state-of-the-art pass rate with cost at $2 per vulnerability fixed. BaxBench found 62% of patched samples still contain insecure patterns, indicating semantic gaps between syntactic fixes and security remediation.

These programs measure different aspects of the patch lifecycle. Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench validates patch semantic correctness. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples.

Claude models in CVE-Agent-Bench

Anthropic has the largest agent representation in the benchmark: five Claude-model configurations. Two test Claude directly via the native API. One tests Claude via Cursor, a commercial IDE integration. Two test Claude via OpenCode, an independent wrapper.

The five agents benchmark Claude Opus models (the larger, more capable variant):

  • claude-claude-opus-4-6: Native Claude API, latest Opus model
  • claude-claude-opus-4-5: Native Claude API, previous Opus release
  • cursor-opus-4.6: Cursor IDE integration, Opus 4.6
  • opencode-claude-opus-4-6: OpenCode CLI wrapper, Opus 4.6
  • opencode-claude-opus-4-5: OpenCode CLI wrapper, Opus 4.5

Pass rate distribution shows that model version (Opus 4.6 vs 4.5) matters more than environment. The Opus 4.6 upgrade delivers consistent gains across all three execution contexts.

Claude Opus 4.5

[ANTHROPIC]

45.7%pass rate
$2.64/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

58

pass

43

fail

26

build

Claude Opus 4.6

[ANTHROPIC]

61.6%pass rate
$2.93/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

77

pass

28

fail

20

build

Cursor Opus 4.6

[ANTHROPIC]

62.5%pass rate
$35.40/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

80

pass

24

fail

24

build

Opencode Claude Opus 4.5

[ANTHROPIC]

36.8%pass rate
$40.13/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

46

pass

29

fail

50

build

Opencode Claude Opus 4.6

[ANTHROPIC]

47.5%pass rate
$51.88/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

58

pass

15

fail

49

build

Environment impact

Cursor wrapper slightly outperforms the native API for Opus 4.6 (62.5% vs 61.6%, a 0.9pp gain). Cursor's orchestration adds marginal benefits, but model capability dominates the result.

OpenCode wrapper underperforms both Cursor and native API across both versions. The gap is consistent: approximately 14-15pp below native API. This reflects orchestration differences (timeout handling, retry logic, prompt format) rather than model capability.

Patch semantic equivalence

BaxBench showed that 38% of patched code remains insecure. This gap. Between syntactic correctness (code compiles, test passes) and semantic correctness (vulnerability closed, no new vulns). Is critical for security.

CVE-Agent-Bench extends BaxBench's preliminary analysis with 1,992 independent evaluations across 15 agent-model pairs. The results validate Anthropic's preliminary finding: most agents fix the immediate bug but leave adjacent vulnerabilities open.

Semantic equivalence is measured in two ways:

  1. Patch syntax: Does the fix compile and pass automated tests? (Primary measure)
  2. Semantic equivalence: Does the fix match Anthropic's reference patch or produce equivalent security properties? (Secondary measure, preliminary)

Anthropic's work on semantic equivalence is ongoing. CVE-Agent-Bench offers a longitudinal dataset to measure improvement as models and tools advance.

Integration with CVE-Agent-Bench

The five Anthropic agents represent the largest single lab in the benchmark. This reflects Anthropic's leadership in security AI research and the breadth of their Claude model lineup.

With 672 evaluations across Anthropic's five agents (1,992 total across all 15 agents), the benchmark yields precise pass-rate estimates. This enables detection of small improvements in future Claude releases.

Claude Code's 500+ zero-day contributions fed vulnerability discovery. CyberGym's synthetic benchmark established baseline quality (28.9%). BaxBench identified semantic gaps in patching. CVE-Agent-Bench measures patching performance on real vulnerabilities with full end-to-end validation.

See Full benchmark results | Agent Profiles | Economics analysis

FAQ

How does CVE-Agent-Bench extend Anthropic's security research?

Claude Code finds vulnerabilities. CyberGym measures fix quality on synthetic samples. BaxBench identifies semantic gaps. CVE-Agent-Bench extends this work with independent evaluation on 128 real CVE samples with full end-to-end validation.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.