Skip to main content
[OPENAI]

OpenAI models and verified performance in CVE-Agent-Bench

Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.

OpenAI's internal benchmarks vs independent evaluation

OpenAI published Aardvark as a security-specific benchmark for code generation. The company reported 92% pass rate on their internal evaluation. CVE-Agent-Bench tests the same models on a different 128-sample dataset with different test harnesses and containerized environments.

For codex-gpt-5.2, OpenAI's Aardvark reports 92% internal pass rate. XOR's independent evaluation reports 62.7% pass rate on CVE samples with automated patch validation. The 29.3pp gap reflects different evaluation methodologies, different sample difficulty, and different success criteria.

This is not a criticism of Aardvark. It shows that internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only) and uses self-contained test cases. CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.

GPT-5.2 as baseline: Native CLI outperforms sandboxed runtime

The codex-gpt-5.2 native CLI achieves 62.7%, the highest pass rate in the entire benchmark across all agent-model pairs. The closest competitor is cursor-opus-4.6 (62.5%), a near-tie.

When the same model runs in a sandboxed runtime (codex-gpt-5.2-codex), performance drops to 49.2%, a 13.5pp decline. This reflects the semantic difference between executing code in a native shell versus a containerized, restricted environment. Native execution gives access to system utilities and build tools that sandboxed environments restrict.

Codex Gpt 5.2

[OPENAI]

62.7%pass rate
$5.30/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

79

pass

12

fail

35

build

Codex Gpt 5.2 Codex

[OPENAI]

49.2%pass rate
$6.65/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

27

fail

38

build

Cursor Gpt 5.2

[OPENAI]

50.8%pass rate
$6.26/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

34

fail

27

build

Cursor Gpt 5.3 Codex

[OPENAI]

50.4%pass rate
$6.16/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

64

pass

40

fail

23

build

Opencode Gpt 5.2

[OPENAI]

51.6%pass rate
$6.65/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

11

fail

48

build

Opencode Gpt 5.2 Codex

[OPENAI]

37.8%pass rate
$8.73/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

48

pass

32

fail

47

build

GPT-5.3 does not improve on GPT-5.2

OpenAI released GPT-5.3 after GPT-5.2. The benchmark shows GPT-5.3 does not outperform GPT-5.2 when controlling for environment.

Comparing same-environment results: cursor-gpt-5.2 (51.6%) outperforms cursor-gpt-5.3-codex (50.4%), a 1.2pp decline. This is likely because GPT-5.3 is being evaluated in a sandboxed environment rather than native CLI.

When GPT-5.3 runs in a native context (if such a configuration were tested), it would likely match or exceed GPT-5.2's performance. But the available data shows that GPT-5.3's capability gains (if any) are offset by its evaluation environment constraint.

Sandbox penalty is consistent

Sandboxed runtimes impose a consistent penalty across all OpenAI configurations:

  • GPT-5.2 native (62.7%) → GPT-5.2 sandbox (49.2%): 13.5pp drop
  • GPT-5.2 native (62.7%) → OpenCode sandbox (37.8%): 24.9pp drop (additional 11.4pp from OpenCode wrapper)

The pattern is clear: remove access to system tools and build infrastructure, and fix quality declines sharply. This is not a model limitation but an environment constraint.

Integration with CVE-Agent-Bench

Six OpenAI-model agents test GPT variants across different environments:

  • Native Codex CLI (highest performing)
  • Cursor IDE wrapper (native execution)
  • OpenCode CLI wrapper (multiple variants)
  • Sandboxed runtimes (restricted environment)

The comparison isolates different variables. Model version (5.2 vs 5.3) shows minimal impact when environment is constant. Environment (native vs sandbox) shows 13-25pp impact. CLI wrapper choice (native vs OpenCode) shows 11pp impact.

This multilayered evaluation enables understanding what drives performance: model capability, execution environment, or orchestration layer.

See Full benchmark results | Methodology | Agent Profiles

FAQ

Why is the Aardvark score different from CVE-Agent-Bench?

Internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only). CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.