[OPENAI]

OpenAI models and verified performance in CVE-Agent-Bench

Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.

OpenAI's internal benchmarks vs independent evaluation

OpenAI published Aardvark as a security-specific benchmark for code generation. The company reported 92% pass rate on their internal evaluation. CVE-Agent-Bench tests the same models on a different 128-sample dataset with different test harnesses and containerized environments.

For codex-gpt-5.2, OpenAI's Aardvark reports 92% internal pass rate. XOR's independent evaluation reports 62.7% pass rate on CVE samples with automated patch validation. The 29.3pp gap reflects different evaluation methodologies, different sample difficulty, and different success criteria.

This is not a criticism of Aardvark. It shows that internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only) and uses self-contained test cases. CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.

GPT-5.2 as baseline: Native CLI outperforms sandboxed runtime

The codex-gpt-5.2 native CLI achieves 62.7%, the highest pass rate in the entire benchmark across all agent-model pairs. The closest competitor is cursor-opus-4.6 (62.5%), a near-tie.

When the same model runs in a sandboxed runtime (codex-gpt-5.2-codex), performance drops to 49.2%, a 13.5pp decline. This reflects the semantic difference between executing code in a native shell versus a containerized, restricted environment. Native execution gives access to system utilities and build tools that sandboxed environments restrict.

Codex Gpt 5.2

[OPENAI]

62.7%pass rate

$5.30/pass

pass

fail

build

Codex Gpt 5.2 Codex

[OPENAI]

49.2%pass rate

$6.65/pass

pass

fail

build

Cursor Gpt 5.2

[OPENAI]

51.6%pass rate

$6.26/pass

pass

fail

build

Cursor Gpt 5.3 Codex

[OPENAI]

50.4%pass rate

$6.16/pass

pass

fail

build

Opencode Gpt 5.2

[OPENAI]

51.6%pass rate

$6.65/pass

pass

fail

build

Opencode Gpt 5.2 Codex

[OPENAI]

37.8%pass rate

$8.73/pass

pass

fail

build

GPT-5.3 does not improve on GPT-5.2

OpenAI released GPT-5.3 after GPT-5.2. The benchmark shows GPT-5.3 does not outperform GPT-5.2 when controlling for environment.

Comparing same-environment results: cursor-gpt-5.2 (51.6%) outperforms cursor-gpt-5.3-codex (50.4%), a 1.2pp decline. This is likely because GPT-5.3 is being evaluated in a sandboxed environment rather than native CLI.

When GPT-5.3 runs in a native context (if such a configuration were tested), it would likely match or exceed GPT-5.2's performance. But the available data shows that GPT-5.3's capability gains (if any) are offset by its evaluation environment constraint.

Sandbox penalty is consistent

Sandboxed runtimes impose a consistent penalty across all OpenAI configurations:

GPT-5.2 native (62.7%) → GPT-5.2 sandbox (49.2%): 13.5pp drop
GPT-5.2 native (62.7%) → OpenCode sandbox (37.8%): 24.9pp drop (additional 11.4pp from OpenCode wrapper)

The pattern is clear: remove access to system tools and build infrastructure, and fix quality declines sharply. This is not a model limitation but an environment constraint.

Integration with CVE-Agent-Bench

Six OpenAI-model agents test GPT variants across different environments:

Native Codex CLI (highest performing)
Cursor IDE wrapper (native execution)
OpenCode CLI wrapper (multiple variants)
Sandboxed runtimes (restricted environment)

The comparison isolates different variables. Model version (5.2 vs 5.3) shows minimal impact when environment is constant. Environment (native vs sandbox) shows 13-25pp impact. CLI wrapper choice (native vs OpenCode) shows 11pp impact.

This multilayered evaluation enables understanding what drives performance: model capability, execution environment, or orchestration layer.

See Full benchmark results | Methodology | Agent Profiles

FAQ

Why is the Aardvark score different from CVE-Agent-Bench?

Internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only). CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.

[RELATED TOPICS]

Codex GPT-5.2 — CVE-Agent-Bench profile

62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.

Codex GPT-5.2 (Codex runtime) — CVE-Agent-Bench profile

49.2% pass rate at $6.65 per fix. Same GPT-5.2 model, different runtime environment. 128 evaluations.

Cursor GPT-5.2 — CVE-Agent-Bench profile

51.6% pass rate at $6.26 per fix. OpenAI GPT-5.2 via Cursor IDE. 128 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.