OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
OpenAI's internal benchmarks vs independent evaluation
OpenAI published Aardvark as a security-specific benchmark for code generation. The company reported 92% pass rate on their internal evaluation. CVE-Agent-Bench tests the same models on a different 128-sample dataset with different test harnesses and containerized environments.
For codex-gpt-5.2, OpenAI's Aardvark reports 92% internal pass rate. XOR's independent evaluation reports 62.7% pass rate on CVE samples with automated patch validation. The 29.3pp gap reflects different evaluation methodologies, different sample difficulty, and different success criteria.
This is not a criticism of Aardvark. It shows that internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only) and uses self-contained test cases. CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.
GPT-5.2 as baseline: Native CLI outperforms sandboxed runtime
The codex-gpt-5.2 native CLI achieves 62.7%, the highest pass rate in the entire benchmark across all agent-model pairs. The closest competitor is cursor-opus-4.6 (62.5%), a near-tie.
When the same model runs in a sandboxed runtime (codex-gpt-5.2-codex), performance drops to 49.2%, a 13.5pp decline. This reflects the semantic difference between executing code in a native shell versus a containerized, restricted environment. Native execution gives access to system utilities and build tools that sandboxed environments restrict.
Codex Gpt 5.2
[OPENAI]
79
pass
12
fail
35
build
Codex Gpt 5.2 Codex
[OPENAI]
63
pass
27
fail
38
build
Cursor Gpt 5.2
[OPENAI]
63
pass
34
fail
27
build
Cursor Gpt 5.3 Codex
[OPENAI]
64
pass
40
fail
23
build
Opencode Gpt 5.2
[OPENAI]
63
pass
11
fail
48
build
Opencode Gpt 5.2 Codex
[OPENAI]
48
pass
32
fail
47
build
GPT-5.3 does not improve on GPT-5.2
OpenAI released GPT-5.3 after GPT-5.2. The benchmark shows GPT-5.3 does not outperform GPT-5.2 when controlling for environment.
Comparing same-environment results: cursor-gpt-5.2 (51.6%) outperforms cursor-gpt-5.3-codex (50.4%), a 1.2pp decline. This is likely because GPT-5.3 is being evaluated in a sandboxed environment rather than native CLI.
When GPT-5.3 runs in a native context (if such a configuration were tested), it would likely match or exceed GPT-5.2's performance. But the available data shows that GPT-5.3's capability gains (if any) are offset by its evaluation environment constraint.
Sandbox penalty is consistent
Sandboxed runtimes impose a consistent penalty across all OpenAI configurations:
- GPT-5.2 native (62.7%) → GPT-5.2 sandbox (49.2%): 13.5pp drop
- GPT-5.2 native (62.7%) → OpenCode sandbox (37.8%): 24.9pp drop (additional 11.4pp from OpenCode wrapper)
The pattern is clear: remove access to system tools and build infrastructure, and fix quality declines sharply. This is not a model limitation but an environment constraint.
Integration with CVE-Agent-Bench
Six OpenAI-model agents test GPT variants across different environments:
- Native Codex CLI (highest performing)
- Cursor IDE wrapper (native execution)
- OpenCode CLI wrapper (multiple variants)
- Sandboxed runtimes (restricted environment)
The comparison isolates different variables. Model version (5.2 vs 5.3) shows minimal impact when environment is constant. Environment (native vs sandbox) shows 13-25pp impact. CLI wrapper choice (native vs OpenCode) shows 11pp impact.
This multilayered evaluation enables understanding what drives performance: model capability, execution environment, or orchestration layer.
FAQ
Why is the Aardvark score different from CVE-Agent-Bench?
Internal benchmarks and independent third-party benchmarks measure different signal. Aardvark is narrower (code generation only). CVE-Agent-Bench includes discovery, analysis, and remediation with real-world vulnerability context.
Codex GPT-5.2 — CVE-Agent-Bench profile
62.7% pass rate at $5.30 per fix. OpenAI model via Codex CLI. Highest accuracy across all agents.
Codex GPT-5.2 (Codex runtime) — CVE-Agent-Bench profile
49.2% pass rate at $6.65 per fix. Same GPT-5.2 model, different runtime environment. 136 evaluations.
Cursor GPT-5.2 — CVE-Agent-Bench profile
51.6% pass rate at $6.26 per fix. OpenAI GPT-5.2 via Cursor IDE. 128 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.