Skip to main content
[METHODOLOGY]

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Test methodology

Each of the 128 vulnerabilities is packaged with: a known-vulnerable environment, a verifier, and a known-good fix for comparison. Agents run in isolated environments.

Scoring

Pass rate is the primary metric. Cost per verified fix, difficulty score, and failure classification provide additional dimensions for agent comparison.

1920
Total evaluations
50.5%
Overall pass rate
1992
Patches analyzed
VALID
Benchmark status

Evaluation Pipeline

Each bug in the benchmark has three components: a container with the vulnerable code, a way to trigger the bug, and an automated test setup. Agents receive the vulnerable code and must produce a fix. The fix is applied, the bug is re-triggered, and the outcome is recorded.

The pipeline runs deterministically. Every agent gets the exact same starting state: same vulnerable code, same environment, same tools. We capture everything the agent does - every file read, every tool call, every token. Then we apply the patch and check if the bug disappeared. Either it did or it did not. No ambiguity.

Isolation matters for fairness. Each agent runs in a fresh container to prevent cross-contamination. Infrastructure failures during the run (out of memory, network timeout) are recorded separately from agent failures (agent refused to patch, patch breaks the code). This distinction drives the analytics.

[EVALUATION FACTORY]

Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.

Generate

AI agents generate patches for CVE samples

128 samples

Reproduce

Verify POC and patch correctness

1920 evals

Patch

Test patches against test suites

50.5% pass rate

How each CVE is verified

Every evaluation follows a 4-step Docker-based verification pipeline.

1. Reproduce

Run the PoC (proof of concept) against the unpatched vulnerable code. Confirm the crash is reproducible.

2. Patch

Apply the agent's git diff inside a Docker container at the vulnerable commit.

3. Build

Compile the patched source with clang++ and memory safety instrumentation.

4. Verify

Re-run the same PoC against the patched binary. If the PoC no longer crashes, the patch passes.

Scoring: Pass = PoC no longer crashes (+1). Fail = still crashes (0). Build = patch doesn't compile (-1). Infra = environment failure (excluded).

Validity Checks

We investigated 5 potential confounds that could invalidate benchmark results.

The checks ask: Could the results be artifacts of how we set up the test? Could easy bugs be overrepresented? Could the test harness bias agents toward certain approaches? Are we measuring what we think we are measuring? These are not abstract questions - they drove changes to the benchmark before we published.

Training contamination[PASS]

None of our test bugs appear in agent training data.

Specification leakage[PASS]

Agents receive only the vulnerable code and a script to trigger the bug. No hints about the fix.

Scoring correctness[PASS]

Automated safety checks produce deterministic pass/fail. Manual audit of 50 samples confirmed 100% scoring accuracy.

Non-determinism[WARN]

Temperature >0 introduces variance. We run single attempts per agent - real-world deployment conditions.

Infrastructure meta-failures[PASS]

128 infra failures (6.7%) - below 5% threshold. Root-cause classified.

Analysis Reference

Additional visualization and analysis tools are available for deeper investigation of benchmark results.

Confidence Intervals Forest Plot0%25%50%75%100%codex/gpt/5.262.7%n=126cursor/opus-4.662.5%n=128claude/opus-4-661.6%n=125gemini/3.1-pro-preview58.7%n=109opencode/gemini/3.1-pr...54.9%n=122cursor/gpt-5.251.6%n=122opencode/gpt/5.251.6%n=122cursor/gpt-5.3-codex50.4%n=127codex/gpt/5.2-codex49.2%n=128opencode/claude/opus-4-647.5%n=122claude/opus-4-545.7%n=127cursor/composer-1.545.2%n=126gemini/3-pro-preview43.0%n=128opencode/gpt/5.2-codex37.8%n=127opencode/claude/opus-4-536.8%n=125

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

This methodology is reproducible

We can run this exact process on your codebase, in your CI pipeline. Same isolation, same safety checks, same deterministic scoring.

Explore more

FAQ

How are vulnerabilities selected?

Each vulnerability is curated with a reproduction environment, test harness, and ground-truth fix. 128 samples so far, scaling to 6,138+. No cherry-picking.

How are tests reproducible?

Each vulnerability is packaged with a known-vulnerable environment, a test harness, and a known-good fix. Anyone can reproduce the results.

What counts as a pass?

A pass means the agent's patch resolves the vulnerability when tested against the verifier. The original bug must be confirmed, and the fix must not introduce regressions.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.