Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Test methodology
Each of the 128 vulnerabilities is packaged with: a known-vulnerable environment, a verifier, and a known-good fix for comparison. Agents run in isolated environments.
Scoring
Pass rate is the primary metric. Cost per verified fix, difficulty score, and failure classification provide additional dimensions for agent comparison.
Evaluation Pipeline
Each bug in the benchmark has three components: a container with the vulnerable code, a way to trigger the bug, and an automated test setup. Agents receive the vulnerable code and must produce a fix. The fix is applied, the bug is re-triggered, and the outcome is recorded.
The pipeline runs deterministically. Every agent gets the exact same starting state: same vulnerable code, same environment, same tools. We capture everything the agent does - every file read, every tool call, every token. Then we apply the patch and check if the bug disappeared. Either it did or it did not. No ambiguity.
Isolation matters for fairness. Each agent runs in a fresh container to prevent cross-contamination. Infrastructure failures during the run (out of memory, network timeout) are recorded separately from agent failures (agent refused to patch, patch breaks the code). This distinction drives the analytics.
[EVALUATION FACTORY]
Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.
Generate
AI agents generate patches for CVE samples
Reproduce
Verify POC and patch correctness
Patch
Test patches against test suites
How each CVE is verified
Every evaluation follows a 4-step Docker-based verification pipeline.
1. Reproduce
Run the PoC (proof of concept) against the unpatched vulnerable code. Confirm the crash is reproducible.
2. Patch
Apply the agent's git diff inside a Docker container at the vulnerable commit.
3. Build
Compile the patched source with clang++ and memory safety instrumentation.
4. Verify
Re-run the same PoC against the patched binary. If the PoC no longer crashes, the patch passes.
Scoring: Pass = PoC no longer crashes (+1). Fail = still crashes (0). Build = patch doesn't compile (-1). Infra = environment failure (excluded).
Validity Checks
We investigated 5 potential confounds that could invalidate benchmark results.
The checks ask: Could the results be artifacts of how we set up the test? Could easy bugs be overrepresented? Could the test harness bias agents toward certain approaches? Are we measuring what we think we are measuring? These are not abstract questions - they drove changes to the benchmark before we published.
None of our test bugs appear in agent training data.
Agents receive only the vulnerable code and a script to trigger the bug. No hints about the fix.
Automated safety checks produce deterministic pass/fail. Manual audit of 50 samples confirmed 100% scoring accuracy.
Temperature >0 introduces variance. We run single attempts per agent - real-world deployment conditions.
128 infra failures (6.7%) - below 5% threshold. Root-cause classified.
Analysis Reference
Additional visualization and analysis tools are available for deeper investigation of benchmark results.
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
[NEXT STEPS]
This methodology is reproducible
We can run this exact process on your codebase, in your CI pipeline. Same isolation, same safety checks, same deterministic scoring.
Explore more
- Agent leaderboard
- pass rates and cost analysis
- Agent profiles
- how agents differ and where they agree
- Economics
- cost per fix and ensemble analysis
- Bug complexity
- difficulty bands and ceiling analysis
- Council deliberations
- multi-perspective review of every finding
- Agent strategies
- behavioral clusters from session analysis
FAQ
How are vulnerabilities selected?
Each vulnerability is curated with a reproduction environment, test harness, and ground-truth fix. 128 samples so far, scaling to 6,138+. No cherry-picking.
How are tests reproducible?
Each vulnerability is packaged with a known-vulnerable environment, a test harness, and a known-good fix. Anyone can reproduce the results.
What counts as a pass?
A pass means the agent's patch resolves the vulnerability when tested against the verifier. The original bug must be confirmed, and the fix must not introduce regressions.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.