Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Why publish validation questions
Most benchmarks publish rankings. We publish the questions we asked to test whether those rankings are valid. If you find a question we missed, tell us.
Three question types
Challenge questions test assumptions. Assume questions explore what the data implies. Build-on questions extend findings to new scenarios.
How we validated the benchmark before publishing
25 structured questions across three types. Each targets a specific claim - from pass rate validity to cost model calibration. We ran these against our own data before publishing results. You can read every question and answer below.
The validation process is not optional. We asked hard questions about the data ourselves before releasing it. Could the pass rates be inflated by easy bug selection? Are the cost numbers accurate or estimates? Does difficulty scoring matter? Every question forced us to check the data and document the answer.
This is the opposite of black-box benchmarking. We did not hide the validation work. Every question and answer is published so you can see exactly what we tested and what we found. If you disagree with our answers, you have the evidence to argue back.
All questions by type
Three question types stress-test the benchmark from different angles. [CHALLENGE] questions test assumptions. [ASSUME] questions explore what the data implies. [BUILD ON] questions extend findings to new territory.
Challenge questions ask: Is this true? Are we measuring what we think we are? Do confounds exist? Assume questions ask: If this is true, what else must be true? Build-on questions ask: What would this imply for future work? Together they form a coherent stress-test of the entire benchmark.
| # | Question | Principle |
|---|---|---|
| 1 | Is the 66.1% pass rate for cursor-opus-4.6 inflated by easy samples that all agents pass, or does it reflect genuine superiority? | codex-pass-rate-validity |
| 2 | Do the near-identical patches across agents indicate training data contamination rather than genuine problem-solving convergence? | patch-similarity-contamination |
| 3 | Is the 49.1% overall pass rate meaningful given that 305+ results are infrastructure failures rather than agent failures? | infra-bias-validity |
| 4 | Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings? | ranking-stability |
| 5 | Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead? | opencode-value-add |
| 6 | Is the Gemini cost-per-fix real or an artifact of turn_heuristic estimation — what happens with measured token data? | gemini-cost-validity |
| 7 | Does the OpenCode wrapper overhead justify ANY quality improvement, or is it pure computational waste? | wrapper-overhead-justification |
| 8 | Are cost/pass efficiency rankings stable if infrastructure failures are excluded from denominators? | efficiency-ranking-stability |
| # | Question | Principle |
|---|---|---|
| 1 | If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals? | contamination-signal-hypothesis |
| 2 | Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs? | difficulty-prediction-model |
| 3 | If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck? | cost-effectiveness-optimization |
| 4 | Assuming the oracle ceiling of 74.3% (any agent passes) is a hard limit, what characterizes the 25.7% of CVEs that NO agent can fix? | oracle-ceiling-characterization |
| 5 | If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success? | behavior-outcome-correlation |
| 6 | What is the optimal sequential dispatch strategy for a multi-agent ensemble maximizing coverage at minimum cost? | ensemble-dispatch-strategy |
| 7 | Given near-identical patches across agents, what is the marginal value of the 4th through 9th agent in an ensemble? | marginal-agent-value |
| 8 | If early stopping at turn 5-8 saves compute on doomed attempts, what's the break-even ROI at 1000 CVEs/year? | early-stopping-roi |
| 9 | At what infrastructure failure rate does retry cost dominate agent selection decisions? | infra-failure-break-even |
| # | Question | Principle |
|---|---|---|
| 1 | Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)? | patch-quality-metrics |
| 2 | How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)? | test-suite-access-impact |
| 3 | Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts? | early-stopping-signal |
| 4 | What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation? | benchmark-v2-design |
| 5 | Can cross-agent agreement patterns be used as a confidence signal for patch correctness without running Docker evaluation? | agreement-confidence-signal |
| 6 | Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed? | pareto-frontier-agent-selection |
| 7 | Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fix | waterfall-dispatch-protocol |
| 8 | Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost) | agentic-vs-manual-breakeven |
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
[NEXT STEPS]
Read the answers
The economics questions produced 10 findings with actionable verdicts. The validation questions confirmed the benchmark holds up under scrutiny.
Explore more
- Results & leaderboard
- pass rates the questions validated
- Methodology
- how we scored and what we excluded
FAQ
How does XOR validate its own benchmark?
We run 25 structured questions against every dataset before publishing. Questions challenge assumptions, test implications, and extend findings into new territory.
Can I see the validation questions?
Yes. All 25 questions and their answers are published on this page. We show our work so you can decide if the methodology holds up.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.