Skip to main content
[VALIDATION]

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Why publish validation questions

Most benchmarks publish rankings. We publish the questions we asked to test whether those rankings are valid. If you find a question we missed, tell us.

Three question types

Challenge questions test assumptions. Assume questions explore what the data implies. Build-on questions extend findings to new scenarios.

25
Research questions
8
Challenge assumptions
9
Explore implications
8
Extend findings

How we validated the benchmark before publishing

25 structured questions across three types. Each targets a specific claim - from pass rate validity to cost model calibration. We ran these against our own data before publishing results. You can read every question and answer below.

The validation process is not optional. We asked hard questions about the data ourselves before releasing it. Could the pass rates be inflated by easy bug selection? Are the cost numbers accurate or estimates? Does difficulty scoring matter? Every question forced us to check the data and document the answer.

This is the opposite of black-box benchmarking. We did not hide the validation work. Every question and answer is published so you can see exactly what we tested and what we found. If you disagree with our answers, you have the evidence to argue back.

All questions by type

Three question types stress-test the benchmark from different angles. [CHALLENGE] questions test assumptions. [ASSUME] questions explore what the data implies. [BUILD ON] questions extend findings to new territory.

Challenge questions ask: Is this true? Are we measuring what we think we are? Do confounds exist? Assume questions ask: If this is true, what else must be true? Build-on questions ask: What would this imply for future work? Together they form a coherent stress-test of the entire benchmark.

[CHALLENGE]Test assumptions (8)
#QuestionPrinciple
1Is the 66.1% pass rate for cursor-opus-4.6 inflated by easy samples that all agents pass, or does it reflect genuine superiority?codex-pass-rate-validity
2Do the near-identical patches across agents indicate training data contamination rather than genuine problem-solving convergence?patch-similarity-contamination
3Is the 49.1% overall pass rate meaningful given that 305+ results are infrastructure failures rather than agent failures?infra-bias-validity
4Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings?ranking-stability
5Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead?opencode-value-add
6Is the Gemini cost-per-fix real or an artifact of turn_heuristic estimation — what happens with measured token data?gemini-cost-validity
7Does the OpenCode wrapper overhead justify ANY quality improvement, or is it pure computational waste?wrapper-overhead-justification
8Are cost/pass efficiency rankings stable if infrastructure failures are excluded from denominators?efficiency-ranking-stability
[ASSUME]Explore implications (9)
#QuestionPrinciple
1If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals?contamination-signal-hypothesis
2Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs?difficulty-prediction-model
3If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck?cost-effectiveness-optimization
4Assuming the oracle ceiling of 74.3% (any agent passes) is a hard limit, what characterizes the 25.7% of CVEs that NO agent can fix?oracle-ceiling-characterization
5If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success?behavior-outcome-correlation
6What is the optimal sequential dispatch strategy for a multi-agent ensemble maximizing coverage at minimum cost?ensemble-dispatch-strategy
7Given near-identical patches across agents, what is the marginal value of the 4th through 9th agent in an ensemble?marginal-agent-value
8If early stopping at turn 5-8 saves compute on doomed attempts, what's the break-even ROI at 1000 CVEs/year?early-stopping-roi
9At what infrastructure failure rate does retry cost dominate agent selection decisions?infra-failure-break-even
[BUILD ON]Extend findings (8)
#QuestionPrinciple
1Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)?patch-quality-metrics
2How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)?test-suite-access-impact
3Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts?early-stopping-signal
4What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation?benchmark-v2-design
5Can cross-agent agreement patterns be used as a confidence signal for patch correctness without running Docker evaluation?agreement-confidence-signal
6Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed?pareto-frontier-agent-selection
7Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fixwaterfall-dispatch-protocol
8Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost)agentic-vs-manual-breakeven

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Read the answers

The economics questions produced 10 findings with actionable verdicts. The validation questions confirmed the benchmark holds up under scrutiny.

Explore more

FAQ

How does XOR validate its own benchmark?

We run 25 structured questions against every dataset before publishing. Questions challenge assumptions, test implications, and extend findings into new territory.

Can I see the validation questions?

Yes. All 25 questions and their answers are published on this page. We show our work so you can decide if the methodology holds up.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.