Skip to main content
[VALIDATION]

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Why publish validation questions

Most benchmarks publish rankings. We publish the questions we asked to test whether those rankings are valid. If you find a question we missed, tell us.

Three question types

Challenge questions test assumptions. Assume questions explore what the data implies. Build-on questions extend findings to new scenarios.

25
Research questions
8
Challenge assumptions
9
Explore implications
8
Extend findings

How we validated the benchmark before publishing

25 structured questions across three types. Each targets a specific claim - from pass rate validity to cost model calibration. We ran these against our own data before publishing results. You can read every question and answer below.

All questions by type

Three question types stress-test the benchmark from different angles. [CHALLENGE] questions test assumptions. [ASSUME] questions explore what the data implies. [BUILD ON] questions extend findings to new territory.

[CHALLENGE]Test assumptions (8)
#QuestionPrinciple
1Is the 66.1% pass rate for cursor-opus-4.6 inflated by easy samples that all agents pass, or does it reflect genuine superiority?codex-pass-rate-validity
2Do the near-identical patches across agents indicate training data contamination rather than genuine problem-solving convergence?patch-similarity-contamination
3Is the 49.1% overall pass rate meaningful given that 305+ results are infrastructure failures rather than agent failures?infra-bias-validity
4Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings?ranking-stability
5Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead?opencode-value-add
6Is the Gemini cost-per-fix real or an artifact of turn_heuristic estimation — what happens with measured token data?gemini-cost-validity
7Does the OpenCode wrapper overhead justify ANY quality improvement, or is it pure computational waste?wrapper-overhead-justification
8Are cost/pass efficiency rankings stable if infrastructure failures are excluded from denominators?efficiency-ranking-stability
[ASSUME]Explore implications (9)
#QuestionPrinciple
1If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals?contamination-signal-hypothesis
2Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs?difficulty-prediction-model
3If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck?cost-effectiveness-optimization
4Assuming the oracle ceiling of 74.3% (any agent passes) is a hard limit, what characterizes the 25.7% of CVEs that NO agent can fix?oracle-ceiling-characterization
5If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success?behavior-outcome-correlation
6What is the optimal sequential dispatch strategy for a multi-agent ensemble maximizing coverage at minimum cost?ensemble-dispatch-strategy
7Given near-identical patches across agents, what is the marginal value of the 4th through 9th agent in an ensemble?marginal-agent-value
8If early stopping at turn 5-8 saves compute on doomed attempts, what's the break-even ROI at 1000 CVEs/year?early-stopping-roi
9At what infrastructure failure rate does retry cost dominate agent selection decisions?infra-failure-break-even
[BUILD ON]Extend findings (8)
#QuestionPrinciple
1Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)?patch-quality-metrics
2How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)?test-suite-access-impact
3Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts?early-stopping-signal
4What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation?benchmark-v2-design
5Can cross-agent agreement patterns be used as a confidence signal for patch correctness without running Docker evaluation?agreement-confidence-signal
6Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed?pareto-frontier-agent-selection
7Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fixwaterfall-dispatch-protocol
8Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost)agentic-vs-manual-breakeven

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

FAQ

How does XOR validate its own benchmark?

We run 25 structured questions against every dataset before publishing. Questions challenge assumptions, test implications, and extend findings into new territory.

Can I see the validation questions?

Yes. All 25 questions and their answers are published on this page. We show our work so you can decide if the methodology holds up.

[RELATED TOPICS]

Patch verification

XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.

Automated vulnerability patching

AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.

Benchmark Methodology

How CVE-Agent-Bench evaluates 13 coding agents on 128 real vulnerabilities. Deterministic, reproducible, open methodology.

Agent Environment Security

AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.

Security Economics for Agentic Patching

Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,664 evaluations analyzed.

Bug Complexity

128 vulnerabilities scored by difficulty. Floor = every agent fixes it. Ceiling = no agent can.

Agent Strategies

How different agents approach the same bug. Strategy matters as much as model capability.

Execution Metrics

Per-agent session data: turns, tool calls, tokens, and timing. See what happens inside an agent run.

Pricing Transparency

Every cost number has a source. Published pricing models, measurement methods, and provider rates.

Automated Vulnerability Patching and PR Review

Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.

Continuous Learning from Verified Agent Runs

A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.

Signed Compliance Evidence for AI Agents

A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.

Compliance Evidence and Standards Alignment

How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.

See which agents produce fixes that work

128 CVEs. 13 agents. 1,664 evaluations. Agents learn from every run.