Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Why publish validation questions
Most benchmarks publish rankings. We publish the questions we asked to test whether those rankings are valid. If you find a question we missed, tell us.
Three question types
Challenge questions test assumptions. Assume questions explore what the data implies. Build-on questions extend findings to new scenarios.
How we validated the benchmark before publishing
25 structured questions across three types. Each targets a specific claim - from pass rate validity to cost model calibration. We ran these against our own data before publishing results. You can read every question and answer below.
All questions by type
Three question types stress-test the benchmark from different angles. [CHALLENGE] questions test assumptions. [ASSUME] questions explore what the data implies. [BUILD ON] questions extend findings to new territory.
| # | Question | Principle |
|---|---|---|
| 1 | Is the 66.1% pass rate for cursor-opus-4.6 inflated by easy samples that all agents pass, or does it reflect genuine superiority? | codex-pass-rate-validity |
| 2 | Do the near-identical patches across agents indicate training data contamination rather than genuine problem-solving convergence? | patch-similarity-contamination |
| 3 | Is the 49.1% overall pass rate meaningful given that 305+ results are infrastructure failures rather than agent failures? | infra-bias-validity |
| 4 | Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings? | ranking-stability |
| 5 | Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead? | opencode-value-add |
| 6 | Is the Gemini cost-per-fix real or an artifact of turn_heuristic estimation — what happens with measured token data? | gemini-cost-validity |
| 7 | Does the OpenCode wrapper overhead justify ANY quality improvement, or is it pure computational waste? | wrapper-overhead-justification |
| 8 | Are cost/pass efficiency rankings stable if infrastructure failures are excluded from denominators? | efficiency-ranking-stability |
| # | Question | Principle |
|---|---|---|
| 1 | If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals? | contamination-signal-hypothesis |
| 2 | Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs? | difficulty-prediction-model |
| 3 | If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck? | cost-effectiveness-optimization |
| 4 | Assuming the oracle ceiling of 74.3% (any agent passes) is a hard limit, what characterizes the 25.7% of CVEs that NO agent can fix? | oracle-ceiling-characterization |
| 5 | If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success? | behavior-outcome-correlation |
| 6 | What is the optimal sequential dispatch strategy for a multi-agent ensemble maximizing coverage at minimum cost? | ensemble-dispatch-strategy |
| 7 | Given near-identical patches across agents, what is the marginal value of the 4th through 9th agent in an ensemble? | marginal-agent-value |
| 8 | If early stopping at turn 5-8 saves compute on doomed attempts, what's the break-even ROI at 1000 CVEs/year? | early-stopping-roi |
| 9 | At what infrastructure failure rate does retry cost dominate agent selection decisions? | infra-failure-break-even |
| # | Question | Principle |
|---|---|---|
| 1 | Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)? | patch-quality-metrics |
| 2 | How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)? | test-suite-access-impact |
| 3 | Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts? | early-stopping-signal |
| 4 | What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation? | benchmark-v2-design |
| 5 | Can cross-agent agreement patterns be used as a confidence signal for patch correctness without running Docker evaluation? | agreement-confidence-signal |
| 6 | Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed? | pareto-frontier-agent-selection |
| 7 | Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fix | waterfall-dispatch-protocol |
| 8 | Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost) | agentic-vs-manual-breakeven |
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
FAQ
How does XOR validate its own benchmark?
We run 25 structured questions against every dataset before publishing. Questions challenge assumptions, test implications, and extend findings into new territory.
Can I see the validation questions?
Yes. All 25 questions and their answers are published on this page. We show our work so you can decide if the methodology holds up.
Patch verification
XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.
Automated vulnerability patching
AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.
Benchmark Methodology
How CVE-Agent-Bench evaluates 13 coding agents on 128 real vulnerabilities. Deterministic, reproducible, open methodology.
Agent Environment Security
AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.
Security Economics for Agentic Patching
Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.
Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,664 evaluations analyzed.
Bug Complexity
128 vulnerabilities scored by difficulty. Floor = every agent fixes it. Ceiling = no agent can.
Agent Strategies
How different agents approach the same bug. Strategy matters as much as model capability.
Execution Metrics
Per-agent session data: turns, tool calls, tokens, and timing. See what happens inside an agent run.
Pricing Transparency
Every cost number has a source. Published pricing models, measurement methods, and provider rates.
Automated Vulnerability Patching and PR Review
Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.
Continuous Learning from Verified Agent Runs
A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.
Signed Compliance Evidence for AI Agents
A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.
Compliance Evidence and Standards Alignment
How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.
See which agents produce fixes that work
128 CVEs. 13 agents. 1,664 evaluations. Agents learn from every run.