Council Deliberations

25 strategic questions answered by multi-model deliberation. Consensus scores, confidence levels, and conclusions from the benchmark council.

Deliberation process

Each question is analyzed independently by multiple models. Responses are compared for agreement, and a final conclusion is synthesized with a confidence rating.

Using the results

High-consensus answers inform benchmark design decisions. Low-consensus questions identify areas where the benchmark needs more data or where reasonable people disagree.

Three analysts argued about every finding

Each of the 25 research questions went through a structured deliberation process. Three independent personas (a mathematician, a creative analyst, and a skeptic) each evaluated the evidence and wrote their opinion. A separate review panel ranked the opinions, then a synthesis produced the final conclusion with a confidence rating.

14 deliberations produced usable conclusions. 11 failed due to upstream infrastructure errors during the council generation run and will be re-run in the next evaluation cycle.

The deliberation protocol avoids single-analyst blindness. A mathematically-minded analyst might miss creative interpretations. A creative analyst might gloss over rigor. A skeptic pushes back on assumptions all three might accept unchallenged. By running three personas and forcing them to argue, we catch gaps that a single voice would miss.

[KEY INSIGHT]

1 high-confidence conclusions

from 14 completed deliberations. High-confidence conclusions had three personas in agreement with corroborating evidence chains.

High confidence means the three personas converged on the same answer and evidence chains corroborated the finding. These are the claims you can trust without caveats. Medium-confidence findings had two personas agreeing but evidence gaps. Lower-confidence conclusions need more work in the next benchmark cycle.

All deliberations

Questions, confidence levels, and synthesized conclusions. Each row represents a full three-persona deliberation cycle.

#	Question	Confidence	Conclusion
1	Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings?	[MEDIUM]	Underdetermined from the provided evidence: we cannot claim rankings are stable across different random CVE samples, and different sample selection could plausibly change at least the close-call order...
2	Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead?	[HIGH]	Based on the provided benchmark evidence, OpenCode is primarily passing through to the same underlying models while adding substantial overhead (higher cost, more build failures) and yielding worse pa...
3	If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals?	[LOW]	You cannot identify which specific CVE patterns show the strongest contamination signals from the provided aggregates; to do so you need a per-CVE table (CVE→CWE/vendor/language/etc.) joined to per-...
4	Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs?	[MEDIUM]	With the current aggregates, we cannot identify which agent characteristics predict success on hard vs easy CVEs; we can only say agent-model identity predicts overall success, and hypothesize that mo...
5	If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck?	[MEDIUM]	Choose gemini-gemini-3-pro-preview for best bang-for-buck (lowest Cost/Pass $3.52, Efficiency Rank 1). If you need ≥~60% pass rate while staying cost-efficient, pick codex-gpt-5.2 (**$...
6	If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success?	[LOW]	Underdetermined from the current evidence; no single behavioral pattern is demonstrably “most predictive of success.” The only defensible takeaway is that interaction intensity (turns/tool calls) is n...
7	Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)?	[MEDIUM]	Yes, the methodology can be extended to score patch quality beyond pass/fail for metrics like minimality and stability using the stored patch/session artifacts, but robust semantic-preservation measur...
8	How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)?	[LOW]	Results would likely improve primarily by reducing “Test Fail” (and increasing passes) for cases where tests can be executed during generation, but the magnitude—and any effect on build/infra failur...
9	Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts?	[MEDIUM]	Yes in principle, but it’s underdetermined from the current evidence whether RLM behavioral analysis will yield a reliable early-stopping criterion that saves compute without harming pass rate; yo...
10	What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation?	[MEDIUM]	CVE-Bench v2 would look like a double-blind, sealed-holdout + standardized-sandbox benchmark that anonymizes model/sample identities during runs and scoring, and that separates “can it fix the CVE?” f...
11	At what infrastructure failure rate does retry cost dominate agent selection decisions?	[LOW]	From this report alone, the dominance threshold is underdetermined due to missing retry and cost-accounting details; under a standard “retry-until-non-infra” assumption, it would be on the order o...
12	Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed?	[MEDIUM]	On the (Cost/Eval, Pass Rate) Pareto frontier, keep gemini-gemini-3-pro-preview and codex-gpt-5.2; never deploy claude-claude-opus-4-5, claude-claude-opus-4-6, codex-gpt-5.2-codex,...
13	Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fix	[MEDIUM]	Dispatch gemini-3-pro-preview → codex-gpt-5.2 → claude-opus-4-6, escalating only on non-pass and stopping on first pass; under an independence approximation the expected spend is **≈ $3.96 per sam...
14	Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost)	[MEDIUM]	Automated agentic patching is cheaper than manual fixes at $150/hr whenever your average manual time per successful fix exceeds $t^=0.4\cdot\text{Cost/Pass}$ minutes—e.g., 1.41 min* (gemini $3.5...

How the process works

The deliberation protocol follows a structured sequence:

Opinion phase: three personas (mathematician, creative, skeptic) independently evaluate the evidence and write their answer with a confidence score and gap list.
Review phase: three reviewers rank the opinions, identify areas of agreement and disagreement.
Synthesis: a chairman synthesizes all opinions and reviews into a final conclusion with a consensus score (0-1) and confidence level.

This structure forces disagreement into the open. If the mathematician and skeptic reach different conclusions, the synthesis has to acknowledge it. The final conclusion is not a hiding of disagreement - it is a documentation of which analysts agreed, which had gaps, and what evidence would resolve the difference.

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Read what the deliberations produced

The economics questions led to 10 actionable findings. The validation questions stress-tested every claim in the benchmark.

Cost analysis findings →Validation questions →

Explore more

Results & leaderboard : the pass rates these deliberations validated
Methodology : scoring rules and exclusion criteria under scrutiny

FAQ

What is the benchmark council?

A structured deliberation process where multiple AI models independently analyze the same question, then a synthesis step identifies consensus, disagreements, and confidence levels.

How is consensus measured?

Each question receives a consensus score (0-1) based on agreement across models. Scores above 0.7 indicate strong agreement. Below 0.4 signals open questions that need more evidence.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.