Skip to main content
[COUNCIL]

Council Deliberations

25 strategic questions answered by multi-model deliberation. Consensus scores, confidence levels, and conclusions from the benchmark council.

Deliberation process

Each question is analyzed independently by multiple models. Responses are compared for agreement, and a final conclusion is synthesized with a confidence rating.

Using the results

High-consensus answers inform benchmark design decisions. Low-consensus questions identify areas where the benchmark needs more data or where reasonable people disagree.

14
Deliberations completed
3
Independent personas
1
High confidence
9
Medium confidence

Three analysts argued about every finding

Each of the 25 research questions went through a structured deliberation process. Three independent personas (a mathematician, a creative analyst, and a skeptic) each evaluated the evidence and wrote their opinion. A separate review panel ranked the opinions, then a synthesis produced the final conclusion with a confidence rating.

14 deliberations produced usable conclusions. 11 failed due to upstream infrastructure errors during the council generation run and will be re-run in the next evaluation cycle.

The deliberation protocol avoids single-analyst blindness. A mathematically-minded analyst might miss creative interpretations. A creative analyst might gloss over rigor. A skeptic pushes back on assumptions all three might accept unchallenged. By running three personas and forcing them to argue, we catch gaps that a single voice would miss.

[KEY INSIGHT]

1 high-confidence conclusions

from 14 completed deliberations. High-confidence conclusions had three personas in agreement with corroborating evidence chains.

High confidence means the three personas converged on the same answer and evidence chains corroborated the finding. These are the claims you can trust without caveats. Medium-confidence findings had two personas agreeing but evidence gaps. Lower-confidence conclusions need more work in the next benchmark cycle.

All deliberations

Questions, confidence levels, and synthesized conclusions. Each row represents a full three-persona deliberation cycle.

#QuestionConfidenceConclusion
1Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings?[MEDIUM]Underdetermined from the provided evidence: we cannot claim rankings are stable across different random CVE samples, and different sample selection could plausibly change at least the close-call order...
2Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead?[HIGH]Based on the provided benchmark evidence, OpenCode is primarily passing through to the same underlying models while adding substantial overhead (higher cost, more build failures) and yielding worse pa...
3If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals?[LOW]You cannot identify which *specific CVE patterns* show the strongest contamination signals from the provided aggregates; to do so you need a per-CVE table (CVE→CWE/vendor/language/etc.) joined to per-...
4Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs?[MEDIUM]With the current aggregates, we cannot identify which agent characteristics predict success on hard vs easy CVEs; we can only say agent-model identity predicts overall success, and hypothesize that mo...
5If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck?[MEDIUM]Choose **gemini-gemini-3-pro-preview** for best bang-for-buck (lowest **Cost/Pass $3.52**, **Efficiency Rank 1**). If you need ≥~60% pass rate while staying cost-efficient, pick **codex-gpt-5.2** (**$...
6If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success?[LOW]Underdetermined from the current evidence; no single behavioral pattern is demonstrably “most predictive of success.” The only defensible takeaway is that interaction intensity (turns/tool calls) is n...
7Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)?[MEDIUM]Yes, the methodology can be extended to score patch quality beyond pass/fail for metrics like minimality and stability using the stored patch/session artifacts, but robust semantic-preservation measur...
8How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)?[LOW]Results would *likely* improve primarily by reducing “Test Fail” (and increasing passes) for cases where tests can be executed during generation, but the magnitude—and any effect on build/infra failur...
9Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts?[MEDIUM]Yes in principle, but it’s **underdetermined from the current evidence** whether RLM behavioral analysis will yield a reliable early-stopping criterion that saves compute without harming pass rate; yo...
10What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation?[MEDIUM]CVE-Bench v2 would look like a double-blind, sealed-holdout + standardized-sandbox benchmark that anonymizes model/sample identities during runs and scoring, and that separates “can it fix the CVE?” f...
11At what infrastructure failure rate does retry cost dominate agent selection decisions?[LOW]From this report alone, the dominance threshold is **underdetermined** due to missing retry and cost-accounting details; under a standard “retry-until-non-infra” assumption, it would be on the order o...
12Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed?[MEDIUM]On the (Cost/Eval, Pass Rate) Pareto frontier, keep **gemini-gemini-3-pro-preview** and **codex-gpt-5.2**; never deploy **claude-claude-opus-4-5**, **claude-claude-opus-4-6**, **codex-gpt-5.2-codex**,...
13Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fix[MEDIUM]Dispatch **gemini-3-pro-preview → codex-gpt-5.2 → claude-opus-4-6**, escalating only on non-pass and stopping on first pass; under an independence approximation the expected spend is **≈ $3.96 per sam...
14Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost)[MEDIUM]Automated agentic patching is cheaper than manual fixes at $150/hr whenever your average manual time per successful fix exceeds \(t^*=0.4\cdot\text{Cost/Pass}\) minutes—e.g., **1.41 min** (gemini $3.5...

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Read what the deliberations produced

The economics questions led to 10 actionable findings. The validation questions stress-tested every claim in the benchmark.

Explore more

FAQ

What is the benchmark council?

A structured deliberation process where multiple AI models independently analyze the same question, then a synthesis step identifies consensus, disagreements, and confidence levels.

How is consensus measured?

Each question receives a consensus score (0-1) based on agreement across models. Scores above 0.7 indicate strong agreement. Below 0.4 signals open questions that need more evidence.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.