Skip to main content
[COMPARISON]

Ensemble strategies: combining agents for higher pass rates

Best single agent: 62.7%. Best pair: 75.7% theoretical. Oracle (any-agent-passes): 85.3%. Marginal analysis shows diminishing returns after 4 agents.

What happens when you run multiple agents on the same CVE and use whichever fix works? Across 128 valid samples and 13 agent configurations, the theoretical ceiling climbs from 62.5% (best single agent) to 79.7% (oracle: any agent passes). This page explores practical ensemble configurations and the cost-benefit tradeoffs of running multiple agents.

Ensemble patching is simple in concept: attempt the same CVE with multiple agents, and use the first successful fix. Each additional agent raises the pass rate, but with diminishing returns. Understanding these diminishing returns helps you decide how many agents to run.

[ENSEMBLE ANALYSIS]

Single-agent baseline

The best single agent by pass count is Cursor-Opus-4.6, which reaches 80 passes out of 128 samples (62.5%). This is the starting point. The remaining 48 samples represent either bugs the agent can't solve with its current configuration or bugs that are genuinely hard across all model architectures.

Two-agent ensembles

Running a second agent on the failures from the best single agent increases the pass rate. The best two-agent pair is Codex GPT-5.2 + Cursor-Opus-4.6, reaching 96 passes out of 128 (75.0%). That is a 12.5 percentage point improvement over the best single agent.

Codex contributes 16 additional passes on samples where Cursor-Opus-4.6 fails. These agents have complementary strengths:

  • Cursor-Opus-4.6 pass count: 80/128
  • Codex GPT-5.2 additional passes: 16
  • Combined: 96/128 = 75.0%

The marginal cost per additional pass from adding Codex is $24.64 (source: ensemble analysis). That is the price of each bug fixed by the second agent that the first agent missed.

Oracle ceiling and hard cases

The oracle rate (79.7%) represents the ceiling: the maximum pass rate achievable with any combination of the 13 agents tested. Out of 128 samples, 102 are solved by at least one agent. The remaining 26 samples (20.3%) resist all 13 agent configurations.

What makes a CVE "hard"? Preliminary analysis of the hard case set suggests:

  • Complex multi-file changes where agents get lost in scope
  • Language-specific edge cases in memory safety (Rust, C)
  • Novel vulnerability types not well-represented in training data
  • Implicit contract violations requiring understanding of undocumented API behavior

These hard cases likely require human intervention. An agent ensemble gets you from 62.5% to 79.7%, automating most fixable bugs. The remaining 20.3% are boundary cases where human expertise matters most.

Majority vote vs oracle

Majority vote (more than half of agents pass) yields only 64/128 = 50.0%. This is worse than the best single agent (62.5%). The reason: many samples are passed by only a few agents. Taking a majority vote penalizes bugs where only 1-3 agents find the fix. Oracle (any agent passes) is the correct ensemble strategy for patching, not majority vote.

Cost analysis

Each agent has a per-evaluation cost. Using estimated costs from the benchmark:

AgentCost per evalCost per passPass rate
Claude Opus 4.6$1.66$2.9361.6%
Gemini 3.1 Pro$1.96$3.9258.7%
Codex GPT-5.2$3.08$5.3062.7%
Cursor Opus 4.6$22.12$35.4062.5%

The best pair by pass count (Codex + Cursor-Opus-4.6) is expensive because Cursor-Opus-4.6 costs $22.12 per evaluation. For cost-conscious deployments, Claude Opus 4.6 ($1.66/eval) combined with Codex GPT-5.2 ($3.08/eval) offers a more practical alternative at $4.74 per evaluation.

Run the cheaper agent first. If it fails, escalate to the more expensive agent. This avoids paying for both agents on every sample.

Example: Run Claude Opus 4.6 first ($1.66/eval, ~61.6% pass rate). On the ~38.4% of failures, escalate to Codex ($3.08/eval).

Expected cost per sample: $1.66 + ($3.08 x 0.384) = $1.66 + $1.18 = $2.84

This is cheaper than running either agent alone at its cost-per-pass rate, while achieving a higher combined pass rate.

Simultaneous strategy

Run both agents on all samples. Simpler to implement (no cascade logic), but more expensive: $1.66 + $3.08 = $4.74 per sample regardless of outcome. Good for organizations that want simplicity over cost optimization.

Practical ensemble strategies

Strategy 1: Cost-efficient pair

Run Claude Opus 4.6 + Codex GPT-5.2 sequentially. Low cost per sample (~$2.84), strong combined coverage. Best for teams with budget constraints.

Strategy 2: Maximum coverage pair

Run Codex GPT-5.2 + Cursor-Opus-4.6. Highest measured pass rate (75.0%), but Cursor-Opus-4.6 adds significant cost ($22.12/eval). Best when fix coverage matters more than cost.

Strategy 3: Lab-balanced ensemble

Run one agent from each lab (Anthropic, Google, OpenAI) to maximize architectural diversity. Example: Claude Opus 4.6 + Gemini 3.1 Pro + Codex GPT-5.2. Cost: $1.66 + $1.96 + $3.08 = $6.70 per sample. Different model architectures increase the chance that at least one agent finds the fix.

Strategy 4: Budget-limited

Your budget determines ensemble size. At $2/sample, you get one Claude run. At $5/sample, you get Claude + Codex sequential. At $7+/sample, you can add a third agent.

When NOT to use ensembles

Ensembles are overkill if:

  • You need just 50% pass rate (single agent suffices)
  • You're verifying fixes, not generating them (one agent is enough)
  • Your CVEs are all simple (single-file changes where agents agree on 80%+)

Ensembles are worth it if:

  • You need 70%+ pass rate (single agent tops out around 62.5%)
  • Coverage matters more than cost (time to fix is critical)
  • You're running agents as a service (ensemble amortizes cost across many users)

Explore more

FAQ

Can running multiple agents improve pass rates?

Yes. The best 2-agent ensemble reaches 75.7% theoretical pass rate, up from 62.7% for the best single agent. An oracle strategy (accept any agent's pass) could reach 85.3%.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.