Ensemble strategies: combining agents for higher pass rates

Best single agent: 62.7%. Best pair: 75.7% theoretical. Oracle (any-agent-passes): 85.3%. Marginal analysis shows diminishing returns after 4 agents.

What happens when you run multiple agents on the same vulnerability and use whichever fix works? Across 128 valid samples and 13 agent configurations, the theoretical ceiling climbs from 62.5% (best single agent) to 79.7% (oracle: any agent passes). This page explores practical ensemble configurations and the cost-benefit tradeoffs of running multiple agents.

Ensemble patching is simple in concept: attempt the same vulnerability with multiple agents, and use the first successful fix. Each additional agent raises the pass rate, but with diminishing returns. Understanding these diminishing returns helps you decide how many agents to run.

[ENSEMBLE ANALYSIS]

Single-agent baseline

The best single agent by pass count is Cursor-Opus-4.6, which reaches 80 passes out of 128 samples (62.5%). This is the starting point. The remaining 48 samples represent either bugs the agent can't solve with its current configuration or bugs that are genuinely hard across all model architectures.

Two-agent ensembles

Running a second agent on the failures from the best single agent increases the pass rate. The best two-agent pair is Codex GPT-5.2 + Cursor-Opus-4.6, reaching 96 passes out of 128 (75.0%). That is a 12.5 percentage point improvement over the best single agent.

Codex contributes 16 additional passes on samples where Cursor-Opus-4.6 fails. These agents have complementary strengths:

Cursor-Opus-4.6 pass count: 80/128
Codex GPT-5.2 additional passes: 16
Combined: 96/128 = 75.0%

The marginal cost per additional pass from adding Codex is $24.64 (source: ensemble analysis). That is the price of each bug fixed by the second agent that the first agent missed.

Oracle ceiling and hard cases

The oracle rate (79.7%) represents the ceiling: the maximum pass rate achievable with any combination of the 13 agents tested. Out of 128 samples, 102 are solved by at least one agent. The remaining 26 samples (20.3%) resist all 13 agent configurations.

What makes a vulnerability "hard"? Preliminary analysis of the hard case set suggests:

Complex multi-file changes where agents get lost in scope
Language-specific edge cases in memory safety (Rust, C)
Novel vulnerability types not well-represented in training data
Implicit contract violations requiring understanding of undocumented API behavior

These hard cases likely require human intervention. An agent ensemble gets you from 62.5% to 79.7%, automating most fixable bugs. The remaining 20.3% are boundary cases where human expertise matters most.

Majority vote vs oracle

Majority vote (more than half of agents pass) yields only 64/128 = 50.0%. This is worse than the best single agent (62.5%). The reason: many samples are passed by only a few agents. Taking a majority vote penalizes bugs where only 1-3 agents find the fix. Oracle (any agent passes) is the correct ensemble strategy for patching, not majority vote.

Cost analysis

Each agent has a per-evaluation cost. Using estimated costs from the benchmark:

Agent	Cost per eval	Cost per pass	Pass rate
Claude Opus 4.6	$1.66	$2.93	61.6%
Gemini 3.1 Pro	$1.96	$3.92	58.7%
Codex GPT-5.2	$3.08	$5.30	62.7%
Cursor Opus 4.6	$22.12	$35.40	62.5%

The best pair by pass count (Codex + Cursor-Opus-4.6) is expensive because Cursor-Opus-4.6 costs $22.12 per evaluation. For cost-conscious deployments, Claude Opus 4.6 ($1.66/eval) combined with Codex GPT-5.2 ($3.08/eval) offers a more practical alternative at $4.74 per evaluation.

Sequential escalation (recommended)

Run the cheaper agent first. If it fails, escalate to the more expensive agent. This avoids paying for both agents on every sample.

Example: Run Claude Opus 4.6 first ($1.66/eval, ~61.6% pass rate). On the ~38.4% of failures, escalate to Codex ($3.08/eval).

Expected cost per sample: $1.66 + ($3.08 x 0.384) = $1.66 + $1.18 = $2.84

This is cheaper than running either agent alone at its cost-per-pass rate, while achieving a higher combined pass rate.

Simultaneous strategy

Run both agents on all samples. Simpler to implement (no cascade logic), but more expensive: $1.66 + $3.08 = $4.74 per sample regardless of outcome. Good for organizations that want simplicity over cost optimization.

Practical ensemble strategies

Strategy 1: Cost-efficient pair

Run Claude Opus 4.6 + Codex GPT-5.2 sequentially. Low cost per sample (~$2.84), strong combined coverage. Best for teams with budget constraints.

Strategy 2: Maximum coverage pair

Run Codex GPT-5.2 + Cursor-Opus-4.6. Highest measured pass rate (75.0%), but Cursor-Opus-4.6 adds significant cost ($22.12/eval). Best when fix coverage matters more than cost.

Strategy 3: Lab-balanced ensemble

Run one agent from each lab (Anthropic, Google, OpenAI) to maximize architectural diversity. Example: Claude Opus 4.6 + Gemini 3.1 Pro + Codex GPT-5.2. Cost: $1.66 + $1.96 + $3.08 = $6.70 per sample. Different model architectures increase the chance that at least one agent finds the fix.

Strategy 4: Budget-limited

Your budget determines ensemble size. At $2/sample, you get one Claude run. At $5/sample, you get Claude + Codex sequential. At $7+/sample, you can add a third agent.

When NOT to use ensembles

Ensembles are overkill if:

You need just 50% pass rate (single agent suffices)
You're verifying fixes, not generating them (one agent is enough)
Your vulnerabilities are all simple (single-file changes where agents agree on 80%+)

Ensembles are worth it if:

You need 70%+ pass rate (single agent tops out around 62.5%)
Coverage matters more than cost (time to fix is critical)
You're running agents as a service (ensemble amortizes cost across many users)

Explore more

Full leaderboard, all agent costs and pass rates
Cross-agent agreement, which agents complement each other
Model upgrades, how newer models help the ensemble
Cost methodology, detailed cost analysis

FAQ

Can running multiple agents improve pass rates?

Yes. The best 2-agent ensemble reaches 75.7% theoretical pass rate, up from 62.7% for the best single agent. An oracle strategy (accept any agent's pass) could reach 85.3%.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Three behavioral clusters: how agents approach vulnerability patching

Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.