Cross-agent agreement: 105 pairwise comparisons

105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.

When two different agents attempt to fix the same vulnerability, how often do they reach the same outcome? This analysis covers all 105 unique pairwise comparisons across 15 agents evaluated in the benchmark. Understanding agreement patterns reveals which agents are interchangeable and which bring complementary strengths.

Two agents "agree" when they both pass or both fail on the same vulnerability. This is distinct from saying they produce identical patches. They only need to reach the same verdict (pass/fail) on the bug. Disagreement happens when one agent passes and the other fails, indicating different capabilities or approaches to the same problem.

[PAIRWISE ANALYSIS]

How to read the agreement matrix

Agreement scores range from 45% to 72.4% across all agent pairs. This might seem low, but it tells you something useful: no two agents have nearly identical capabilities. If agreement were above 90%, agents would be redundant. The moderate agreement rates mean that running multiple agents on the same vulnerability would measurably improve coverage.

The highest agreement pair reaches 72.4%: codex-gpt-5.2 and cursor-opus-4.6. These agents agree on roughly 3 out of every 4 samples, meaning they fail to pass about 1 in 4 of the same bugs. The lowest agreement pair sits around 45%: opencode-opus-4-5 and gemini-3-pro. These agents diverge on more than half their attempts, with one passing where the other fails.

Agreement scores answer a practical question: are these agents redundant? If you've already run Agent A and it failed, should you try Agent B? The agreement matrix tells you exactly how much marginal value Agent B adds. At 45% agreement, Agent B disagrees 55% of the time, meaning more than half of Agent B's failures would have been passes if Agent A had been run first. That's real coverage gain.

Agreement Heatmap

Pairwise agreement rates between all agent configurations. Higher values (green) indicate agents that agree more often on the same samples. Lower values (red) indicate agents with divergent capabilities - useful for ensemble strategies where complementarity drives coverage gain.

Read the matrix by picking two agents on the axes. Their intersection shows percentage agreement. For example: Claude Opus-4.6 + Gemini-3.1-Pro agree 69.8% of the time. This means on about 40 of 128 samples, one passes where the other fails. Those 40 disagreement points are where ensemble strategy adds value.

High agreement pairs

The top 5 agent pairs by agreement score:

Codex GPT-5.2 + Cursor Opus-4.6 (72.4%). These agents are closest in capability. They handle the same bugs well and fail on the same hard cases. If you can only afford to run one, this pair shows that Codex is slightly ahead in pass rate but Cursor is cheaper per fix.
Claude Opus-4.6 + Gemini-3.1-Pro (69.8%). Strong agreement between Anthropic's and Google's flagship models. Both hit similar pass rates (~58-60%) and tend to agree on which samples are solvable.
Claude Opus-4.6 + Codex-GPT-5.2 (68.5%). These form the "best ensemble pair" specifically because their agreement is high enough to be reliable but low enough to be complementary (see ensemble strategies).
Gemini-3.1-Pro + Cursor-Opus-4.6 (67.3%). Cross-lab agreement between Google and Anthropic models via Cursor. Still relatively high, suggesting multiple labs converge on similar problem solving approaches.

High agreement pairs are useful for verification: if Agent A and Agent B both pass the same vulnerability, you have higher confidence in the fix. It's less a question of "did the fix work?" and more "did the fix work the same way twice?" But as a sole strategy, high-agreement pairs add little value for coverage.

Low agreement pairs

The bottom 5 agent pairs by agreement score:

Opencode-Opus-4-5 + Gemini-3-Pro (~45%). These agents almost never agree. Opencode (wrapper) dramatically underperforms native models, creating maximum divergence with top-tier models. This pair would give 55% disagreement rate, highest marginal value for ensemble runs.
Opencode-Opus-4-5 + Codex-GPT-5.2 (47.2%). Wrapper dampens model performance, creating complementary coverage with the top-tier native CLI agent.
Gemini-3.0 + Claude-Opus-4.6 (48.1%). Older Gemini vs newer Opus. The generation gap creates divergent outcomes on edge cases.
Opencode-GPT-5.2-Codex + Gemini-3.1-Pro (~49%). Double-wrapped sandbox agent paired with native CLI model. Multiple abstraction layers create maximum divergence.

Low agreement pairs are the ensemble workhorses. Running both agents on the same vulnerability yields maximum coverage gain. The tradeoff is cost: you pay for two agents instead of one. But on 55% of disagreement samples, the second agent converts a failure into a pass. A pure coverage gain.

Agreement by lab

Same-lab agents (from the same company) show higher agreement than cross-lab pairs:

Anthropic models (Claude variants): agreement averages ~65%
Google models (Gemini variants): agreement averages ~62%
OpenAI models (Codex, GPT variants): agreement averages ~64%
Cross-lab pairs: agreement drops to ~52% average

This pattern reflects different training data, architectures, and problem-solving philosophies within each lab. Anthropic's agents converge on similar approaches to bug fixing. But a Codex agent and a Gemini agent, trained independently, diverge on roughly half their attempts.

The implication: if you're locked into one lab's models due to cost or organizational preference, you sacrifice ~13 percentage points of ensemble coverage compared to running agents from multiple labs. The diversity of training backgrounds drives disagreement, and disagreement is what ensemble strategies exploit.

Disagreement analysis

When agents disagree, exactly one passes and the other fails. These disagreement samples are the most informative. They reveal:

Where model architecture matters (GPT's approach fails but Claude's succeeds)
Differences in code understanding depth (one agent misses a bug detail the other catches)
Contrasting repair strategies (one generates compilable code, the other doesn't)
Tool use patterns (one calls external resources the other ignores)

Analyzing a single disagreement sample often teaches you something about how different agents approach reasoning. If Agent A generates a patch that modifies the wrong file but Agent B correctly identifies the vulnerable function, you learn something about their code comprehension.

A disagreement set across 100+ samples shows empirically which agents are worth running together. The 105 pairwise comparisons give you all the data you need to construct an optimal ensemble strategy for your use case. If coverage is your goal, pick pairs with disagreement rates above 50%. If cost is your goal, pick pairs where one is cheap (Gemini-3.0, Cursor Composer 1.5) and the other is accurate (Codex-GPT-5.2).

Ensemble potential from disagreement

The disagreement set directly bounds ensemble improvement. If Agent A passes 60% of samples and Agent B passes 60% of samples but they disagree on 25% of samples, the 2-agent ensemble (use either agent's fix) can theoretically reach:

Agent A passes 60%
Agent B passes 60%
Of the 40% Agent A fails, Agent B passes 25% of those (25% disagreement on the failed set)
Expected ensemble pass: 60% + (40% × disagreement%) ≈ 72.5%

This theoretical ceiling comes from the assumption that "disagreement" is evenly distributed across Agent A's success and failure cases. In practice, disagreement clusters in the middle range (both agents passing 40%, both failing 35%, disagreeing on 25%), which is exactly the sweet spot for ensemble coverage.

Practical implications

Agreement and disagreement metrics guide agent selection for three scenarios:

Single-agent deployment, Choose the agent with the highest pass rate or best cost per pass. Agreement is irrelevant; you're only running one.
Verification runs, Pick two agents with high agreement (above 65%). If both pass, you have strong signal. If both fail, the bug is genuinely hard. If they disagree, investigate why.
Ensemble/fallback runs, Pick two agents with low agreement (below 50%). Run the cheap one first. If it fails, escalate to the expensive one. Maximum coverage gain for minimum cost.

All 105 pairwise comparisons are available in the full benchmark dataset. Use them to build custom ensemble strategies tuned to your cost and coverage constraints.

Explore more

Full leaderboard, All agent pass rates and costs
Ensemble strategies, How to combine agents for maximum coverage
Model upgrades, How much do newer models help?
Native vs wrapper, Comparing CLI-based and wrapper-based agents

FAQ

Do different agents agree on which vulnerabilities they can fix?

Agreement ranges from 45% to 72.4%. The highest-performing agents agree most. The best 2-agent ensemble could theoretically reach 75.7% pass rate.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Three behavioral clusters: how agents approach vulnerability patching

Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.