[AGENTS]

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Agent comparison

15 agent-model configurations ranked by pass rate on 128 real vulnerabilities.

Configuration details

Each configuration specifies the model, system prompt, tool access, and memory settings. Differences in configuration drive differences in pass rate and cost.

Agents compared

62.7%

Top pass rate

Behavior groups

105

Agent pairs compared

Agent Comparison

15 agents tested on 128 real bugs. Each agent runs in an isolated container with automated safety checks.

The agents span multiple model providers and configurations. Some are CLI tools that read files locally. Others are API-based agents with web access. Some use older models, others use the latest. This diversity matters because it shows what is actually available to security teams right now, not just theoretical best cases.

Each agent configuration is stable across all 128 samples. We do not tune agents per-bug or cherry-pick runs. The results are reproducible - if you run the same agent again, you should get similar pass rates. This reproducibility is what makes the benchmark actionable for your own pipeline.

Agent naming guide

Agent names follow the format harness/model. The same LLM through different coding agents produces different patch quality.

For example: claude/opus-4-5 runs Claude Opus 4.5 through Anthropic's Claude Code agent. opencode/claude-opus-4-5 runs the same model through the OpenCode harness. Same model, different harness — different results. The benchmark measures the agent harness, not just the underlying model.

Why harness matters:

How the agent reads code (all at once vs incremental)
Tool availability (git, editor, compiler access)
Iteration logic (one shot vs refine-and-retry)
Tokenization and context window management

#162.5%

cursor-opus-4.6

80/128 passed$22.13/eval

#258.1%

codex-gpt-5.2

79/136 passed$3.08/eval

#356.6%

claude-claude-opus-4-6

77/136 passed$1.66/eval

#452.3%

opencode-gemini-gemini-3.1-pro-preview

67/128 passed$3.04/eval

#550.0%

cursor-gpt-5.3-codex

64/128 passed$3.08/eval

#650.0%

gemini31-gemini-3.1-pro-preview

64/128 passed$1.96/eval

#749.2%

cursor-gpt-5.2

63/128 passed$3.08/eval

#846.3%

codex-gpt-5.2-codex

63/136 passed$3.08/eval

#946.3%

opencode-gpt-5.2

63/136 passed$3.08/eval

#1044.5%

cursor-composer-1.5

57/128 passed$1.75/eval

#1142.6%

claude-claude-opus-4-5

58/136 passed$1.13/eval

#1242.6%

opencode-claude-opus-4-6

58/136 passed$22.13/eval

#1340.4%

gemini-gemini-3-pro-preview

55/136 passed$1.96/eval

#1435.3%

opencode-gpt-5.2-codex

When two agents both attempt the same bug, how often do they reach the same outcome? High agreement means they have similar strengths. Low agreement means they complement each other - useful if you want to run multiple agents.

Agent agreement patterns inform ensemble strategy. If two agents always agree, running them both wastes money - one is redundant. If they disagree often, running both covers more bugs. The agreement matrix shows which agent pairs complement each other and which ones overlap.

Most agreeing pairs

4-6 × opus-4.679.7%

gpt-5.2 × 5.3-codex77.3%

5.3-codex × opus-4.677.3%

gpt-5.2 × opus-4.675.8%

composer-1.5 × gpt-5.275.0%

Least agreeing pairs

pro-preview × 5.2-codex43.0%

pro-preview × 5.2-codex46.1%

pro-preview × 4-548.4%

pro-preview × 4-648.4%

4-5 × pro-preview50.0%

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Find the right agent for your stack

Different agents handle different types of bugs better. We can test agents against YOUR codebase to find the best fit.

Talk to an engineer →How patching works →

Explore more

Full leaderboard
- pass rates and cost analysis
Economics
- cost per fix, best trade-offs
Methodology
- validity checks, difficulty scoring
Agent strategies
- how agents cluster by approach and behavior
Execution metrics
- turns, tool calls, and token usage by agent

FAQ

Which agents are benchmarked?

15 agent-model configurations including Claude Code, Codex, Gemini CLI, and others. Each tested on the same 128 vulnerabilities.

How often are benchmarks updated?

Benchmarks update as new models ship. We re-run tests regularly so rankings stay current.

Can I add my own agent?

Yes. Any agent that writes code can be benchmarked. Contact us to add your agent configuration to the test suite.

[RELATED TOPICS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.