Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Agent comparison
15 agent-model configurations ranked by pass rate on 128 real vulnerabilities.
Configuration details
Each configuration specifies the model, system prompt, tool access, and memory settings. Differences in configuration drive differences in pass rate and cost.
Agent Comparison
15 agents tested on 128 real bugs. Each agent runs in an isolated container with automated safety checks.
The agents span multiple model providers and configurations. Some are CLI tools that read files locally. Others are API-based agents with web access. Some use older models, others use the latest. This diversity matters because it shows what is actually available to security teams right now, not just theoretical best cases.
Each agent configuration is stable across all 136 samples. We do not tune agents per-bug or cherry-pick runs. The results are reproducible - if you run the same agent again, you should get similar pass rates. This reproducibility is what makes the benchmark actionable for your own pipeline.
Agent naming guide
Agent names follow the format harness/model. The same LLM through different coding agents produces different patch quality.
For example: claude/opus-4-5 runs Claude Opus 4.5 through Anthropic's Claude Code agent. opencode/claude-opus-4-5 runs the same model through the OpenCode harness. Same model, different harness — different results. The benchmark measures the agent harness, not just the underlying model.
Why harness matters:
- How the agent reads code (all at once vs incremental)
- Tool availability (git, editor, compiler access)
- Iteration logic (one shot vs refine-and-retry)
- Tokenization and context window management
Compare agents by lab
See how agents from each model provider perform on CVE patching.
Claude Opus 4.5
[ANTHROPIC]
58
pass
43
fail
26
build
Claude Opus 4.6
[ANTHROPIC]
77
pass
28
fail
20
build
Codex Gpt 5.2
[OPENAI]
79
pass
12
fail
35
build
Codex Gpt 5.2 Codex
[OPENAI]
63
pass
27
fail
38
build
Cursor Composer 1.5
[CURSOR]
57
pass
39
fail
30
build
Cursor Gpt 5.2
[OPENAI]
63
pass
34
fail
25
build
Cursor Gpt 5.3 Codex
[OPENAI]
64
pass
40
fail
23
build
Cursor Opus 4.6
[ANTHROPIC]
80
pass
24
fail
24
build
Gemini 3 pro preview
[GOOGLE]
55
pass
36
fail
37
build
Gemini31 Gemini 3.1 pro preview
[GOOGLE]
64
pass
18
fail
27
build
Opencode Claude Opus 4.5
[ANTHROPIC]
46
pass
29
fail
50
build
Opencode Claude Opus 4.6
[ANTHROPIC]
58
pass
15
fail
49
build
Opencode Gemini Gemini 3.1 pro preview
[GOOGLE]
67
pass
25
fail
30
build
Opencode Gpt 5.2
[OPENAI]
63
pass
11
fail
48
build
Opencode Gpt 5.2 Codex
[OPENAI]
48
pass
32
fail
47
build
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
[NEXT STEPS]
Find the right agent for your stack
Different agents handle different types of bugs better. We can test agents against YOUR codebase to find the best fit.
Explore more
- Full leaderboard
- pass rates and cost analysis
- Economics
- cost per fix, best trade-offs
- Methodology
- validity checks, difficulty scoring
- Agent strategies
- how agents cluster by approach and behavior
- Execution metrics
- turns, tool calls, and token usage by agent
FAQ
Which agents are benchmarked?
15 agent-model configurations including Claude Code, Codex, Gemini CLI, and others. Each tested on the same 128 vulnerabilities.
How often are benchmarks updated?
Benchmarks update as new models ship. We re-run tests regularly so rankings stay current.
Can I add my own agent?
Yes. Any agent that writes code can be benchmarked. Contact us to add your agent configuration to the test suite.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.