Agent Strategies
How different agents approach the same bug. Behavioral clustering reveals that strategy matters as much as model capability.
Strategy clusters
K-means clustering on session features (turns, tool calls, file reads, edits, backtracking) reveals distinct behavioral patterns. Agents cluster by approach, not just by model.
What this means for agent selection
If an agent falls into a low-pass-rate cluster, its behavioral pattern may be the bottleneck - not its model intelligence. Strategy is sometimes more tunable than the model itself.
How AI agents approach the same bug differently
K-means clustering on 10 session features (turns, tool calls, file reads, edits, backtracking) reveals 3 distinct behavioral patterns across 973 sessions. Some agents explore broadly. Others edit fast. The pattern predicts the outcome.
Behavior matters because it tells you why agents succeed or fail - not just whether they do. An agent that reads every file in a project before making a single edit uses more tokens than an agent that reads targeted files. An agent that generates many candidate patches and tests them has different cluster characteristics than one that generates a single patch. These patterns emerge from the data.
[KEY INSIGHT]
60% pass rate in the best cluster
The cluster with the highest pass rate shares specific behavioral traits. Pass rate correlates with approach - not just model capability.
This is important because it means behavioral tuning can improve outcomes without changing the underlying model. If an agent falls into a low-performing cluster, its behavioral pattern (how it reads, explores, edits) might be the bottleneck. Changing the prompt or agent instructions to shift behavior could improve pass rates more than swapping to a different model.
Cluster composition
Each cluster's size, pass rate, and dominant agents. Larger clusters represent the most common behavioral strategy.
Cluster size reveals what approaches agents naturally converge on. If most agents fall into a single cluster, they are using similar strategies. If agents spread across clusters, they take diverse approaches. Diversity can be good (different strengths on different bug types) or bad (no coherent strategy).
speed-runner
211 sessions
Top agents
explorer
25 sessions
Top agents
surgical-expert
737 sessions
Top agents
DPO preference pairs
For each CVE sample, we construct preference pairs from the evaluation outcomes.
Gold pairs
Pass vs build-fail. Strongest signal. The winning patch fixes the vulnerability while the losing patch breaks the build.
Silver pairs
Pass vs test-fail. Medium signal. Both patches compile but only one fixes the bug.
Bronze pairs
Test-fail vs build-fail. Weakest signal. Neither fixes the bug, but one at least compiles.
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
[NEXT STEPS]
Drill into session-level data
The session analysis page shows per-agent metrics: turns, tool calls, and token usage. The trajectory data lets you trace individual agent decision paths.
Explore more
- Bug complexity
- which bugs are easy, which are impossible
- Economics
- cost per fix by agent and strategy
FAQ
How do agents differ in their approach?
Some agents explore broadly (read many files, backtrack often). Others edit fast (fewer reads, targeted changes). The approach pattern predicts the outcome.
Can I configure agent strategy?
Often yes. System prompts, tool access, and memory settings influence agent behavior. The best-performing cluster shares specific traits: fewer file reads, targeted edits, less backtracking.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.