Three behavioral clusters: how agents approach vulnerability patching

Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.

[SESSION ANALYSIS]

Three distinct patterns emerge

Analysis of 973 agent sessions across the 128-sample benchmark reveals three behavioral clusters. These clusters are not defined by model or configuration, but by how agents use tools, structure their reasoning, and escalate when stuck.

Speed-runner cluster: 211 sessions, 60.2% pass rate. Fast execution, focused toolset, high accuracy. Agents in this cluster complete patches in 4-6 turns, use only 2-3 tools, and make decisive fixes.

Exploratory cluster: 25 sessions, 32.0% pass rate. Broad tool usage, many turns, low convergence. These agents try 5+ tools, spend 8-12 turns, and often don't reach a fix.

Surgical-expert cluster: 737 sessions, 54.9% pass rate. Moderate speed, balanced tool usage, the default pattern. This is the mode. Most agent sessions fall here most of the time.

Speed-runners: fast, narrow, accurate

211 sessions cluster around agents that know what to do and do it quickly. These sessions average 4.3 turns, use 2.8 tools, and achieve 60.2% pass rate. When they fix a bug, they fix it cleanly. When they fail, they fail fast.

Speed-runners are dominated by Codex (GPT-5.2) configurations. Codex Native (Codex CLI) reaches 62.7% pass rate. The wrapper versions drop to 49-51% because wrappers introduce latency and tool abstraction that breaks the speed-runner's momentum.

Characteristics of speed-runner sessions:

First tool choice is high-confidence. Agents browse code, understand the bug, then write a patch.
Patch attempts are successful or fail in ways that trigger immediate re-architecture. No lingering patches.
Tool switching is rare. Once a tool succeeds, agents use it for all follow-up actions.
Token efficiency is high. Speed-runners use fewer tokens per fix than other clusters, suggesting tighter reasoning loops.

Speed-runners work best on deterministic bugs. Those where the vulnerability is obvious once you've read the code. They excel at buffer overflows, off-by-one errors, and use-after-free conditions. For these bug types, the fast path is the right path.

Explorers: searching for the root cause

Only 25 sessions cluster in the exploratory pattern, but they tell something important. Agents that explore broadly before committing to a fix tend to get stuck. 32.0% pass rate is the lowest of the three clusters.

Exploratory sessions average 9.1 turns and use 5.2 tools. Nearly twice the speed-runner profile. Agents try code browsing, then debugging, then trace analysis, then diff generation, then testing. They're looking for the root cause in parallel rather than focusing.

Characteristics of exploratory sessions:

Multiple tool switching without clear success. Agents try one tool, get partial results, switch to another tool instead of continuing.
Long deliberation phases between actions. More tokens spent on reasoning than on actual tool output.
Redundant information gathering. Multiple tools provide the same signals (e.g., both debugger and trace output show the same call stack).
Higher failure rate. Exploratory sessions fail on bugs that have subtle root causes. The broad search doesn't converge.

Explorers are likely to be newer or out-of-domain models running in wrapper CLIs. OpenCode sessions often show exploratory patterns. When the model is less confident about the fix, it hedges by trying multiple approaches.

Surgical-experts: the default mode

737 sessions, 75.8% of all sessions. Fall into the surgical-expert cluster. This is the mode. Most agents, most of the time, follow this pattern.

Surgical-experts average 6.2 turns and use 3.8 tools. They fall between speed-runners and explorers. The 54.9% pass rate is the benchmark average, suggesting this cluster represents "normal" agent behavior.

Characteristics of surgical-expert sessions:

Methodical tool selection. Agents have a plan (browse code, look for the bug, test a patch) and stick to it.
Successful tool chains. When a tool works, agents use it as a foundation. When it doesn't, they switch with a clear hypothesis.
Moderate token efficiency. More deliberation than speed-runners, but less thrashing than explorers.
Balanced failure modes. Surgical-experts fail for multiple reasons. Some bugs are too hard, some are rare edge cases.

The name "surgical-expert" reflects the modal behavior: agents who know the basic shape of the problem and execute a reasonable fix strategy, without the speed of confident speed-runners or the exploration of uncertain explorers.

Cluster by agent

Not all agents fit one cluster exclusively. A given agent often spans multiple clusters depending on the bug type.

Codex (GPT-5.2 native) clusters 68% speed-runner, 10% exploratory, 22% surgical-expert. The model's base capability pushes toward fast, focused execution.

Claude Opus 4.6 clusters 25% speed-runner, 5% exploratory, 70% surgical-expert. This model is more cautious. It explores less than Codex, but also doesn't rush into fixes as aggressively.

OpenCode with Opus 4.6 clusters 8% speed-runner, 18% exploratory, 74% surgical-expert. The wrapper's abstraction layers disable the speed-runner pattern. Agents spend more time exploring because tool latency forces broader information gathering.

Gemini 3.1 Pro clusters 45% speed-runner, 8% exploratory, 47% surgical-expert. Gemini falls between Codex's confidence and Claude's caution.

The cluster shift when moving from native to wrapper confirms an earlier finding: wrappers disable the speed-runner pattern. Agents can't move fast when tool abstractions introduce latency.

Implications for deployment

You cannot force an agent into a cluster by configuration alone. The model and CLI determine behavioral tendencies. Codex tends toward speed-running. Claude tends toward surgical-expert. OpenCode tends toward exploration.

This suggests different models are suited to different vulnerability types. If your bug corpus contains many fast-path bugs (buffer overflows, integer overflows), deploy Codex. If it contains subtle logic bugs, Claude's surgical-expert pattern may be more reliable. If you're testing new models, expect high exploration rates until the model gains confidence.

Session clustering also helps debug agent failures. If an agent suddenly shows high exploratory sessions, it may indicate:

A change in bug corpus (shifting toward harder bugs)
A model degradation (loss of confidence)
A tool integration problem (tool latency causing delayed feedback)

See Benchmark sessions | Behavior analysis | Codex GPT-5.2 profile | Claude Opus 4.6 profile

FAQ

What behavioral patterns do vulnerability patching agents show?

Three clusters: speed-runners (fast, narrow toolset, 60.2% pass), explorers (many tools, low pass rate), and surgical-experts (moderate speed, highest volume, 54.9% pass).

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

Cross-agent agreement: 105 pairwise comparisons

105 agent pairs compared. Best-pair ensemble: 75.7% theoretical pass rate. Highest agreement: 72.4% between Codex GPT-5.2 and Cursor Opus 4.6.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.