[COST ANALYSIS]

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,664 evaluations analyzed.

Should you buy

Break-even against manual fixing ($150/hr) happens at 10 CVEs per year for the cheapest agent. Most teams patch more than that.

Which agent to buy

Four agents sit on the Pareto frontier: best trade-off between cost and pass rate. Nine agents are dominated - another agent is both cheaper and more accurate.

Economics findings

High confidence

$2.64

Cheapest fix

227x

vs manual fixing

What AI vulnerability patching costs and whether it is worth it

9 findings from the benchmark economics analysis. 6 backed by measured token data. 1 rely on heuristic cost estimates with lower confidence.

[KEY INSIGHT]

3-agent waterfall = 66.9% at $4.22/pass

The top 3 agents cover 66.9% of bugs. Agents 4 through 13 add only 7.4 percentage points more. Diminishing returns hit fast.

Confidence distribution

Each finding rated by evidence quality. High = measured token data from API logs. Medium = partial measurements. Low = heuristic estimates only.

High

Medium

Low

All 9 findings

#1low

Is the $4.85/fix Gemini cost real or artifact of turn_heuristic?

Verdict

UNCERTAIN — heuristic-only, no measured token data

Gemini cost of $4.85/pass uses turn_heuristic (0 sessions with token data). This assumes 1.96/eval based on agent-average turn counts. Only Claude agents have measured token data (114+122 sessions). The heuristic could over- or under-estimate by 2-5x depending on actual Gemini token usage patterns (...

#2high

Are cost/pass efficiency rankings stable if infra failures excluded?

Verdict

UNSTABLE — ranking order changes

Excluding 377 infra failures from denominators changes the cost/pass ranking. Agents with high infra failure rates (claude-claude-opus-4-5) benefit most from exclusion. Top-3 order: claude-claude-opus-4-5, claude-claude-opus-4-6, cursor-composer-1.5.

#3high

Optimal sequential dispatch strategy for 3-agent ensemble?

Verdict

Best 3-agent waterfall: claude-claude-opus-4-5 → claude-claude-opus-4-6 → gemini-gemini-3-pro-preview

Simulated all 504 possible 3-agent dispatch orderings across 136 samples. Best strategy: claude-claude-opus-4-5 → claude-claude-opus-4-6 → gemini-gemini-3-pro-preview achieves 66.9% pass rate at $4.22/pass ($2.83/sample avg). Cheapest-3 agents (claude-claude-opus-4-5, claude-claude-opus-4-6, cursor-...

#4high

Marginal value of 4th through 9th agent in ensemble?

Verdict

Top-3 cover 66.9%, agents 4-9 add only 7.4pp more

Adding agents in cost-efficiency order: first 3 agents cover 66.9% of solvable samples. Agents 4-9 collectively add 7.4pp coverage at rapidly increasing cost. The marginal value of each additional agent decreases sharply after the 3rd.

#5medium

Break-even ROI for early stopping at turn 5-8 at 1000 CVEs/year?

Verdict

Turn-8 cutoff saves ~$4658/run, kills 125 passes

Analyzed 465 sessions. Median turns for passes: 5, for fails: 9. Early stopping at turn 8 saves 4607 turns (~$4658 compute) but kills 125 successful patches (45.1% false positive rate). At 1000 CVEs/year: ~$34248 annual savings.

#6medium

At what infra failure rate does retry cost dominate agent selection?

Verdict

At current 21.8% rate, retries add ~$1885 waste/run

Current infra failure rate is 21.8% (377/1727). For expensive agents (OpenCode-Claude at $22.12/eval), each infra failure wastes $22.12. At >15% infra rate, retry cost for expensive agents exceeds the savings from not using a cheaper agent. For cheap agents (Claude-4.5 at $1.13/eval), infra rate wou...

#7high

Build Pareto frontier for production agent selection — which agents are dominated?

Verdict

4 Pareto-optimal agents, 9 dominated

Pareto-optimal agents: cursor-opus-4.6, codex-gpt-5.2, claude-claude-opus-4-6, claude-claude-opus-4-5. Dominated agents (should never be deployed): cursor-gpt-5.3-codex, opencode-gpt-5.2, codex-gpt-5.2-codex, opencode-claude-opus-4-6, cursor-gpt-5.2, cursor-composer-1.5, gemini-gemini-3-pro-preview,...

#8high

Design sequential waterfall dispatch protocol and model expected cost per fix

Verdict

Waterfall achieves 101/136 (74.3%) at $31.85/pass

Full 9-agent waterfall (cheapest first): 101/136 passes (74.3%) at $31.85/pass ($23.65/sample avg). Average 2.0 agents tried per resolved sample. The cheapest agent resolves most samples, with diminishing returns from escalation.

#9high

Break-even: automated patching vs manual developer fixes ($150/hr)?

Verdict

ALL agents are cost-effective vs manual ($600/fix). Best: 227.4x cheaper.

At $150/hr × 4hr = $600/manual fix, every agent is cheaper: from $2.64/fix (claude-claude-opus-4-5, 227.4x cheaper) to $76.54/fix (cursor-opus-4.6, 7.8x cheaper). Break-even: 10 CVEs/year justifies infrastructure for the cheapest agent.

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

FAQ

What does AI vulnerability patching cost?

$2.64 to $52 per verified fix, depending on agent and model. 227x cheaper than manual fixing at $150/hr.

How many agents do I need?

Three agents cover 66.9% of bugs. Agents 4 through 13 add only 7.4 percentage points more. Diminishing returns hit fast.

[RELATED TOPICS]

Patch verification

XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.

Automated vulnerability patching

AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.

Benchmark Methodology

How CVE-Agent-Bench evaluates 13 coding agents on 128 real vulnerabilities. Deterministic, reproducible, open methodology.

Agent Environment Security

AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.

Security Economics for Agentic Patching

Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Bug Complexity

128 vulnerabilities scored by difficulty. Floor = every agent fixes it. Ceiling = no agent can.

Agent Strategies

How different agents approach the same bug. Strategy matters as much as model capability.

Execution Metrics

Per-agent session data: turns, tool calls, tokens, and timing. See what happens inside an agent run.

Pricing Transparency

Every cost number has a source. Published pricing models, measurement methods, and provider rates.

Automated Vulnerability Patching and PR Review

Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.

Continuous Learning from Verified Agent Runs

A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.

Signed Compliance Evidence for AI Agents

A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.

Compliance Evidence and Standards Alignment

How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.

See which agents produce fixes that work

128 CVEs. 13 agents. 1,664 evaluations. Agents learn from every run.