Cost Analysis
10 findings on what AI patching costs and whether it is worth buying. 1,664 evaluations analyzed.
Should you buy
Break-even against manual fixing ($150/hr) happens at 10 CVEs per year for the cheapest agent. Most teams patch more than that.
Which agent to buy
Four agents sit on the Pareto frontier: best trade-off between cost and pass rate. Nine agents are dominated - another agent is both cheaper and more accurate.
What AI vulnerability patching costs and whether it is worth it
9 findings from the benchmark economics analysis. 6 backed by measured token data. 1 rely on heuristic cost estimates with lower confidence.
[KEY INSIGHT]
3-agent waterfall = 66.9% at $4.22/pass
The top 3 agents cover 66.9% of bugs. Agents 4 through 13 add only 7.4 percentage points more. Diminishing returns hit fast.
Confidence distribution
Each finding rated by evidence quality. High = measured token data from API logs. Medium = partial measurements. Low = heuristic estimates only.
All 9 findings
Is the $4.85/fix Gemini cost real or artifact of turn_heuristic?
Verdict
UNCERTAIN — heuristic-only, no measured token data
Gemini cost of $4.85/pass uses turn_heuristic (0 sessions with token data). This assumes 1.96/eval based on agent-average turn counts. Only Claude agents have measured token data (114+122 sessions). The heuristic could over- or under-estimate by 2-5x depending on actual Gemini token usage patterns (...
Are cost/pass efficiency rankings stable if infra failures excluded?
Verdict
UNSTABLE — ranking order changes
Excluding 377 infra failures from denominators changes the cost/pass ranking. Agents with high infra failure rates (claude-claude-opus-4-5) benefit most from exclusion. Top-3 order: claude-claude-opus-4-5, claude-claude-opus-4-6, cursor-composer-1.5.
Optimal sequential dispatch strategy for 3-agent ensemble?
Verdict
Best 3-agent waterfall: claude-claude-opus-4-5 → claude-claude-opus-4-6 → gemini-gemini-3-pro-preview
Simulated all 504 possible 3-agent dispatch orderings across 136 samples. Best strategy: claude-claude-opus-4-5 → claude-claude-opus-4-6 → gemini-gemini-3-pro-preview achieves 66.9% pass rate at $4.22/pass ($2.83/sample avg). Cheapest-3 agents (claude-claude-opus-4-5, claude-claude-opus-4-6, cursor-...
Marginal value of 4th through 9th agent in ensemble?
Verdict
Top-3 cover 66.9%, agents 4-9 add only 7.4pp more
Adding agents in cost-efficiency order: first 3 agents cover 66.9% of solvable samples. Agents 4-9 collectively add 7.4pp coverage at rapidly increasing cost. The marginal value of each additional agent decreases sharply after the 3rd.
Break-even ROI for early stopping at turn 5-8 at 1000 CVEs/year?
Verdict
Turn-8 cutoff saves ~$4658/run, kills 125 passes
Analyzed 465 sessions. Median turns for passes: 5, for fails: 9. Early stopping at turn 8 saves 4607 turns (~$4658 compute) but kills 125 successful patches (45.1% false positive rate). At 1000 CVEs/year: ~$34248 annual savings.
At what infra failure rate does retry cost dominate agent selection?
Verdict
At current 21.8% rate, retries add ~$1885 waste/run
Current infra failure rate is 21.8% (377/1727). For expensive agents (OpenCode-Claude at $22.12/eval), each infra failure wastes $22.12. At >15% infra rate, retry cost for expensive agents exceeds the savings from not using a cheaper agent. For cheap agents (Claude-4.5 at $1.13/eval), infra rate wou...
Build Pareto frontier for production agent selection — which agents are dominated?
Verdict
4 Pareto-optimal agents, 9 dominated
Pareto-optimal agents: cursor-opus-4.6, codex-gpt-5.2, claude-claude-opus-4-6, claude-claude-opus-4-5. Dominated agents (should never be deployed): cursor-gpt-5.3-codex, opencode-gpt-5.2, codex-gpt-5.2-codex, opencode-claude-opus-4-6, cursor-gpt-5.2, cursor-composer-1.5, gemini-gemini-3-pro-preview,...
Design sequential waterfall dispatch protocol and model expected cost per fix
Verdict
Waterfall achieves 101/136 (74.3%) at $31.85/pass
Full 9-agent waterfall (cheapest first): 101/136 passes (74.3%) at $31.85/pass ($23.65/sample avg). Average 2.0 agents tried per resolved sample. The cheapest agent resolves most samples, with diminishing returns from escalation.
Break-even: automated patching vs manual developer fixes ($150/hr)?
Verdict
ALL agents are cost-effective vs manual ($600/fix). Best: 227.4x cheaper.
At $150/hr × 4hr = $600/manual fix, every agent is cheaper: from $2.64/fix (claude-claude-opus-4-5, 227.4x cheaper) to $76.54/fix (cursor-opus-4.6, 7.8x cheaper). Break-even: 10 CVEs/year justifies infrastructure for the cheapest agent.
Unlock full results
Enter your email to access the full methodology, per-sample analysis, and patch examples.
FAQ
What does AI vulnerability patching cost?
$2.64 to $52 per verified fix, depending on agent and model. 227x cheaper than manual fixing at $150/hr.
How many agents do I need?
Three agents cover 66.9% of bugs. Agents 4 through 13 add only 7.4 percentage points more. Diminishing returns hit fast.
Patch verification
XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.
Automated vulnerability patching
AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,664 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.
Benchmark Methodology
How CVE-Agent-Bench evaluates 13 coding agents on 128 real vulnerabilities. Deterministic, reproducible, open methodology.
Agent Environment Security
AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.
Security Economics for Agentic Patching
Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
Bug Complexity
128 vulnerabilities scored by difficulty. Floor = every agent fixes it. Ceiling = no agent can.
Agent Strategies
How different agents approach the same bug. Strategy matters as much as model capability.
Execution Metrics
Per-agent session data: turns, tool calls, tokens, and timing. See what happens inside an agent run.
Pricing Transparency
Every cost number has a source. Published pricing models, measurement methods, and provider rates.
Automated Vulnerability Patching and PR Review
Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.
Continuous Learning from Verified Agent Runs
A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.
Signed Compliance Evidence for AI Agents
A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.
Compliance Evidence and Standards Alignment
How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.
See which agents produce fixes that work
128 CVEs. 13 agents. 1,664 evaluations. Agents learn from every run.