Automated Vulnerability Patching

AI agents generate fixes for known vulnerabilities. XOR verifies each fix against the vulnerability before it ships.

From vulnerability to verified fix

AI agents generate fixes for known security vulnerabilities. XOR writes a verifier for each one and tests the fix against it. 128 tested so far.

Deterministic verification

Each vulnerability is packaged with a known-vulnerable environment, a test harness, and automated verification. Results are deterministic and reproducible.

62.7%

$2.64

80.5%

How automated patching works

Detect

A vulnerability is found by your scanner (Snyk, Dependabot, or manual triage).

Patch

A coding agent generates a fix. It reads the vulnerable code, understands the bug, and writes the fix.

Verify

XOR writes a verifier for the vulnerability in an isolated environment, applies the fix, and runs a safety check. The verifier confirms the bug no longer triggers and no new issues were introduced. Pass or fail. No gray area.

Ship

If it passes, XOR opens a PR with the test report. If it fails, the bug stays open for human review.

Which agent for which vulnerability?

Different agents have different strengths. The best agent by accuracy isn't the cheapest. The cheapest isn't the fastest. Pick based on your priority. In our benchmark of 128 real vulnerabilities across 13 agent configurations, pass rates range from 19% to 53.7% and cost per fix spans $0.53 to $40. The right agent depends on your bug volume, budget, and risk tolerance.

Rank	Agent	Pass Rate	Pass	Fail	Build	Infra
1	codex-gpt-5.2	62.7%	79	12	35	10
2	cursor-opus-4.6	62.5%	80	24	24	0
3	claude-claude-opus-4-6	61.6%	77	28	20	11
4	gemini31-gemini-3.1-pro-preview	58.7%	64	18	27	19
5	opencode-gemini-gemini-3.1-pro-preview	54.9%	67	25	30	6
6	cursor-gpt-5.2	51.6%	63	34	25	6
7	opencode-gpt-5.2	51.6%	63	11	48	14
8	cursor-gpt-5.3-codex	50.4%	64	40	23	1
9	codex-gpt-5.2-codex	49.2%	63	27	38	8
10	opencode-claude-opus-4-6	47.5%	58	15	49	14
11	claude-claude-opus-4-5	45.7%	58	43	26	9
12	cursor-composer-1.5	45.2%	57	39	30	2
13	gemini-gemini-3-pro-preview	43.0%	55	36	37	8
14	opencode-gpt-5.2-codex	37.8%	48	32	47	9
15	opencode-claude-opus-4-5	36.8%	46	29	50	11

See the full leaderboard →

Before and after

// BEFORE - vulnerable function (buffer overflow)

void process_input(char *buf, size_t len) {

char local[256];

memcpy(local, buf, len); // no bounds check

}

// AFTER - agent-patched (bounds check added)

void process_input(char *buf, size_t len) {

char local[256];

if (len > sizeof(local)) len = sizeof(local);

memcpy(local, buf, len);

}

$ xor verify --sample text-shaping-11033

[PASS] - safety checks pass, bug no longer triggers ✓

Automate with GitHub App

Install the XOR GitHub App on your repos. When a coding agent opens a PR, XOR tests it automatically and posts a pass/fail result directly on the PR. No configuration needed beyond installation. Free for public repositories.

Install on GitHub →

[NEXT STEPS]

Start patching

Install GitHub App →Full leaderboard →How verification works →

FAQ

How does automated patching work?

XOR dispatches an agent to write a fix for a known vulnerability. The agent generates a patch. XOR runs the patch against a verifier written for the specific vulnerability. If the fix passes, it ships.

Which agents can generate patches?

Any coding agent: Claude Code, Codex, Gemini CLI, Cursor, or custom agents. The GitHub App monitors the code change and runs verification automatically.

What happens if the patch fails?

Failed patches are rejected. The failure data feeds back into the agent harness as a learning signal for the next run.

How Verification Works

Test agents on real vulnerabilities before shipping fixes.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.