Skip to main content
[EVALUATION]

Patch verification

XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.

[PATCH + VERIFY]

Every fix tested against the vulnerability

OutcomeConfirm the agent's patch resolves the CVE before it ships.

MechanismXOR writes a verifier for the specific CVE, applies the agent's patch in an isolated environment, and re-runs the verifier. Pass or fail.

Proof1,224 evaluations with pass/fail evidence. Infrastructure failures count.

Verify the fix, not just the code

Finding bugs is solved. Confirming the fix works is not. XOR writes a verifier for each CVE, applies the agent's patch, and re-runs the verifier. Only confirmed fixes ship. Failures become learning signal for the next run.

How it works

  1. Agent tools and permissions are checked before the run
  2. Patch is applied in an isolated vulnerable environment
  3. The verifier is re-run to confirm the CVE is resolved
  4. A pass/fail report is attached to the PR
  5. Results feed back into the agent harness for continuous learning
498
Fixes passed
563
Fixes failed
116
Build failures
47
Infrastructure failures

How XOR checks every agent PR

The XOR GitHub App writes a verifier for each bug, applies the agent's fix, and runs safety checks on every PR. Each run produces a report with a pass or fail verdict.

No report, no merge.

How verification works

When a coding agent generates a fix for a bug, three things happen:

  1. The fix is applied to an isolated container running the vulnerable code. The container is an exact reproduction of the original environment — same compiler flags, same dependencies, same OS.

  2. A verifier runs against the fixed code. This is the test that checks whether the vulnerability still triggers. If it no longer triggers, the fix works.

  3. Safety checks run automatically. Memory safety tools detect buffer overflows, use-after-free, and other issues the fix may have introduced or missed.

If the bug no longer triggers AND safety checks pass, the fix is verified. Everything else is a failure.

What counts as verified

  • Bug no longer triggers after the fix
  • No new memory safety issues introduced
  • Build passes in the reproduced environment

What fails

  • Bug still triggers after the fix
  • Safety checks find new issues
  • Build or infrastructure failures during testing

What pass and fail look like

PASS — harfbuzz/harfbuzz#11033

$ docker run --rm xor-verify harfbuzz-11033

applying patch... 23 lines changed

building with safety checks...

running verifier... clean exit

exit 0 — fix verified ✓

FAIL — libarchive/libarchive#12466

$ docker run --rm xor-verify libarchive-12466

applying patch... 18 lines changed

building with safety checks...

ERROR: memory safety issue detected

exit 1 — bug still present ✗

BUILD FAIL — envoy/envoy#28190

$ docker run --rm xor-verify envoy-28190

applying patch... 41 lines changed

ERROR: compilation failed — missing include

exit 2 — patch does not compile ✗

Four possible outcomes

[PASS]

Bug no longer triggers. Safety checks pass. The fix works.

[FAIL]

Bug still triggers after the fix. The agent's code change didn't resolve the issue.

[BUILD]

Code doesn't compile. Missing files, syntax errors, type mismatches.

[INFRA]

Container timeout, sandbox error, network failure. Excluded from agent scoring.

What gets rejected

  • Fixes with no bug reproduction
  • Runs with missing or unsigned audit logs
  • Build failures or infrastructure errors during testing
  • Agent tools that fail security checks

Infrastructure failures are excluded

Infrastructure failures (timeouts, network errors, CI flakes) are logged for debugging but excluded from agent pass-rate calculations. This prevents environment instability from penalizing otherwise functional agents.

CI integration

Verification runs as a GitHub Check. Install the

XOR GitHub App

. Every PR from a coding agent gets a pass/fail result with a link to the full test report.

[NEXT STEPS]

See verification results

FAQ

How does agent verification work?

Agents are wrapped in observation harnesses. When an agent writes a fix for a CVE, XOR runs the fix against the original vulnerability. If the test passes, the fix is verified. Results are logged and attached to the PR.

What if the agent fix causes a regression?

Regressions are caught in the verification harness. The agent can see the regression and try again. Failed runs are primary learning signals that feed back into the agent training pipeline.

Which agents are compatible?

Any agent that writes code: Claude Code, Codex, Gemini CLI, Cursor, or custom agents with code generation. No lock-in. The GitHub App monitors the code change and runs verification automatically.

[RELATED TOPICS]

Automated vulnerability patching

AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.

Benchmark Results

50.7% pass rate. $4.16 per fix. Real data from 1,224 evaluations.

Benchmark Results

50.7% pass rate. $4.16 per fix. Real data from 1,224 evaluations.

Agent Cost Economics

Fix vulnerabilities for $4.16–$87 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

9 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.

Benchmark Methodology

How CVE-Agent-Bench evaluates 9 coding agents on 136 real vulnerabilities. Deterministic, reproducible, open methodology.

Agent Environment Security

AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.

Security Economics for Agentic Patching

Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.

Automated Vulnerability Patching and PR Review

Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.

Continuous Learning from Verified Agent Runs

A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.

Signed Compliance Evidence for AI Agents

A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.

Compliance Evidence and Standards Alignment

How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.

See which agents produce fixes that work

136 CVEs. 9 agents. 1,224 evaluations. Agents learn from every run.