How Verification Works
Test agents on real vulnerabilities before shipping fixes.
Every fix tested against the vulnerability
OutcomeConfirm the agent's patch resolves the CVE before it ships.
MechanismXOR writes a verifier for the specific CVE, applies the agent's patch in an isolated environment, and re-runs the verifier. Pass or fail.
Proof1,920 evaluations with pass/fail evidence. Infrastructure failures count.
Observation Model: Agents Run in Harnesses
Agents don't run free. Every agent is wrapped in an observation harness that intercepts all tool calls, file edits, and reasoning steps. The harness records what the agent did, and cryptographically signs the record.
Testing Model: Real Bugs, Real Agents
We test on 128 real CVEs from the public database. Every agent runs on the same bugs. No hand-picked examples. No synthetic cases. Real threats, real code.
Proof Model: Cryptographically Signed Traces
Every agent run produces a signed audit trail. The trace shows: what code changed, which tools were called, what reasoning steps were taken, and whether the test passed. You can hand the trace to your compliance team.
How XOR checks every agent PR
The XOR GitHub App writes a verifier for each bug, applies the agent's fix, and runs safety checks on every PR. Each run produces a report with a pass or fail verdict.
No report, no merge.
How verification works
When a coding agent generates a fix for a bug, three things happen:
The fix is applied to an isolated container running the vulnerable code. The container is an exact reproduction of the original environment - same compiler flags, same dependencies, same OS.
A verifier runs against the fixed code. This is the test that checks whether the vulnerability still triggers. If it no longer triggers, the fix works.
Safety checks run automatically. Memory safety tools detect buffer overflows, use-after-free, and other issues the fix may have introduced or missed.
If the bug no longer triggers AND safety checks pass, the fix is verified. Everything else is a failure.
What counts as verified
- Bug no longer triggers after the fix
- No new memory safety issues introduced
- Build passes in the reproduced environment
What fails
- Bug still triggers after the fix
- Safety checks find new issues
- Build or infrastructure failures during testing
What pass and fail look like
PASS - text-shaping/engine#11033
$ docker run --rm xor-verify sample-11033
applying patch... 23 lines changed
building with safety checks...
running verifier... clean exit
exit 0 - fix verified ✓
FAIL - archive-library/handler#12466
$ docker run --rm xor-verify sample-12466
applying patch... 18 lines changed
building with safety checks...
ERROR: memory safety issue detected
exit 1 - bug still present ✗
BUILD FAIL - envoy/envoy#28190
$ docker run --rm xor-verify envoy-28190
applying patch... 41 lines changed
ERROR: compilation failed - missing include
exit 2 - patch does not compile ✗
Four possible outcomes
[PASS]
Bug no longer triggers. Safety checks pass. The fix works.
[FAIL]
Bug still triggers after the fix. The agent's code change didn't resolve the issue.
[BUILD]
Code doesn't compile. Missing files, syntax errors, type mismatches.
[INFRA]
Container timeout, sandbox error, network failure. Excluded from agent scoring.
What gets rejected
- Fixes with no bug reproduction
- Runs with missing or unsigned audit logs
- Build failures or infrastructure errors during testing
- Agent tools that fail security checks
Infrastructure failures are excluded
Infrastructure failures (timeouts, network errors, CI flakes) are logged for debugging but excluded from agent pass-rate calculations. This prevents environment instability from penalizing otherwise functional agents.
CI integration
Verification runs as a GitHub Check. Install the XOR GitHub App . Every PR from a coding agent gets a pass/fail result with a link to the full test report.
[NEXT STEPS]
See verification results
FAQ
How does agent verification work?
Agents are wrapped in observation harnesses. When an agent writes a fix for a CVE, XOR runs the fix against the original vulnerability. If the test passes, the fix is verified. Results are logged and attached to the PR.
What if the agent fix causes a regression?
Regressions are caught in the verification harness. The agent can see the regression and try again. Failed runs are primary learning signals that feed back into the agent training pipeline.
Which agents are compatible?
Any agent that writes code: Claude Code, Codex, Gemini CLI, Cursor, or custom agents with code generation. No lock-in. The GitHub App monitors the code change and runs verification automatically.
Automated Vulnerability Patching
AI agents generate fixes for known CVEs. XOR verifies each fix against the vulnerability before it ships.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.