Skip to main content
[EVALUATION]

How Verification Works

Test agents on real vulnerabilities before shipping fixes.

[PATCH + VERIFY]

Every fix tested against the vulnerability

OutcomeConfirm the agent's patch resolves the CVE before it ships.

MechanismXOR writes a verifier for the specific CVE, applies the agent's patch in an isolated environment, and re-runs the verifier. Pass or fail.

Proof1,920 evaluations with pass/fail evidence. Infrastructure failures count.

Observation Model: Agents Run in Harnesses

Agents don't run free. Every agent is wrapped in an observation harness that intercepts all tool calls, file edits, and reasoning steps. The harness records what the agent did, and cryptographically signs the record.

Testing Model: Real Bugs, Real Agents

We test on 128 real CVEs from the public database. Every agent runs on the same bugs. No hand-picked examples. No synthetic cases. Real threats, real code.

Proof Model: Cryptographically Signed Traces

Every agent run produces a signed audit trail. The trace shows: what code changed, which tools were called, what reasoning steps were taken, and whether the test passed. You can hand the trace to your compliance team.

942
Fixes passed
413
Fixes failed
509
Build failures
56
Infrastructure failures

How XOR checks every agent PR

The XOR GitHub App writes a verifier for each bug, applies the agent's fix, and runs safety checks on every PR. Each run produces a report with a pass or fail verdict.

No report, no merge.

How verification works

When a coding agent generates a fix for a bug, three things happen:

  1. The fix is applied to an isolated container running the vulnerable code. The container is an exact reproduction of the original environment - same compiler flags, same dependencies, same OS.

  2. A verifier runs against the fixed code. This is the test that checks whether the vulnerability still triggers. If it no longer triggers, the fix works.

  3. Safety checks run automatically. Memory safety tools detect buffer overflows, use-after-free, and other issues the fix may have introduced or missed.

If the bug no longer triggers AND safety checks pass, the fix is verified. Everything else is a failure.

What counts as verified

  • Bug no longer triggers after the fix
  • No new memory safety issues introduced
  • Build passes in the reproduced environment

What fails

  • Bug still triggers after the fix
  • Safety checks find new issues
  • Build or infrastructure failures during testing

What pass and fail look like

PASS - text-shaping/engine#11033

$ docker run --rm xor-verify sample-11033

applying patch... 23 lines changed

building with safety checks...

running verifier... clean exit

exit 0 - fix verified ✓

FAIL - archive-library/handler#12466

$ docker run --rm xor-verify sample-12466

applying patch... 18 lines changed

building with safety checks...

ERROR: memory safety issue detected

exit 1 - bug still present ✗

BUILD FAIL - envoy/envoy#28190

$ docker run --rm xor-verify envoy-28190

applying patch... 41 lines changed

ERROR: compilation failed - missing include

exit 2 - patch does not compile ✗

Four possible outcomes

[PASS]

Bug no longer triggers. Safety checks pass. The fix works.

[FAIL]

Bug still triggers after the fix. The agent's code change didn't resolve the issue.

[BUILD]

Code doesn't compile. Missing files, syntax errors, type mismatches.

[INFRA]

Container timeout, sandbox error, network failure. Excluded from agent scoring.

What gets rejected

  • Fixes with no bug reproduction
  • Runs with missing or unsigned audit logs
  • Build failures or infrastructure errors during testing
  • Agent tools that fail security checks

Infrastructure failures are excluded

Infrastructure failures (timeouts, network errors, CI flakes) are logged for debugging but excluded from agent pass-rate calculations. This prevents environment instability from penalizing otherwise functional agents.

CI integration

Verification runs as a GitHub Check. Install the XOR GitHub App . Every PR from a coding agent gets a pass/fail result with a link to the full test report.

[NEXT STEPS]

See verification results

FAQ

How does agent verification work?

Agents are wrapped in observation harnesses. When an agent writes a fix for a CVE, XOR runs the fix against the original vulnerability. If the test passes, the fix is verified. Results are logged and attached to the PR.

What if the agent fix causes a regression?

Regressions are caught in the verification harness. The agent can see the regression and try again. Failed runs are primary learning signals that feed back into the agent training pipeline.

Which agents are compatible?

Any agent that writes code: Claude Code, Codex, Gemini CLI, Cursor, or custom agents with code generation. No lock-in. The GitHub App monitors the code change and runs verification automatically.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.