New: 15 agents benchmarked on 128 real vulnerabilities. Agents learn from every run.New: agent benchmark results See which agents pass →
Training signal you can verify by running the code.
Your coding model is graded by humans, or by another model judging it. XOR gives you agent trajectories on real security tasks, each with a deterministic pass/fail label from real execution. No raters. No judge model. The reward is whether the fix actually works.
128 vulnerability test casesCurrent verified1,920 Verified evaluationsCurrent verified6,138+ target vulnerabilitiesTarget250+ codebasesTarget
You cannot grade your own homework.
OutcomeInternal benchmarks are marketing, not training signal. XOR runs your agent against thousands of real security tasks and labels every attempt by execution: did the fix hold, or did the vulnerability survive.
MechanismEvery trajectory ships with the full reasoning trace, tool calls, and diff, plus a deterministic pass/fail verdict from running the code. Formatted for supervised fine-tuning and reinforcement learning. The label is not a human opinion. It is what happened when the patch ran.
ProofThousands of verified security tasks. Each attempt graded by execution, not by a rater and not by a judge model.
Reward signal from real execution, not human raters.
How agents learnHuman labels drift. Judge-model evals inherit the judge's blind spots. XOR's label is the verdict of a deterministic verifier that runs the patch and replays the trigger. Pass means the vulnerability is gone. The signal is the same one you would compute yourself if you ran every fix by hand, at a scale you cannot staff.
Your model's real rank, head to head.
Benchmark resultsSelf-reported pass rates are not a leaderboard. XOR benchmarks coding agents on the same real security tasks under the same verifier, so you see where your model lands against the field on work that matters. No marketing numbers. Run it and read the rank.
Formats and rights, settled up front.
Talk to the teamTrajectories ship as structured records: reasoning trace, tool calls, diff, and the execution verdict, in JSON or CBOR. Licensing is perpetual post-training rights, named in the contract. The trained model is your property. You get the schema and a sample drop before you commit a dollar.
"We already buy labeled data."
Human labels and judge-model evals carry the labeler's bias. XOR's label is the exit code of the verifier. It cannot be coached, and it does not drift between annotators.
"How do we know the label is correct?"
Every verdict comes from running the patch against the task and replaying the trigger. Pass means the vulnerability is gone. You can re-run it and get the same answer.
"What are the licensing terms?"
Perpetual post-training rights, stated in the contract. The trained model is your property. We deliver the schema and a sample drop before you commit.
FAQ
Where does the training signal come from?
Agent trajectories on real security tasks, each graded by execution and formatted for supervised fine-tuning and reinforcement learning. No human raters. No judge model. The reward is whether the fix actually works when the code runs.
How is this different from human-labeled data?
Human and judge-model labels encode opinion, and opinions disagree. XOR's label is a deterministic pass/fail from running the patch against the task. The same input always produces the same verdict.
What format do trajectories ship in?
Structured records with the reasoning trace, tool calls, diff, and execution verdict, available as JSON or CBOR. You get the schema and a sample drop before any contract so your data team can wire it into a training pipeline first.
What are the training rights?
Perpetual post-training rights, stated in the contract. The trained model is your lab's property. The terms are settled before delivery, not after.
How agents learn
How execution-verified outcomes become training signal.
Benchmark results
Coding agents ranked on real security tasks under one verifier.
How verification works
Deterministic pass/fail verdicts from running the code.
Benchmark methodology
How agents are tested on real-world security tasks.
See your model's real rank.
Thousands of verified security tasks. Every attempt graded by execution. Talk to the team about a sample drop.
$xor patch --verify --learn