Agent Run Trajectories
A complete record of every agent run. See what the agent did, verify it independently, and feed the data back.
Every run makes agents smarter
OutcomeFeed verified outcomes back into agents so they improve over time.
MechanismXOR records every agent action, signs it, and feeds pass/fail results back into the agent harness. Failed fixes become learning signal. Passing fixes expand the training set.
ProofIETF Internet-Draft format. Open standard, not proprietary.
Record what agents do
XOR captures every action, tool call, and output from each agent run. You get a complete record of what happened and why.
Feed results back into agents
Every verified outcome feeds back into the agent harness. The system prompt is upgraded, memory from previous runs is injected, and the next vulnerability is triaged by business impact.
How trajectories fit the loop
People steer the run while agents execute. Most interaction happens through prompts. XOR captures each run as a verifiable trajectory, then keeps the loop running until reviews are clean.
What trajectories capture
A trajectory is a timestamped record of every action an agent took during a run. It logs which agent ran, what tools it called, what files it changed, what tests it ran, and what the results were. Each action is cryptographically signed so it cannot be altered. The trajectory is immutable proof of what the agent did.
Why behavioral clustering matters
Different agents approach the same vulnerability differently. Some agents write minimal fixes. Others add safety checks. Some explore the codebase before fixing. Others fix immediately and test. By analyzing thousands of trajectories, XOR identifies patterns: which agents are methodical vs aggressive, which ones double-check their work, which ones get stuck in loops. This helps teams understand agent behavior and predict which agent will work best for their codebase and risk tolerance.
[SPECIFICATION]
What the draft requires
- Session Trace and File Attribution records
- Signing Envelope with a COSE_Sign1 wrapper for cryptographic verification
Conformance Requirements: Producer/Verifier/Consumer classes with RFC 2119 terminology
- CDDL schema for trace structure and validation
Trace fields that matter
- Agent identity, tool calls, and outputs per step
- File operations tied to patch evidence
- Reasoning entries (optional, privacy-gated)
- Verification outcomes tied to CVE identifiers
Where trajectories show up in XOR
Trajectories are attached to PR test reports and verification runs, so every fix is traceable and replayable. Teams can replay a trajectory to understand why an agent made a decision or to audit the fix for compliance. Security teams can use trajectory patterns to detect suspicious agent behavior or drift from expected approaches.
FAQ
What is an agent trajectory?
A trajectory is a signed record of every action an agent took during a run: tool calls, file edits, reasoning steps, and the final outcome (pass/fail).
How are trajectories used for learning?
Every trajectory feeds back into the agent harness. Failed runs become learning signal. Passing runs expand the training corpus. Each cycle makes agents smarter.
Can I access raw trajectory data?
Yes. Trajectories are available in JSON and CBOR formats. Export to your analytics pipeline or SIEM.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.