DPO Training Pairs
Direct Preference Optimization training data from CVE-Agent-Bench. 1,920 labeled examples with Gold, Silver, and Bronze preference pairs for RLHF post-training on security patching.
Direct Preference Optimization pairs for security patching
Each CVE-Agent-Bench evaluation produces a preference signal: pass fixes the bug, test-fail compiles but does not fix, build-fail breaks the build. This is Direct Preference Optimization (DPO) training data. 1,920 labeled examples across 15 agent configurations provide ground truth for post-training security patching models.
Preference pairs tell a model which outcomes are desirable. A pass outcome paired with a build-fail outcome teaches the model that generating compilable code is better than generating broken code. A pass paired with a test-fail teaches that fixing the bug is better than compiling. These distinctions matter because the strongest training signals come from outcomes that differ only in one dimension.
Three tiers of preference signals
Gold pairs
Pass > Build-fail. Strongest signal. Winning patch fixes the bug. Losing patch breaks compilation.
Strongest
Silver pairs
Pass > Test-fail. Medium signal. Both patches compile. Only one fixes the bug.
Medium
Bronze pairs
Test-fail > Build-fail. Weakest signal. Neither fixes the bug. Only compilability differs.
Weakest
Outcome Distribution
Agent-level breakdown of pass, fail, and build outcomes across all evaluations. The distribution shows how frequently each outcome type appears in the dataset. Pass outcomes (fixes that work) are the foundation for preference signals. Build-fail outcomes provide negative examples to teach the model what not to generate. Test-fail outcomes sit in the middle, representing patches that compile but miss the bug entirely.
Understanding outcome distribution helps you assess data quality for training. If most evaluations land on one outcome type, you have unbalanced training data. A healthy distribution spreads across all three outcome types, giving the model diverse learning signals.
Preference Pair Distribution
Gold, Silver, and Bronze DPO pair counts from the evaluation dataset. Each pair type has different signal strength for training. Gold pairs (pass vs build-fail) provide the clearest learning signal because they differ in the most impactful dimension: does the patch work at all? Silver pairs (pass vs test-fail) teach the model to distinguish between compilable code and correct code.
The ratio of Gold to Silver to Bronze pairs reflects the underlying data. More Gold pairs mean stronger overall training signal. Bronze pairs are rare because they only occur when you have multiple agents all failing on the same sample in different ways.
Reward Signal Comparison
Ternary (+1/0/-1) vs five-level reward shaping distributions. The ternary signal is simple: pass is good, anything else is not. The five-level signal adds nuance by splitting pass into "surgical fix" and "complex fix", and by distinguishing build-fail from infrastructure errors.
More reward levels let you train on finer-grained distinctions, but they also require more data to distinguish between reward buckets. The five-level approach is strictly more informative if you have enough evaluations per reward level. For datasets with 1,920 total evaluations spread across 15 agents, the ternary signal may be more stable.
Reward Configuration
Base Reward
1
Difficulty Bonus
+0.5
Teamwork Bonus
+0.25
Exploration Bonus
+0.1
Bonuses applied when agents solve difficult samples, contribute unique solutions, or explore novel reasoning paths.
Reward scoring for RLHF
Use a ternary or five-level reward function to train your model. The ternary signal applies to all 1,920 evaluations across all agents and CVE samples.
Ternary reward
- +1.0 = pass outcome (vulnerability fixed, all tests pass)
- 0.0 = test-fail outcome (patch compiles but bug persists)
- -1.0 = build-fail outcome (patch breaks compilation)
- excluded = infra outcome (environment failure, no learning signal)
Five-level reward (advanced)
Refine the signal by incorporating patch size and semantic complexity:
- +1.0 = pass + surgical patch (≤10 lines). Best outcome: correct and concise.
- +0.7 = pass + complex patch (>10 lines). Correct but larger than necessary.
- 0.0 = test-fail. Compiles but does not fix.
- -0.5 = build-fail. Breaks compilation.
- -1.0 = infra failure (excluded from training).
Curriculum ordering by difficulty
Start training on easy samples and progress to hard ones. CVE-Agent-Bench assigns empirical difficulty scores (0.0 to 1.0) to each sample based on pass rate across all 15 agents. Use these to order your training curriculum.
Easy samples
58 samples with difficulty 0.0 to 0.25. High pass rate across agents. Provides reward signal baseline.
Training phase 1: learn basic patterns
Medium samples
45 samples with difficulty 0.25 to 0.75. Mixed pass/fail results. Requires distinguishing fix patterns.
Training phase 2: refine strategies
Hard samples
25 samples with difficulty >0.75. Low pass rate across agents. Tests edge cases and domain reasoning.
Training phase 3: handle hard cases
Floor samples
0 samples have difficulty 1.0 (no agent passed). Use hard negatives to teach failure patterns.
Training phase 4: learn from failure
Pair generation formula
To generate preference pairs from raw evaluations, cross all outcomes for each sample:
for each CVE sample:
passing_agents = agents with outcome=="pass"
failing_agents = agents with outcome=="test-fail" or "build-fail"
# Gold pairs
for pass_eval in passing_agents:
for build_eval in build_failing_agents:
pair = (pass_eval, build_eval, reward=pass_reward - build_reward)
# Silver pairs
for pass_eval in passing_agents:
for test_eval in test_failing_agents:
pair = (pass_eval, test_eval, reward=pass_reward - test_reward)
# Bronze pairs
for test_eval in test_failing_agents:
for build_eval in build_failing_agents:
pair = (test_eval, build_eval, reward=test_reward - build_reward)The formula generates |passing_agents| × |build_failing_agents| gold pairs per sample. For samples where all agents fail, only bronze pairs exist. For samples where all agents pass, no pairs exist (no learning signal).
Patch composition and semantic categories
Passing patches in CVE-Agent-Bench fall into semantic categories. Use this to understand what your model is learning:
- Logic fix. Change an operator, swap variables, or restructure control flow. 71 evaluations.
- Guard check. Add a conditional to check preconditions before an unsafe operation. 155 evaluations.
- Bounds check. Add size validation before buffer access. 164 evaluations.
- Allocation fix. Fix memory allocation (size calculation, leak, use-after-free). 89 evaluations.
- Null check. Validate pointer before dereference. 92 evaluations.
Training on semantically diverse examples teaches your model to recognize different fix patterns, not memorize specific code changes.
Data access and format
Request access to the full dataset at /contact. You will receive a JSON file with all 1,920 evaluations. Each record includes:
- sample_id, agent_model, outcome (pass | fail | build | infra)
- patch (full text), patch_bytes (size in characters)
- difficulty (0.0-1.0), semantic_category (logic_fix, guard_check, etc.)
- cost_usd (API cost for this evaluation)
Pair generation is deterministic. The same input data produces the same preference signal across all implementations.
See also
FAQ
What are DPO preference pairs?
Direct Preference Optimization pairs rank outcomes: pass > test-fail > build-fail. Gold pairs (pass vs build-fail) provide strongest training signal. Silver pairs (pass vs test-fail) are medium strength. Bronze pairs (test-fail vs build-fail) are weakest. Use these to post-train your security patching model.
How many preference pairs does the data contain?
1,920 evaluation outcomes across 128 CVE samples and 15 agent configurations. Pair generation is cross-product: for each sample, generate pass > test-fail, pass > build-fail, test-fail > build-fail. Easy samples have more pairs. Hard samples may have only Bronze pairs.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.