DPO Training Pairs

Direct Preference Optimization training data from Vulnerability-Agent-Bench. 1,920 labeled examples with Gold, Silver, and Bronze preference pairs for RLHF post-training on security patching.

1,920

128

Direct Preference Optimization pairs for security patching

Each Vulnerability-Agent-Bench evaluation produces a preference signal: pass fixes the bug, test-fail compiles but does not fix, build-fail breaks the build. This is Direct Preference Optimization (DPO) training data. 1,920 labeled examples across 15 agent configurations provide ground truth for post-training security patching models.

Preference pairs tell a model which outcomes are desirable. A pass outcome paired with a build-fail outcome teaches the model that generating compilable code is better than generating broken code. A pass paired with a test-fail teaches that fixing the bug is better than compiling. These distinctions matter because the strongest training signals come from outcomes that differ only in one dimension.

Three tiers of preference signals

Gold pairs

Pass > Build-fail. Strongest signal. Winning patch fixes the bug. Losing patch breaks compilation.

Strongest

Silver pairs

Pass > Test-fail. Medium signal. Both patches compile. Only one fixes the bug.

Medium

Bronze pairs

Test-fail > Build-fail. Weakest signal. Neither fixes the bug. Only compilability differs.

Weakest

Outcome Distribution

Agent-level breakdown of pass, fail, and build outcomes across all evaluations. The distribution shows how frequently each outcome type appears in the dataset. Pass outcomes (fixes that work) are the foundation for preference signals. Build-fail outcomes provide negative examples to teach the model what not to generate. Test-fail outcomes sit in the middle, representing patches that compile but miss the bug entirely.

Understanding outcome distribution helps you assess data quality for training. If most evaluations land on one outcome type, you have unbalanced training data. A healthy distribution spreads across all three outcome types, giving the model diverse learning signals.

Preference Pair Distribution

Gold, Silver, and Bronze DPO pair counts from the evaluation dataset. Each pair type has different signal strength for training. Gold pairs (pass vs build-fail) provide the clearest learning signal because they differ in the most impactful dimension: does the patch work at all? Silver pairs (pass vs test-fail) teach the model to distinguish between compilable code and correct code.

The ratio of Gold to Silver to Bronze pairs reflects the underlying data. More Gold pairs mean stronger overall training signal. Bronze pairs are rare because they only occur when you have multiple agents all failing on the same sample in different ways.

Reward Signal Comparison

Ternary (+1/0/-1) vs five-level reward shaping distributions. The ternary signal is simple: pass is good, anything else is not. The five-level signal adds nuance by splitting pass into "surgical fix" and "complex fix", and by distinguishing build-fail from infrastructure errors.

More reward levels let you train on finer-grained distinctions, but they also require more data to distinguish between reward buckets. The five-level approach is strictly more informative if you have enough evaluations per reward level. For datasets with 1,920 total evaluations spread across 15 agents, the ternary signal may be more stable.

Reward Configuration

Base Reward

Difficulty Bonus

+0.5

Teamwork Bonus

+0.25

Exploration Bonus

+0.1

Bonuses applied when agents solve difficult samples, contribute unique solutions, or explore novel reasoning paths.

Reward scoring for RLHF

Use a ternary or five-level reward function to train your model. The ternary signal applies to all 1,920 evaluations across all agents and vulnerability samples.

Ternary reward

+1.0 = pass outcome (vulnerability fixed, all tests pass)
0.0 = test-fail outcome (patch compiles but bug persists)
-1.0 = build-fail outcome (patch breaks compilation)
excluded = infra outcome (environment failure, no learning signal)

Five-level reward (advanced)

Refine the signal by incorporating patch size and semantic complexity:

+1.0 = pass + surgical patch (≤10 lines). Best outcome: correct and concise.
+0.7 = pass + complex patch (>10 lines). Correct but larger than necessary.
0.0 = test-fail. Compiles but does not fix.
-0.5 = build-fail. Breaks compilation.
-1.0 = infra failure (excluded from training).

Curriculum ordering by difficulty

Start training on easy samples and progress to hard ones. Vulnerability-Agent-Bench assigns empirical difficulty scores (0.0 to 1.0) to each sample based on pass rate across all 15 agents. Use these to order your training curriculum.

Easy samples

58 samples with difficulty 0.0 to 0.25. High pass rate across agents. Provides reward signal baseline.

Training phase 1: learn basic patterns

Medium samples

45 samples with difficulty 0.25 to 0.75. Mixed pass/fail results. Requires distinguishing fix patterns.

Training phase 2: refine strategies

Hard samples

25 samples with difficulty >0.75. Low pass rate across agents. Tests edge cases and domain reasoning.

Training phase 3: handle hard cases

Floor samples

0 samples have difficulty 1.0 (no agent passed). Use hard negatives to teach failure patterns.

Training phase 4: learn from failure

Pair generation formula

To generate preference pairs from raw evaluations, cross all outcomes for each sample:

for each vuln sample:
passing_agents = agents with outcome=="pass"
failing_agents = agents with outcome=="test-fail" or "build-fail"

# Gold pairs
for pass_eval in passing_agents:
  for build_eval in build_failing_agents:
    pair = (pass_eval, build_eval, reward=pass_reward - build_reward)

# Silver pairs
for pass_eval in passing_agents:
  for test_eval in test_failing_agents:
    pair = (pass_eval, test_eval, reward=pass_reward - test_reward)

# Bronze pairs
for test_eval in test_failing_agents:
  for build_eval in build_failing_agents:
    pair = (test_eval, build_eval, reward=test_reward - build_reward)

The formula generates |passing_agents| × |build_failing_agents| gold pairs per sample. For samples where all agents fail, only bronze pairs exist. For samples where all agents pass, no pairs exist (no learning signal).

Patch composition and semantic categories

Passing patches in Vulnerability-Agent-Bench fall into semantic categories. Use this to understand what your model is learning:

Logic fix. Change an operator, swap variables, or restructure control flow. 71 evaluations.
Guard check. Add a conditional to check preconditions before an unsafe operation. 155 evaluations.
Bounds check. Add size validation before buffer access. 164 evaluations.
Allocation fix. Fix memory allocation (size calculation, leak, use-after-free). 89 evaluations.
Null check. Validate pointer before dereference. 92 evaluations.

Training on semantically diverse examples teaches your model to recognize different fix patterns, not memorize specific code changes.

Data access and format

Request access to the full dataset at /contact. You will receive a JSON file with all 1,920 evaluations. Each record includes:

sample_id, agent_model, outcome (pass | fail | build | infra)
patch (full text), patch_bytes (size in characters)
difficulty (0.0-1.0), semantic_category (logic_fix, guard_check, etc.)
cost_usd (API cost for this evaluation)

Pair generation is deterministic. The same input data produces the same preference signal across all implementations.

FAQ

What are DPO preference pairs?

Direct Preference Optimization pairs rank outcomes: pass > test-fail > build-fail. Gold pairs (pass vs build-fail) provide strongest training signal. Silver pairs (pass vs test-fail) are medium strength. Bronze pairs (test-fail vs build-fail) are weakest. Use these to post-train your security patching model.

How many preference pairs does the data contain?

1,920 evaluation outcomes across 128 vuln samples and 15 agent configurations. Pair generation is cross-product: for each sample, generate pass > test-fail, pass > build-fail, test-fail > build-fail. Easy samples have more pairs. Hard samples may have only Bronze pairs.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.

DPO Training Pairs

Direct Preference Optimization pairs for security patching

Three tiers of preference signals

Gold pairs

Silver pairs

Bronze pairs

Outcome Distribution

Preference Pair Distribution

Reward Signal Comparison

Reward Configuration

Reward scoring for RLHF

Ternary reward

Five-level reward (advanced)

Curriculum ordering by difficulty

Easy samples

Medium samples

Hard samples

Floor samples

Pair generation formula

Patch composition and semantic categories

Data access and format

See also

FAQ

What are DPO preference pairs?

How many preference pairs does the data contain?

See which agents produce fixes that work