Dataset Schema Reference

Field specifications, reward signal definitions, and data dictionary for all 1,920 Vulnerability-Agent-Bench evaluation records.

1,920

128

40+

Evaluation schema for Vulnerability-Agent-Bench

Vulnerability-Agent-Bench contains 1,920 labeled evaluations: 128 vulnerability samples × 15 agent configurations (with some agents not tested on all samples). Each evaluation record contains outcome, cost, patch metadata, and difficulty scores. This reference describes every field in the dataset.

The schema is stable. All 1,920 records follow the same field structure, making it easy to build tools to filter, rank, and analyze results. Access the full dataset by requesting it at /contact.

Field reference

sample_id : string

Vulnerability identifier in format org/repo#id. Example: torvalds/linux#123456. Uniquely identifies the vulnerability and the source project.

agent_model : string

Agent configuration in format harness/model. Example: cot/gpt-4o. Identifies the reasoning approach and model used for this evaluation.

outcome : string

Result of the evaluation. One of: pass, test-fail, build-fail, infra. See reward signal table below for interpretation.

reward_score : number

Numerical signal for RLHF training. +1.0 for pass, 0.0 for test-fail, -1.0 for build-fail, null for infra.

cost_usd : number

API cost for this evaluation in USD. Includes all token calls (input + output) for the agent's problem-solving session. Two agents have measured costs from API logs. Thirteen agents have costs estimated from token counts and published API pricing.

difficulty_score : number (0.0-1.0)

Empirical difficulty computed as 1.0 - pass_rate_across_agents. A sample where 14 of 15 agents pass has difficulty 0.067. A sample where 0 agents pass has difficulty 1.0. Use this to order your training curriculum from easy to hard.

difficulty_category : string

One of: easy (difficulty 0.0-0.25), medium (0.25-0.75), hard (0.75-1.0), ceiling (no agent passed, 1.0). Shorthand for binning difficulty scores.

patch_text : string

The patch generated by the agent. In unified diff format. Empty for failed runs. Size ranges from 1 line (minimal fix) to 200+ lines (comprehensive fix).

semantic_category : string

Classification of the patch type. One of: logic-fix, guard-check, bounds-check, allocation-fix, null-check. Null for non-passing evaluations.

timestamp_utc : string

ISO 8601 timestamp of when the evaluation ran. Useful for sorting and understanding benchmark timing.

Data quality checks

We run several automated checks on every benchmark publication:

All 1,920 records have valid schema (no null required fields, no type mismatches).
Every sample ID references a real vulnerability with disclosure linkage and trigger input.
Every agent configuration has at least one evaluation attempt on all 128 samples (some agents skipped some samples, total evaluations under 2,000).
Difficulty scores are within [0.0, 1.0]. Histogram shows normal distribution centered on 0.5.
Patch text has been verified to match the actual generated code from agent logs.
Difficulty scores are deterministic: computed once from evaluation outcomes, not changing between releases.
Cost data is derived from API logs (2 agents) or token counts + published pricing (13 agents).
No sample appears with outcome==infra more than twice. Infra failures are random and not systematic.

Request access

The full dataset is available on request. Send your name, organization, and intended use case to /contact. You will receive a download link within 24 hours.

The data is provided as a single JSON file (~3 MB). Each record is a JSON object with the fields described above. Parse it with any standard JSON library.

Dataset access is gated. Request access and receive download link within 24 hours.

Request Access

Get download link for full Vulnerability-Agent-Bench dataset with evaluation metadata.

Request Dataset

Dataset Schema

Field	Type
sample_id	string
agent_model	string
outcome	pass\|fail\|build\|infra
time_seconds	number
cost_usd	number

RLHF Reward Signal

Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.

FAQ

What fields are in each evaluation record?

Each record contains: sample_id, agent_model, outcome, reward_score, cost_usd, difficulty_score, patch_bytes, patch_semantic_category, time_seconds, patch_text. See the schema reference for complete field documentation.

What is the reward signal?

+1.0 for pass (bug fixed), 0.0 for test-fail (compiles but no fix), -1.0 for build-fail (broken patch), null for infra (environment failure). Designed for RLHF and DPO training.

How do I access the full dataset?

Request access at /contact with your name, organization, and use case. You will receive a JSON download link within 24 hours. All 1,920 records follow the stable schema.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.