Dataset Schema Reference
Field specifications, reward signal definitions, and data dictionary for all 1,920 CVE-Agent-Bench evaluation records.
Evaluation schema for CVE-Agent-Bench
CVE-Agent-Bench contains 1,920 labeled evaluations: 128 vulnerability samples × 15 agent configurations (with some agents not tested on all samples). Each evaluation record contains outcome, cost, patch metadata, and difficulty scores. This reference describes every field in the dataset.
The schema is stable. All 1,920 records follow the same field structure, making it easy to build tools to filter, rank, and analyze results. Access the full dataset by requesting it at /contact.
Field reference
sample_id : string
Vulnerability identifier in format org/repo#id. Example: torvalds/linux#123456. Uniquely identifies the CVE and the source project.
agent_model : string
Agent configuration in format harness/model. Example: cot/gpt-4o. Identifies the reasoning approach and model used for this evaluation.
outcome : string
Result of the evaluation. One of: pass, test-fail, build-fail, infra. See reward signal table below for interpretation.
reward_score : number
Numerical signal for RLHF training. +1.0 for pass, 0.0 for test-fail, -1.0 for build-fail, null for infra.
cost_usd : number
API cost for this evaluation in USD. Includes all token calls (input + output) for the agent's problem-solving session. Two agents have measured costs from API logs. Thirteen agents have costs estimated from token counts and published API pricing.
difficulty_score : number (0.0-1.0)
Empirical difficulty computed as 1.0 - pass_rate_across_agents. A sample where 14 of 15 agents pass has difficulty 0.067. A sample where 0 agents pass has difficulty 1.0. Use this to order your training curriculum from easy to hard.
difficulty_category : string
One of: easy (difficulty 0.0-0.25), medium (0.25-0.75), hard (0.75-1.0), ceiling (no agent passed, 1.0). Shorthand for binning difficulty scores.
patch_text : string
The patch generated by the agent. In unified diff format. Empty for failed runs. Size ranges from 1 line (minimal fix) to 200+ lines (comprehensive fix).
semantic_category : string
Classification of the patch type. One of: logic-fix, guard-check, bounds-check, allocation-fix, null-check. Null for non-passing evaluations.
timestamp_utc : string
ISO 8601 timestamp of when the evaluation ran. Useful for sorting and understanding benchmark timing.
Data quality checks
We run several automated checks on every benchmark publication:
- All 1,920 records have valid schema (no null required fields, no type mismatches).
- Every sample ID references a real vulnerability with CVE linkage and PoC code.
- Every agent configuration has at least one evaluation attempt on all 128 samples (some agents skipped some samples, total evaluations under 2,000).
- Difficulty scores are within [0.0, 1.0]. Histogram shows normal distribution centered on 0.5.
- Patch text has been verified to match the actual generated code from agent logs.
- Difficulty scores are deterministic: computed once from evaluation outcomes, not changing between releases.
- Cost data is derived from API logs (2 agents) or token counts + published pricing (13 agents).
- No sample appears with outcome==infra more than twice. Infra failures are random and not systematic.
Request access
The full dataset is available on request. Send your name, organization, and intended use case to /contact. You will receive a download link within 24 hours.
The data is provided as a single JSON file (~3 MB). Each record is a JSON object with the fields described above. Parse it with any standard JSON library.
[DATA ACCESS]
Dataset access is gated. Request access and receive download link within 24 hours.
Request Access
Get download link for full CVE-Agent-Bench dataset with evaluation metadata.
Request DatasetDataset Schema
| Field | Type |
|---|---|
| sample_id | string |
| agent_model | string |
| outcome | pass|fail|build|infra |
| time_seconds | number |
| cost_usd | number |
RLHF Reward Signal
Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.
See also
FAQ
What fields are in each evaluation record?
Each record contains: sample_id, agent_model, outcome, reward_score, cost_usd, difficulty_score, patch_bytes, patch_semantic_category, time_seconds, patch_text. See the schema reference for complete field documentation.
What is the reward signal?
+1.0 for pass (bug fixed), 0.0 for test-fail (compiles but no fix), -1.0 for build-fail (broken patch), null for infra (environment failure). Designed for RLHF and DPO training.
How do I access the full dataset?
Request access at /contact with your name, organization, and use case. You will receive a JSON download link within 24 hours. All 1,920 records follow the stable schema.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.