Skip to main content
[SCHEMA]

Dataset Schema Reference

Field specifications, reward signal definitions, and data dictionary for all 1,920 CVE-Agent-Bench evaluation records.

1,920
Evaluation records
128
Unique CVE samples
15
Agent configurations
40+
C/C++ projects

Evaluation schema for CVE-Agent-Bench

CVE-Agent-Bench contains 1,920 labeled evaluations: 128 vulnerability samples × 15 agent configurations (with some agents not tested on all samples). Each evaluation record contains outcome, cost, patch metadata, and difficulty scores. This reference describes every field in the dataset.

The schema is stable. All 1,920 records follow the same field structure, making it easy to build tools to filter, rank, and analyze results. Access the full dataset by requesting it at /contact.

Field reference

sample_id : string

Vulnerability identifier in format org/repo#id. Example: torvalds/linux#123456. Uniquely identifies the CVE and the source project.

agent_model : string

Agent configuration in format harness/model. Example: cot/gpt-4o. Identifies the reasoning approach and model used for this evaluation.

outcome : string

Result of the evaluation. One of: pass, test-fail, build-fail, infra. See reward signal table below for interpretation.

reward_score : number

Numerical signal for RLHF training. +1.0 for pass, 0.0 for test-fail, -1.0 for build-fail, null for infra.

cost_usd : number

API cost for this evaluation in USD. Includes all token calls (input + output) for the agent's problem-solving session. Two agents have measured costs from API logs. Thirteen agents have costs estimated from token counts and published API pricing.

difficulty_score : number (0.0-1.0)

Empirical difficulty computed as 1.0 - pass_rate_across_agents. A sample where 14 of 15 agents pass has difficulty 0.067. A sample where 0 agents pass has difficulty 1.0. Use this to order your training curriculum from easy to hard.

difficulty_category : string

One of: easy (difficulty 0.0-0.25), medium (0.25-0.75), hard (0.75-1.0), ceiling (no agent passed, 1.0). Shorthand for binning difficulty scores.

patch_text : string

The patch generated by the agent. In unified diff format. Empty for failed runs. Size ranges from 1 line (minimal fix) to 200+ lines (comprehensive fix).

semantic_category : string

Classification of the patch type. One of: logic-fix, guard-check, bounds-check, allocation-fix, null-check. Null for non-passing evaluations.

timestamp_utc : string

ISO 8601 timestamp of when the evaluation ran. Useful for sorting and understanding benchmark timing.

Data quality checks

We run several automated checks on every benchmark publication:

  • All 1,920 records have valid schema (no null required fields, no type mismatches).
  • Every sample ID references a real vulnerability with CVE linkage and PoC code.
  • Every agent configuration has at least one evaluation attempt on all 128 samples (some agents skipped some samples, total evaluations under 2,000).
  • Difficulty scores are within [0.0, 1.0]. Histogram shows normal distribution centered on 0.5.
  • Patch text has been verified to match the actual generated code from agent logs.
  • Difficulty scores are deterministic: computed once from evaluation outcomes, not changing between releases.
  • Cost data is derived from API logs (2 agents) or token counts + published pricing (13 agents).
  • No sample appears with outcome==infra more than twice. Infra failures are random and not systematic.

Request access

The full dataset is available on request. Send your name, organization, and intended use case to /contact. You will receive a download link within 24 hours.

The data is provided as a single JSON file (~3 MB). Each record is a JSON object with the fields described above. Parse it with any standard JSON library.

[DATA ACCESS]

Dataset access is gated. Request access and receive download link within 24 hours.

Request Access

Get download link for full CVE-Agent-Bench dataset with evaluation metadata.

Request Dataset

Dataset Schema

FieldType
sample_idstring
agent_modelstring
outcomepass|fail|build|infra
time_secondsnumber
cost_usdnumber

RLHF Reward Signal

Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.

See also

FAQ

What fields are in each evaluation record?

Each record contains: sample_id, agent_model, outcome, reward_score, cost_usd, difficulty_score, patch_bytes, patch_semantic_category, time_seconds, patch_text. See the schema reference for complete field documentation.

What is the reward signal?

+1.0 for pass (bug fixed), 0.0 for test-fail (compiles but no fix), -1.0 for build-fail (broken patch), null for infra (environment failure). Designed for RLHF and DPO training.

How do I access the full dataset?

Request access at /contact with your name, organization, and use case. You will receive a JSON download link within 24 hours. All 1,920 records follow the stable schema.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.