Open benchmark for AI vulnerability patching — 128 real CVEs, 15 agents, reproducible results.

[CVE-AGENT-BENCH]

Benchmark Explorer

CVE-Agent-Bench tests whether AI coding agents can fix real CVE vulnerabilities in production codebases. 1,920 evaluations. 128 CVEs. 15 agents. Best pass rate: 62.7%. Cheapest fix: $2.64.

How each CVE is verified:

1. Reproduce

Run the trigger against unpatched code. Confirm it crashes.

2. Patch

Apply the agent's git diff inside a Docker container.

3. Build

Compile with the verifier toolchain and memory safety instrumentation.

4. Verify

Re-run the same trigger. If no crash → [PASS]. Still crashes → [FAIL].

Scoring: Pass = +1 (trigger no longer crashes). Fail = 0 (still crashes). Build = -1 (patch doesn't compile). Infra = excluded.

128

CVE samples

1,920

Verified evaluations

Agent configurations

62.7%

Best agent pass rate

[CVE-AGENT-BENCH LEADERBOARD]

gpt-5.2

62.7%

cursor-opus-4.6

62.5%

claude-opus-4-6

61.6%

gemini31-gemini-3.1-pro-preview

58.7%

opencode-gemini-3.1-pro-preview

54.9%

cursor-gpt-5.2

51.6%

oc/gpt-5.2

51.6%

cursor-gpt-5.3-codex

50.4%

gpt-5.2-codex

49.2%

oc/claude-opus-4-6

47.5%

claude-opus-4-5

45.7%

cursor-composer-1.5

45.2%

gemini-3-pro-preview

43%

oc/gpt-5.2-codex

37.8%

oc/claude-opus-4-5

36.8%

Current verified dataset: 1,920 evaluations · 128 CVE samples · 15 agents · Target: 6,138+ vulnerabilities

Agent names = harness/model. The same LLM through different coding agents produces different patch quality. For example: claude/opus-4-5 is Claude Opus 4.5 through Anthropic's Claude Code. opencode/claude-opus-4-5 is the same model through the OpenCode harness. Same model, different harness — different results. The harness is what we are measuring, not just the model.

[HOW TO USE THIS DATA]

For RLHF / DPO training

Each evaluation is a labeled example: +1 (pass), 0 (fail), -1 (build-fail). Use difficulty scores for curriculum ordering.

For benchmarking your agent

Run your agent on the same 128 CVEs. Log results to W&B Weave. Compare against 15 baselines.

For pre-training data

1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.

For research

IRT difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.

[EVALUATION FACTORY]

Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.

Generate

AI agents generate patches for CVE samples

128 samples

Reproduce

Verify POC and patch correctness

1920 evals

Patch

Test patches against test suites

50.5% pass rate

[AGENT TRAJECTORIES]

Real agent sessions from CVE-Agent-Bench. Watch how different agents approach the same vulnerability.

Speed-runner: arrow #20123

Claude Opus 4.5 fixes a null-check bug in 3 tool calls and 19 seconds. Grep → Read → Edit pattern.

Tools

Tokens

8.4k

Duration

19s

[PASS]

5/1

0 / 6

●

Grep: Search for FieldFromFlatbuffer function to locate the vulnerable code

◉

Found in src/arrow/extension_array_builder.cc at line 156

●

Read: Read src/arrow/flatbuffer.cc to understand the null-check issue

Files: src/arrow/flatbuffer.cc

◉

Code shows field->name() called without null check. Missing safety guard.

●

Edit: Add null check before field->name() call to prevent crash

Files: src/arrow/flatbuffer.cc

◉

[PASS] Patch applied. Tests pass. Null-check fix prevents crash.

✓

[PASS] 5 added / 1 removed

[Files]

src/arrow/flatbuffer.cc

[VERIFICATION PIPELINE]

✓

Step 1: Pull Image

2.10s

[verify] Pulling benchmark image

verifier:20123 pulled successfully

✓

Step 2: Write Patch

0.30s

[verify] Writing patch to /tmp/agent-patch.diff

Patch written (4 bytes)

✓

Step 3: Apply Patch

1.60s

[verify] Applying patch to source

git apply /tmp/agent-patch.diff

Patch applied cleanly

✓

Step 4: Build

39.10s

[verify] Building with verifier toolchain

compile with memory safety instrumentation

Linking verification runtime

Build completed successfully

✓

Step 5: Run Trigger

1.67s

[verify] Running trigger against patched binary

verify /tmp/trigger

Reading 2 bytes from trigger input

Compression error. Error code: -6

Execution successful

[PASS]

Vulnerability fixed. trigger no longer crashes.

[VERIFIABLE]

This session conforms to the IETF Verifiable Agent Conversation Record format. The data structure maps to the VAC entry types (tool-call, tool-result, message) and could be wrapped in a COSE_Sign1 envelope for cryptographic non-repudiation.

→ draft-birkholz-verifiable-agent-conversations

[SAMPLE EXPLORER]

136 CVE samples × 15 agents. Each cell is one evaluation. Sorted by difficulty (easiest top) and pass rate (best left).

Sample (136)

GPT5.2

CsrC4.6

C4.6

Gem3.1

OC-Gem3.1

CsrGPT

OC-GPT5.2

Csr5.3

GPT5.2C

OC-C4.6

C4.5

Csr1.5

Gem3

OC-GPT5.2C

OC-C4.5

text-shaping/text-shaping #10899

text-shaping/text-shaping #11001

git-library/git-library #11167

network-switch/network-switch #10796

packet-analyzer/packet-analyzer #1237

image-processor/image-processor #11429

text-shaping/text-shaping #11033

text-shaping/text-shaping #11081

text-shaping/text-shaping #11263

text-shaping/text-shaping #11290

archive-library/archive-library #11196

git-library/git-library #10999

git-library/git-library #11382

mesh-networking/mesh-networking #11376

mesh-networking/mesh-networking #14821

network-switch/network-switch #10710

crypto-library/crypto-library #10628

packet-analyzer/packet-analyzer #1236

text-shaping/text-shaping #11522

mesh-networking/mesh-networking #12589

data-framework/data-framework #24101

image-processor/image-processor #11078

spell-checker/spell-checker #16531

rpc-framework/rpc-framework #7188

text-shaping/text-shaping #10948

text-shaping/text-shaping #11305

archive-library/archive-library #13435

archive-library/archive-library #15431

git-library/git-library #11173

sip-server/sip-server #53080

linux-utils/linux-utils #53149

file-identifier/file-identifier #13222

archive-library/archive-library #12817

sip-server/sip-server #52204

data-framework/data-framework #20116

text-shaping/text-shaping #11351

text-shaping/text-shaping #11367

archive-library/archive-library #38751

opcua-library/opcua-library #11484

network-switch/network-switch #11408

data-framework/data-framework #57209

text-shaping/text-shaping #11060

text-shaping/text-shaping #12241

archive-library/archive-library #11011

git-library/git-library #11004

sip-server/sip-server #53397

embedded-server/embedded-server #53038

unicode-codec/unicode-codec #66063

pattern-matcher/pattern-matcher #12424

data-framework/data-framework #28750

text-shaping/text-shaping #12312

image-formats/image-formats #12818

archive-library/archive-library #14574

archive-library/archive-library #20459

network-switch/network-switch #12255

packet-analyzer/packet-analyzer #10162

data-compressor/data-compressor #50433

network-switch/network-switch #11160

embedded-server/embedded-server #53029

pgp-library/pgp-library #25386

data-framework/data-framework #20123

service-proxy/service-proxy #22137

json-parser/json-parser #18140

data-framework/data-framework #20113

data-compressor/data-compressor #24837

embedded-server/embedded-server #28474

metadata-library/metadata-library #45993

fs-utilities/fs-utilities #49679

data-compressor/data-compressor #30193

data-framework/data-framework #37888

text-shaping/text-shaping #10724

image-converter/image-converter #12193

image-codec/image-codec #42839

image-codec/image-codec #49277

image-pipeline/image-pipeline #26855

data-compressor/data-compressor #30253

image-codec/image-codec #35293

opcua-library/opcua-library #10676

geo-library/geo-library #10637

pgp-library/pgp-library #25388

network-switch/network-switch #10731

archive-library/archive-library #19509

git-library/git-library #11007

opcua-library/opcua-library #11435

python-runtime/python-runtime #58295

service-proxy/service-proxy #22080

service-proxy/service-proxy #25207

image-formats/image-formats #13016

mesh-networking/mesh-networking #12631

text-shaping/text-shaping #10081

archive-library/archive-library #15120

opcua-library/opcua-library #10604

chem-toolkit/chem-toolkit #36609

data-compressor/data-compressor #30761

crypto-node/crypto-node #34657

rpc-framework/rpc-framework #1847

rpc-framework/rpc-framework #47834

archive-library/archive-library #12466

unicode-codec/unicode-codec #57632

analytics-db/analytics-db #60890

image-converter/image-converter #13180

image-codec/image-codec #40396

chem-toolkit/chem-toolkit #42769

stat-reader/stat-reader #12662

disassembly-engine/disassembly-engine #12953

disassembly-engine/disassembly-engine #12957

disassembly-engine/disassembly-engine #12988

disassembly-engine/disassembly-engine #58789

disassembly-engine/disassembly-engine #8877

js-engine/js-engine #65386

js-engine/js-engine #65393

data-compressor/data-compressor #29287

service-proxy/service-proxy #26685

service-proxy/service-proxy #26834

service-proxy/service-proxy #28869

service-proxy/service-proxy #30618

service-proxy/service-proxy #32878

service-proxy/service-proxy #44850

file-identifier/file-identifier #1065

fuzz-engine/fuzz-engine #51072

mesh-compressor/mesh-compressor #37705

serial-library/serial-library #38778

text-shaping/text-shaping #10097

text-shaping/text-shaping #10953

text-shaping/text-shaping #12292

archive-library/archive-library #15278

cad-library/cad-library #54380

mesh-networking/mesh-networking #12536

geo-library/geo-library #11016

binary-analyzer/binary-analyzer #10222

binary-analyzer/binary-analyzer #11359

pgp-library/pgp-library #24528

pgp-library/pgp-library #24538

pgp-library/pgp-library #25292

i18n-library/i18n-library #65873

cpu-emulator/cpu-emulator #36552

Pass — vulnerability fixed

Fail — patch applied, tests still crash

Build — patch does not compile

Infra — environment failure (excluded)

Difficulty:easymediumhardfloorceiling

[W&B INTEGRATION]

Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench

Import your evaluation results into W&B for centralized tracking.

# Pseudocode — implement these functions for your agent

import wandb

def upload_to_wandb(results):
    with wandb.init(
        project="cve-bench",
        entity="tobias_xor-xor",
        job_type="evaluation"
    ):
        wandb.log({
            "pass_rate": results.pass_rate,
            "total_evals": results.total,
            "cost_usd": results.cost
        })

upload_to_wandb(evaluation_results)

Run your agent against the benchmark and log results to W&B.

# Pseudocode — implement these functions for your agent

def evaluate_agent(agent, samples):
    results = {
        "pass": 0,
        "fail": 0,
        "build": 0,
        "infra": 0
    }

    for sample in samples:
        outcome = run_agent(agent, sample)
        results[outcome] += 1

    return results

# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)

Expected schema for evaluation results.

{
  "agent_model": "string",
  "sample_id": "string",
  "outcome": "pass" | "fail" | "build" | "infra",
  "time_seconds": number,
  "cost_usd": number,
  "tokens_in": number,
  "tokens_out": number
}

[DATA ACCESS]

Dataset access is gated. Request access and receive download link within 24 hours.

Request Access

Get download link for full CVE-Agent-Bench dataset with evaluation metadata.

Request Dataset

Dataset Schema

Field	Type
sample_id	string
agent_model	string
outcome	pass\|fail\|build\|infra
time_seconds	number
cost_usd	number

RLHF Reward Signal

Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.

Run your agent against 128 CVEs

Download the dataset, log results to W&B Weave, and compare against 15 baselines. The current best hits 62.7%.

Get Started