Skip to main content

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

15 agents benchmarked on 128 real vulnerabilities

OutcomePick the right agent before you deploy. See which ones produce fixes that pass.

Mechanism1,920 evaluations. Pass rates, cost per fix, and difficulty scores for every agent.

ProofBest pass rate: 62.7%. Cheapest verified fix: $2.64.

Agent Rankings

  1. Pass rate (primary metric)
  2. Cost per verified fix
  3. Difficulty-weighted performance
  4. Trend over time

Cost Economics

Fix a vulnerability with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.

Difficulty Scoring

Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.

128
1,920
15
62.7%

Current test dataset

128 real bugs tested, 1,920 test runs across 15 agent configurations. Growing to 6,138+ vulnerabilities across 250+ production codebases.

Vulnerability-Agent-Bench evaluates how well AI agents can generate security patches for real vulnerability samples. This is not a toy benchmark -- the bugs come from a curated dataset of real vulnerabilities collected from production codebases that teams maintain. The agents run in isolated containers and patches are verified with automated tests.

How it works

  1. Reproduce each bug with a known way to trigger it and a known-good fix.
  2. Run each agent in an isolated environment.
  3. Apply the agent's fix and check if the bug is gone.
  4. Record pass/fail results and categorize failures.
  5. Adjust scores for bug difficulty so results are fair.

The process is deterministic. Every agent receives the same environment, the same inputs, and the same constraints. We measure what agents actually do, not what they claim to do. If an agent fails to generate a fix, or generates a broken patch, or times out, all of those count as failures.

What's in the report

  • Agent leaderboard by pass rate and cost
  • Failure categories and why fixes fail
  • Guide for choosing the right agent and model
  • Fix examples and test results

The report goes beyond rankings. We document every failure mode - agents that produce patches that compile but do not fix the bug, agents that refuse to patch, infrastructure failures. We analyze patch semantics to understand different fixing approaches. We map which agents agree and which ones disagree on the same bugs.

Why this matters

Engineering leaders need proof before scaling AI fixes to hundreds of developers. Security leaders need audit-ready evidence. XOR delivers independent, tested results that both teams can trust.

Most agent benchmarks measure generic coding tasks. This one measures security patching specifically. The skills are different. A high-pass-rate general coding agent might fail on security context. We test what matters for your pipeline.

How to use this data

For RLHF / DPO training

Each evaluation is a labeled example (+1 pass, 0 fail, -1 build-fail). Use difficulty scores for curriculum ordering.

For benchmarking your agent

Run your agent on the same 128 vulnerabilities. Log results to W&B Weave. Compare against 15 baselines.

For pre-training data

1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.

For research

Empirical difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.

Browse the data

Full benchmark report

Enter your email below to access agent configurations, patch examples, failure analysis, and full methodology.

Agent Configurations

128 real bugs tested across 15 agent configurations. Growing to 6,138+ vulnerabilities. Each agent runs in an isolated container with automated safety checks. See how verification works .

codexgpt-5.262.7%
cursoropus-4.662.5%
claudeclaude-opus-4-661.6%
gemini31gemini-3.1-pro-preview58.7%
opencodegemini-gemini-3.1-pro-preview54.9%
cursorgpt-5.251.6%
opencodegpt-5.251.6%
cursorgpt-5.3-codex50.4%
codexgpt-5.2-codex49.2%
opencodeclaude-opus-4-647.5%
claudeclaude-opus-4-545.7%
cursorcomposer-1.545.2%
geminigemini-3-pro-preview43%
opencodegpt-5.2-codex37.8%
opencodeclaude-opus-4-536.8%
[SAMPLE VIEW]
Agent: opencode-o3  │  CVE-2024-XXXXX  │  OpenSSL
──────────────────────────────────────────────
Step 1: Clone repository            [2.1s]
Step 2: Reproduce vulnerability     [4.7s]  ← test case triggers crash
Step 3: Analyze root cause          [8.3s]  ← bounds check missing
Step 4: Generate patch              [3.2s]  ← adds size validation
Step 5: Verify fix (safety check)   [5.1s]  ← test case no longer crashes
──────────────────────────────────────────────
Result: PASS  │  Time: 23.4s  │  IRT Score: 0.73

Sample Distribution

128 evaluated / 6,138+ dataset

Current evaluation: 128 healthchecked samples across 27 codebases. Full dataset: 6,138+ vulnerabilities across 250+ codebases.

text-shaping/engineC++
10
archive-library/handlerC
3
git-library/coreC
3
image-processor/raw-decoderC++
2
industrial-protocol/opc-uaC
2
network-switch/ovsC
2
data-processing/arrowC++
1
js-engine/runtimeC
1
cryptocurrency/nodeC++
1
data-compressor/c-codecC
1
disassembler/engineC
1
embedded-server/networkingC
1
analytics-db/engineC++
1
coverage-tool/engineC++
1
3d-codec/decoderC++
1
serialization/buffersC++
1
rpc-framework/rpcC++
1
image-codec/jxlC++
1
sip-server/proxyC
1
mesh-networking/threadC++
1
language-runtime/cpythonC
1
reverse-engineering/frameworkC
1
unicode-processing/simdC++
1
unicode-support/icuC++
1
system-utilities/coreC
1
malware-detection/rulesC
1
statistics/readerC
1

Patch Examples

Real vulnerability fixes from Vulnerability-Agent-Bench samples. Each patch is CI-verified.

[file]

Uninitialized memory read in regex match buffer. The `pmatch` array was allocated but not zeroed, causing memory safety checks to flag undefined behavior on partial match paths.

Added `memset(pmatch, 0, sizeof(regmatch_t) * nmatch)` before the regex match call to initialize the buffer.

Safety check passes: no uninitialized memory access. Regression tests unchanged.

[packet analyzer]

Out-of-bounds read in DOF protocol dissector (`packet-dof.c`). Insufficient bounds checking on packet length allowed reading past buffer end.

Added bounds check before accessing packet data to verify remaining buffer length covers the expected field size.

Safety check passes: no out-of-bounds read. Existing dissector tests pass.

[text shaping]

Buffer overflow in OpenType layout table processing. Font shaping with malformed GPOS/GSUB tables triggered writes past allocated buffer.

Added length validation on subtable offsets before processing, rejecting malformed tables early.

Safety check passes: no buffer overflow. Text shaping test suite passes.

Failure Taxonomy

10 layers
[AGENT]Agent capability issues
L1
Infrastructure failures - Empty patches, missing files, agent produced no output.
L2
Agent behavior failures - Code reformatting, wrong file location, partial patch.
L3
Vulnerability understanding failures - Agent misunderstood root cause, fixed wrong issue.
[BUILD]Build and verification issues
L4
Build environment failures - Syntax errors, missing includes, incompatible types.
L5
Verification failures - Build succeeds but safety check still fires, crash not fixed.
[INFRA]Infrastructure and timeout issues
L6
Trajectory errors - Billing failures, rate limits, authentication errors during agent run.
L7
Timeout subcategories - Context window exhaustion, reasoning loops, agent exceeded time limit.
[SYSTEM]Systemic and composite patterns
L8
Convergent failure patterns - All agents produce empty patch, all agents fail same sample.
L9
Cloud Run job status - Job scheduling failures, resource limits, container crashes.
L10
Composite diagnostic scoring - Aggregate across layers to classify overall failure mode.
1,920total attempts
978 failed (50.9%)·942 passed (49.1%)
L1 Infrastructure failures
213
L2 Agent behavior failures
178
L3 Vulnerability understanding failures
237
L4 Build environment failures
142
L5 Verification failures
113
L6 Trajectory errors
47
L7 Timeout subcategories
36
L8 Convergent failure patterns
12

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

FAQ

Which agent has the highest pass rate?

Codex GPT-5.2 at 62.7% on the vulnerability benchmark dataset. See the full rankings with cost breakdowns.

How much does it cost to fix a vulnerability?

Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.

Are these costs real or estimates?

Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).

Do pass rates change?

Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.