W&B Weave Integration

Track Vulnerability-Agent-Bench evaluation results in Weights & Biases. Access 1,920 baseline traces and compare your agent against 15 configurations.

1,920

128

wandb.ai

Weights & Biases integration for benchmark results

Vulnerability-Agent-Bench results are available on W&B Weave at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations are logged with full tracing: model calls, token usage, patch output, and outcome. You can import this data into your own W&B workspace and benchmark your agent against 15 baseline configurations.

W&B Weave lets you compare agent behavior across dimensions that matter: pass rate, cost per fix, time to solution, token efficiency. You own your data—no licensing required to access the public benchmark results.

Access the baseline data

Public W&B project

Open wandb.ai/tobias_xor-xor/cve-bench/weave in your browser. No authentication required. You can:

View all 1,920 evaluation traces with full model call logs
Filter by agent, outcome (pass/fail/build), sample ID, or cost range
Inspect individual traces to see token counts and model reasoning
Compare agents side-by-side on pass rate, cost, or time
Download data as JSON for offline analysis

Import baseline data to your workspace

Use the W&B SDK to import baseline results into your own project for comparison:

import wandb

# Connect to the public baseline project
baseline_api = wandb.Api()
baseline_project = baseline_api.project(
  "tobias_xor-xor/cve-bench/weave"
)

# Pull all runs (agents)
baseline_runs = baseline_project.runs()
for run in baseline_runs:
  # Log into your project
  with wandb.init(project="my-cve-bench", reinit=True):
      wandb.log({
          "baseline_agent": run.name,
          "baseline_pass_rate": run.summary.get("pass_rate"),
          "baseline_cost_usd": run.summary.get("total_cost_usd"),
          "baseline_avg_time_sec": run.summary.get("avg_time_seconds"),
      })

Log your own agent results

Evaluate your agent on the same 128 vulnerability samples and log results to W&B for comparison:

import wandb
from your_agent import run_agent
from benchmark_samples import load_samples

samples = load_samples()  # 128 vuln samples
results = []

with wandb.init(project="cve-bench", name="my-agent-v1"):
  for sample in samples:
      outcome, patch_text, tokens_used, time_sec = run_agent(sample)

      # Log the trace
      wandb.log({
          "sample_id": sample.id,
          "agent_model": "my-model-v1",
          "outcome": outcome,  # pass | test-fail | build-fail | infra
          "patch_bytes": len(patch_text),
          "time_seconds": time_sec,
          "tokens_used": tokens_used,
          "cost_usd": tokens_used * price_per_token,
      })

      results.append({
          "sample_id": sample.id,
          "outcome": outcome,
      })

  # Log summary stats
  pass_rate = sum(1 for r in results if r["outcome"] == "pass") / len(results)
  wandb.log({"pass_rate": pass_rate})
  wandb.finish()

Expected schema

Every evaluation record logged to W&B should follow this schema:

Required fields

agent_model (string). Name of your agent, e.g., "my-agent-v1"
sample_id (string). Vulnerability sample ID, must match baseline data
outcome (string). One of: pass, test-fail, build-fail, infra
time_seconds (number). Wall-clock time for this evaluation

Recommended fields

cost_usd (number). API cost for this evaluation
patch_bytes (number). Size of generated patch
tokens_input, tokens_output (number). For your own tracking

Optional fields

model_output (string). Full patch text or model response
error_message (string). If outcome==infra, why did it fail?

Compare against baselines in W&B dashboard

After logging your results, use W&B's dashboard to compare:

Pass rate

Sort the table by pass rate (highest first). Where does your agent rank among the 15 baselines?

Cost per fix

Divide total cost by number of passes. Which agent achieves the best cost efficiency?

Time to fix

Average time per passing evaluation. Trade off speed vs. accuracy.

Token efficiency

Plot passes vs. total tokens consumed. More passes per token equals better.

Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench

Import your evaluation results into W&B for centralized tracking.

# Pseudocode — implement these functions for your agent

import wandb
from your_agent import load_results

def upload_to_wandb(results):
    with wandb.init(
        project="cve-bench",
        entity="tobias_xor-xor",
        job_type="evaluation"
    ):
        wandb.log({
            "pass_rate": results.pass_rate,
            "total_evals": results.total,
            "cost_usd": results.cost
        })

evaluation_results = load_results("results.json")
upload_to_wandb(evaluation_results)

Run your agent against the benchmark and log results to W&B.

# Pseudocode — implement these functions for your agent

def evaluate_agent(agent, samples):
    results = {
        "pass": 0,
        "fail": 0,
        "build": 0,
        "infra": 0
    }

    for sample in samples:
        outcome = run_agent(agent, sample)
        results[outcome] += 1

    return results

# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)

Expected schema for evaluation results.

{
  "agent_model": "string",
  "sample_id": "string",
  "outcome": "pass" | "fail" | "build" | "infra",
  "time_seconds": number,
  "cost_usd": number,
  "tokens_in": number,
  "tokens_out": number
}

Analyzing per-sample performance

W&B Weave lets you drill into individual samples to understand where your agent differs from baselines:

Beating the baseline. Which samples does your agent pass that all baselines fail? These are potential strengths.
Weakness candidates. Which samples does your agent fail that all baselines pass? Good debugging targets.
Ceiling samples. Which samples have all agents failing? Not worth debugging.
Cost distribution. Do your infra failures cost more than baseline infra failures?

Using Weave for tracing and debugging

W&B Weave stores full traces of your agent's reasoning. Log model calls with inputs and outputs:

import wandb
from wandb.integration.anthropic import wandb_callback

# Use Anthropic callback for automatic tracing
client = Anthropic(callbacks=[wandb_callback()])

with wandb.init(project="cve-bench"):
  response = client.messages.create(
      model="claude-opus-4-1-20250805",
      max_tokens=4000,
      system="You are a security patch generator...",
      messages=[{
          "role": "user",
          "content": f"Fix this bug: {sample.bug_description}"
      }]
  )

  # Weave automatically logs the full trace
  # including token counts and response time

This gives you introspection into what your model is thinking when solving a bug. Compare traces across agents to understand different solving strategies.

Benchmark project policies

Public baseline project. Everyone can view baseline results. No login required.
Your results are private. Create your own W&B project. Only people with access can see your agent's performance.
No API key required to read baseline data. Use the W&B web UI or API with public access.
Sample IDs are stable. The 128 samples do not change. Compare results across time and configurations using the same sample IDs.

FAQ

Where are the benchmark results on W&B?

Public project at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations with full traces: model calls, token counts, outcomes, patches. No authentication required to view.

Can I log my own agent results to W&B?

Yes. Create your own W&B project, run your agent on the same 128 vuln samples, and log results using wandb.log(). The schema is: agent_model, sample_id, outcome (pass|test-fail|build-fail|infra), time_seconds, cost_usd.

How do I compare against the baseline?

After logging your results, use W&B dashboard to sort by pass rate, cost per fix, or time to solution. Filter by sample_id to drill into specific vulnerabilities. Identify where your agent beats or trails the 15 baselines.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.