Skip to main content
[W&B]

W&B Weave Integration

Track CVE-Agent-Bench evaluation results in Weights & Biases. Access 1,920 baseline traces and compare your agent against 15 configurations.

15
Baseline agents
1,920
Evaluation traces
128
CVE samples
wandb.ai
Public project

Weights & Biases integration for benchmark results

CVE-Agent-Bench results are available on W&B Weave at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations are logged with full tracing: model calls, token usage, patch output, and outcome. You can import this data into your own W&B workspace and benchmark your agent against 15 baseline configurations.

W&B Weave lets you compare agent behavior across dimensions that matter: pass rate, cost per fix, time to solution, token efficiency. You own your data—no licensing required to access the public benchmark results.

Access the baseline data

Public W&B project

Open wandb.ai/tobias_xor-xor/cve-bench/weave in your browser. No authentication required. You can:

  • View all 1,920 evaluation traces with full model call logs
  • Filter by agent, outcome (pass/fail/build), sample ID, or cost range
  • Inspect individual traces to see token counts and model reasoning
  • Compare agents side-by-side on pass rate, cost, or time
  • Download data as JSON for offline analysis

Import baseline data to your workspace

Use the W&B SDK to import baseline results into your own project for comparison:

import wandb

# Connect to the public baseline project
baseline_api = wandb.Api()
baseline_project = baseline_api.project(
  "tobias_xor-xor/cve-bench/weave"
)

# Pull all runs (agents)
baseline_runs = baseline_project.runs()
for run in baseline_runs:
  # Log into your project
  with wandb.init(project="my-cve-bench", reinit=True):
      wandb.log({
          "baseline_agent": run.name,
          "baseline_pass_rate": run.summary.get("pass_rate"),
          "baseline_cost_usd": run.summary.get("total_cost_usd"),
          "baseline_avg_time_sec": run.summary.get("avg_time_seconds"),
      })

Log your own agent results

Evaluate your agent on the same 128 CVE samples and log results to W&B for comparison:

import wandb
from your_agent import run_agent
from benchmark_samples import load_samples

samples = load_samples()  # 128 CVE samples
results = []

with wandb.init(project="cve-bench", name="my-agent-v1"):
  for sample in samples:
      outcome, patch_text, tokens_used, time_sec = run_agent(sample)

      # Log the trace
      wandb.log({
          "sample_id": sample.id,
          "agent_model": "my-model-v1",
          "outcome": outcome,  # pass | test-fail | build-fail | infra
          "patch_bytes": len(patch_text),
          "time_seconds": time_sec,
          "tokens_used": tokens_used,
          "cost_usd": tokens_used * price_per_token,
      })

      results.append({
          "sample_id": sample.id,
          "outcome": outcome,
      })

  # Log summary stats
  pass_rate = sum(1 for r in results if r["outcome"] == "pass") / len(results)
  wandb.log({"pass_rate": pass_rate})
  wandb.finish()

Expected schema

Every evaluation record logged to W&B should follow this schema:

Required fields

  • agent_model (string). Name of your agent, e.g., "my-agent-v1"
  • sample_id (string). CVE sample ID, must match baseline data
  • outcome (string). One of: pass, test-fail, build-fail, infra
  • time_seconds (number). Wall-clock time for this evaluation

Recommended fields

  • cost_usd (number). API cost for this evaluation
  • patch_bytes (number). Size of generated patch
  • tokens_input, tokens_output (number). For your own tracking

Optional fields

  • model_output (string). Full patch text or model response
  • error_message (string). If outcome==infra, why did it fail?

Compare against baselines in W&B dashboard

After logging your results, use W&B's dashboard to compare:

Pass rate

Sort the table by pass rate (highest first). Where does your agent rank among the 15 baselines?

Cost per fix

Divide total cost by number of passes. Which agent achieves the best cost efficiency?

Time to fix

Average time per passing evaluation. Trade off speed vs. accuracy.

Token efficiency

Plot passes vs. total tokens consumed. More passes per token equals better.

[W&B INTEGRATION]

Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench

Import your evaluation results into W&B for centralized tracking.

# Pseudocode — implement these functions for your agent

import wandb

def upload_to_wandb(results):
    with wandb.init(
        project="cve-bench",
        entity="tobias_xor-xor",
        job_type="evaluation"
    ):
        wandb.log({
            "pass_rate": results.pass_rate,
            "total_evals": results.total,
            "cost_usd": results.cost
        })

upload_to_wandb(evaluation_results)

Analyzing per-sample performance

W&B Weave lets you drill into individual samples to understand where your agent differs from baselines:

  • Beating the baseline. Which samples does your agent pass that all baselines fail? These are potential strengths.
  • Weakness candidates. Which samples does your agent fail that all baselines pass? Good debugging targets.
  • Ceiling samples. Which samples have all agents failing? Not worth debugging.
  • Cost distribution. Do your infra failures cost more than baseline infra failures?

Using Weave for tracing and debugging

W&B Weave stores full traces of your agent's reasoning. Log model calls with inputs and outputs:

import wandb
from wandb.integration.anthropic import wandb_callback

# Use Anthropic callback for automatic tracing
client = Anthropic(callbacks=[wandb_callback()])

with wandb.init(project="cve-bench"):
  response = client.messages.create(
      model="claude-opus-4-1-20250805",
      max_tokens=4000,
      system="You are a security patch generator...",
      messages=[{
          "role": "user",
          "content": f"Fix this bug: {sample.bug_description}"
      }]
  )

  # Weave automatically logs the full trace
  # including token counts and response time

This gives you introspection into what your model is thinking when solving a bug. Compare traces across agents to understand different solving strategies.

Benchmark project policies

  • Public baseline project. Everyone can view baseline results. No login required.
  • Your results are private. Create your own W&B project. Only people with access can see your agent's performance.
  • No API key required to read baseline data. Use the W&B web UI or API with public access.
  • Sample IDs are stable. The 128 samples do not change. Compare results across time and configurations using the same sample IDs.

See also

FAQ

Where are the benchmark results on W&B?

Public project at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations with full traces: model calls, token counts, outcomes, patches. No authentication required to view.

Can I log my own agent results to W&B?

Yes. Create your own W&B project, run your agent on the same 128 CVE samples, and log results using wandb.log(). The schema is: agent_model, sample_id, outcome (pass|test-fail|build-fail|infra), time_seconds, cost_usd.

How do I compare against the baseline?

After logging your results, use W&B dashboard to sort by pass rate, cost per fix, or time to solution. Filter by sample_id to drill into specific CVEs. Identify where your agent beats or trails the 15 baselines.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.