W&B Weave Integration
Track CVE-Agent-Bench evaluation results in Weights & Biases. Access 1,920 baseline traces and compare your agent against 15 configurations.
Weights & Biases integration for benchmark results
CVE-Agent-Bench results are available on W&B Weave at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations are logged with full tracing: model calls, token usage, patch output, and outcome. You can import this data into your own W&B workspace and benchmark your agent against 15 baseline configurations.
W&B Weave lets you compare agent behavior across dimensions that matter: pass rate, cost per fix, time to solution, token efficiency. You own your data—no licensing required to access the public benchmark results.
Access the baseline data
Public W&B project
Open wandb.ai/tobias_xor-xor/cve-bench/weave in your browser. No authentication required. You can:
- View all 1,920 evaluation traces with full model call logs
- Filter by agent, outcome (pass/fail/build), sample ID, or cost range
- Inspect individual traces to see token counts and model reasoning
- Compare agents side-by-side on pass rate, cost, or time
- Download data as JSON for offline analysis
Import baseline data to your workspace
Use the W&B SDK to import baseline results into your own project for comparison:
import wandb
# Connect to the public baseline project
baseline_api = wandb.Api()
baseline_project = baseline_api.project(
"tobias_xor-xor/cve-bench/weave"
)
# Pull all runs (agents)
baseline_runs = baseline_project.runs()
for run in baseline_runs:
# Log into your project
with wandb.init(project="my-cve-bench", reinit=True):
wandb.log({
"baseline_agent": run.name,
"baseline_pass_rate": run.summary.get("pass_rate"),
"baseline_cost_usd": run.summary.get("total_cost_usd"),
"baseline_avg_time_sec": run.summary.get("avg_time_seconds"),
})Log your own agent results
Evaluate your agent on the same 128 CVE samples and log results to W&B for comparison:
import wandb
from your_agent import run_agent
from benchmark_samples import load_samples
samples = load_samples() # 128 CVE samples
results = []
with wandb.init(project="cve-bench", name="my-agent-v1"):
for sample in samples:
outcome, patch_text, tokens_used, time_sec = run_agent(sample)
# Log the trace
wandb.log({
"sample_id": sample.id,
"agent_model": "my-model-v1",
"outcome": outcome, # pass | test-fail | build-fail | infra
"patch_bytes": len(patch_text),
"time_seconds": time_sec,
"tokens_used": tokens_used,
"cost_usd": tokens_used * price_per_token,
})
results.append({
"sample_id": sample.id,
"outcome": outcome,
})
# Log summary stats
pass_rate = sum(1 for r in results if r["outcome"] == "pass") / len(results)
wandb.log({"pass_rate": pass_rate})
wandb.finish()Expected schema
Every evaluation record logged to W&B should follow this schema:
Required fields
- agent_model (string). Name of your agent, e.g., "my-agent-v1"
- sample_id (string). CVE sample ID, must match baseline data
- outcome (string). One of: pass, test-fail, build-fail, infra
- time_seconds (number). Wall-clock time for this evaluation
Recommended fields
- cost_usd (number). API cost for this evaluation
- patch_bytes (number). Size of generated patch
- tokens_input, tokens_output (number). For your own tracking
Optional fields
- model_output (string). Full patch text or model response
- error_message (string). If outcome==infra, why did it fail?
Compare against baselines in W&B dashboard
After logging your results, use W&B's dashboard to compare:
Pass rate
Sort the table by pass rate (highest first). Where does your agent rank among the 15 baselines?
Cost per fix
Divide total cost by number of passes. Which agent achieves the best cost efficiency?
Time to fix
Average time per passing evaluation. Trade off speed vs. accuracy.
Token efficiency
Plot passes vs. total tokens consumed. More passes per token equals better.
[W&B INTEGRATION]
Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench
Import your evaluation results into W&B for centralized tracking.
# Pseudocode — implement these functions for your agent
import wandb
def upload_to_wandb(results):
with wandb.init(
project="cve-bench",
entity="tobias_xor-xor",
job_type="evaluation"
):
wandb.log({
"pass_rate": results.pass_rate,
"total_evals": results.total,
"cost_usd": results.cost
})
upload_to_wandb(evaluation_results)Run your agent against the benchmark and log results to W&B.
# Pseudocode — implement these functions for your agent
def evaluate_agent(agent, samples):
results = {
"pass": 0,
"fail": 0,
"build": 0,
"infra": 0
}
for sample in samples:
outcome = run_agent(agent, sample)
results[outcome] += 1
return results
# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)Expected schema for evaluation results.
{
"agent_model": "string",
"sample_id": "string",
"outcome": "pass" | "fail" | "build" | "infra",
"time_seconds": number,
"cost_usd": number,
"tokens_in": number,
"tokens_out": number
}Analyzing per-sample performance
W&B Weave lets you drill into individual samples to understand where your agent differs from baselines:
- Beating the baseline. Which samples does your agent pass that all baselines fail? These are potential strengths.
- Weakness candidates. Which samples does your agent fail that all baselines pass? Good debugging targets.
- Ceiling samples. Which samples have all agents failing? Not worth debugging.
- Cost distribution. Do your infra failures cost more than baseline infra failures?
Using Weave for tracing and debugging
W&B Weave stores full traces of your agent's reasoning. Log model calls with inputs and outputs:
import wandb
from wandb.integration.anthropic import wandb_callback
# Use Anthropic callback for automatic tracing
client = Anthropic(callbacks=[wandb_callback()])
with wandb.init(project="cve-bench"):
response = client.messages.create(
model="claude-opus-4-1-20250805",
max_tokens=4000,
system="You are a security patch generator...",
messages=[{
"role": "user",
"content": f"Fix this bug: {sample.bug_description}"
}]
)
# Weave automatically logs the full trace
# including token counts and response timeThis gives you introspection into what your model is thinking when solving a bug. Compare traces across agents to understand different solving strategies.
Benchmark project policies
- Public baseline project. Everyone can view baseline results. No login required.
- Your results are private. Create your own W&B project. Only people with access can see your agent's performance.
- No API key required to read baseline data. Use the W&B web UI or API with public access.
- Sample IDs are stable. The 128 samples do not change. Compare results across time and configurations using the same sample IDs.
See also
FAQ
Where are the benchmark results on W&B?
Public project at wandb.ai/tobias_xor-xor/cve-bench/weave. All 1,920 evaluations with full traces: model calls, token counts, outcomes, patches. No authentication required to view.
Can I log my own agent results to W&B?
Yes. Create your own W&B project, run your agent on the same 128 CVE samples, and log results using wandb.log(). The schema is: agent_model, sample_id, outcome (pass|test-fail|build-fail|infra), time_seconds, cost_usd.
How do I compare against the baseline?
After logging your results, use W&B dashboard to sort by pass rate, cost per fix, or time to solution. Filter by sample_id to drill into specific CVEs. Identify where your agent beats or trails the 15 baselines.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Validation Process
25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.