Nvidia AI infrastructure and agent evaluation benchmarks
Agent Morpheus (38 CVE triage, 9.3x speedup), NIM guardrails, NeMo-RL, AIQ profiling, and Nemotron models.
Nvidia's agent evaluation infrastructure
Nvidia operates three complementary research programs for agent systems. Agent Morpheus performed triage on 38 CVE samples, identifying vulnerability root causes and prioritization strategies. It achieved a 9.3x speedup over manual triage, compressing discovery-to-analysis time.
Nvidia Inference Microservices (NIM) includes guardrails and safety constraints for production LLM deployment. The NIM platform enforces rate limiting, request filtering, and output sanitization across all Nvidia-hosted models.
NeMo-RL Code Environment is a reinforcement learning simulator for agent training on code generation tasks. Agents train on synthetic code problems, then transfer to real vulnerabilities.
AIQ is Nvidia's agent profiling tool. It measures and reports per-agent metrics: latency, quality, hallucination rates, planning depth, and tool usage patterns.
Agent Morpheus performance
Agent Morpheus achieved 38-sample triage in hours instead of days. The 9.3x speedup came from automated root-cause analysis and priority scoring. It did not attempt to generate patches. Triage only.
CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. The relationship is sequential. Morpheus identifies what needs fixing. XOR measures who fixes it well.
| Dimension | Morpheus | CVE-Agent-Bench |
|---|---|---|
| Scope | Triage (38 CVEs) | Patching (128 CVEs) |
| Task | Root-cause identification | Fix generation and validation |
| Output | Priority score + summary | Validated patch code |
| Workflow | Upstream (discovery phase) | Downstream (remediation phase) |
The two benchmarks are complementary. Morpheus compresses the discovery phase. CVE-Agent-Bench measures the remediation phase. Together, they form a complete pipeline.
NIM guardrails and safety
Nvidia's NIM platform runs Nemotron and other models with production-grade constraints. The guardrails prevent model outputs from:
- Generating exploitable code
- Exposing secrets or credentials
- Outputting binary vulnerabilities in plaintext
- Bypassing security controls
CVE-Agent-Bench does not use NIM directly (it runs open-source model deployments). But NIM guardrails offer architectural insights: agents in production need safety layers, not just raw model output.
NeMo-RL code training
NeMo-RL trains agents on synthetic code tasks before deploying to real vulnerabilities. This pre-training is designed to improve sample efficiency on real data.
CVE-Agent-Bench measures zero-shot performance (no task-specific training). No NeMo-RL-trained agents were evaluated in the benchmark. The benchmark establishes a baseline for measuring the impact of pre-training approaches like NeMo-RL in future evaluations.
AIQ profiling
AIQ measures per-agent execution metrics. The profiler reports:
- Latency: Time from problem statement to fix generation
- Quality: Pass rate on sample set
- Hallucination: Factually incorrect assertions in reasoning
- Planning depth: Number of reasoning steps before action
- Tool usage: Which APIs did the agent call?
CVE-Agent-Bench complements AIQ by measuring quality deterministically. AIQ profiles behavior. CVE-Agent-Bench validates correctness independently.
Nemotron models and CVE-Agent-Bench
Nvidia does not currently have proprietary models in CVE-Agent-Bench. The benchmark focuses on widely available open-source deployments (Claude, GPT, Gemini).
However, Nvidia's research outputs (Morpheus methodology, NIM guardrails, NeMo-RL pre-training, AIQ profiling) inform the benchmark design. Specifically:
- Morpheus's 9.3x triage speedup motivated structured root-cause analysis in agent prompts
- NIM guardrails influenced safety constraints in patch validation
- NeMo-RL training insights shaped the decision to focus on zero-shot evaluation
- AIQ metrics informed the per-agent execution tracking
Strategic positioning
Nvidia's infrastructure enables large-scale agent deployment. Their research validates that agents can accelerate security workflows (Morpheus triage speedup), but also need safety controls (NIM guardrails).
CVE-Agent-Bench offers a shared evaluation baseline. If Nvidia releases Nemotron models for public use, they can be evaluated on the 128-CVE dataset to measure competitive positioning.
Until then, the benchmark uses Nvidia's research insights to improve agent evaluation rigor while testing vendor-neutral open models.
See Full benchmark results | Methodology | Validation Process
FAQ
How does Agent Morpheus relate to CVE-Agent-Bench?
Morpheus performs triage on CVE samples (identifying root causes and prioritization). CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. Sequential workflow: Morpheus identifies what needs fixing. XOR measures who fixes it well.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Google security AI and CVE-Agent-Bench
How Google's Big Sleep, Naptime, and Sec-Gemini intersect with independent agent evaluation on 128 real CVEs.
OpenAI models and verified performance in CVE-Agent-Bench
Aardvark 92% self-reported vs XOR's independent 62.7% for codex-gpt-5-2. Non-determinism via trajectory clustering and GPT-5.3 cyber capabilities.
Anthropic security research and patch equivalence validation
Claude Code 500+ zero-days, CyberGym 28.9% SOTA at $2/vuln, BaxBench 62% insecure patches, 1,992 independent evaluations.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.