[NVIDIA]

Nvidia AI infrastructure and agent evaluation benchmarks

Agent Morpheus (38 CVE triage, 9.3x speedup), NIM guardrails, NeMo-RL, AIQ profiling, and Nemotron models.

Nvidia's agent evaluation infrastructure

Nvidia operates three complementary research programs for agent systems. Agent Morpheus performed triage on 38 CVE samples, identifying vulnerability root causes and prioritization strategies. It achieved a 9.3x speedup over manual triage, compressing discovery-to-analysis time.

Nvidia Inference Microservices (NIM) includes guardrails and safety constraints for production LLM deployment. The NIM platform enforces rate limiting, request filtering, and output sanitization across all Nvidia-hosted models.

NeMo-RL Code Environment is a reinforcement learning simulator for agent training on code generation tasks. Agents train on synthetic code problems, then transfer to real vulnerabilities.

AIQ is Nvidia's agent profiling tool. It measures and reports per-agent metrics: latency, quality, hallucination rates, planning depth, and tool usage patterns.

Agent Morpheus performance

Agent Morpheus achieved 38-sample triage in hours instead of days. The 9.3x speedup came from automated root-cause analysis and priority scoring. It did not attempt to generate patches. Triage only.

CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. The relationship is sequential. Morpheus identifies what needs fixing. XOR measures who fixes it well.

Dimension	Morpheus	CVE-Agent-Bench
Scope	Triage (38 CVEs)	Patching (128 CVEs)
Task	Root-cause identification	Fix generation and validation
Output	Priority score + summary	Validated patch code
Workflow	Upstream (discovery phase)	Downstream (remediation phase)

The two benchmarks are complementary. Morpheus compresses the discovery phase. CVE-Agent-Bench measures the remediation phase. Together, they form a complete pipeline.

NIM guardrails and safety

Nvidia's NIM platform runs Nemotron and other models with production-grade constraints. The guardrails prevent model outputs from:

Generating exploitable code
Exposing secrets or credentials
Outputting binary vulnerabilities in plaintext
Bypassing security controls

CVE-Agent-Bench does not use NIM directly (it runs public-weight model deployments). But NIM guardrails offer architectural insights: agents in production need safety layers, not just raw model output.

NeMo-RL code training

NeMo-RL trains agents on synthetic code tasks before deploying to real vulnerabilities. This pre-training is designed to improve sample efficiency on real data.

CVE-Agent-Bench measures zero-shot performance (no task-specific training). No NeMo-RL-trained agents were evaluated in the benchmark. The benchmark establishes a baseline for measuring the impact of pre-training approaches like NeMo-RL in future evaluations.

AIQ profiling

AIQ measures per-agent execution metrics. The profiler reports:

Latency: Time from problem statement to fix generation
Quality: Pass rate on sample set
Hallucination: Factually incorrect assertions in reasoning
Planning depth: Number of reasoning steps before action
Tool usage: Which APIs did the agent call?

CVE-Agent-Bench complements AIQ by measuring quality deterministically. AIQ profiles behavior. CVE-Agent-Bench validates correctness independently.

Nemotron models and CVE-Agent-Bench

Nvidia does not currently have proprietary models in CVE-Agent-Bench. The benchmark focuses on widely available frontier deployments (Claude, GPT, Gemini).

However, Nvidia's research outputs (Morpheus methodology, NIM guardrails, NeMo-RL pre-training, AIQ profiling) inform the benchmark design. Specifically:

Morpheus's 9.3x triage speedup motivated structured root-cause analysis in agent prompts
NIM guardrails influenced safety constraints in patch validation
NeMo-RL training insights shaped the decision to focus on zero-shot evaluation
AIQ metrics informed the per-agent execution tracking

Strategic positioning

Nvidia's infrastructure enables large-scale agent deployment. Their research validates that agents can accelerate security workflows (Morpheus triage speedup), but also need safety controls (NIM guardrails).

CVE-Agent-Bench offers a shared evaluation baseline. If Nvidia releases Nemotron models for public use, they can be evaluated on the 128-CVE dataset to measure competitive positioning.

Until then, the benchmark uses Nvidia's research insights to improve agent evaluation rigor while testing vendor-neutral open models.

See Full benchmark results | Methodology | Validation Process

FAQ

How does Agent Morpheus relate to CVE-Agent-Bench?

Morpheus performs triage on CVE samples (identifying root causes and prioritization). CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. Sequential workflow: Morpheus identifies what needs fixing. XOR measures who fixes it well.

[RELATED TOPICS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.