Skip to main content
[NVIDIA]

Nvidia AI infrastructure and agent evaluation benchmarks

Agent Morpheus (38 CVE triage, 9.3x speedup), NIM guardrails, NeMo-RL, AIQ profiling, and Nemotron models.

Nvidia's agent evaluation infrastructure

Nvidia operates three complementary research programs for agent systems. Agent Morpheus performed triage on 38 CVE samples, identifying vulnerability root causes and prioritization strategies. It achieved a 9.3x speedup over manual triage, compressing discovery-to-analysis time.

Nvidia Inference Microservices (NIM) includes guardrails and safety constraints for production LLM deployment. The NIM platform enforces rate limiting, request filtering, and output sanitization across all Nvidia-hosted models.

NeMo-RL Code Environment is a reinforcement learning simulator for agent training on code generation tasks. Agents train on synthetic code problems, then transfer to real vulnerabilities.

AIQ is Nvidia's agent profiling tool. It measures and reports per-agent metrics: latency, quality, hallucination rates, planning depth, and tool usage patterns.

Agent Morpheus performance

Agent Morpheus achieved 38-sample triage in hours instead of days. The 9.3x speedup came from automated root-cause analysis and priority scoring. It did not attempt to generate patches. Triage only.

CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. The relationship is sequential. Morpheus identifies what needs fixing. XOR measures who fixes it well.

DimensionMorpheusCVE-Agent-Bench
ScopeTriage (38 CVEs)Patching (128 CVEs)
TaskRoot-cause identificationFix generation and validation
OutputPriority score + summaryValidated patch code
WorkflowUpstream (discovery phase)Downstream (remediation phase)

The two benchmarks are complementary. Morpheus compresses the discovery phase. CVE-Agent-Bench measures the remediation phase. Together, they form a complete pipeline.

NIM guardrails and safety

Nvidia's NIM platform runs Nemotron and other models with production-grade constraints. The guardrails prevent model outputs from:

  • Generating exploitable code
  • Exposing secrets or credentials
  • Outputting binary vulnerabilities in plaintext
  • Bypassing security controls

CVE-Agent-Bench does not use NIM directly (it runs open-source model deployments). But NIM guardrails offer architectural insights: agents in production need safety layers, not just raw model output.

NeMo-RL code training

NeMo-RL trains agents on synthetic code tasks before deploying to real vulnerabilities. This pre-training is designed to improve sample efficiency on real data.

CVE-Agent-Bench measures zero-shot performance (no task-specific training). No NeMo-RL-trained agents were evaluated in the benchmark. The benchmark establishes a baseline for measuring the impact of pre-training approaches like NeMo-RL in future evaluations.

AIQ profiling

AIQ measures per-agent execution metrics. The profiler reports:

  • Latency: Time from problem statement to fix generation
  • Quality: Pass rate on sample set
  • Hallucination: Factually incorrect assertions in reasoning
  • Planning depth: Number of reasoning steps before action
  • Tool usage: Which APIs did the agent call?

CVE-Agent-Bench complements AIQ by measuring quality deterministically. AIQ profiles behavior. CVE-Agent-Bench validates correctness independently.

Nemotron models and CVE-Agent-Bench

Nvidia does not currently have proprietary models in CVE-Agent-Bench. The benchmark focuses on widely available open-source deployments (Claude, GPT, Gemini).

However, Nvidia's research outputs (Morpheus methodology, NIM guardrails, NeMo-RL pre-training, AIQ profiling) inform the benchmark design. Specifically:

  • Morpheus's 9.3x triage speedup motivated structured root-cause analysis in agent prompts
  • NIM guardrails influenced safety constraints in patch validation
  • NeMo-RL training insights shaped the decision to focus on zero-shot evaluation
  • AIQ metrics informed the per-agent execution tracking

Strategic positioning

Nvidia's infrastructure enables large-scale agent deployment. Their research validates that agents can accelerate security workflows (Morpheus triage speedup), but also need safety controls (NIM guardrails).

CVE-Agent-Bench offers a shared evaluation baseline. If Nvidia releases Nemotron models for public use, they can be evaluated on the 128-CVE dataset to measure competitive positioning.

Until then, the benchmark uses Nvidia's research insights to improve agent evaluation rigor while testing vendor-neutral open models.

See Full benchmark results | Methodology | Validation Process

FAQ

How does Agent Morpheus relate to CVE-Agent-Bench?

Morpheus performs triage on CVE samples (identifying root causes and prioritization). CVE-Agent-Bench measures the next phase: patch generation on samples where root cause is known. Sequential workflow: Morpheus identifies what needs fixing. XOR measures who fixes it well.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.