Skip to main content
[AGENTS]

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Agent comparison

15 agent-model configurations ranked by pass rate on 128 real vulnerabilities.

Configuration details

Each configuration specifies the model, system prompt, tool access, and memory settings. Differences in configuration drive differences in pass rate and cost.

15
Agents compared
62.7%
Top pass rate
3
Behavior groups
105
Agent pairs compared

Agent Comparison

15 agents tested on 128 real bugs. Each agent runs in an isolated container with automated safety checks.

The agents span multiple model providers and configurations. Some are CLI tools that read files locally. Others are API-based agents with web access. Some use older models, others use the latest. This diversity matters because it shows what is actually available to security teams right now, not just theoretical best cases.

Each agent configuration is stable across all 136 samples. We do not tune agents per-bug or cherry-pick runs. The results are reproducible - if you run the same agent again, you should get similar pass rates. This reproducibility is what makes the benchmark actionable for your own pipeline.

Agent naming guide

Agent names follow the format harness/model. The same LLM through different coding agents produces different patch quality.

For example: claude/opus-4-5 runs Claude Opus 4.5 through Anthropic's Claude Code agent. opencode/claude-opus-4-5 runs the same model through the OpenCode harness. Same model, different harness — different results. The benchmark measures the agent harness, not just the underlying model.

Why harness matters:

  • How the agent reads code (all at once vs incremental)
  • Tool availability (git, editor, compiler access)
  • Iteration logic (one shot vs refine-and-retry)
  • Tokenization and context window management
#162.5%
80/128 passed$22.13/eval
#258.1%
79/136 passed$3.08/eval
#356.6%
77/136 passed$1.66/eval
#452.3%
67/128 passed$3.04/eval
#550.0%
64/128 passed$3.08/eval
#650.0%
64/128 passed$1.96/eval
#749.2%
63/128 passed$3.08/eval
#846.3%
63/136 passed$3.08/eval
#946.3%
63/136 passed$3.08/eval
#1044.5%
57/128 passed$1.75/eval
#1142.6%
58/136 passed$1.13/eval
#1242.6%
58/136 passed$22.13/eval
#1340.4%
55/136 passed$1.96/eval
#1435.3%
48/136 passed$3.08/eval
#1533.8%
46/136 passed$13.58/eval

Compare agents by lab

See how agents from each model provider perform on CVE patching.

Claude Opus 4.5

[ANTHROPIC]

45.7%pass rate
$2.64/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

58

pass

43

fail

26

build

Claude Opus 4.6

[ANTHROPIC]

61.6%pass rate
$2.93/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

77

pass

28

fail

20

build

Codex Gpt 5.2

[OPENAI]

62.7%pass rate
$5.30/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

79

pass

12

fail

35

build

Codex Gpt 5.2 Codex

[OPENAI]

49.2%pass rate
$6.65/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

27

fail

38

build

Cursor Composer 1.5

[CURSOR]

45.2%pass rate
$3.93/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

57

pass

39

fail

30

build

Cursor Gpt 5.2

[OPENAI]

51.6%pass rate
$6.26/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

34

fail

25

build

Cursor Gpt 5.3 Codex

[OPENAI]

50.4%pass rate
$6.16/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

64

pass

40

fail

23

build

Cursor Opus 4.6

[ANTHROPIC]

62.5%pass rate
$35.40/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

80

pass

24

fail

24

build

Gemini 3 pro preview

[GOOGLE]

43.0%pass rate
$4.85/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

55

pass

36

fail

37

build

Gemini31 Gemini 3.1 pro preview

[GOOGLE]

58.7%pass rate
$3.92/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

64

pass

18

fail

27

build

Opencode Claude Opus 4.5

[ANTHROPIC]

36.8%pass rate
$40.13/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

46

pass

29

fail

50

build

Opencode Claude Opus 4.6

[ANTHROPIC]

47.5%pass rate
$51.88/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

58

pass

15

fail

49

build

Opencode Gemini Gemini 3.1 pro preview

[GOOGLE]

54.9%pass rate
$5.81/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

67

pass

25

fail

30

build

Opencode Gpt 5.2

[OPENAI]

51.6%pass rate
$6.65/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

63

pass

11

fail

48

build

Opencode Gpt 5.2 Codex

[OPENAI]

37.8%pass rate
$8.73/pass
Agent personality radar chartAccuracySpeedEfficiencyPrecisionBreadthReliability

48

pass

32

fail

47

build

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Find the right agent for your stack

Different agents handle different types of bugs better. We can test agents against YOUR codebase to find the best fit.

Explore more

FAQ

Which agents are benchmarked?

15 agent-model configurations including Claude Code, Codex, Gemini CLI, and others. Each tested on the same 128 vulnerabilities.

How often are benchmarks updated?

Benchmarks update as new models ship. We re-run tests regularly so rankings stay current.

Can I add my own agent?

Yes. Any agent that writes code can be benchmarked. Contact us to add your agent configuration to the test suite.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.