Skip to main content
[COMPARISON]

Model upgrade impact: how much do newer models help?

Gemini 3 to 3.1 upgrade: +15.7pp. Claude 4.5 to 4.6 upgrade: +15.9pp via native, +10.7pp via OpenCode. Model upgrades are the single largest factor in agent improvement.

When a model vendor releases a new version, does it actually help with CVE patching? This page quantifies the impact of major model upgrades across the benchmark. The data reveals that model generation improvements are the largest performance lever available. Outweighing CLI optimization, wrapper choice, or agent prompt tuning.

Three major model transitions happened during the benchmark window. Each one shows measurable improvement and sets a baseline for expected gains when new models ship. The data helps you decide when to upgrade and how much benefit to expect.

[UPGRADE DATA]

Model upgrade data

Upgrade PathMeasurementOld VersionNew VersionGainDomain
Claude OpusPass rate45.7%61.6%+15.9ppNative Claude CLI
Claude OpusBuild failures19.1%14.7%Drop of 4.4ppNative Claude CLI
Claude OpusFail outcomes31.6%20.6%Drop of 11ppNative Claude CLI
GeminiPass rate43.0%58.7%+15.7ppNative Gemini CLI
GPT via CursorPass rate51.6%50.4%-1.2ppCursor wrapper

These numbers represent the same benchmark samples tested with both old and new model versions, run through identical CLIs. The comparison is as clean as possible: model only variable.

Claude Opus 4.5 to 4.6 shows the largest absolute improvement in the benchmark. A single model generation accounts for 15.9 percentage points of pass rate improvement. This single upgrade is worth more than any tuning, prompt engineering, or CLI switching. If you deploy patching agents, upgrading your underlying model is the highest-ROI change you can make.

Gemini 3.0 to 3.1 shows nearly identical gains. Google's model upgrade path mirrors Anthropic's. Roughly 16 percentage points of pass rate improvement per generation. This consistency across labs suggests that model scaling and training data improvements compound at similar rates.

GPT-5.2 to 5.3 tells a different story. Tested via the Cursor wrapper, GPT-5.3 actually drops 1.2 percentage points. This doesn't mean GPT-5.3 is worse globally. GPT-5.3 may dominate other domains. But on CVE patching specifically, tested through a wrapper runtime, the new version underperforms. This suggests that wrapper choice matters, and newer models don't always translate their improvements through different runtime environments.

Claude Opus 4.5 to 4.6: the flagship upgrade

Anthropic's upgrade from Opus 4.5 to Opus 4.6 has the most data points in the benchmark. Testing both versions on the same 136 CVEs via the native Claude CLI gives a clean signal.

Pass rate: 45.7% to 61.6%, a gain of 15.9 percentage points. This is not a modest improvement. On a practical level, if you deployed Opus 4.5 patching agents across a codebase and converted them all to Opus 4.6, your successful fix rate jumps from 46% to 62%. You now close 16 more bugs per 100 vulnerabilities discovered.

Build failures drop from 19.1% to 14.7%, a reduction of 4.4 percentage points. Build failures occur when the agent generates syntactically invalid code. Opus 4.6 generates compilable code more consistently than Opus 4.5. This is the model's code understanding improving. Fewer hallucinated imports, fewer malformed functions, fewer missing braces.

Fail outcomes (agent attempted but did not pass) drop from 31.6% to 20.6%. These are cases where the agent understood the task, generated code, tested it, but the test didn't pass. The 11-point improvement here shows Opus 4.6 not only attempts more bugs but attempts them more correctly. It generates patches that are closer to correct on the first try.

The model upgrade touches all three failure modes: fewer attempts at unpatchable bugs, fewer syntactically invalid attempts, and more attempts that are semantically correct. This across-the-board improvement is why model generations matter.

Gemini 3.0 to 3.1 Pro

Google's upgrade from Gemini 3.0 to 3.1 Pro shows parallel improvement. Pass rate climbs from 43.0% to 58.7%, a gain of 15.7 percentage points. Identical magnitude to Anthropic's upgrade, suggesting that model scaling improvements compound similarly across vendors.

Gemini 3.0 was tested via native Google Cloud CLI. Gemini 3.1 Pro is only available via wrapper APIs (no native CLI at the time of benchmarking), creating a slight comparison asymmetry. Despite this, the 15.7-point improvement mirrors Anthropic's trajectory.

This consistency is reassuring. Model upgrades from major labs converge on similar improvement rates: one generation typically yields 15-16 percentage points on CVE patching tasks. If you're evaluating when to upgrade, this number gives you a prediction. When Google releases Gemini 3.2 or Anthropic releases Opus 4.7, expect roughly 15-point improvements.

GPT-5.2 to GPT-5.3: wrapper dynamics

OpenAI's upgrade to GPT-5.3 represents the newest model in the benchmark, but it shows a -1.2pp regression when tested via Cursor. This outlier teaches an important lesson about wrappers.

The regression doesn't prove GPT-5.3 is globally worse. OpenAI's model may excel in other domains. But the CVE patching task, funneled through Cursor's runtime and tool call limitations, shows the new model underperforming. Why?

One hypothesis: GPT-5.3 may have different scaling laws or training instabilities in specific task domains. Another: Cursor's tool calling protocol may not align well with GPT-5.3's output format. A third: GPT-5.3 is larger and may require different prompting strategies than GPT-5.2 to work well through wrapper APIs.

The key learning: model upgrades don't automatically guarantee improvement through every runtime. Native CLI environments (Anthropic's claude-cli, Google's gcloud) directly expose the model's capabilities. Wrapper APIs (Cursor, OpenCode) introduce mediation layers that can dampen or distort improvements. If you upgrade models, test them in your actual deployment environment (native CLI vs wrapper) before trusting the improvement.

Wrapper dampening effect

Claude Opus 4.5 to 4.6 shows different improvements across runtimes:

  • Native CLI: +15.9pp (45.7% to 61.6%)
  • OpenCode wrapper: +10.7pp (36.8% to 47.5%)

The same model upgrade applied through a wrapper shows ~33% less improvement than the native CLI. Wrappers introduce overhead: token limits, tool call constraints, response format translation, and retry logic that native CLIs skip.

The OpenCode wrapper captures roughly two-thirds of the raw model improvement. This ratio is consistent across other model pairs tested through wrappers, suggesting it's a fundamental characteristic of wrapper mediation. When you deploy agents through a wrapper instead of a native CLI, expect to sacrifice ~1/3 of your model improvement.

This has strategic implications. If your infrastructure requires a wrapper for compliance, logging, or integration reasons, understand that you're paying a 33% performance tax on model upgrades. You might recoup that tax by running the model multiple times or running multiple models in ensemble, but the single-run performance will be dampened.

Cost impact of upgrades

Model upgrades do increase cost per attempt. Claude Opus 4.5 to 4.6 costs $2.64 to $2.93 per attempt, a cost increase of 11%. But measured by cost per successful fix:

  • Opus 4.5: ~$5.78 per fix ($2.64 ÷ 45.7%)
  • Opus 4.6: ~$4.76 per fix ($2.93 ÷ 61.6%)

Cost per fix actually decreases by 18%, even though cost per attempt increases by 11%. This is the power of pass rate improvement: you need fewer attempts to get fixes. The model upgrade pays for itself through efficiency gains.

Similar math applies to Gemini: cost per attempt up ~8%, cost per fix down because pass rate improvement outpaces cost growth.

Recommendation

Always upgrade to the latest model. The pattern is clear:

  1. Model upgrades yield 15-16pp pass rate improvement per generation
  2. This improvement outweighs any other optimization (CLI choice, prompt tuning, wrapper selection)
  3. Cost per fix actually improves despite higher per-attempt costs
  4. Same-generation improvements are consistent across labs (Anthropic, Google, OpenAI)

If you're deciding between upgrading your model versus tuning your CLI arguments or switching wrappers, upgrade the model. The benchmark data shows this is the highest-ROI change available. Every dollar spent on running a newer model is money well spent compared to optimization efforts that yield 1-3 percentage points.

Monitor new releases from all labs. When Anthropic releases Opus 4.7, when Google releases Gemini 3.2, when OpenAI releases GPT-5.4, test them against your own codebase and expect 15-point improvements. The consistency of this pattern makes it predictable and actionable.

Explore more

FAQ

How much do model upgrades improve CVE patching performance?

15-16 percentage points for both Gemini and Claude native CLIs. The upgrade gain shrinks through wrapper CLIs to 10.7pp via OpenCode.

[RELATED TOPICS]

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.