Model upgrade impact: how much do newer models help?
Gemini 3 to 3.1 upgrade: +15.7pp. Claude 4.5 to 4.6 upgrade: +15.9pp via native, +10.7pp via OpenCode. Model upgrades are the single largest factor in agent improvement.
When a model vendor releases a new version, does it actually help with CVE patching? This page quantifies the impact of major model upgrades across the benchmark. The data reveals that model generation improvements are the largest performance lever available. Outweighing CLI optimization, wrapper choice, or agent prompt tuning.
Three major model transitions happened during the benchmark window. Each one shows measurable improvement and sets a baseline for expected gains when new models ship. The data helps you decide when to upgrade and how much benefit to expect.
[UPGRADE DATA]Model upgrade data
| Upgrade Path | Measurement | Old Version | New Version | Gain | Domain |
|---|---|---|---|---|---|
| Claude Opus | Pass rate | 45.7% | 61.6% | +15.9pp | Native Claude CLI |
| Claude Opus | Build failures | 19.1% | 14.7% | Drop of 4.4pp | Native Claude CLI |
| Claude Opus | Fail outcomes | 31.6% | 20.6% | Drop of 11pp | Native Claude CLI |
| Gemini | Pass rate | 43.0% | 58.7% | +15.7pp | Native Gemini CLI |
| GPT via Cursor | Pass rate | 51.6% | 50.4% | -1.2pp | Cursor wrapper |
These numbers represent the same benchmark samples tested with both old and new model versions, run through identical CLIs. The comparison is as clean as possible: model only variable.
Claude Opus 4.5 to 4.6 shows the largest absolute improvement in the benchmark. A single model generation accounts for 15.9 percentage points of pass rate improvement. This single upgrade is worth more than any tuning, prompt engineering, or CLI switching. If you deploy patching agents, upgrading your underlying model is the highest-ROI change you can make.
Gemini 3.0 to 3.1 shows nearly identical gains. Google's model upgrade path mirrors Anthropic's. Roughly 16 percentage points of pass rate improvement per generation. This consistency across labs suggests that model scaling and training data improvements compound at similar rates.
GPT-5.2 to 5.3 tells a different story. Tested via the Cursor wrapper, GPT-5.3 actually drops 1.2 percentage points. This doesn't mean GPT-5.3 is worse globally. GPT-5.3 may dominate other domains. But on CVE patching specifically, tested through a wrapper runtime, the new version underperforms. This suggests that wrapper choice matters, and newer models don't always translate their improvements through different runtime environments.
Claude Opus 4.5 to 4.6: the flagship upgrade
Anthropic's upgrade from Opus 4.5 to Opus 4.6 has the most data points in the benchmark. Testing both versions on the same 136 CVEs via the native Claude CLI gives a clean signal.
Pass rate: 45.7% to 61.6%, a gain of 15.9 percentage points. This is not a modest improvement. On a practical level, if you deployed Opus 4.5 patching agents across a codebase and converted them all to Opus 4.6, your successful fix rate jumps from 46% to 62%. You now close 16 more bugs per 100 vulnerabilities discovered.
Build failures drop from 19.1% to 14.7%, a reduction of 4.4 percentage points. Build failures occur when the agent generates syntactically invalid code. Opus 4.6 generates compilable code more consistently than Opus 4.5. This is the model's code understanding improving. Fewer hallucinated imports, fewer malformed functions, fewer missing braces.
Fail outcomes (agent attempted but did not pass) drop from 31.6% to 20.6%. These are cases where the agent understood the task, generated code, tested it, but the test didn't pass. The 11-point improvement here shows Opus 4.6 not only attempts more bugs but attempts them more correctly. It generates patches that are closer to correct on the first try.
The model upgrade touches all three failure modes: fewer attempts at unpatchable bugs, fewer syntactically invalid attempts, and more attempts that are semantically correct. This across-the-board improvement is why model generations matter.
Gemini 3.0 to 3.1 Pro
Google's upgrade from Gemini 3.0 to 3.1 Pro shows parallel improvement. Pass rate climbs from 43.0% to 58.7%, a gain of 15.7 percentage points. Identical magnitude to Anthropic's upgrade, suggesting that model scaling improvements compound similarly across vendors.
Gemini 3.0 was tested via native Google Cloud CLI. Gemini 3.1 Pro is only available via wrapper APIs (no native CLI at the time of benchmarking), creating a slight comparison asymmetry. Despite this, the 15.7-point improvement mirrors Anthropic's trajectory.
This consistency is reassuring. Model upgrades from major labs converge on similar improvement rates: one generation typically yields 15-16 percentage points on CVE patching tasks. If you're evaluating when to upgrade, this number gives you a prediction. When Google releases Gemini 3.2 or Anthropic releases Opus 4.7, expect roughly 15-point improvements.
GPT-5.2 to GPT-5.3: wrapper dynamics
OpenAI's upgrade to GPT-5.3 represents the newest model in the benchmark, but it shows a -1.2pp regression when tested via Cursor. This outlier teaches an important lesson about wrappers.
The regression doesn't prove GPT-5.3 is globally worse. OpenAI's model may excel in other domains. But the CVE patching task, funneled through Cursor's runtime and tool call limitations, shows the new model underperforming. Why?
One hypothesis: GPT-5.3 may have different scaling laws or training instabilities in specific task domains. Another: Cursor's tool calling protocol may not align well with GPT-5.3's output format. A third: GPT-5.3 is larger and may require different prompting strategies than GPT-5.2 to work well through wrapper APIs.
The key learning: model upgrades don't automatically guarantee improvement through every runtime. Native CLI environments (Anthropic's claude-cli, Google's gcloud) directly expose the model's capabilities. Wrapper APIs (Cursor, OpenCode) introduce mediation layers that can dampen or distort improvements. If you upgrade models, test them in your actual deployment environment (native CLI vs wrapper) before trusting the improvement.
Wrapper dampening effect
Claude Opus 4.5 to 4.6 shows different improvements across runtimes:
- Native CLI: +15.9pp (45.7% to 61.6%)
- OpenCode wrapper: +10.7pp (36.8% to 47.5%)
The same model upgrade applied through a wrapper shows ~33% less improvement than the native CLI. Wrappers introduce overhead: token limits, tool call constraints, response format translation, and retry logic that native CLIs skip.
The OpenCode wrapper captures roughly two-thirds of the raw model improvement. This ratio is consistent across other model pairs tested through wrappers, suggesting it's a fundamental characteristic of wrapper mediation. When you deploy agents through a wrapper instead of a native CLI, expect to sacrifice ~1/3 of your model improvement.
This has strategic implications. If your infrastructure requires a wrapper for compliance, logging, or integration reasons, understand that you're paying a 33% performance tax on model upgrades. You might recoup that tax by running the model multiple times or running multiple models in ensemble, but the single-run performance will be dampened.
Cost impact of upgrades
Model upgrades do increase cost per attempt. Claude Opus 4.5 to 4.6 costs $2.64 to $2.93 per attempt, a cost increase of 11%. But measured by cost per successful fix:
- Opus 4.5: ~$5.78 per fix ($2.64 ÷ 45.7%)
- Opus 4.6: ~$4.76 per fix ($2.93 ÷ 61.6%)
Cost per fix actually decreases by 18%, even though cost per attempt increases by 11%. This is the power of pass rate improvement: you need fewer attempts to get fixes. The model upgrade pays for itself through efficiency gains.
Similar math applies to Gemini: cost per attempt up ~8%, cost per fix down because pass rate improvement outpaces cost growth.
Recommendation
Always upgrade to the latest model. The pattern is clear:
- Model upgrades yield 15-16pp pass rate improvement per generation
- This improvement outweighs any other optimization (CLI choice, prompt tuning, wrapper selection)
- Cost per fix actually improves despite higher per-attempt costs
- Same-generation improvements are consistent across labs (Anthropic, Google, OpenAI)
If you're deciding between upgrading your model versus tuning your CLI arguments or switching wrappers, upgrade the model. The benchmark data shows this is the highest-ROI change available. Every dollar spent on running a newer model is money well spent compared to optimization efforts that yield 1-3 percentage points.
Monitor new releases from all labs. When Anthropic releases Opus 4.7, when Google releases Gemini 3.2, when OpenAI releases GPT-5.4, test them against your own codebase and expect 15-point improvements. The consistency of this pattern makes it predictable and actionable.
Explore more
- Full leaderboard, All agents and their pass rates
- Cross-agent agreement, How agents compare on the same bugs
- Native vs wrapper, Comparing CLI-based and wrapper-based agents
- Cost vs performance, How to evaluate the trade-off
FAQ
How much do model upgrades improve CVE patching performance?
15-16 percentage points for both Gemini and Claude native CLIs. The upgrade gain shrinks through wrapper CLIs to 10.7pp via OpenCode.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
Three behavioral clusters: how agents approach CVE patching
Speed-runners (211 sessions, 60.2% pass), explorers (25 sessions, 32%), surgical-experts (737 sessions, 54.9%). Clustered by tool usage, turn count, and token patterns.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.