Cursor Composer 1.5 — Vulnerability-Agent-Bench profile

45.2% pass rate at $3.93 per fix. Cursor's proprietary model. Lowest infra failure count.

[CURSOR]

Cursor Composer 1.5

45.2%

Pass Rate

$3.93

Cost per Pass

128

Total Evals

cursor

CLI

Outcome Distribution

Pass (57)

Fail (39)

Build (30)

Infra (2)

Cursor Composer 1.5 is Cursor's proprietary model for agentic code generation. It is the only agent in Vulnerability-Agent-Bench that does not rely on OpenAI, Google, or Anthropic's base models. Across 128 evaluations, Composer 1.5 achieved 45.2% pass rate with a cost of $3.93 per successful fix. The model passed 57 samples, failed 39, experienced 30 build failures, and encountered 2 infrastructure failures.

Proprietary model positioning

Cursor's decision to develop a proprietary model sets Composer 1.5 apart. The model is not a wrapper around GPT, Claude, or Gemini. It is trained on Cursor's code generation data and fine-tuned for the Cursor IDE's use cases. In Vulnerability-Agent-Bench, Composer 1.5 is evaluated in the same way as all other agents: a generic CLI interface with no Cursor IDE integration.

The 45.2% pass rate positions Composer 1.5 slightly below the field average (47.3%) but competitive with first-generation models like Gemini 3 Pro (40.4%) and OpenCode variants. The cost of $3.93 per fix is reasonable, landing it in the lower-cost tier alongside Gemini 3.1 Pro ($3.92).

Speed-efficiency profile

Composer 1.5's behavioral signature shows efficiency as a key strength. Efficiency is 97, speed is 85, and precision is 100. The model generates patches quickly and uses tokens conservatively. This makes Composer 1.5 useful for rapid evaluation cycles where you need many quick passes rather than a few perfect ones.

Accuracy is 32, the sixth-lowest in the benchmark. This reflects the model's design bias: speed and cost over thoroughness. Reliability is 55, in the middle range. The model is not fragile, but it is not the most stable either.

Failure modes

The 39 fail count is the third-highest in the benchmark. These are cases where Composer 1.5 generates valid patches that build and run but do not resolve the vulnerability. The model produces output, the test environment can execute it, but the patch is incomplete.

This is the opposite pattern from many other agents. OpenCode agents tend to fail at build time (build failures outnumber fails). Composer 1.5 fails at the semantic level: the code compiles, but the vulnerability is not fixed. This suggests the model's issue is not environment compatibility but vulnerability reasoning.

Precision is 100, meaning the output is well-formed. The failure mode is logical: the model knows how to generate valid code but sometimes misunderstands what the vulnerability requires.

Infrastructure reliability

Two infrastructure failures out of 128 evals (1.6%) is the second-best in the benchmark. Only cursor-gpt-5.3-codex achieves lower (1 failure). Composer 1.5's runtime is stable. You can rely on the test environment to stay running.

This stability is valuable for production systems. If you deploy Cursor Composer 1.5 in your vulnerability patching pipeline, expect minimal operational interruptions from environment issues.

Build failure analysis

Build failures (30) account for 23.4% of evals. This is higher than the field median (20%) but not extreme. These cases are where the test environment cannot compile Composer 1.5's patch. The most common reasons are missing dependencies, incompatible language version assumptions, or syntax that requires a toolchain the container lacks.

Composer 1.5's 30 build failures, combined with 39 semantic fails and 2 infra failures, account for 71 of the 128 evals. The remaining 57 evals pass. This breakdown shows the model's primary challenge is vulnerability reasoning, not environment integration.

Model exclusivity

Composer 1.5 is only available through Cursor's API. You cannot use it via OpenAI, Google, or Anthropic endpoints. This means evaluating Cursor Composer in your own system requires integrating with Cursor's proprietary infrastructure.

For teams that already use Cursor for IDE-based development, integrating Composer 1.5 into your vulnerability patching system is straightforward. For teams that don't, it requires an additional API credential and library. This is a trade-off: proprietary models are not universally accessible.

Cost positioning

At $3.93 per fix, Composer 1.5 is cost-competitive. It is cheaper than most Claude variants and comparable to Gemini 3.1 Pro ($3.92). The 45.2% pass rate means you need more evaluations than higher-accuracy models to achieve the same fix volume, but per-evaluation cost is low.

For teams running many proofs-of-concept on many vulnerabilities, Composer 1.5 is affordable. For teams running a few high-value evaluations, the accuracy matters more than cost.

Comparison to Cursor's other agent

Cursor also offers a GPT-5.2-based variant in the benchmark. That variant achieves 51.6% pass rate, compared to Composer 1.5's 45.2%. The GPT-5.2 variant is 6.4 percentage points more accurate. Cursor offers both options: use Composer 1.5 for rapid, cheap evaluation and GPT-5.2 for accuracy when it matters.

Ecosystem positioning

Composer 1.5 is Cursor's entry into the agentic vulnerability patching market. The benchmark results show it is competitive on cost but lags on accuracy. For Cursor, this is a reasonable position: the company's strength is the IDE and user experience, not the base model. Integrating better base models (like GPT-5.2) into Cursor is a natural upgrade path.

When to choose Composer 1.5

Use Cursor Composer 1.5 if you need low-cost vulnerability patch evaluation and you have Cursor infrastructure already deployed. The 45.2% pass rate is acceptable for reconnaissance: run it on 100 vulnerabilities, get 45 patches, manually validate the promising ones.

Use Cursor's GPT-5.2 variant if you need higher accuracy. Use other agents if you need to minimize proprietary dependencies.

Summary

Cursor Composer 1.5 is a cost-effective proprietary model for vulnerability patching. The 45.2% pass rate at $3.93 per fix is reasonable for rapid evaluation. Infrastructure reliability is excellent (only 2 failures). Semantic failure (vulnerability reasoning) is the primary limitation. For teams already using Cursor, Composer 1.5 is a viable option. For teams evaluating models independently, GPT and Claude variants offer higher accuracy at comparable or lower cost.

Explore Vulnerability-Agent-Bench results to compare Cursor Composer 1.5 against other agents. Review benchmark economics to calculate evaluation cost for your vulnerability volume. See Cursor's GPT-5.2 variant for a higher-accuracy alternative within Cursor's ecosystem.

FAQ

How reliable is Cursor Composer 1.5?

45.2% pass rate on 128 vulnerabilities at $3.93 per fix. Lowest infrastructure failures (2 total). 57 passes, 39 fails, 30 build failures.

What is the cost per successful patch for Cursor Composer 1.5?

$3.93 per successful patch across 128 evaluations. Cost-competitive with Gemini 3.1 Pro ($3.92) despite a lower pass rate. For teams already using Cursor, Composer 1.5 is an affordable option for rapid vulnerability evaluation.

How does Cursor Composer 1.5 compare to other agents?

45.2% pass rate sits slightly below the benchmark average of 47.3%. Its primary distinction is infrastructure stability: only 2 infra failures from 128 evaluations. The main weakness is semantic failure, 39 incorrect patches that compile but do not fix the vulnerability.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.