How Agents Get Attacked

20% jailbreak success rate. 42 seconds average. 90% of successful attacks leak data. Threat data grounded in published research.

Agent attack taxonomy

Four primary vectors: prompt injection, tool poisoning, skill supply chain compromise, and protocol exploits. Each vector has distinct detection and mitigation requirements.

Real-world attack data

Pillar Security monitored 2,000+ LLM applications and found a 20% jailbreak success rate with 42-second average time. 90% of successful attacks resulted in sensitive data leakage.

20%

42s

90%

Agent threat surface: real-world data, not theory

Every stat on this page comes from published research. Pillar Security monitored 2,000+ LLM applications. MCPSecBench tested 17 attack types across 4 surfaces. arXiv papers documented multi-agent exploitation at scale. This is not theoretical. See OWASP agentic risks for mapped controls.

Attack taxonomy

Prompt injection

20% success rate across 2,000+ applications. Average time: 42 seconds. 90% of successful attacks result in data leakage (Pillar Security, Oct 2024).

Tool poisoning

36.5% average attack success rate. Manipulated tool descriptions trick agents into executing harmful actions. o1-mini hit 72.8% success (MCPTox, arXiv:2508.14925). Learn more about MCP security.

Skill supply chain

36.82% of 3,984 agent skills contain security flaws. 76 confirmed malicious payloads in public marketplaces (Snyk ToxicSkills). 350% rise in GitHub Actions supply chain attacks in 2025 (StepSecurity). See building secure skills for prevention.

Protocol exploits

17 attack types across 4 MCP surfaces. 85%+ of identified attacks compromise at least one platform (MCPSecBench, arXiv:2508.13220).

Multi-agent amplification

When multiple agents collaborate, a compromised agent can propagate attacks across the system. Research shows 58-90% success rates for arbitrary code execution via multi-agent orchestration systems, with some configurations reaching 100% (arXiv:2503.12188).

Prompt injection on a single agentic coding assistant can compromise the entire supply chain of repositories it touches (arXiv:2601.17548).

Defense effectiveness

A meta-analysis of 78 published studies found that attackers with adaptive strategies succeed at 85%+ rates. Most defense mechanisms achieve less than 50% mitigation (arXiv:2506.23260). This gap between attack and defense effectiveness means detection and response are more reliable than prevention alone. XOR's verification pipeline focuses on output validation rather than perfect input protection.

"Prompt injection is defining the AI era". CrowdStrike 2026 Threat Report

What XOR catches

Verification pipeline

Every agent-generated fix is tested against the original vulnerability. Bad patches are rejected before review.

Guardrail review

Inline review comments on risky changes. Uncertainty stop: XOR says when confidence is low instead of guessing.

CI hardening

Actions pinned to SHA. Workflow permissions reduced to least-privilege. Counters the 350% rise in Actions supply chain attacks.

Skill scanning

Agent tools checked against vulnerability databases before execution. Unsigned tools are blocked.

Sources

Pillar Security, State of Attacks on GenAI (2024-2025), 2,000+ LLM apps
arXiv:2601.17548 — Prompt Injection Attacks on Agentic Coding Assistants
arXiv:2503.12188 — Multi-Agent Systems Execute Arbitrary Malicious Code
arXiv:2510.23883 — Agentic AI Security: Threats, Defenses, Evaluation
arXiv:2506.23260 — Adaptive attack strategies, 78 studies meta-analysis
International AI Safety Report 2026. 100+ experts, 30+ countries
CrowdStrike 2026 Threat Report, AI threat vectors
StepSecurity, GitHub Actions supply chain attacks

[NEXT STEPS]

MCP server security →Agent safety →Building secure skills →Third-party risk overview →

FAQ

How often do jailbreak attacks succeed?

20% of jailbreak attempts succeed with an average time of 42 seconds. 90% of successful attacks result in sensitive data leakage (Pillar Security, 2,000+ LLM applications monitored).

Can multi-agent systems be exploited for code execution?

Yes. Research shows 58-90% success rates for arbitrary code execution via multi-agent orchestration systems, with some configurations reaching 100% (arXiv:2503.12188).

How effective are current defenses?

Most defense mechanisms achieve less than 50% mitigation against adaptive attack strategies. Attackers with budget for multiple attempts succeed at 85%+ rates across 78 published studies (arXiv:2506.23260).

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Native CLIs vs wrapper CLIs: the 10-16pp performance gap

Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.

Cost vs performance: where agents sit on the Pareto frontier

15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.

See which agents produce fixes that work

128 vulnerabilities. 15 agents. 1,920 evaluations. Agents learn from every run.