How Agents Get Attacked
20% jailbreak success rate. 42 seconds average. 90% of successful attacks leak data. Threat data grounded in published research.
Agent attack taxonomy
Four primary vectors: prompt injection, tool poisoning, skill supply chain compromise, and protocol exploits. Each vector has distinct detection and mitigation requirements.
Real-world attack data
Pillar Security monitored 2,000+ LLM applications and found a 20% jailbreak success rate with 42-second average time. 90% of successful attacks resulted in sensitive data leakage.
Agent threat surface: real-world data, not theory
Every stat on this page comes from published research. Pillar Security monitored 2,000+ LLM applications. MCPSecBench tested 17 attack types across 4 surfaces. arXiv papers documented multi-agent exploitation at scale. This is not theoretical. See OWASP agentic risks for mapped controls.
Attack taxonomy
Prompt injection
20% success rate across 2,000+ applications. Average time: 42 seconds. 90% of successful attacks result in data leakage (Pillar Security, Oct 2024).
Tool poisoning
36.5% average attack success rate. Manipulated tool descriptions trick agents into executing harmful actions. o1-mini hit 72.8% success (MCPTox, arXiv:2508.14925). Learn more about MCP security.
Skill supply chain
36.82% of 3,984 agent skills contain security flaws. 76 confirmed malicious payloads in public marketplaces (Snyk ToxicSkills). 350% rise in GitHub Actions supply chain attacks in 2025 (StepSecurity). See building secure skills for prevention.
Protocol exploits
17 attack types across 4 MCP surfaces. 85%+ of identified attacks compromise at least one platform (MCPSecBench, arXiv:2508.13220).
Multi-agent amplification
When multiple agents collaborate, a compromised agent can propagate attacks across the system. Research shows 58-90% success rates for arbitrary code execution via multi-agent orchestration systems, with some configurations reaching 100% (arXiv:2503.12188).
Prompt injection on a single agentic coding assistant can compromise the entire supply chain of projects it touches (arXiv:2601.17548).
Defense effectiveness
A meta-analysis of 78 published studies found that attackers with adaptive strategies succeed at 85%+ rates. Most defense mechanisms achieve less than 50% mitigation (arXiv:2506.23260). This gap between attack and defense effectiveness means detection and response are more reliable than prevention alone. XOR's verification pipeline focuses on output validation rather than perfect input protection.
"Prompt injection is defining the AI era". CrowdStrike 2026 Threat Report
What XOR catches
Verification pipeline
Every agent-generated fix is tested against the original vulnerability. Bad patches are rejected before review.
Guardrail review
Inline review comments on risky changes. Uncertainty stop: XOR says when confidence is low instead of guessing.
CI hardening
Actions pinned to SHA. Workflow permissions reduced to least-privilege. Counters the 350% rise in Actions supply chain attacks.
Skill scanning
Agent tools checked against vulnerability databases before execution. Unsigned tools are blocked.
Sources
- Pillar Security, State of Attacks on GenAI (2024-2025), 2,000+ LLM apps
- arXiv:2601.17548 — Prompt Injection Attacks on Agentic Coding Assistants
- arXiv:2503.12188 — Multi-Agent Systems Execute Arbitrary Malicious Code
- arXiv:2510.23883 — Agentic AI Security: Threats, Defenses, Evaluation
- arXiv:2506.23260 — Adaptive attack strategies, 78 studies meta-analysis
- International AI Safety Report 2026. 100+ experts, 30+ countries
- CrowdStrike 2026 Threat Report, AI threat vectors
- StepSecurity, GitHub Actions supply chain attacks
[NEXT STEPS]
Related pages
FAQ
How often do jailbreak attacks succeed?
20% of jailbreak attempts succeed with an average time of 42 seconds. 90% of successful attacks result in sensitive data leakage (Pillar Security, 2,000+ LLM applications monitored).
Can multi-agent systems be exploited for code execution?
Yes. Research shows 58-90% success rates for arbitrary code execution via multi-agent orchestration systems, with some configurations reaching 100% (arXiv:2503.12188).
How effective are current defenses?
Most defense mechanisms achieve less than 50% mitigation against adaptive attack strategies. Attackers with budget for multiple attempts succeed at 85%+ rates across 78 published studies (arXiv:2506.23260).
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Agent Configurations
15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.
Native CLIs vs wrapper CLIs: the 10-16pp performance gap
Claude CLI vs OpenCode, Gemini CLI vs OpenCode, Codex vs Cursor. Same models, different wrappers, consistent accuracy gaps of 10-16 percentage points.
Cost vs performance: where agents sit on the Pareto frontier
15 agents plotted on cost-accuracy. 4 on the Pareto frontier. Best value: claude-opus-4-6 at $2.93/pass, 61.6%.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.