harfbuzz in CVE-Agent-Bench — 19 vulnerabilities tested
19 vulnerability samples from harfbuzz (text shaping library), generating 285 evaluations across 15 agents.
Overview
harfbuzz is a Unicode text shaping library used by Chrome, Firefox, Android, and LibreOffice to render complex scripts correctly. Font rendering requires precise memory management and bounds checking, making it a critical component in millions of software systems. The library handles font files with varying complexity, from simple Latin scripts to intricate Asian and Indic writing systems.
Benchmark coverage
19 vulnerability samples from harfbuzz are included in CVE-Agent-Bench, generating 285 individual evaluations across 15 agent configurations. These samples focus on buffer overflows, heap memory corruption, and out-of-bounds reads that occur during font parsing and glyph shaping operations.
Vulnerability classes
harfbuzz samples cover specific vulnerability patterns common in font processing code:
- Heap buffer overflows in font table parsing, where malformed or truncated font files exceed allocated memory bounds
- Out-of-bounds reads in glyph shaping operations, triggered by invalid glyph indices or corrupted outline data
- Integer overflows in size calculations that lead to undersized buffer allocation
- Use-after-free bugs in font object reference counting during decompression or format conversion
- Null pointer dereferences when expected font table structures are missing or malformed
Why harfbuzz bugs are interesting for agent evaluation
harfbuzz vulnerabilities test an agent's ability to understand complex memory safety issues in font parsing code. The project has intricate data structures for font tables, glyph outlines, and shaping algorithms. Bugs often require domain-specific knowledge about OpenType font specification and careful handling of variable-length binary data. Agents must balance defensive programming with performance constraints in a widely-used library.
The 19 samples in the benchmark represent the types of issues that lead to remote code execution when processing untrusted font files. A single malformed font embedded in a web page, PDF, or email attachment can trigger memory corruption.
Agent performance on harfbuzz
Per-project performance data is not yet published. The full benchmark results aggregate performance across all projects. You can review how individual agents performed overall at the full results page, where you can sort by pass rate, cost, and other metrics. The methodology behind agent evaluation is documented in the benchmark methodology guide.
Related projects
Other memory-safety intensive projects in the benchmark include:
- libarchive, archive format parsing with similar bounds-checking challenges
- libjxl, image codec with comparable decoding complexity
- libgit2, binary format parsing with variable-length data handling
Explore more
- Full benchmark results
- Agent profiles
- Methodology
- Economics analysis, cost per verified patch
FAQ
How do agents perform on harfbuzz vulnerabilities?
harfbuzz has 19 samples in CVE-Agent-Bench, the largest per-project sample. Font parsing bugs test domain-specific knowledge that varies across agent models.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
Benchmark Methodology
How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.
libarchive in CVE-Agent-Bench — 12 vulnerabilities tested
12 vulnerability samples from libarchive (archive handling), generating 180 evaluations across 15 agents.
envoyproxy in CVE-Agent-Bench — 9 vulnerabilities tested
9 vulnerability samples from envoyproxy (layer 7 proxy), generating 135 evaluations across 15 agents.
Apache in CVE-Agent-Bench — 7 vulnerabilities tested
7 vulnerability samples from Apache HTTP Server and related projects, generating 105 evaluations across 15 agents.
See which agents produce fixes that work
128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.