
MD vs MMD vs YAML experiment of speed/tool calls/tokens/efficiency
I benchmarked mermaid vs markdown vs YAML as LLM agent memory — 250+ trials, results flipped depending on the model
TL;DR: I had this intuition that mermaid diagrams should beat markdown as the storage format for agent memory (tasks, project notes, codebase descriptions). Fewer tokens, explicit pointers, faster navigation. So I built a benchmark. The hypothesis was mostly wrong in interesting ways:
- YAML beats mermaid on tokens (−34% vs markdown vs mermaid's −20%)
- On Claude subagents, format barely affects speed — system prompt overhead drowns the signal
- On GPT-4o with a clean harness, structured formats are 40% faster than markdown — mermaid and YAML both win
- GPT-4o-mini gets less accurate on structured formats (90–95% vs 100% on markdown) — a model-size-vs-format interaction I didn't expect
- Mermaid's biggest win is variance: 5–6× lower stddev on wall time on Claude. Predictable latency, never the fastest, never the slowest
So the answer to "is mermaid the best format for agent memory?" is: it depends what you're optimizing for, and which model you're running.
What I tested
Three identical fact sets ("memory pack about a fictional staff engineer"), encoded three different ways:
alex_md/— markdown prosealex_mmd/— mermaid diagrams (mindmap for user facts, flowchart for feedback rules, graph for codebase imports)alex_yaml/— YAML
Then 7 benchmark tasks across 4 categories:
- Recall — single-fact lookups ("What's the user's timezone?")
- Coding context — needs convention from memory ("Which module for auth?")
- Adversarial — contradiction, multi-hop ("Modules transitively depending on auth?")
- Hard — bigger codebase (25 modules), needs 3+ parallel reads
Two harness paths:
- Claude Code subagents (Claude Opus 4.7) — has ~20k system-prompt overhead
- OpenAI direct API (gpt-4o and gpt-4o-mini) — clean harness, format effects visible
YAML was the critical control. Without it, any win for mermaid could just mean "structured beats prose." YAML lets me ask: is mermaid specifically special, or just any structure?
What surprised me
1. Mermaid's token efficiency depends on the data shape.
For small graphs (6 modules, 5 edges), mermaid was −20% vs markdown. For a bigger codebase (25 modules, 30+ edges), mermaid became +33% larger than markdown — each a --> b\n adds linear overhead while bullet lists pack denser. Mermaid is great for small dense relationship graphs; bad for large enumeration lists.
2. The "graph pointer enables parallel reads" hypothesis didn't differentiate formats.
When I asked a question requiring 3 file reads, modern Claude (and OpenAI) issued all 3 reads in parallel regardless of format. Markdown bullet lists trigger parallelism just as well as mermaid edges. So the cognitive model "graphs let the agent jump" was wrong — it's actually "any clear file inventory triggers parallel reads."
3. On GPT-4o, the speed gap is huge:
| Format | gpt-4o wall | gpt-4o-mini wall |
|---|---|---|
| md | 3.11s | 2.72s |
| mmd | 1.88s (−40%) | 2.16s (−21%) |
| yaml | 1.80s (−42%) | 2.13s (−22%) |
But the Claude subagent runs barely showed this — because Claude's system prompt is so big the pack format barely matters. This means most blog posts comparing prompt formats with Claude Code are probably noise. You need an API-direct harness to see real format effects.
4. Small models care about format more — in the opposite direction.
gpt-4o-mini's success rate:
- md: 100%
- mmd: 95%
- yaml: 90%
gpt-4o was 100% across all three. So capable models gain speed from structure; smaller models lose accuracy. If you're shipping a hybrid stack (use 4o-mini for cheap calls, 4o for complex ones), you'd want different memory formats per tier. Nobody talks about this.
5. The variance finding (Claude only):
Across 30 trials per format on Claude, mermaid had 5× lower wall-time stddev than markdown or YAML. Markdown occasionally crawled at 20s; mermaid never went above 14.9s. Never won the race, never lost it either. For p99 latency SLOs this might actually matter more than mean.
Decision matrix I'd use now
| Optimize for | Pick |
|---|---|
| Cheapest tokens | YAML |
| Fastest on big models (4o, Opus) | YAML or mermaid (~tied) |
| Reliability on small models | Markdown |
| Latency consistency (p99) | Mermaid |
| Human-team editability | YAML |
| Small relationship graphs | Mermaid |
| Large lists / enumerations | Markdown |
Caveats I want to flag
- N=3–8 seeds per cell. Means are stable; variance findings are robust; the small-model accuracy gap is from 1–2 failed trials and needs more seeds.
- Memory packs are tiny by production standards (~600–2k tokens). Real CLAUDE.md files at scale would show different effects.
- Single domain ("staff engineer working on a SaaS API"). Different task domains (legal, medical, creative) probably behave differently.
- I built the mermaid representations by hand — a worse mermaid pack would lose harder. Mermaid is sensitive to authoring quality.
What I'd want to test next
- 50+ module codebases — does the format-flip-at-scale generalize?
- Multi-turn conversations where memory accumulates
- Local models (Llama, Qwen) — do they pattern-match more like gpt-4o-mini or gpt-4o?
- Hybrid encoding: pointer-only CLAUDE.md + detail files in a separate format
Happy to share more detail on any specific finding. Curious if anyone else has run similar experiments — particularly on the small-model-format-fragility thing, which feels under-studied.