I benchmarked mermaid vs markdown vs YAML as LLM agent memory — 250+ trials, results flipped depending on the model

TL;DR: I had this intuition that mermaid diagrams should beat markdown as the storage format for agent memory (tasks, project notes, codebase descriptions). Fewer tokens, explicit pointers, faster navigation. So I built a benchmark. The hypothesis was mostly wrong in interesting ways:

YAML beats mermaid on tokens (−34% vs markdown vs mermaid's −20%)
On Claude subagents, format barely affects speed — system prompt overhead drowns the signal
On GPT-4o with a clean harness, structured formats are 40% faster than markdown — mermaid and YAML both win
GPT-4o-mini gets less accurate on structured formats (90–95% vs 100% on markdown) — a model-size-vs-format interaction I didn't expect
Mermaid's biggest win is variance: 5–6× lower stddev on wall time on Claude. Predictable latency, never the fastest, never the slowest

So the answer to "is mermaid the best format for agent memory?" is: it depends what you're optimizing for, and which model you're running.

What I tested

Three identical fact sets ("memory pack about a fictional staff engineer"), encoded three different ways:

alex_md/ — markdown prose
alex_mmd/ — mermaid diagrams (mindmap for user facts, flowchart for feedback rules, graph for codebase imports)
alex_yaml/ — YAML

Then 7 benchmark tasks across 4 categories:

Recall — single-fact lookups ("What's the user's timezone?")
Coding context — needs convention from memory ("Which module for auth?")
Adversarial — contradiction, multi-hop ("Modules transitively depending on auth?")
Hard — bigger codebase (25 modules), needs 3+ parallel reads

Two harness paths:

Claude Code subagents (Claude Opus 4.7) — has ~20k system-prompt overhead
OpenAI direct API (gpt-4o and gpt-4o-mini) — clean harness, format effects visible

YAML was the critical control. Without it, any win for mermaid could just mean "structured beats prose." YAML lets me ask: is mermaid specifically special, or just any structure?

What surprised me

1. Mermaid's token efficiency depends on the data shape.

For small graphs (6 modules, 5 edges), mermaid was −20% vs markdown. For a bigger codebase (25 modules, 30+ edges), mermaid became +33% larger than markdown — each a --> b\n adds linear overhead while bullet lists pack denser. Mermaid is great for small dense relationship graphs; bad for large enumeration lists.

2. The "graph pointer enables parallel reads" hypothesis didn't differentiate formats.

When I asked a question requiring 3 file reads, modern Claude (and OpenAI) issued all 3 reads in parallel regardless of format. Markdown bullet lists trigger parallelism just as well as mermaid edges. So the cognitive model "graphs let the agent jump" was wrong — it's actually "any clear file inventory triggers parallel reads."

3. On GPT-4o, the speed gap is huge:

Format	gpt-4o wall	gpt-4o-mini wall
md	3.11s	2.72s
mmd	1.88s (−40%)	2.16s (−21%)
yaml	1.80s (−42%)	2.13s (−22%)

But the Claude subagent runs barely showed this — because Claude's system prompt is so big the pack format barely matters. This means most blog posts comparing prompt formats with Claude Code are probably noise. You need an API-direct harness to see real format effects.

4. Small models care about format more — in the opposite direction.

gpt-4o-mini's success rate:

md: 100%
mmd: 95%
yaml: 90%

gpt-4o was 100% across all three. So capable models gain speed from structure; smaller models lose accuracy. If you're shipping a hybrid stack (use 4o-mini for cheap calls, 4o for complex ones), you'd want different memory formats per tier. Nobody talks about this.

5. The variance finding (Claude only):

Across 30 trials per format on Claude, mermaid had 5× lower wall-time stddev than markdown or YAML. Markdown occasionally crawled at 20s; mermaid never went above 14.9s. Never won the race, never lost it either. For p99 latency SLOs this might actually matter more than mean.

Decision matrix I'd use now

Optimize for	Pick
Cheapest tokens	YAML
Fastest on big models (4o, Opus)	YAML or mermaid (~tied)
Reliability on small models	Markdown
Latency consistency (p99)	Mermaid
Human-team editability	YAML
Small relationship graphs	Mermaid
Large lists / enumerations	Markdown

Caveats I want to flag

N=3–8 seeds per cell. Means are stable; variance findings are robust; the small-model accuracy gap is from 1–2 failed trials and needs more seeds.
Memory packs are tiny by production standards (~600–2k tokens). Real CLAUDE.md files at scale would show different effects.
Single domain ("staff engineer working on a SaaS API"). Different task domains (legal, medical, creative) probably behave differently.
I built the mermaid representations by hand — a worse mermaid pack would lose harder. Mermaid is sensitive to authoring quality.

What I'd want to test next

50+ module codebases — does the format-flip-at-scale generalize?
Multi-turn conversations where memory accumulates
Local models (Llama, Qwen) — do they pattern-match more like gpt-4o-mini or gpt-4o?
Hybrid encoding: pointer-only CLAUDE.md + detail files in a separate format

https://preview.redd.it/bma1tkbhbw1h1.png?width=2585&format=png&auto=webp&s=7d0e7655ca1cf7aad95a8fbf9c217184346612d1

https://preview.redd.it/atfkh3ahbw1h1.png?width=1039&format=png&auto=webp&s=de2b14f7e7b2557927f1abdab246c1dd5df3a882

https://preview.redd.it/fevo54ahbw1h1.png?width=1039&format=png&auto=webp&s=a817befa1cd95cce13206909e563aa2d237496ca

https://preview.redd.it/rnhx92ahbw1h1.png?width=1759&format=png&auto=webp&s=e083d7e23869b666680c5178613abe9f2cf40b22

https://preview.redd.it/12c043ahbw1h1.png?width=1154&format=png&auto=webp&s=8bc3c637637c8f8867752d1df9dc356638ee036c

https://preview.redd.it/re5hv3ahbw1h1.png?width=1239&format=png&auto=webp&s=8ff2bc81d7c8274b853aa82934280d3c5212bd5a

https://preview.redd.it/n23xt3ahbw1h1.png?width=1758&format=png&auto=webp&s=8558401025cbcec5e9eb9a7f595e1341138b2d1e

https://preview.redd.it/ob9fdtahbw1h1.png?width=919&format=png&auto=webp&s=903ab4891fe804be1e263b9b8b396db948f5e924

https://preview.redd.it/0ear3sahbw1h1.png?width=2042&format=png&auto=webp&s=82c670cf9a98e99d6d882530d22e1c573d35528d

https://preview.redd.it/tsdgr4ahbw1h1.png?width=919&format=png&auto=webp&s=259ddb9344542641f00febe984c524f2871f50c7

https://preview.redd.it/rrh9vtahbw1h1.png?width=919&format=png&auto=webp&s=f35fa6cee15c948ffab79daa0f11692a3318eaeb

https://preview.redd.it/825u03ahbw1h1.png?width=918&format=png&auto=webp&s=f6b4437eb661f408ec7ad09a1733eac440921332

https://preview.redd.it/ggqnm3ahbw1h1.png?width=905&format=png&auto=webp&s=7093192ce8f9687c14e8ef4120416c2402a254b2

https://preview.redd.it/j1jgt3ahbw1h1.png?width=919&format=png&auto=webp&s=e7440ea23cd5dea1979a1b7336054d94057bf2c9

https://preview.redd.it/3zv253ahbw1h1.png?width=919&format=png&auto=webp&s=112cb64961bca9baf1f85db67a135f1962e4061e

https://preview.redd.it/r6ys9tahbw1h1.png?width=919&format=png&auto=webp&s=0ebf4f39352097f135254f872cd911ee5e8626a4

https://preview.redd.it/fwtqy3ahbw1h1.png?width=919&format=png&auto=webp&s=63340791884311915d95df65f26cdebead167d0c

Happy to share more detail on any specific finding. Curious if anyone else has run similar experiments — particularly on the small-model-format-fragility thing, which feels under-studied.

u/Ashamed_Safety_9782