r/AIMemory

Because architecture: What MuSiQue 1,000Q benchmarking taught me about why current memory retrieval can’t live up to its promise

Most of us faced some version of the same problem dealing with AI in work and life: Memory retrieval for AI eventually disappoints because we expect human-like retrieval but often get trash.

Drilling down deeper, one realizes that we are more often than not expecting random-access multi-hop retrieval - because that’s how our human memory works. But what we currently have as tools are graph crawling, cosine lookups or (gasp) regex matching. Who knew grep was such a powerful tool, token waste be damned?

So how do you make an AI system remember in a way that’s actually useful for humans? You model it after human memory, of course. Not a Frankenstein bolt-on mess of open-source code, but a designed-from-the-ground-up, built-from-scratch lean memory engine modeled after literal neurobiological systems.

My own frustration trying to fully utilize AI for my day job as a pharma/biotech consultant drove me to build this sparse tensor-based graph memory engine over the past few months — my PhD is in biochemistry, so I’m drawing from what I actually know rather than what sounds good on a pitch deck. And because I am a proud scientist (almost to a fault), I naively threw the engine against MuSiQue 1,000Q, which is as close to a real multi-hop memory recall test as we have in the literature. It could have gone horribly wrong, but if it did, you wouldn’t be reading about it.

The short version: F1 = 0.677 on the full 1,000Q corpus (highest published zero-shot end-to-end score as of May 2026, to the best of my knowledge). Yeah. Went quite a bit better than I expected.

Reader-controlled baseline with a compact local embedding model (nomic): 0.565 vs LlamaIndex at 0.418 and BM25 at 0.329.

But the number isn’t really the point. What I think matters more for anyone building memory systems is why this architecture works differently from established tools.

The recall problem nobody talks about

Vector similarity search answers “what’s close to this query in embedding space?” That’s fine for a simple lookup. Search, rank, done. But MuSiQue was specifically designed to defeat that mechanism — it was designed so that no single retrieved passage contains the entire answer. You need passage A to find passage B to find passage C. That’s memory traversal, not memory search. Graph crawling is also similarly limited as it must crawl edges at the risk of fanning out too thin before finding the next relevant node.

The engine builds a weighted graph where edges carry typed relationships (like various neurotransmitters) and activation energy propagates through connections (like how neurons fire) — nodes that are semantically distant but informationally connected either through logical relationships, provenance, hierarchy etc. still light up if the path between them has enough weight. Same principle as biological associative recall: you smell something and remember a childhood memory that has zero semantic overlap with the smell but a strong associative pathway.

That’s the architectural hypothesis. The benchmark results suggest it works. I posted the full methodology and honest limitations over on r/RAG (including the ~52% reader confound, PropRAG’s superior retrieval lift at +81.9% vs our +71.7%, and Beam Retrieval’s higher supervised score of 0.692) because I didn’t want to bury the caveats. Full transparency on what beat us and where. You can also see the full write-up with all the numbers: https://elucidx.ca/insights/2026-05-15-rag-needs-real-value/

The harness is public

The engine itself is proprietary and patent-pending — I’m not releasing source. But the evaluation harness, dataset, and scoring protocol are all public: [github.com/wonker007/musique-eval-harness]. If you’re building a memory system and want to know how it does on genuine multi-hop recall, run your system against the same corpus, same protocol with the same scorer and post the number. I’ll reference it.

I’m also currently running conversational-scale benchmarks (128K to 10M token range) testing temporal reasoning, knowledge updates, and contradiction detection — the stuff that actually matters for memory persistence over long interactions with AI. More data coming.

If anyone here is working on multi-hop recall architectures — whether that’s GraphRAG, memory-augmented transformers, or something else entirely — I’d love to hear what serious benchmarks you’re using and what you’re seeing. MuSiQue is good but it’s still Wikipedia passages, not production conversational data.

(Post was written with the help of AI, edited by me)

u/wonker007 — 3 days ago

▲ 11 r/AIMemory+1 crossposts

Search gives you candidates, not evidence

Took me a while to admit this, but for agent retrieval, top-k snippets aren't evidence. They're candidates. They look confident, they're often subtly off, and the model will cite them without blinking.

What's worked better for me is two steps instead of one. Search to narrow down candidates, then reopen the actual source and read it narrowly to confirm. Retrieval gets you close, reading is what verifies. People tend to pick a side, index-and-retrieve or let-the-agent-crawl, but in practice you want both, same as how you'd Google your way to a page and then actually read it.

I built something around that split and ran one eval on it. Big grain of salt, single agent and single corpus, not a general claim. Finding implementations in a ~2000-file repo, plain shell averaged 962 tokens at 22/24 hits. Search then browse landed 23/24 at 460 tokens. Roughly half the tokens at slightly better recall.

See this for details: https://github.com/zilliztech/mfs

u/ethanchen20250322 — 7 days ago

▲ 74 r/AIMemory+2 crossposts

Mem0 publishes 93.4% on LongMemEval. The harness has hardcoded answers for specific question_ids.

Mem0 publishes 93.4% on LongMemEval as their state-of-the-art overall score. When we ran their hosted product through a clean evaluation harness (gpt-5 answerer, binary judge with no lean-toward-yes instruction, 5-seed mean), the best we could get was 73.8%. A 19.6-point gap on the same memory system and the same data.

We dug further, the gap is in their public benchmark harness. Reading their prompts.py file at the commit they shipped right before their April 14 announcement (commit bd063eea, April 3, 2026):

1. Dataset-specific equivalence rules in the answer prompt.

https://preview.redd.it/va27d4jzvw8h1.png?width=3024&format=png&auto=webp&s=2d835fafc5a1583cef7fed3c6343b405d4b37dad

Lines 138 to 148 contain 14 rules that map 1-to-1 to specific public LongMemEval question_ids. A sample, verbatim:

The point of LongMemEval is that the system has to figure out when "scratch grains" should count as "layer feed." Hardcoding the equivalence into the answer prompt skips the reasoning step.

The dataset hints get applied inside a hidden chain-of-thought block.

https://preview.redd.it/szk9ka57ww8h1.png?width=2940&format=png&auto=webp&s=56e04fb8e44dd8b4707a7062f8d07d116a86c58a

Line 53: Before answering, reason step-by-step inside <mem_thinking> tags:
Line 65: The user will only see text outside the <mem_thinking> tags.

The judge only sees the final cleaned answer. The dataset-keyed reasoning is invisible to anyone sampling outputs.

3. The judge is explicitly told to default to "yes."

https://preview.redd.it/xroeatxaww8h1.png?width=3006&format=png&auto=webp&s=126fad0f35c618e523a1eef3a864a76870c85fbc

Line 269 of the same file: IMPORTANT BIAS CHECK: You have a tendency to say "no" too quickly. Before concluding "no", you MUST verify the answer is truly wrong, not just differently worded. When in doubt, lean toward "yes".

Lines 328 to 334 add a 5-step gauntlet to clear before marking anything WRONG. No comparable gauntlet exists before marking anything CORRECT.

4. Bonus finding in their LoCoMo judge.

https://preview.redd.it/67yu69beww8h1.png?width=3024&format=png&auto=webp&s=93159462881bb10e168473dea546895099c25dfb

Different file, same repo, commit edcd6f1d (April 9, 2026). Line 212 of benchmarks/locomo/prompts.py:

Read the last clause carefully. Evidence can promote a WRONG prediction to CORRECT. The same evidence cannot demote a CORRECT prediction to WRONG. A one-directional score lift, written into the judge by hand.

Mem0 named this mechanism in their own commit messages. The April 3 commit message reads: "Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), conflicting numbers, personalization scan, BIAS CHECK in judge, chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK." Their engineer typed the words "BIAS CHECK in judge" and "5-step FINAL CHECK" into git, on April 3, eleven days before the announcement of new SOTA numbers.

Verify in 2 minutes (direct GitHub permalinks at the pinned commits):

L145 chandelier: github.com/mem0ai/memory-benchmarks/blob/bd063eea04de4f8a19927beea155afa094a01905/benchmarks/longmemeval/prompts.py#L145
L269 BIAS CHECK: same file, #L269
L212 LoCoMo override: github.com/mem0ai/memory-benchmarks/blob/edcd6f1d42400837b1fcb6997716f1769dc51a37/benchmarks/locomo/prompts.py#L212
April 3 commit message: github.com/mem0ai/memory-benchmarks/commit/bd063eea04de4f8a19927beea155afa094a01905

I tried meeting with their founder and communicating the issue; since the past 2-3 weeks, but we couldn't and I thought that it might be time for the community to learn about it.

Full-disclosure: I am the founder of Maximem.ai - another Agentic Memory and Context Management company. This is not an attempt to malign, but to put their latest numbers into perspective.

reddit.com

u/Ok_Row9465 — 10 days ago

▲ 13 r/AIMemory

looking for a real AI memory PRODUCT

Im looking for an easy to setup AI memory tool, without out having to go to Github and do it by myself. A real PRODUCT not a hobby project.

Any recommendation?

reddit.com

u/CautiousTwist7958 — 10 days ago

▲ 8 r/AIMemory+3 crossposts

NornicDB - benchmark 1-60 hops shortest path on 500k nodes 4m edges

TLDR:

The Scale: 500,000 nodes, ~3.97 million edges (~16 connections/node) benched on an Apple M3 Max.

Performance Breakdown

Depths 1–4 (Local): Sub-millisecond (94us -> 342us median). Fits entirely within the adjacency and edge body caches.
Depths 5–8 (Mid-range): Single-digit milliseconds (1.2ms -> 2.3ms median). Working set starts hitting cold subgraphs, causing some tail latency noise.
Depths 9–39 (Deep): Linear scaling at roughly 14ms per hop.
Depth 40+ (The Cache Cliff): Latency instantly doubles (jumping from 282ms -> 612ms median). The BFS frontier hits ~200,000 nodes, obliterating the default ⁠nodeCacheMaxEntries⁠ limit of 50,000 and forcing raw disk iterator hits.
Depth 60 (Full Scan): Maxes out at \approx 1 second for a full cross-sector traversal.

github.com

u/Dense_Gate_5193 — 10 days ago

▲ 5 r/AIMemory

If I Only Had a Brain

I wrote this after seeing another thread asking for “a real AI memory product” that is easy to set up and just works.

My take: most of what gets sold as AI memory is really recall — chunks in a vector store, pulled back by similarity. Useful, but not memory.

I’ve been building my own local-first memory system, and the thing it keeps proving back to me is that the hard part is not storage or retrieval. It is deciding what your knowledge actually is: facts, events, procedures, contradictions, provenance, and time.

Curious where people here draw the line between recall and memory.

open.substack.com

u/bczajak — 9 days ago

▲ 6 r/AIMemory

Cognee just put out new APIs for memory and they look good

They are fully agent-memory now with primitives like remember, recall, forget and improve. Looks good, especially improve: https://www.cognee.ai/blog/deep-dives/inside-cognee-1-0

There's also the new rust core which drops cold start and search by a lot.

They had a bunch of announcements today. IMO the Postgres one database support is the biggest one. Just not sure how they are putting a graph in it.

u/regentwells — 10 days ago

▲ 9 r/AIMemory+1 crossposts

looking to dive deep into ai memory & data retrieval methods... where do I start?

so i've been working on the ai memory area for sometime and now i feel its time to dive deeper into it than just consuming the surface layer, but relatively i know nothing much better than the names of the prominent methods through which we are providing the ai agent the right piece of info without directly blowing up the context window (eg can be rag), but i don't know the deep mechanics

would be really glad if someone could help me map out the process moving forward!

(my working area might need me to do some deeper research n stuff on the same thing)

reddit.com

u/Col-ASY — 12 days ago

▲ 4 r/AIMemory

What happens when multiple AI agents remember different versions of the same user?

I've been thinking about a problem that seems likely to get bigger as AI agents become more common.

Today, we focus on helping AI remember things. But what happens when multiple agents and services share that memory?

For example:

Agent A says I prefer Python.
Agent B says I prefer Rust.
One memory came from a conversation last week.
Another came from a project six months ago.

Questions I'm curious about:

Which source should be considered authoritative?
How should conflicting memories be resolved?
Should users be able to approve or revoke specific memories?
How can an AI explain why it trusts a particular memory?

Are there existing systems or research projects working on memory governance, provenance, and consent for AI memories?

Would love to hear how others are thinking about this.

reddit.com

u/Ok-Sheepherder-7194 — 14 days ago

Because architecture: What MuSiQue 1,000Q benchmarking taught me about why current memory retrieval can’t live up to its promise

Search gives you candidates, not evidence

Mem0 publishes 93.4% on LongMemEval. The harness has hardcoded answers for specific question_ids.

looking for a real AI memory PRODUCT

NornicDB - benchmark 1-60 hops shortest path on 500k nodes 4m edges

If I Only Had a Brain

Cognee just put out new APIs for memory and they look good

looking to dive deep into ai memory &amp; data retrieval methods... where do I start?

What happens when multiple AI agents remember different versions of the same user?

looking to dive deep into ai memory & data retrieval methods... where do I start?