u/j-m-k-s

Introducing Exabase M-1: State-of-the-art AI memory with a smaller, cheaper model
▲ 12 r/Rag+1 crossposts

Introducing Exabase M-1: State-of-the-art AI memory with a smaller, cheaper model

We want to share some research we've been working on around memory retrieval for agents.

TLDR: our memory engine (M-1) just scored 96.4% on LongMemEval, the main benchmark for conversational memory. Highest reported score, and we did it with Gemini 3 Flash, not Pro.

The small model is the bit we care about most (cost efficiency).

When we started building our memory engine, we kept running into the same pattern: memory systems that only worked well when paired with big, expensive models.

The model ends up compensating for weak retrieval. Fine for a benchmark, but it falls apart in production where every query costs money and latency matters.

We wanted to know: can you build retrieval good enough that a cheap model gets the right answer?

That question led us to look at how human memory actually works – not as database lookup, but as reconstructive, associative, temporally-aware recall. We collaborated with Hyperplane Labs, a European applied research lab focused on cognitive AI architectures, on the retrieval architecture.

3 ideas that shaped the design:

  • Retrieval as reconstructive recall, not keyword search
  • Temporal awareness built into scoring, not bolted on
  • Context that's coherent and ordered, not just relevant

We evaluated on the most comprehensive benchmark for conversational memory – designed to stress multi-session reasoning, temporal understanding, and knowledge updates. The kinds of scenarios where current systems tend to break or fall back to larger models.

We achieved state-of-the-art results, with a smaller, cheaper model than every other system reported.

Full paper with methodology, comparative results, and downloadable data: https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark

The system powers our own apps in production, and the memory API is available if anyone wants to try it.

If you're building agents with memory, we'd be curious to hear what retrieval problems you're running into.

Especially around multi-session reasoning and temporal updates, which is where we've seen the biggest gap between current approaches and what's actually needed.

u/j-m-k-s — 1 day ago

#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Disclosure: first author.

Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability.

96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%.

Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered:

  • Query decomposition: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments.
  • Temporal salience scoring: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009).
  • Coherence re-ranking: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model.

Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions.

Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%.

Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested.

Above ~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream.

Paper | Results | Answerer prompt

Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.

u/j-m-k-s — 6 days ago