
Introducing Exabase M-1: State-of-the-art AI memory with a smaller, cheaper model
We want to share some research we've been working on around memory retrieval for agents.
TLDR: our memory engine (M-1) just scored 96.4% on LongMemEval, the main benchmark for conversational memory. Highest reported score, and we did it with Gemini 3 Flash, not Pro.
The small model is the bit we care about most (cost efficiency).
When we started building our memory engine, we kept running into the same pattern: memory systems that only worked well when paired with big, expensive models.
The model ends up compensating for weak retrieval. Fine for a benchmark, but it falls apart in production where every query costs money and latency matters.
We wanted to know: can you build retrieval good enough that a cheap model gets the right answer?
That question led us to look at how human memory actually works – not as database lookup, but as reconstructive, associative, temporally-aware recall. We collaborated with Hyperplane Labs, a European applied research lab focused on cognitive AI architectures, on the retrieval architecture.
3 ideas that shaped the design:
- Retrieval as reconstructive recall, not keyword search
- Temporal awareness built into scoring, not bolted on
- Context that's coherent and ordered, not just relevant
We evaluated on the most comprehensive benchmark for conversational memory – designed to stress multi-session reasoning, temporal understanding, and knowledge updates. The kinds of scenarios where current systems tend to break or fall back to larger models.
We achieved state-of-the-art results, with a smaller, cheaper model than every other system reported.
Full paper with methodology, comparative results, and downloadable data: https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark
The system powers our own apps in production, and the memory API is available if anyone wants to try it.
If you're building agents with memory, we'd be curious to hear what retrieval problems you're running into.
Especially around multi-session reasoning and temporal updates, which is where we've seen the biggest gap between current approaches and what's actually needed.