r/ContextEngineering

Agent with tiered working memory and cross-session learning — architecture, gaps, and what the research didn't cover

I've been building PRAANA — a coding agent with two systems I couldn't find combined in one self-contained binary: an Adaptive Context Engine (within-session) and Cognitive Memory (cross-session). Posting because the architectural decisions may be useful independent of the coding use case.

The core problem:

Every agent session is a context window management problem. Append-until-full plus reactive compaction is lossy — by the time you compact, you've already paid the drift cost, and you've lost track of which information was load-bearing.

PRAANA's ACE curates on every turn. A deterministic compiler assembles the prompt in 5 sections:

1. System Frame       — identity + tools
2. Memory Digest      — ranked cross-session learnings
3. Active State       — current work objects, full resolution
4. Peripheral Stubs   — everything inactive, one-line anchors
5. Recent Turns       — last N turns, budget-capped

State objects demote Active → Soft → Hard based on idle turns. Two-pass auto-hydration before each turn: substring keyword match, then BM25 for fuzzy overlap. Scores are density-weighted: decisions score 1.0, narrative scores 0.6, errors score 0.8. The compiler knows what kind of information is filling up, not just token count.

Cognitive Memory:

At /exit, a summariser extracts structured learnings from the transcript. Six kinds: fact, preference, decision, pattern, mistake, constraint — domain-agnostic; coding-specific knowledge lives in content, not schema. Stored in SQLite with sqlite-vec + Transformers.js (in-process, 384-dim). Confidence decays 5%/day. Entries confirmed across two or more sessions promote to Consolidated Memory (10x slower decay). Ranked recall: cosine × confidence × recency × pin_boost.

Where the research fell short:

I surveyed 20+ agent-memory repos. What I found:

Mem0, LangChain, and most memory backends are retrieval systems. They store and recall but have no outcome-based feedback loops. No architecture for "this memory was used and confirmed, increase confidence" vs "this memory was contradicted, reduce it." Letta has the most interesting consolidation work (sleep-time agents) but it's a platform, not extractable, and consolidation is partial.

Nobody combined proactive context curation with learning memory in one self-contained process. The compression tools — Headroom, ACON — are SDK/proxy layers that sit between you and the LLM. They don't own agent state.

The gap I missed: the research covered storage architecture, not learning signal. The reinforcement path in PRAANA — boost confidence when a session succeeds, decay when contradicted — is wired but the session-success signal hasn't shipped yet (#162). I designed a complete feedback loop and then discovered the trigger was the hard part.

The larger plan:

Four systems — Adaptive Context, Cognitive Memory, Background Consolidation, Intelligent Router — all domain-agnostic. No system encodes anything about code. The coding agent is the proving ground; coding outcomes are measurable. Once Phase 1 validates the architecture, Phase 2 extracts the runtime as @praana/runtime. I'm not extracting it until the coding agent proves it works.

Gaps:

Reinforcement path dormant (#162). No A/B eval harness — scorecard ships, headless task runner is next, no published benchmark claims. Background Consolidation Processor schema exists, not scalable yet. Runtime extraction is Phase 2, not started.

GitHub: amitkumardubey/praana — MIT, TypeScript, Bun.

If you're working on agent memory or context management architecturally, I'd welcome the comparison. What are you seeing in production that the research repos didn't surface?

u/Reasonable_Craft_425 — 8 hours ago

▲ 52 r/ContextEngineering+1 crossposts

Notes from a conversation with a Large Enterprise CIO; about enterprise context management, ontologies and semantic layer

Recently, I had the chance to speak at length with the CIO of a large enterprise (obviously can't share the identity), around their thoughts on semantic layers, ontologies and agentic systems. They are fairly active in the CIO circles and have been engaging with their peers on the topic. Notes below are a mix from both our observations.

Some obvious observations first:

Large enterprises are disproportionately focussed on building internal agents (rather than customer-facing ones), with the focus on reducing talent costs and they are already realizing that the infra for it is far from ready
Enterprises are understanding the pain and the need for context management but they don't have the right terminology for it yet
Most enterprises are pointing agents at fragmented internal systems and hoping the model infers business meaning across them which obviously breaks quickly in production.

A few interesting aspects that emerged:

1. Static ontologies are dead on arrival. The real world environment changes daily but the semantic model updates once a quarter and hence the system is stale before it ships. Even human organizations get redesigned every few years because reality moves. An intelligent system should be able to reorganize its internal understanding far more often than that. The better analogy is cognition, not schema design: continuous consolidation, continuous re-linking, continuous updating of what matters.

2. The bottleneck is not data access, it is context selection. The real question is rarely "how do I retrieve more information." It is what context is right for this decision, what should be ignored and how fast that can be assembled at the speed the task demands. A person making a judgment call is not querying a giant flat database. They are drawing on a compressed, evolving, relevance-weighted internal model and that is much closer to the actual design problem.

3. Enterprise semantics gets misread in two opposite directions. Some people flatten it to metadata and catalog descriptions. Others make it so abstract it cannot be operationalized. The real need sits in between: technical enough to run in production, dynamic enough to evolve with the business and grounded enough to encode institutional meaning without collapsing under latency, security and ownership constraints.

4. Vendor semantics is not organizational semantics. Every major platform is now shipping its own semantic layer, but a company's core institutional knowledge cannot be fully outsourced to whichever vendor has the best UI this quarter. Meaning scattered across product surfaces owned by different vendors gets you local optimizations but never a coherent institutional model. This might be one of the more unresolved problems in enterprise AI right now.

5. The hard part is representing judgment, not just knowledge. Most valuable work inside a company is not a deterministic logic tree. People get hired for how they interpret incomplete information and make calls under ambiguity, not just for what they know. So the real question is not how to build a company knowledge base. It is how to build systems that inherit evolving decision context, not just stored facts.

One more thing, the same need gets called an ontology, a knowledge graph, a semantic layer, a context graph, a company brain, agent memory or institutional memory, sometimes all in one conversation. That pattern usually means the need is ahead of the label.

My rough takeaway: we may be underrating how much "intelligence at work" depends on continuously evolving context, not model quality or data availability alone. The next real layer probably is not another copilot or orchestration framework. It is whatever can unify fragmented meaning, keep it current, and make it queryable at decision speed without collapsing under latency, trust, or governance constraints.

Genuinely curious how people here see it: are semantic layers and context graphs the actual missing layer for enterprise agents or is this still too early, too abstract, or too category-confused to matter yet?

reddit.com

u/Ok_Row9465 — 1 day ago

▲ 160 r/ContextEngineering+27 crossposts

How to build an AGY WIKI OKF on the Antigravity CLI

AGY Builders,

We are all trying to build useful and scalable workflows for our AGY CLI and ecosystem, but the speed at which we need to learn, build, and deploy new things is incredibly overwhelming. If you are feeling that pressure, you are in the right place here at r/GoogleAntigravityCLI.

Over the past few weeks, I have been testing an "AGY WIKI OKF" setup that I put together myself (after inviting some members of this community to collaborate; mod is not proud). I know some folks might hesitate to trust a tutorial from a random Redditor, but I wanted to share this with the community anyway because it actually works.

I was able to build this because I am all-in on Google and the Antigravity Ecosystem. I’m a truly AGY—I am not some ultra-smart, 10x developer, but I know how to work hard, I dig for the right information, and I iterate.

AGY WIKI OKF | The Idea

To build a frictionless, token-efficient knowledge WIKI engine that transforms static documentation or notes (information) into an active, intelligent collaborator—orchestrated entirely by Antigravity CLI.

The core philosophy is simple: treat knowledge management as a clean pipeline and tokens as a premium, finite resource.

By anchoring this architecture to Google’s Antigravity CLI, the AGY WIKI OKF bypasses heavy middleware and complex UI layers, delivering a hyper-focused AI partner built entirely for execution speed, context hygiene, and minimal footprint.

Why adopting AGY WIKI OKF matters:

Stay organized (AGY OCD): Structured Markdown and YAML keep the chaos in check.
Save tokens: Doing more with less context window bloat.
Scale shareable knowledge: Making it easy to pass context and logic between different LLMs.
Humans and Agents working together: One standardized, readable format that works perfectly for both of us.
BYOD (Bring Your Own Data): Own your context. Port it to the newest model, platform, or OS instantly.

The Tools

Antigravity CLI
Obsidian : The IDE for the Knowledge bank
Obsidian Web Clipper:

The WIKI

In the agent-first era, a WIKI is no longer just a static graveyard for human notes; it is the operational hard drive for your agents. By maintaining a highly structured WIKI, you ensure that every piece of context is stored in a clean, machine-readable format. This means that whether you are testing a new modular skill or spinning up a specialized agent, your AGY CLI knows exactly where to find the precise context it needs to generate autonomous action, moving you far beyond simple, reactive conversational text.

Reference: Gist on Knowledge Representation

Google Open Knowledge Format (OKF)

Google’s Open Knowledge Format (OKF) feels like the exact missing piece we've needed for orchestrating multiple AI agents effectively. It provides a vendor-neutral, interoperable standard for storing and sharing organizational knowledge.

Why this is huge for orchestration:

The "Lingua Franca" for Agents: Any agent can read it out of the box without platform-specific integrations.
Seamless Context Passing: Specialized agents can access, update, and pass the exact same foundational context back and forth.
Human-in-the-Loop Oversight: Because OKF is just Markdown and YAML, it’s inherently readable and auditable.
Scalable Knowledge: It acts as a shared, living library that grows alongside your agents.

AGY WIKI OKF Integration

Structuring an AGY Wiki using OKF revolutionizes how complex knowledge is shared. By standardizing documentation with concise Markdown and YAML frontmatter, OKF provides a unified taxonomy for cataloging AGY CLI slash commands or skills It is highly token-efficient, stripping away bloated formatting and maximizing context window limits.

The Prompt for Building an AGY WIKI OKF

AGY CLI WIKI OKF PROMT EXAMPLE

/grillme I want to initialize a brand-new, empty Obsidian vault from scratch that adheres strictly to the Open Knowledge Format (OKF) standard, with the specific intent of potentially open-sourcing or sharing this architecture later. I want a purely blank, skeletal framework with no pre-populated data. Please grill me to define the optimal architectural blueprint for this vault. I need you to interrogate me on: Do not generate the directory structure or files until you are satisfied that you have captured all my requirements for a production-ready, shareable knowledge base. 
Core Directory Hierarchy: How should we structure the root (e.g., /concepts, /resources, /indices, /log) to be intuitive for external users? Template Strategy: What base boilerplate templates do we need to ensure every new file is automatically OKF-compliant and structured for consistent metadata? Workflow Logic: Since this is a fresh start, what processes should we bake in for capturing information vs. refining knowledge that could be easily documented for others? CLI Integration: What specific file locations or configurations do we need to ensure this vault plays nicely with the Antigravity CLI from day one? Open-Source &amp; Contributor Documentation: What files should we create to make this a "deployable" standard? Please include requirements for: A README.md with installation and usage instructions. A CONTRIBUTING.md that defines how to add new concepts or schemas. A "System Architecture" document that explains the logic behind the folder structure and metadata fields, ensuring anyone who clones this vault understands how to extend it.

The Final File Structure

AGY WIKI OKF
    ├── .agyrc
    ├── ARCHITECTURE.md
    ├── CONTRIBUTING.md
    ├── README.md
    ├── .agy
    │   └── .keep
    ├── .obsidian
    │   ├── app.json
    │   ├── appearance.json
    │   ├── core-plugins.json
    │   └── workspace.json
    ├── 00-Inbox
    │   └── .keep
    ├── 10-Projects
    │   └── .keep
    ├── 20-Areas
    │   └── .keep
    ├── 30-Resources
    │   ├── .keep
    │   └── Google Antigravity Documentation.md
    ├── 40-Archive
    │   └── .keep
    ├── 99-Meta
    │   └── Templates
    │       ├── Base_Template.md
    │       ├── Project_Template.md
    │       └── Resource_Template.md
    └── Clippings

TL;DR

AGY WIKI OKF: Organizes your information (context) , AGY CLI commands, skills behaviors, and A2A workflows into a token-efficient, shareable format that reduces inference costs for any LLM.
Open Knowledge Format (OKF): Provides a standardized, vendor-neutral way to share context (Markdown + YAML), preventing platform lock-in and eliminating data fragmentation.

AGY Builders, I genuinely want your input on this. Please comment, grill me, roast me, ask questions, or give me your raw feedback on this AGY WIKI OKF setup. We are building the foundation to organize and share our data in the BYOD era. Let's build the future together.

u/AgentPadrino — 2 days ago

▲ 326 r/ContextEngineering+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago

▲ 61 r/ContextEngineering+40 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 2 days ago

▲ 8 r/ContextEngineering+1 crossposts

Cross agents assistance/memory layer - ideal solution

My first post in a while, so bare with me.
A bit about myself, exited a company on 2023. worked since on Software architecture, and in the last couple of years, around the AI architecture to make an organization (R&D mostly) utilize AI in a better way.

In a recent project i did, i was requested to build a knowledge layer for a small startup (10 R&D employees). I researched quite extensively (Supermemory, etc.) but all seem like something that won't sustain and won't be called by the devs in their agents.

Another issue was that even if it works, how would we utilize it for other agents like a KB slackbot that their sales team use, or an SRE bot that need to decide if an event it seen in the logs is a bug or a feature?

So bottom line, the project is somewhat a success, somewhat a failure. Not something i'm proud of. Which got me into thinking on how to effectively capture and share context across the organization with zero/minimal burden to people?

What i envision is how we did buddy training for a new employee (back in the old days...), we would sit a new employee next to a senior one (who likes it or not), and let them look how it work and ask questions.

Taking notes on design choices
How to troubleshoot some problems
How to raise a local environment
Where to look for the ticket
What is a known issue that we should tackle later after we do X
What dashboard in Grafana has the important logs about this system
etc.

But instead of putting a person next to the developer, there is already an AI agent working with it.

Such a system (and i need your help on defining it❤️) would:

Work on every agent type: coding, internal bot, framework, etc.
Capture and recall memories natively during the conversation with the AI agent
Capture and recall needs natively
Create and optimize workflows (skills) natively as we activate and feedback these workflows
Promote/Graduate memories/needs/skills from a local level to team/org level as they mature and get more traction
Share the collected memories/needs with other agents (plugin?)

Basically, doing compound knowledge growth via the conversations with AI agents

Would be happy to hear your thoughts.

reddit.com

u/Yarharel — 3 days ago

▲ 4 r/ContextEngineering+3 crossposts

REQL: a relational entities query language context engine for coding agents

A recently published REQL on GitHub, after working on it for some time, a local repository context engine designed for coding agents and developer tools.

To clarify its positioning: REQL is not another graph database, graph framework, or graph visualization tool. It uses a graph internally to represent relationships between files, symbols, imports, calls, tests, documentation, and other repository elements, but the graph itself is not the product.

The project is intended to be embedded into existing workflows as a structured, end-to-end pipeline for repository indexing, incremental updates, querying, and context generation. The goal is to let tools and agents retrieve a compact, connected, and source-grounded view of a codebase instead of scanning the entire repository or relying only on whatever fits into a prompt.

REQL currently includes:

Tree-sitter-based analysis for more than 30 languages;
deeper extraction for Python, JavaScript, and TypeScript;
incremental compilation, caching, deletion handling, and watch mode;
a dedicated query language;
local storage without requiring an external graph database;
a CLI, Python API, and optional MCP server.

There are no mandatory LLM calls in the core indexing and retrieval pipeline.

The project is still in alpha and there are certainly areas that need improvement, but I decided to publish it because I hope it can already be useful to people working on coding agents, repository analysis tools, or structured context pipelines.

GitHub: https://github.com/sh1zen/reql

I would really appreciate feedback from anyone willing to test it on a real repository, especially regarding retrieval quality, unsupported project structures, integration issues, or anything that feels unnecessarily complicated.

I also hope some of you may find it useful enough to participate in its development. Issues, pull requests, and contributions are very welcome.

u/cl0wnfire — 2 days ago

▲ 2 r/ContextEngineering+1 crossposts

New, not-a-wrapper RAG engine: MuSiQue 1000Q multi-hop benchmark against HippoRAG2, BM25 and LlamaIndex

Been lurking and commenting here and there for a while, hinting at building something out of sheer frustration on crappy context management state of AI especially related to my day job in pharma and healthcare. So I just up and went on to build a new-from-the-ground-up graph-based retrieval engine and ran it through MuSiQue - the 1,000Q set.

This is not a wrapper, not a Frankenstein mish-mash of open source code. Legit new architecture based on what I know best - biology. And I think I’m as qualified as they come as a PhD in biochemistry working in biotech and pharma nearing twenty years now.

Posting the full results, methodology, and limitations here because I actually have the balls to put it all out there - and the results are damn impressive, if I do say so myself.

And yes, the dry bits below are written with the help of AI (thank you Claude) because this is an AI-related sub.

Setup

Same corpus as HippoRAG 2: 1,000 questions and 11,656 Wikipedia passages from their published HuggingFace dataset (osunlp/HippoRAG_2). 496 answerable questions scored. Evaluation metric: SQuAD F1 — deterministic token-level precision/recall, no LLM judge involved. All comparators (BM25, LlamaIndex) run through the same reader model (Gemini Flash, temperature=0) on the same hardware to control variables.

The engine is a Rust-based sparse tensor graph that retrieves through associative activation pathways rather than pure vector similarity search. It runs as a single 12.5 MB binary. The entire benchmark was run on a laptop (i7, 16GB RAM, RTX 3050 Ti).

Results

Reader-controlled baseline (same reader, same embedding model across all three):

System	F1
BM25 (whitespace tokenization, top_k=50)	0.329
LlamaIndex (nomic-embed-text-v1.5, 768d)	0.418
Donna-Alfred (nomic-embed-text-v1.5, "Eager Mode")	0.565

With optimized configuration (stronger embedding model (Gemini) + reader reasoning enabled): F1 = 0.677. To the best of our knowledge as of May 2026, this is the highest published zero-shot end-to-end F1 on MuSiQue. Yeah. Good stuff.

Total benchmark cost: $30.04.

Now the honest part

The 0.677 number needs context that I’m not going to bury. Three things:

Reader confound. HippoRAG 2 used Llama-3.3-70B as their reader; I used Gemini Flash. Comparing BM25 baselines across readers (theirs: 0.288, ours: 0.329), roughly 52% of the raw F1 gap between our baseline and HippoRAG 2’s published 0.486 is attributable to reader advantage, not retrieval quality. The fairer comparison is BM25-relative retrieval lift — how much each system improves over BM25 using the same reader:

System	F1	BM25 (same reader)	Retrieval lift
LlamaIndex (Flash)	0.418	0.329	+27.1%
HippoRAG 2 (Llama-3.3-70B)	0.486	0.288	+68.8%
Donna w/ nomic (Flash)	0.565	0.329	+71.7%
PropRAG (Llama-3.3-70B)	0.524	0.288	+81.9%

PropRAG beats us on retrieval lift. +81.9% vs our +71.7%. We are not claiming to be the best retrieval system in the world for everything. That kind of thing just can't exist. We are claiming competitive retrieval quality at a fraction of the computational cost — our embedding model was 137M parameters vs NV-Embed-v2 at 7-8B.

Supervised systems score higher. Beam Retrieval (Zhang et al., NAACL 2024), fine-tuned on MuSiQue’s own training data, reaches 0.692. Our engine is zero-shot — no task-specific training. The gap is 1.5 F1 points.

What the engine is NOT

It’s not open-source. It’s proprietary and patent-pending. I’m not releasing code, binaries, or API access. I will be opening up slots for alpha testers in the near future though, so stay tuned.

What IS public: the benchmark methodology, the dataset (HippoRAG 2’s published corpus on HuggingFace), the evaluation protocol, and the evaluation harness. The eval harness is here: https://github.com/wonker007/musique-eval-harness

Per the original protocol, the scoring metric is deterministic. Anyone can reproduce the comparator arms and verify the methodology claims independently.

I built this solo using AI - lots of AI. Claude, Gemini, Perplexity (well, Perplexity technically isn't AI but why not give a shoutout - RIP), ChatGPT. Part of me wants this to be proof that vibe coding can actually produce production quality software, although with over 1,300 quality and governance documents weighing in at over 145 MB (not code, just the markdown documentation part), it isn't exactly "vibe" coding per se. FYI, quality management principles were borrowed from my wheelhouse of pharma and diagnostics manufacturing.

As I said, my background is biochemistry and pharma commercial strategy, not CS. The architectural approach is neurobiology-inspired - associative activation over a sparse tensor graph, same way biological neural networks process and retrieve by spreading activation through synapse connections of varying affinities and through several different neurotransmitters. The CS establishment will probably hate this claim because there are so many kids claiming to have solved RAG by “modeling after biology and the brain”. But I actually have the credentials to back my claim up.

But the thing is, F1 doesn’t care about your pedigree or your claims, and neither does MuSiQue. This is hard data from hard code, plain and simple.

I say bring your benchmark data in with full transparency if you want to play with the big boys.

What I’m looking for from this community

Methodological criticism. If the experimental design has a flaw, I want to know. If there’s a comparator I should be running against, tell me. If the reader confound analysis is insufficient, challenge it. The full write-up with all the numbers, per-hop breakdowns, the 2×2 optimization matrix, production calibration curves, and the data sovereignty argument for single-binary deployment is here: https://elucidx.ca/insights/2026-05-15-rag-needs-real-value/

I’m also working toward formalizing this for peer-reviewed publication and running additional benchmarks as we speak (conversational RAG at 128K-10M token scale). More data coming.

And if you’re really interested, as I mentioned, I’m planning to open up alpha testing in the near future, probably when I finish up the conversational benchmark. Only serious enterprise-level engineers need apply - it’s a highly-customizable drop-in Rust-based RAG engine with 70+ tunable variables on a clean API surface.

u/wonker007 — 3 days ago

▲ 1 r/ContextEngineering

I was getting frustrated with how AI coding agents navigate large repos, so I started building some helper scripts

I've been spending a lot of time using Codex and Antigravity on a fairly large Laravel + React project.

After a while I noticed the same patterns over and over again.

The agent would:

read way more terminal output than necessary
dump huge files just to inspect a single function
repeat similar searches across multiple folders
burn through context on information it never actually used
end up asking for approval dozens of times because of lots of tiny shell commands

The models themselves weren't really the problem. The workflow was.

So I started writing a small set of PowerShell helper scripts to guide repository navigation instead of letting the agent freely explore everything.

Things like:

compacting noisy build/test output
investigating a feature across multiple folders with a single command
reading specific symbols instead of entire files
keeping searches focused
reducing repeated repository exploration

I'm still experimenting with the workflow, but it's already made a noticeable difference for me.

I'm curious how everyone else is approaching this.

Do you just let your agent explore freely, or have you built your own tooling/rules to keep context usage under control?

If people are interested, I'm happy to share what I've built in the comments.

reddit.com

u/yxf2y — 3 days ago

▲ 20 r/ContextEngineering+1 crossposts

AMA with MongoDB: Max Marcon (Director of Product), Mikiko Bazeley (Staff Developer Advocate), and Yang Li (Senior Solutions Architect). They work on AI agents in production. Ask them anything about context engineering at our AMA next Wednesday (7/8)!

Hi r/ContextEngineering!

I’m Nina (u/ContextualNina), your friendly AMA moderator for next week, the inaugural AMA for this subreddit! I’m excited to introduce the three people who will be taking all of your questions for our upcoming AMA: Max Marcon (u/mmarcon), Mikiko Bazeley (u/mmbaze), and Yang Li (u/Ok-Amphibian6116). Between the three of them, they spend a lot of time working with teams building AI agent systems that need to hold up in production.

Ask them anything during a live AMA right here on Wednesday, July 8 from 12-1 PM ET (9-10 AM PT). The real tradeoffs, the messy parts, AI hype vs. reality - whatever you’ve got.

I invited this group because they work directly on the data layer for production AI agents, which gives them a pretty grounded view of where things get hard: context design, retrieval quality, memory, state, multi-step workflows, and the parts of agent systems that tend to fail outside of demos.

We’ll be answering questions about:

Where context engineering ends and memory engineering begins
What “context rot” looks like as context gets longer
How to think about memory in multi-agent systems
When RAG beats long context, and when it doesn’t
The context mistakes that can quietly sink agent systems in production

You can start dropping in questions now ahead of time (they’ll answer them during the live window), or ask them live next Wednesday!

Full disclosure: I’m the founding mod of this subreddit, and I recently started at MongoDB. I thought this subreddit could benefit from chatting with some of my new colleagues.

https://preview.redd.it/cclnm62oqoah1.jpg?width=720&format=pjpg&auto=webp&s=29a99450aedd525142f33da5dfef545874c8715a

https://preview.redd.it/dlgzx0dpqoah1.jpg?width=720&format=pjpg&auto=webp&s=695190cb7e5bce175ea56ab7726899f1dd6a1d7b

https://preview.redd.it/tecqw15qqoah1.jpg?width=1440&format=pjpg&auto=webp&s=256dcfaa205a8e94161dfdb3fb5997784ad7d196

reddit.com

u/ContextualNina — 3 days ago

▲ 5 r/ContextEngineering

I built a live repo-map endpoint for coding agents: context before generation

I’ve been working on SigMap, and the newest part is SigMap Live: a public demo/API-style endpoint where you paste a GitHub repo and get a verified context map back.

The problem I’m trying to solve:

AI coding agents often waste a lot of context before they even start editing. They first need to work out:

where the relevant files are
which functions/classes matter
what parts of the repo can be ignored
whether the answer is grounded in real code or just plausible

SigMap Live turns a public repo into a compact signature map first.

Flow:

Paste GitHub repo
→ detect source folders
→ extract function/class signatures
→ redact obvious secrets
→ rank relevant files
→ ask the codebase / judge groundedness / adapt for agents

The live routes include:

POST /api/analyze   repo URL → verified context map
POST /api/ask       context map + question → grounded answer
POST /api/query     plain-English query → ranked files, no LLM
POST /api/judge     answer + context → groundedness score
POST /api/adapt     convert map for Cursor, Claude Code, etc.
GET  /api/benchmark repo URL → before/after token stats

Current benchmark page reports:

405 repos evaluated
321 supported repos in the headline token result
98.7% overall token reduction
95.6% average per-repo reduction
51 real coding tasks
96× cheaper context
82.4% retrieval hit@5 with BM25 re-ranking

One thing I’m intentionally not claiming: agent wall-clock speedup. The latest A/B result was too close to call, so the proven value right now is smaller, cheaper, better-ranked context — not “agents are definitely faster.”

Demo: https://sigmap-live.vercel.app/demo

Live repo: https://github.com/manojmallick/sigmap-live

Benchmark suite: https://github.com/manojmallick/sigmap-benchmark-suite

Core CLI: https://github.com/manojmallick/sigmap

Question for people building or using coding agents:

Would you rather consume this as:

a hosted endpoint,
an MCP server,
a CLI step before the agent starts,
or something that writes directly into AGENTS.md / CLAUDE.md / Cursor rules?

u/Independent-Flow3408 — 4 days ago

▲ 2 r/ContextEngineering+2 crossposts

GOAT 2.0 — Proactive Memory Demo

Fresh session.

First message: "Goat" — one word, essentially no semantic retrieval signal.

Second message: "Ce ai notat mă?" — ambiguous, no topic, no keywords.

The prefetch daemon runs as the first step on every turn, before the LLM call, retrieving from episodic memory concurrently with context assembly — independent of what the user said.

Result:

source_tier: episodic

results_found: 15

results_used: 10

tokens_l3: 1533

latency_search: 0.234s

Retrieval was not driven by the semantic content of the query, but by the daemon running proactively on every turn regardless of input.

Raw logs below. No edits.

I'm interested in technical criticism. If you think this would fail under a specific scenario, tell me which one.

u/Takashikiari — 4 days ago

▲ 1 r/ContextEngineering

Free, private, cross-tool "Context OS" layer that lives on your Git repo

or can be pure text.

[edit: link might help]

https://readthemanifest.net/

It's free, no sign-up, and fully private. Context lives as plain markdown in a GitHub repo you own. With a plain text layer of operating rules ("Context OS"). Nothing on my servers, nothing to log into. You can open it in any text editor and point any model at it — Claude, ChatGPT, Cursor, whatever.

It's worked well for me; hoping it's useful to someone else too. Happy to answer anything, and genuinely open to feedback.

reddit.com

u/ItsSillySeason — 5 days ago

▲ 1 r/ContextEngineering

Google quietly dropped a new open standard for AI agents in June 2026. Most people missed it. It's called OKF.

Been diving deep into agent memory architecture lately and stumbled on OKF - Open Knowledge Format - published by Google Cloud on June 12th. It's gotten way less attention than it deserves.

The core idea is simple: instead of explaining your codebase/systems to an AI agent every single session, you build a .okf/ directory of markdown files with YAML frontmatter that any agent can read. One required field (type). No SDK, no schema registry, no vendor lock-in. Just files.

What makes it interesting vs. just using CLAUDE.md or AGENTS.md:

It's a knowledge graph, not a flat list - concepts link to each other via plain markdown links
Versioned in git next to your code
Works across any agent (Claude Code, Cursor, Codex, 20+)
Karpathy's LLM wiki gist basically predicted this pattern; Google just formalized it

I wrote two pieces on it if anyone wants to go deeper:

Part 1 - What OKF is and how it works: Google Just Quietly Released the Missing Piece for AI Agents. It's Called OKF.

Part 2 - OKF + RAG together (when to use each, hybrid architecture): Your AI Agent Has Two Memory Problems. OKF Solves One. RAG Solves the Other.

The OKF vs RAG breakdown is the part I found most useful - they're not competing, they solve different memory problems. OKF handles your "known-knowns." RAG handles the large unstructured corpus. Most production stacks need both.

Curious if anyone here is already using something like this pattern.

reddit.com

u/Akhil_vallala — 5 days ago

▲ 5 r/ContextEngineering+4 crossposts

Memory Abstraction Layer: MAL is HAL concepts applied to agentic memory systems

I am a mechanical engineer by trade. I build CNC robots. In that world, two things cause errors and crashes: bad program instructions and noise. A programmatic error comes from a bug, either in the control system or in the subprogram instructions the machine is running. Noise is electrical: EMI out of circuit coupling, current taking a path it should not because of impedance back to the source. One is a fault in what you told the machine to do. The other is the environment corrupting a signal that was clean when it left.

I have run LinuxCNC for years. It uses a system called HAL, the Hardware Abstraction Layer, to define and control the machine. HAL is how you describe every pin, signal, and component, then wire them into one running system you can read off a page.

When I started pulling AI into what I do, the biggest hurdle was not a new problem. It was the same two failure modes in different clothes. A model gives you bad instructions when its context is wrong, and it drifts when the known-good state degrades over time, which is just noise corrupting a signal that used to be clean. Keeping the model's current state accurate, and stopping the good state from rotting, was the whole fight.

So I treated it like a machine fault. I put my critical thinking, problem solving, and diagnostic troubleshooting to work on it the same way I would on a crash on the shop floor. The result is MAL, the Memory Abstraction Layer, the functional layer of how Recall works. It is a distillation of what I already knew, applied to AI systems and accelerated by AI to fill the gaps in my knowledge and write the harder code syntax for me.

MAL is HAL one layer up. Instead of abstracting hardware, it abstracts memory. It is not a literal port, not HAL's wiring copied onto a database pin for pin. It is the concept of how HAL works, the whole pattern of pins, signals, components, and a scheduler, applied to an AI's durable memory. HAL controls a machine. MAL controls the thing that kept breaking when I put AI on the bench: the state carried across each user and AI turn.

Status: this is implemented as a running Recall prototype, not just an architecture sketch. The screenshot shows the Recall panel operating against a persistent graph, and the code snippets later in this post show the four boundaries that matter: compiling a mini-index, expanding selected cells, writing claims through an admission gate, and running deterministic recomputation outside the model. The full source is not published here, so read this as a prototype disclosure rather than a reproducible benchmark.

https://preview.redd.it/26rrrkzae8ah1.png?width=1601&format=png&auto=webp&s=06cab6cc448afdacec61f68d6edaf92f437aeb83

Recall running inside the local agent workspace. The Recall panel is connected to a SQLite-backed graph, showing 1,148 cells, 1,143 relations, active memory-in-use cards, compile/search/write controls, and a 900-word compiled memory budget. This screenshot demonstrates the working interface; the snippets below show the MAL loop underneath.

What it actually does, one turn at a time

MAL is a control system, and the thing it controls is the user-and-AI exchange. Each turn is one cycle. The per-turn protocol has five beats: push, expand, work, write-back, tick. A session primes once at the start, then every turn runs the cycle.

Push. A prompt arrives. Before the model sees it, a hook pushes a mini-index: a short list of candidate cells, each shown as an id, a title, a compact score row, and any flags. Not the contents, just the headers. The lines look like this:

67ee107d [decision] Recall v5 architecture named: MAL (Memory Abstraction Layer) b63c2d54 [decision] MAL offloads the work: model states claim + confidence [SUPERSEDED?]
Expand. The model reads by title and pulls the full body of only the few cells worth reading; the rest stay as one-line headers. A 200-cell graph and a 200,000-cell graph cost the model the same amount here, because it only ever reads the slice it asked for. If a row carries a flag (stale, challenged, superseded), the model has to open that cell before it can act on the topic. That rule is enforced, not suggested: skip the dig and the turn is blocked until it is done.
Work. The model does the real task with the expanded cells in hand.
Write-back. On the way out, the model writes what it learned. Its entire authoring job is a claim (a kind, a title, a body) and one calibrated confidence number, plus the edges it intends. If the new fact corrects an old one, it points a contradicts edge at that old cell's id, and the old cell loses standing. The model never hand-formats the notation or computes a score. The builder and the admission firewall do that.
Tick. Between turns, with no model running, a deterministic operator pass recomputes the scores, currency, salience, and the standing signals. When the next prompt arrives, the push already reflects the new state.

The hooks that close the loop

The five beats are not something the model remembers to do. They fire on their own, driven by three hooks at three moments. In HAL terms, the hooks are the thread: the scheduler that runs the parts in order, every cycle, whether or not anyone is paying attention.

Session start (orient). Once per session, before any work, a hook injects the operating manual: how the memory works and what the graph is about. It is inject-only. It primes the context window and then gets out of the way.
Prompt submit (push). On every prompt, before the model runs its forward pass, a hook pushes the mini-index: the seed cells, their flags, and a few terse reminders. This hook has teeth. It can block, so a flag like "expand required" is not a polite request. It also nudges the model to consider standing up a recurring read as its own op during the turn, before write-back.
Stop (write-back and backstop). After the answer, a third hook handles the end of the turn. It is the wrong place to prime anything, because the pass is already done, so its job is the opposite: make sure the turn wrote back what it learned, and refuse to release the turn if a flagged cell was never opened.

Between turns, with no model in the loop at all, the deterministic tick runs the ops and recomputes the signals. Orient before the session, push before the pass, write-back after it, tick between turns. That is the whole schedule, and the model only occupies the middle of it.

One rule keeps the hooks lean. The expensive, stable content (what every op means, how the addressing works) is taught once, in a single map cell inside the graph. The per-turn push never re-explains any of it. It only points, carrying the cheap, changing part: which cells are in play this turn and which ones are flagged. Teach once in the graph, reference tersely every turn. It is the same split as keeping the operating manual as cells instead of as a string baked into a hook.

The concept, mapped from HAL to memory

The reason HAL was the right thing to copy is that its parts already have clean jobs, and every one of them has a memory counterpart. This is the correspondence, not a literal rewrite:

HAL	MAL
pin	a cell field
signal	an addressable value (a derived field has one owning op, for tick determinism)
component	an op (watch, watchdog, trend, drift, quorum, score, reflex, smooth, clamp, latch, route, fanout, snapshot, record, replay, pid, oneshot)
thread	the operator tick, running between turns
net (the wire)	the dotted address
netlist (the .hal file)	the memory netlist

In HAL you wire components to signals on a thread and you get a machine you can read off one file. In MAL you wire ops to values on the tick and you get a memory you can read off one netlist. The structure carried over. What changed is what flows through it.

Why a control layer is the right shape

The analogy is not decoration. It holds because the two problems are the same problem.

A control system exists to keep a process in a known-good state against two enemies: bad commands and noise. On the machine, a bad command is a buggy instruction in the program, and noise is EMI corrupting a signal that left clean. The whole job of HAL is to make the machine legible enough that you can see both coming: every signal named, every connection on the page, a scheduler keeping the readings current.

Memory degradation in an AI is the same two enemies under different names. A bad command is a wrong or stale fact entering the model's context. Noise is drift: the known-good state decaying as new, weaker, or contradictory claims pile up over time. Left alone, both corrupt the state the model acts on, the same way they corrupt a machine. So the fix has the same shape: name every value, keep the wiring legible, reconcile conflicting inputs into one trustworthy reading, catch the bad state and replace it on the record, and run a scheduler that keeps the picture current between moves.

That is why a hardware abstraction layer, of all things, was the right pattern to lift. Not because memory is like hardware, but because keeping memory accurate is a control problem, and HAL is a control-system design that already solved the legibility and scheduling parts. MAL is that design pointed at the state of the user-AI exchange instead of at motors.

Where MAL leaves HAL behind

A concept is only worth borrowing if you are honest about where it stops fitting. Three places MAL departs from HAL, and they are the interesting part.

Many writers, one reader. This is the inversion, and it is the heart of it. HAL is one writer, many readers: one pin drives a signal, many components read it, and the value is whatever the writer put there. MAL is the opposite. Many actors write to a cell over time, claims, edges, supersessions, from different agents and different sessions, and there is one reader: the single agent reading the compiled slice this turn. Because the writers are many and fallible, the value a cell shows is not any one writer's number. It is a reconciliation. This is why a cell has both a stated confidence and an effective confidence, and why they differ: stated is what a writer claimed, effective is what survives calibration, support, and contradiction once everyone's contributions are weighed.

The edges are real and directional. HAL draws arrows on its signals but ignores them, because in hardware the direction of flow is already implied by who writes and who reads. MAL edges carry meaning, so direction is load-bearing. a > b is the directed edge from a to b; a < b is from b to a. A supports edge and a contradicts edge pointing the same direction do very different things to the effective value downstream.

Versions and supersession. HAL is a flat wiring layer with no history. MAL has a time axis: a cell can be superseded, and the supersede chain is addressable by version (@vN). A correction does not overwrite the old value; it demotes it and records the replacement, so a later reader sees both the current fact and the one it replaced, plus why. That is the whole defense against the known-good state quietly rotting: nothing good gets silently overwritten, it gets superseded on the record.

Put together, these are why MAL is a control system and not just storage. It does not only hold the state of the user-AI exchange; it reconciles many fallible inputs into one trustworthy reading, keeps direction and history, and recomputes the picture every tick.

The notation

Because the rendered graph is meant to be read by sight, MAL has its own small language, modeled on HAL's. It has a lexicon (the words) and a grammar (the sentences).

The lexicon

Handle: kind_hex, a three-letter kind prefix and a short hex tag, like dec_a3ee for a decision. ALLCAPS marks an immutable cell (RECALL_v5); lowercase is mutable.
Separators, by how tightly they bind: _ joins words inside one name; - walks a field within a cell (dec_a3ee-scores-eff); . crosses an edge to a neighbor (dec_a3ee.supports), so the number of periods is the number of graph hops.
Values: written field(value). A ! inside marks an immutable number (conf(.7!)); bare is mutable. Types are float for scores and bit for actuators.
Version: u/vN is a point on the supersede chain. Wildcard: .* fans out over every neighbor through an edge (dec_a3ee.supports.*).
Expand-required: a leading ^ in the mini-index means the cell is superseded, stale, or challenged, and the model must expand it before use (^dec_a3ee ...). That caret is the dig flag from the loop above, written in one character.

The grammar

The sentences follow HAL's halcmd style. Tokens are separated by a single space, the name comes first, and connections follow. A quoted "..." string is one token, exempt from the space rule, used for free text like a title or body. A # runs to end of line as a comment. Direction with < and > is meaningful.

The sentence forms:

form	shape	example
wire (net)	`net <signal> <target> <inputs>...`	`net eff dec_a3ee < conf calib supports.* contradicts.*`
set (setp)	`<addr> = <value>`	`dec_a3ee-flags-annexed = true`
schedule (addf)	`addf <op> tick`	`addf contradiction-load tick`
edge	`<source> <relation>> <target> (<weight>)`	`dec_a3ee supports> dec_signals_a2b7 (+.6)`
render (read)	`<handle> "<title>" <field(value)>... <relation>-><target>(<w>)...`	see below

A netlist snippet

Here is one cell rendered in read form, then wired and scheduled in write form:

# a cell, rendered: handle, title, scores, then edges
dec_a3ee "add watchdog op" conf(.7!) unc(.10) eff(.61) curr(.9) sal(.5) annexed(0) pinned(0)
  supports&gt; dec_signals_a2b7(+.6)  contradicts&gt; obs_9c1f(-.8)

# wire the effective-confidence signal on it (write form)
net eff dec_a3ee &lt; conf calib supports.* contradicts.*

# declare an edge (direction: &gt; forward a to b, &lt; reverse)
dec_a3ee supports&gt; dec_signals_a2b7 (+.6)

# fire an actuator
dec_a3ee-flags-annexed = true

# schedule a between-turn signal onto the tick
addf contradiction-load tick

Read the top line and the many-writers-one-reader idea becomes concrete. conf(.7!) is the stated confidence, immutable, what the author claimed. eff(.61) is the effective confidence, mutable, what is left after calibration plus the +.6 support and the -.8 contradiction are reconciled. The reader gets .61, not .7. The net eff line is the wiring that produces it: the effective signal is a function of the stated confidence, the writer's calibration, and the fan-out over every supporting and contradicting edge.

What the language does not do

The grammar wires ops; it does not define their math. The formulas (the effective-confidence reconciliation, the per-type currency decay, the allocation-pressure math) live inside the ops, the way a HAL component's math lives in compiled C and not in the .hal file. The language only connects pre-built ops to values and to the tick. The one op you can configure without code is the reflex, set with a truth-table personality rather than a formula, so even user-defined boolean logic needs no expression language. That keeps the surface small on purpose.

Status of the language. Be clear about what runs. The graph renders to this notation today, but one direction only: graph to text. A parser and loader that read a netlist back into a wired graph are specified here and not yet written. That reader is the next piece, and its acceptance test is a round trip: render the graph, parse it, load it, render again, and require the two renders to match. The model never reads the netlist either way; it reads the compiled slice. The netlist is for human audit and for tooling such as replay, diff, and version control.

Borrowing the next layer: components

Everything so far buys one thing: a durable, structured state with a gate on what gets in, where admission has the same shape no matter who wrote it. Every claim, from any actor, any agent, any session, goes through the one firewall and comes out in the one contract. That uniformity is not a nicety. It is the precondition for the next borrow from HAL.

Here is why. In HAL, a component can read a signal without knowing or caring which component drives it, because every signal is a typed value with one shape. That is the only reason you can wire a deterministic component to a wire and trust what it reads. MAL gets the same guarantee from the admission gate: many writers, one shape. Once a value is guaranteed to have that shape regardless of author, a deterministic subprogram can wire to it and run on it safely. The gate is what turns a pile of claims into clean signals.

So you can take the second layer of HAL, the components. In HAL a component is a small compiled subprogram that reads signals, computes something, and drives other signals, all scheduled on the thread. In MAL a component is the same idea over memory: a small deterministic program that reads cell values, computes something more involved than a single score, and either writes a derived value back or fires an actuator, scheduled on the tick between turns. No model runs inside one, the same way no model runs inside any op.

The ones I wired up are the controls-room set: a watch that trips on a threshold, a trend that takes the rate and acceleration over a series of cells, a drift that measures a value against a pinned baseline, a quorum that fires on k-of-m agreement, a score that rolls a metric. The boolean logic is one configurable component, a reflex, that covers the whole and2, or2, xor2 family with a truth table instead of a formula. That is what lets you connect them the way you connect logic on a machine: wire two watches through an or2 so the alert trips if either condition goes bad, latch it so it stays tripped across turns, fan it out to a severity readout. A tripwire is that composition given a job: a deterministic condition that stays silent until it trips, so silence itself becomes the all-good signal, and the only thing that ever speaks up is a real change.

This is where the memory stops being a place you read from and starts being a system that watches itself. The components run between turns whether or not anyone asked. A threshold passes, a webhook fires, and a decision that drifted out of its known-good band tells you on its own.

It is not rebuilt every turn

A fair worry about a stateless model is that it has to stand the whole apparatus up again on every fresh turn. It does not. The system persists in the store and in the deterministic tick, both of which run between turns with no model involved. The only thing that is fresh each turn is the model's working context, and rebuilding that context is exactly the cost MAL removes. Instead of re-deriving state from scratch or re-reading raw transcripts, the model reads back a thin, pre-digested, trust-weighted slice: the mini-index first, then selective expansion. And because the model wrote those cells in the first place, reading them re-evokes its earlier reasoning instead of reconstructing it cold.

The graph boots itself

A fresh MAL graph starts from a deterministic 10-cell bootstrap, then the normal loop takes over and init never fires again for that graph.

Cells 1 to 5 are the system layer, the constitution: auto-written, locked, pinned, immutable, and identical in every graph.

purpose
method
map (the MAL structure itself: addressing, cell anatomy, edge semantics)
hooks (the lifecycle: orient, push, write-back, tick, the compaction boundary)
expectations (the behavioral contract: wire your edges, pick the right kind, supersede on real change, confidence is recorded and weighed, do not assert from unchecked memory, dig flagged cells)

Cells 6 to 10 are the foundation, the project charter: answered one question at a time by the user, and mutable.

objective
constraints
risks
success criteria
carried context

Putting the operating manual in the graph as cells, rather than as a string baked into a hook, is what lets it survive a context compaction and be re-evoked afterward. The map being cell 3 is the point: the structure teaches itself from inside the store it describes.

How it came together

Two things had to meet for this to work, and they came from opposite directions.

The first was the problem, seen from the inside. Recall was not built as a database for me to query. It was built for the agent. It started by asking the model what it actually needed in order to remember well and to trust what it remembered, and the answers are the whole design: typed claims with a calibrated confidence, supersession instead of overwrite, and a record of what contradicts what. Earlier versions were far more ambitious and sprawling; the part that survived and narrowed into Recall was the memory core. Most pull-based memory tools inherited the human metaphor of a database you go and search. This came from asking the thing that has to live in the memory what would keep it honest.

The second was the structure, brought in from another trade. I already knew HAL cold from years on LinuxCNC, and when I sketched how to address and wire a memory graph, it landed on the same path-addressing shape HAL uses. Recalling HAL from the shop and deriving the addressing for memory met in the same place. Two independent routes arriving at one design is about the strongest signal you get that the design is sound.

After that it was diagnostic work plus acceleration. I used the troubleshooting habits I lean on for a machine crash to find where the memory state was breaking, and I used AI to fill the gaps in what I did not know and to write the harder code syntax. The concept is mine and comes off the shop floor. The speed of building it came from the same kind of system it was built to improve.

Under the hood: the four boundaries

This part is a prototype disclosure, not a reproducible benchmark. The snippets below are from the running Recall v5 source, trimmed for readability with elisions marked; the formulas and signatures are verbatim. They show the four boundaries where the design either holds or it does not: Recall sits upstream of the model, the read is a mini-index then a selective expand, every write goes through one gate, and the scores recompute deterministically with no model in the loop.

Recall is upstream of the model. Before the model runs, the prompt's objective is compiled into a Recall packet and merged into the text the model receives. The packet is built first, so the model sees reconciled memory before it acts.

export function buildPromptContextPush(
  store: Store,
  objective: string,
  options: ContextCompileOptions &amp; DirectiveOptions = {},
): PromptContextPush {
  const packet = compileContext(store, objective, options);
  const directive = recallDirectiveBlock(options);
  const expansionRequired =
    packet.staleOrLowTrust.length &gt; 0 || packet.conflicts.length &gt; 0;
  const text = [
    "[Recall context push for this prompt]",
    directive.trimEnd(),
    "",
    formatContextPacket(packet),
    expansionRequired
      ? "EXPAND REQUIRED: conflicts or low-trust cells are present; inspect relevant handles before relying on them."
      : "Use expansion_handles only when exact evidence matters.",
    "",
  ].join("\n");
  return { objective, directive, packet, text, expansionRequired };
}

The Codex adapter wires Recall's MCP server into Codex so the same packet and tools are reachable there; the push itself is platform-neutral.

1. Compile the mini-index. The prompt becomes a ranked seed set, one mini-index line per hit, and a cell that needs review carries the expand flag. compileContext wraps this and trims the packet to a word budget (the 900 in the screenshot).

export function compile(
  store: Store,
  query: string,
  opts: { limit?: number } = {},
): CompileResult {
  const limit = opts.limit ?? 10;
  const hits = store.search(query, { limit });
  const lines = hits.map((h) =&gt;
    renderMiniIndexLine(h.cell, { expand: h.cell.flags.requiresReview }),
  );
  return { hits, lines };
}

2. Expand selected cells. Mini-index first, selective expansion second. A handle (a full id, or id#field.path) opens exactly one cell plus its neighbor links, never the whole graph.

export function inspectCell(store: Store, handle: string): CellContext {
  const parsed = parseExpansionHandle(handle);
  const cell = store.get(parsed.target) ?? store.getByHandle(parsed.target);
  if (!cell) throw new Error(`Unknown cell: ${parsed.target}`);
  const neighbors = store.neighbors(cell.key);
  const incoming = neighbors.filter((link) =&gt; link.direction === "in");
  const outgoing = neighbors.filter((link) =&gt; link.direction === "out");
  // ... footprint (word and byte counts), optional field preview ...
  return { cell, incoming, outgoing, /* footprint, */ expansionHandles };
}

3. Write through the admission gate. The model hands in a claim (a kind, a title, a body), one confidence number, and the edges it intends. Every author runs the same pipeline: validate, screen for secrets, attenuate unsupported confidence, build the cell, then fold in the actor's calibration to get effective confidence. The model never formats the cell or computes a score.

export interface WriteProposal {
  kind: string;
  title: string;
  body: string;
  confidence: number; // (0, 1], required, no default
  edges?: { relation: string; target: string; weight?: number }[];
  // ... topics, entities, sourceRefs, operation, origin, verification ...
}

export function admit(proposal: WriteProposal, ctx: AdmitContext = {}): AdmissionResult {
  const validation = validateProposal(proposal);   // R0 schema; reject on any structural issue
  if (!validation.ok) return { accepted: false, issues: validation.issues, warnings: [], attenuations: [] };

  const screen = screenSecrets(proposal);           // reject if a credential pattern is present
  if (!screen.allowed) return { accepted: false, issues: screen.issues, warnings: [], attenuations: [] };

  const factor = ctx.calibrationFactor ?? 1;         // 0.5..1 from the actor's track record; 1 = neutral
  const att = attenuateConfidence(proposal);         // cap unsupported high confidence
  const cell = buildCell({ ...proposal, confidence: att.confidence }, { key: ctx.key, now: ctx.now });

  cell.scores.actorCalibration = factor;
  cell.scores.effective = effectiveConfidence({
    stated: att.confidence, calibration: factor, supportMass: 0, challengeMass: 0,
  });
  // with a store: dedup, apply supersedes edges, recompute neighbors' effective ...
  return { accepted: true, cell, issues: [], warnings: att.warnings, attenuations: att.attenuations };
}

4. Recompute on the tick, with no model. This is the line between MAL and a plain memory database. Between turns, every active cell decays its currency from its own timestamp and recomputes its effective confidence from current support and contradiction mass. Pinned cells are exempt from decay, and a tick never counts as reinforcement.

// effective = clamp01(stated*calibration + 0.15*tanh(support) - 0.6*tanh(challenge))
export function effectiveConfidence({ stated, calibration, supportMass, challengeMass }) {
  return clamp01(
    stated * calibration + 0.15 * Math.tanh(supportMass) - 0.6 * Math.tanh(challengeMass),
  );
}

// currency = cFloor + (c0 - cFloor) * exp(-dt/tau)   (dt and tau in days)
export function currency({ c0, dt, tau, cFloor = 0.1 }) {
  return cFloor + (c0 - cFloor) * Math.exp(-dt / tau);
}

// the between-turn deterministic tick (HAL's "thread"); no LLM runs here
function recompute(store: Store, cell: Cell, now: string): Cell {
  const scores = { ...cell.scores };
  if (!cell.flags.pinned) {
    const dt = Math.max(0, (Date.parse(now) - Date.parse(cell.updatedAt)) / DAY_MS);
    scores.currency = currency({ c0: cell.scores.currencyC0, dt, tau: TAU_DAYS[cell.stability] });
  }
  const m = neighborMass(store, cell.key);
  scores.effective = effectiveConfidence({
    stated: cell.scores.conf, calibration: cell.scores.actorCalibration,
    supportMass: m.supportMass, challengeMass: m.challengeMass,
  });
  return { ...cell, scores }; // updatedAt preserved: a tick is not a reinforcement
}

The verifier. A functional verifier, npm run verify:recall-panel, was added for the Recall panel and passes. It checks that the panel is correctly wired to the graph (the SQLite-backed store and the compile, search, and write controls), not that it clears any performance number. Read it as a wiring check, not a benchmark.

Recall, MAL, and AIDDE

A quick map of the three names, because they get used together and they are not the same thing.

Recall is the programming foundation. At the bottom is a local-first memory substrate: a SQLite-backed graph of typed cells, an admission gate every write passes through, calibrated confidence, supersession instead of overwrite, and a compile path that returns a ranked, budgeted slice. That layer ships as a package and runs today. It is the working base everything else stands on, and it is what the four boundaries above are made of.

MAL is what that foundation evolves into. v5 recasts the same primitives as a hardware abstraction layer for memory: a cell field is a pin, an addressable value is a signal, an op is a component, the between-turn tick is the thread, and the rendered graph is a netlist. On top of the proven store it adds the deterministic op and signal layer and the addressing language. The four boundaries earlier in this post are MAL running. The netlist language is MAL specified, with the reader still to come.

AIDDE is where it runs. The screenshot at the top is AIDDE, (Artificial Intelligence Driven Development Environment)with Recall embedded as a panel. The agent compiles, searches, and writes the same SQLite graph from inside the editor, against a live cell count and a word budget, so the memory layer is not a side service the agent calls out to; it sits in the workspace the agent already works in. MAL is the layer that panel stands on.

So Recall is the substrate, MAL is the abstraction layer it grows into, and AIDDE is the workspace that puts both in front of a working agent.

Why this shape holds up

Two things make MAL age well. It rides capability gains for free: a stronger model uses the same layer better with no rewrite, and a weaker model still gets the deterministic floor underneath it. And it keeps the expensive, stateful, always-on work in deterministic code where it belongs, leaving the model to do the one thing only it can do, which is to state a calibrated claim and judge relevance.

That is the whole bet, and it comes straight off the shop floor. A machine does not stay accurate because the controller is smart. It stays accurate because the wiring is legible, the signals are reconciled, the bad state gets caught and replaced instead of silently riding along, and a scheduler keeps the picture current between every move.

if you want to try Recall it is standalone and OSS https://github.com/H-XX-D/recall-memory-substrate

The AIDDE (Artificial Intelligence Driven Development Environment)is a Codex Claude SDK native bring your subscription development environment that shifts the old IDE with AI chat to a High level view cockpit where you specify design, direct intent, monitor changes, audit actions control permissions and access in real time across a codebase. Beta is done and if your interested ask in the comments for a link to the Alpha

u/Empty-Poetry8197 — 6 days ago

▲ 3 r/ContextEngineering+1 crossposts

The hard part of a customer-facing agent is trusting the context it acts on.

I've been building agents for sales/CS workflows and kept hitting the same wall: the demo is great, but nobody will actually point it at a real customer w/o keeping a human in the loop (HITL).

What finally clicked is that it's not a context problem, it's a trust problem. Inside any account, the "context" you feed the agent is a mix of what the customer actually said, what someone inferred, what was true last quarter, and what the model made up. To a retriever those all look identical, so the agent treats a hallucination like a signed commitment.

The one that pushed me over the edge: an agent congratulated a customer on "expanding with the platform" based on a real note from a deal that had churned two quarters earlier. The note was real. It just wasn't true anymore.

What actually helped was treating customer knowledge the way a good rep does, with four things the raw context doesn't carry:

provenance (did the customer say this, or did we infer it?)
freshness (a champion or a next step has a shelf life)
action boundaries (drafting is fine; sending or writing to the CRM needs a check)
proof (what did the agent rely on, and what changed)

Disclosure: I work on an open-source project (CRMy) that does this, so I'm biased.

More interested in how the rest of you handle it: are you keeping agents off stale/made-up customer data in the prompt, in retrieval, or as a separate layer?

reddit.com

u/rangerrrr — 6 days ago

▲ 3 r/ContextEngineering+1 crossposts

Everyone talks about the "second brain" pattern for AI dev. Here's my actual one-file implementation.

There's been a lot of discussion here about giving LLMs persistent context across sessions. Most solutions I see are over-engineered: vector databases, embeddings, memory plugins.

Here's what actually works for me as a solo developer. Two files:

CHANGELOG.md An append-only architectural decision ledger. Single-line entries, newest at top. When you load this at session start, the model immediately knows your project's history, every decision, and why things are the way they are, without a single word of re-explanation.

.dory/agents.md Operational directives. Intent-first. Zero padding. Decompose before implementing. Verify state before acting. These aren't prompt hacks, they're constraints that make the model faster and more precise on engineering work.

The key insight: sessions should be atomic. Start fresh, work fast, persist state locally, close the tab. Context doesn't live in the chat thread, it lives in your repo, in version control, where it belongs.

Works with Claude, ChatGPT, or any local model. With Claude Code, agents.md injects into the system prompt automatically.

I'll paste the full agents.md in the comments for anyone who wants to see the actual directives.

→ github.com/tjqscott/dory

# Apathy Esports Changelog
- Architecture — single `run.py` with five sequential phases: Scan → Sync → Execute → Settle → Email.
- Scheduling — hourly cron job on Raspberry Pi via `crontab`.
- Persistence — `state.json` as sole persistence layer; rolling 7-day window, pruned each run. No database; Polymarket tracks full bet history independently.
- Volume module — `volume_model.py` as a separate module for volume projection (later inlined).
- Market scanning — polls `gamma-api.polymarket.com/markets` for 4 game tags: LoL (65), Dota 2 (102366), CS2 (100780), Valorant (101672).
- Scan params — 48h end-date window, `volume_num_min=1000` floor to avoid pagination cap issues, `limit=1000` per tag.

u/tjqscott — 6 days ago

▲ 74 r/ContextEngineering+2 crossposts

Mem0 publishes 93.4% on LongMemEval. The harness has hardcoded answers for specific question_ids.

Mem0 publishes 93.4% on LongMemEval as their state-of-the-art overall score. When we ran their hosted product through a clean evaluation harness (gpt-5 answerer, binary judge with no lean-toward-yes instruction, 5-seed mean), the best we could get was 73.8%. A 19.6-point gap on the same memory system and the same data.

We dug further, the gap is in their public benchmark harness. Reading their prompts.py file at the commit they shipped right before their April 14 announcement (commit bd063eea, April 3, 2026):

1. Dataset-specific equivalence rules in the answer prompt.

https://preview.redd.it/va27d4jzvw8h1.png?width=3024&format=png&auto=webp&s=2d835fafc5a1583cef7fed3c6343b405d4b37dad

Lines 138 to 148 contain 14 rules that map 1-to-1 to specific public LongMemEval question_ids. A sample, verbatim:

The point of LongMemEval is that the system has to figure out when "scratch grains" should count as "layer feed." Hardcoding the equivalence into the answer prompt skips the reasoning step.

The dataset hints get applied inside a hidden chain-of-thought block.

https://preview.redd.it/szk9ka57ww8h1.png?width=2940&format=png&auto=webp&s=56e04fb8e44dd8b4707a7062f8d07d116a86c58a

Line 53: Before answering, reason step-by-step inside <mem_thinking> tags:
Line 65: The user will only see text outside the <mem_thinking> tags.

The judge only sees the final cleaned answer. The dataset-keyed reasoning is invisible to anyone sampling outputs.

3. The judge is explicitly told to default to "yes."

https://preview.redd.it/xroeatxaww8h1.png?width=3006&format=png&auto=webp&s=126fad0f35c618e523a1eef3a864a76870c85fbc

Line 269 of the same file: IMPORTANT BIAS CHECK: You have a tendency to say "no" too quickly. Before concluding "no", you MUST verify the answer is truly wrong, not just differently worded. When in doubt, lean toward "yes".

Lines 328 to 334 add a 5-step gauntlet to clear before marking anything WRONG. No comparable gauntlet exists before marking anything CORRECT.

4. Bonus finding in their LoCoMo judge.

https://preview.redd.it/67yu69beww8h1.png?width=3024&format=png&auto=webp&s=93159462881bb10e168473dea546895099c25dfb

Different file, same repo, commit edcd6f1d (April 9, 2026). Line 212 of benchmarks/locomo/prompts.py:

Read the last clause carefully. Evidence can promote a WRONG prediction to CORRECT. The same evidence cannot demote a CORRECT prediction to WRONG. A one-directional score lift, written into the judge by hand.

Mem0 named this mechanism in their own commit messages. The April 3 commit message reads: "Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), conflicting numbers, personalization scan, BIAS CHECK in judge, chain-of-thought <judge_thinking> tags, 5-step FINAL CHECK." Their engineer typed the words "BIAS CHECK in judge" and "5-step FINAL CHECK" into git, on April 3, eleven days before the announcement of new SOTA numbers.

Verify in 2 minutes (direct GitHub permalinks at the pinned commits):

L145 chandelier: github.com/mem0ai/memory-benchmarks/blob/bd063eea04de4f8a19927beea155afa094a01905/benchmarks/longmemeval/prompts.py#L145
L269 BIAS CHECK: same file, #L269
L212 LoCoMo override: github.com/mem0ai/memory-benchmarks/blob/edcd6f1d42400837b1fcb6997716f1769dc51a37/benchmarks/locomo/prompts.py#L212
April 3 commit message: github.com/mem0ai/memory-benchmarks/commit/bd063eea04de4f8a19927beea155afa094a01905

I tried meeting with their founder and communicating the issue; since the past 2-3 weeks, but we couldn't and I thought that it might be time for the community to learn about it.

Full-disclosure: I am the founder of Maximem.ai - another Agentic Memory and Context Management company. This is not an attempt to malign, but to put their latest numbers into perspective.

reddit.com

u/Ok_Row9465 — 10 days ago

▲ 6 r/ContextEngineering+1 crossposts

Why RAG Fails Before the Model Gets Involved

Most teams blame the model when an AI agent gives a weak answer.

The model hallucinated.

The prompt was bad.

The context window was too small.

The instructions were unclear.

Sometimes that is true. But in many production RAG systems, the failure happens earlier.

The agent gets the wrong context before the model ever starts generating.

That means the answer is already compromised before the LLM gets involved.

This is the hidden failure mode behind many AI agent rollouts: RAG does not fail at generation first. It fails at retrieval.

The Answer Was Broken Before the Model Started

RAG was supposed to solve a simple problem.

Companies had knowledge trapped in documents, databases, policies, tickets, contracts, manuals, and internal systems. AI agents needed access to that knowledge to answer useful questions and automate real work.

So teams built RAG systems.

They chunked documents. Embedded text. Stored vectors. Retrieved the nearest matches. Sent those matches into the model.

For simple recall, this worked.

Ask a question with clear wording. Retrieve a few related chunks. Generate an answer.

But production workflows are rarely that simple.

An insurance underwriter checking an endorsement might need one clause from 300 pages across six PDFs.

A support agent might need the latest version of a policy, not the retired one with similar wording.

A healthcare workflow might require patient history, time-sensitive risk factors, and permissioned clinical data.

An internal operations agent might need to know which entity is connected to which workflow, what changed over time, and which source is allowed for the user asking.

This is where RAG starts to break.

The system retrieves what sounds similar, not necessarily what is current, connected, allowed, or correct.

The agent then compensates by reading more.

More files.
Longer chunks.
Bigger context windows.
More tool calls.
More reasoning tokens.

The cost grows fast.

The issue is not whether the agent can eventually find the answer.

The issue is how much context it has to read to get there.

RAG Works Until the Data Gets Real

Most RAG systems are built on vector search.

Vector search ranks content by semantic similarity. It is useful because it can find related language even when the query and document do not use the exact same words.

But semantic similarity is not the same as structural context.

Enterprise data is not flat.

It has versions.
Relationships.
Permissions.
Timelines.
Policies.
Hierarchies.
Domain-specific rules.

Flat vectors compress meaning, but they do not naturally preserve all of that structure.

That creates predictable failure modes.

A policy from 2022 and a policy from 2025 might look semantically similar.

A clause that supports an action and a clause that restricts it might both sit near the same topic.

Two customer records might mention the same product, but only one belongs to the account in question.

A document might be topically relevant, but inaccessible under the user’s permissions.

To the model, this matters.

If retrieval gives the model the wrong context, the model still tries to answer.

And modern models are good at sounding confident.

So the output looks polished, even when the retrieval set is weak.

This is why RAG failures are hard to diagnose. The visible failure is the answer. The root cause is often the context selection step that happened before generation.

Why More Context Becomes the Trap

When RAG starts failing, teams usually add more.

More chunks.

More metadata filters.

More reranking.

More prompt instructions.

More tool calls.

More context.

Larger context windows are especially tempting. If the agent is missing the right information, send it more information.

That can improve accuracy. It also shifts the burden from retrieval to the model.

Instead of narrowing context before generation, the system asks the LLM to read a larger stack and reason through it.

This works until cost, latency, and reliability become the next problem.

The agent might reach a better answer, but it burns more tokens to get there. Each query becomes more expensive. Each repeated task reloads the same context. Each workflow compounds the cost.

For a prototype, that might be fine.

For production, it becomes a blocker.

Teams then face a hard tradeoff:

Keep context narrow and risk weak answers
Add more context and watch costs climb
Build custom GraphRAG infrastructure
Pause the rollout until the economics improve

None of these are ideal.

The core issue remains the same: the retrieval layer is not representing the structure of the business.

The Retrieval Layer Needs Structure

Better RAG does not start by making the model read more.

It starts by making retrieval smarter.

The retrieval layer should know more than which text sounds related.

It should understand:

Which version is current
Which entities are connected
Which relationships matter
Which permissions apply
Which facts are relevant to the task
Which signals belong together
Which context can be ignored

This is the shift from semantic retrieval to structured retrieval.

The goal is not to replace language models. The goal is to give them better context before they start reasoning.

If the retrieval layer gives the model the right context, the model spends fewer tokens finding the answer and more tokens doing useful work.

That is where the cost savings come from.

Not magic compression.

Not weaker reasoning.

Not worse answers.

The savings come from reducing waste before generation.

Index once. Retrieve precisely. Send the model the context it needs, not every document that might be related.

A Different Approach to Retrieval

The focus is not on making models read more information.

It is on helping them reach the right information faster.

Instead of relying on semantic similarity alone, add business context to retrieval so agents can make better decisions about what to read and what to ignore.

That means retrieval can account for things like versions, relationships, permissions, timelines, and other signals that matter in real-world workflows.

The result is a shorter path to the answer.

Agents spend less time searching through loosely related content and more time working with relevant context.

For teams already running RAG in production, this matters because model costs are often a symptom of retrieval inefficiency.

When retrieval is imprecise, agents compensate by loading more context, making more tool calls, and consuming more tokens.

When retrieval improves, those costs often fall naturally.

The goal is simple:

Give the model better context before generation.

Reduce unnecessary reading.

Improve reliability without forcing teams to rebuild their entire stack.

The future of production RAG is not bigger context windows.

It is shorter paths to the answer.

That starts with retrieval.

reddit.com

u/RecommendationFit374 — 12 days ago

▲ 1 r/ContextEngineering+1 crossposts

I think we're treating LLM context completely wrong (2 AM brain damage, please tell me where this falls apart)

I was supposed to be asleep 3 hours ago.

Instead I somehow ended up questioning why every AI framework on earth seems to do this:

prompt = system\_prompt + retrieved\_docs + memory + tool\_outputs + chat\_history

send\_to\_model(prompt)

Then...

it does it again.

And again.

Every request.

Same system prompt.

Same company docs.

Same memory.

Same retrieved knowledge.

Same everything.

And nobody seems particularly bothered by this.

\---

At some point I found myself staring at my monitor thinking:

Why are we treating context like a disposable string?

Why is the mental model:

context = build\_context()

instead of:

context\_v2 = context\_v1.branch(delta)

where context is persistent state that evolves over time?

At this point I felt very smart.

This feeling would not survive the evening.

\---

My first idea was:

"Easy. Build a Git-style DAG."

Something like:

ROOT

SYSTEM

DOCS

MEMORY

/ \\

A B

Then use Lowest Common Ancestor.

Find shared history.

Reuse context.

Collect Nobel Prize 👀.

You know. Standard procedure.

\---

Then reality arrived.

Transformers do not care about my beautiful graph theory.

They care about token sequences.

This:

SYSTEM → DOCS → MEMORY

and this:

DOCS → SYSTEM → MEMORY

might be logically equivalent.

But they are different token streams.

Different token streams mean different computation.

Different computation means no KV-cache reuse.

So graph theory and I are no longer on speaking terms.

\---

Okay.

New idea.

Don't build a DAG of prompts.

Build a DAG of context components.

Like:

SYSTEM\_NODE

DOC\_NODE\_1

DOC\_NODE\_2

MEMORY\_NODE

TOOL\_OUTPUT\_NODE

Then an execution becomes:

Execution(\[

SYSTEM\_NODE,

DOC\_NODE\_2,

MEMORY\_NODE

\])

Which started looking suspiciously less like prompt engineering...

and more like Bazel.

Or Nix.

Or build systems.

Which was concerning.

\---

Then things got weird.

I realized this graph accidentally becomes provenance tracking.

Example:

Run A

SYSTEM

DOC\_v1

MEMORY\_v1

TASK\_A

Later:

Run B

SYSTEM

DOC\_v2

MEMORY\_v1

TASK\_B

Now I can ask:

\* Which document changed?

\* Which memory update changed behavior?

\* Which retrieval increased cost?

\* Which branch introduced hallucinations?

\* Why did this run suddenly become expensive?

Without building a separate observability system.

I somehow accidentally invented git blame for prompts.

\---

At this point I was no longer solving the original problem.

I was just following the raccoon deeper into the sewer.

\---

Then I thought:

What if every context component tracked metadata?

{

"tokens": 1800,

"cost": "$0.03",

"latency": "400ms",

"usefulness\_score": "???"

}

Now imagine thousands of runs later:

DOC\_17

Appeared in 1400 runs

Consumed 18% of total token budget

Improved output quality by 0.3%

System says:

Delete DOC\_17

Now we're not asking:

Can I fit context?

We're asking:

What is the cheapest context plan

that achieves target quality?

Which sounds suspiciously like database query optimization.

And that's when I got nervous.

\---

Then I found another problem.

Who computes this?

{

"usefulness\_score": 0.72

}

Because maybe:

Document appears useless 99% of the time.

But:

That 1% prevents catastrophic failure.

So frequency is not usefulness.

Now I had somehow wandered into causal inference.

I have never once asked to be near causal inference.

Yet there I was.

\---

Then things got worse.

I realized context could have a memory hierarchy.

HOT CONTEXT

Always resident

WARM CONTEXT

Frequently loaded

COLD CONTEXT

Retrieved on demand

Which means I may have accidentally reinvented RAM management.

For prompts.

At 2 AM.

For free.

\---

Then I found the giant flaw.

Current APIs mostly work like:

full prompt

↓

request

↓

response

So even if I build this beautiful context architecture internally...

the model provider still says:

Cool...

Send the whole prompt again.

Bruhhhhh

Which means:

\# Version A (possible today)

Context Graph

↓

Optimizer

↓

Prompt Assembly

↓

API

Benefits:

\* provenance

\* observability

\* optimization

\* context analytics

But not much actual compute reuse.

\---

\#Version B (future)

Imagine providers exposed persistent context handles.

Something like:

ctx = create\_context(company\_docs)

ctx2 = ctx.branch(memory\_delta)

generate(ctx2)

Now context becomes a first-class object.

Now the model understands persistence natively.

Now things get interesting.

\---

The weird part is that the deeper I went, the less this felt like AI engineering.

And the more it felt like operating systems.

Or databases.

Or build systems.

Or some horrible combination of all three.

\---

I started with:

Why are agents wasting tokens?

And somehow ended up here:

Context Operating Systems

Which sounds either profound or deeply stupid.

I genuinely cannot tell which.

\---

I am NOT claiming this is practical.

I am NOT claiming this is novel.

I am NOT claiming I've solved anything.

I am mostly asking:

Why are we still treating context as temporary text?

Should context eventually become a managed computational resource?

Or am I just rediscovering three existing papers, two caching systems, and something inference engineers solved six months ago?

Either way, I'm curious.

Please tell me where this entire thing falls apart.

reddit.com

u/Unfair_Layer3085 — 12 days ago