r/ContextEngineering

Fine-tuned RAG: teaching your retriever which embedding dimensions matter (+11% hit rate, +12% completeness, +9% faithfulness)
▲ 29 r/ContextEngineering+4 crossposts

Fine-tuned RAG: teaching your retriever which embedding dimensions matter (+11% hit rate, +12% completeness, +9% faithfulness)

Hi all,

I developed a fine-tuned retrieval head (neural net) for RAG that transforms query embeddings before retrieval, so the system learns which embedding dimensions actually matter for your corpus — rather than weighting them all equally as standard cosine similarity does.

The problem

In any domain-specific corpus, some embedding dimensions are highly predictive for matching queries to the right passages, while others are effectively noise. Standard cosine similarity can't distinguish between the two, so retrieval gets pulled toward superficially similar but substantively irrelevant passages. The fine-tuned RAG is designed to prevent exactly that.

How it works

  1. Synthetic question generation — An LLM generates multiple questions per chunk in the corpus, for which the answers can be inferred from that chunk. This creates a dataset of question-chunk pairs (QA-pairs). These are embedded using an embedding model and divided into a training and validation set.
  2. Neural net training — A lightweight neural network using MNR loss is trained on the training QA-pairs. After each epoch, the model is evaluated on the validation set by measuring retrieval hit rate: the proportion of validation questions for which the correct chunk appears in the top-5 retrieved results. Retrieval works by embedding the question, passing it through the neural network to transform the embedding, and ranking all corpus chunks by cosine similarity to the transformed embedding.

Through this mechanism, the projection head learns for these 'type of questions' which dimensions in the embeddings are informative for finding the best chunks — and which are irrelevant.

Results

To validate the architecture, I used the Legal RAG Bench dataset as a proof of concept — evaluating on 100 held-out test questions.

Retrieval Hit Rate:

  • The fine-tuned retriever achieves 82% Hit Rate (k = 20), compared to 71% for the standard cosine retriever — an 11 percentage point improvement, meaning the correct chunk appears in the top 20 results significantly more often when the query embedding is first transformed through the fine-tuned retriever.

Answer quality (LLM-as-judge, 1–5 scale across 6 metrics):

  • Outperforms traditional RAG (top-k cosine sim) on all 6 metrics
  • Largest gains in completeness (+12%) and faithfulness (+9%)
  • Consistent improvement across every metric — not just isolated gains — suggesting that retrieving more relevant context has a broad positive effect on answer quality

Code and full write-up available on GitHub: https://github.com/BartAmin/Fine-tuned-RAG

u/Much_Pie_274 — 9 hours ago
▲ 5 r/ContextEngineering+1 crossposts

Which knowledge bases are you connecting as an MCP?

I'm looking for the best MCP knowledge base connectors as I use multiple AI tools and need something that I can plug into whichever AI tool I want.

I have heard about obsidian but what other options are out there?

Show me your best tools/setups?

reddit.com
u/Reasonable-Jump-8539 — 3 days ago
▲ 5 r/ContextEngineering+1 crossposts

How to properly benchmark a context/memory solution

I want to benchmark my own memory tool. What I did so far was a bunch of runs in codex headless mode using --json.

https://developers.openai.com/codex/noninteractive

You can fire prompt and everything is recorded end-to-end. How many tool calls. What was called, the inputs and outputs. How long the prompt took. And how many tokens got consumed.

For small codebases under 100 files of code I know my tool loses against vanilla. And the answers were of the same quality.

But when I ran it on a 350 file codebase codex using my memory layer outperformed vanilla in performance and quality of the response. The prompt was about discovery and figuring out the architecture.

What I did expect to happen was only that the answers would be better. I had expected that there will be always a tax because my system banks on sidecar files where every code file has it's own side car that you can find with the same path just in a parallel folder.

What was funky is the README.md. In the case with 350 files the file was mostly correct and should be a bigger help for codex that couldn't rely on the memory layer. But it still at several points in my code jumped to the wrong conclusions and said that an old code path is the mature current one. That was really weird. I took the README.md out and of course same issue.

And no matter how often I ran that it would stubbornly take the wrong path and say the outdated path is the right one. Codex using my nemory knew every single time what the correct path is. When it gets to the old code parts it "finds" a note right beside that tells that this code is a dead end. The README.md might here already deeply buried in the context so it doesn't matter much. And I feel this is what helps it to reliable. So that part I know for sure.

But I don't know if I can trust the "performance" numbers. Sure the Codex tool measures deterministically. And the thing was faster with the analysis prompt. I could tell that without the tool. However it doesn't mean I can draw the right conclusions. I have a hint.

**So if you were in my shoes what would you test next and what tools would you use?**

I am certainly going to try a larger codebase from github and use older tickets that have been solved recently. And I will publish the artifacts and the github memory artifacts on a seperate github repo. So everyone can just download the memory and test it on that code repo themselves without the need to build one from scratch. I think that would make stuff repeatable for everyone.

But other than that I am open for suggestions regarding methodology.

For anyone interested you can check my repo here. It is still in alpha and there is still one mayor issue where I want to make the coordination folder the only runtime artifact. But this is an ergonomics thing. The memory system is fully operational.

https://github.com/Foxfire1st/agents-remember-md

u/FoxFire17739 — 7 days ago
▲ 4 r/ContextEngineering+3 crossposts

Is anyone else drowning in AI context management on large codebases?

Working on a fairly large Azure microservices system (.NET, 40+ services, 5+ years old). We've adopted AI coding assistants across the team and there's genuine productivity gain for individual tasks.
 
But there's a problem nobody seems to talk about: every new chat session is a blank slate.
 
Our codebase has years of accumulated decisions:
• We use a specific handler pattern for vendor integrations
• Auth service has a specific cache-aside setup with historical reasons
• Service boundaries that look weird but make sense given our deployment constraints
• Interface conventions that all the senior engineers know but aren't written anywhere useful
 
When I open a new AI chat, none of that context exists. I either paste a context dump (expensive, eats token budget) or the AI generates code that's syntactically correct but architecturally wrong for our system.
 
We've tried:
• System prompts with architecture descriptions - partial help
• Cursor rules files - limited
• Just re-explaining every session - waste of time
 
I'm actually building a tool to solve this (happy to share more if there's interest) but first wanted to know — is this a widespread problem or specific to how we work?
 
How are experienced devs handling context management with AI assistants on mature codebases?

reddit.com
u/killerexelon — 11 days ago
▲ 14 r/ContextEngineering+4 crossposts

I built a context engine that indexes your codebase and serves it to your coding agent via MCP. The agent understands the architecture before making changes instead of exploring blindly.

On benchmarks it takes Sonnet 4.0 from 66% to 73.4% on SWE-bench. Biggest help on complex repos (Django +12%, sympy +17%).

Most AI coding agents struggle when they hit 10k+ line repositories because of context loss. I’ve been benchmarking Xanther.ai using a proprietary PRAT protocol designed to handle systemic validation rather than just code completion.

Key Results:

  • Context Handling: Zero-shot success on multi-file PRs in complex repos.
  • Orchestration: Integrated with MCP for real-time tool use.
  • Quality: Focused on deterministic, enterprise-grade output that passes CI/CD on the first run.

Curious to hear what you guys think about the transition from "chat-with-code" to fully autonomous agents

Results on SWE-bench Verified (500 real bugs)

MiniMax M2.5 + Xanther: 78.2% ($0.22/instance)

Sonnet 4.0 + Xanther: 73.4% (baseline was 66%)

Claude Opus without it: 76.8% ($0.75/instance)

Biggest gains on complex repos — sympy +17%, scikit-learn +13%, django +12%.

Looking for people to try it on real projects. Free tier, 60 second setup:

https://preview.redd.it/xpf20k6ugtyg1.png?width=1137&format=png&auto=webp&s=c6091dae916b0a6e8762b2323eedcbd1477962bb

Works with Claude Code, Cursor, Kiro, Windsurf — anything that supports MCP.

https://xanther.ai

Discord: https://discord.gg/Y768kBRS

https://medium.com/@xanther.ai/how-a-0-02-call-model-scored-78-2-on-swe-bench-verified-beating-every-model-on-the-leaderboard-153be05a60f1

reddit.com
u/Economy_Leopard112 — 11 days ago

Context engineering for AI coding tools across a multi-repo enterprise is a different problem than anyone documents

Most of the context engineering content I find assumes a single repository. Feed the AI your codebase, build a context layer, get better suggestions. Clean and simple. The reality for any non-trivial enterprise is multiple repos, multiple services, internal libraries that live in separate repos, platform code that everything depends on but nobody on any individual team owns, and shared standards documents that apply across all of it.

Context engineering for that environment is genuinely hard and I haven't found good documentation on how teams are actually solving it. The naive approach is index everything and let the context layer figure it out. The problem is that context from unrelated services generates noise. The backend API team doesn't need suggestions informed by the mobile app codebase. But they do need suggestions informed by the shared internal library that both use.

The questions we're working through: how do you scope context per team without losing cross-cutting signal? How do you handle the internal library layer that needs to be in everyone's context but at different depths? How do you prevent the context layer from becoming a maintenance burden as repos evolve independently?

reddit.com
u/ninjapapi — 10 days ago
▲ 12 r/ContextEngineering+3 crossposts

NornicDB 1.1.0 preview - memory decay as declarative policy - MIT Licensed

hey guys so i wrote a database, NornicDB.

https://github.com/orneryd/NornicDB/releases/tag/v1.1.0-preview-1

it got mentioned in research last month. https://arxiv.org/pdf/2604.11364

the researcher actually commended on issue #100 here:

https://github.com/orneryd/NornicDB/issues/100#issuecomment-4296916032

and i’ve released a preview tag for people to play with. 1.1.0-preview. docker images, mac installer, or build it locally.

the idea is to convert memory decay into policy that can be declared in cypher. it started with Ebbinghaus but as the researcher pointed out, is insufficient for agentic memory.

with the policies you can define the decay curve profiles. when you enable memory decay, it sets up policies to match the Ebbinghaus-Roynard model as he describes in the paper. that plus the “canonical graph ledger” bootstrap enables you to move a lot of glue code into the database using the primitives i provide. (cardinality, temporal no-overlap constraints, etc…)

the way it works is a visibility suppression layer in between Cypher and badger. on-access meta is stored in a separate index. there are functions to reveal/decay scoring functions in cypher for debugging queries or bypassing the visibility layer. having the layer there and the meta flushed separately from the data itself maintains negligible performance overhead for enabling it at the data layer.

it’s research backed. I’m writing my own research paper in response to 4 different papers converging on my database implementation.

726 stars and counting. MIT licensed. neo4j and qdrant driver compatible.

enjoy!

edit: clarity on performance overhead. the way i’ve built it and benchmarked it, the performance overhead is within noise tolerances. +/- <1% variance across runs and overhead measures in nanoseconds in tests.

u/Dense_Gate_5193 — 9 days ago
▲ 4 r/ContextEngineering+5 crossposts

Context Engineering Is the Compass Coding Agent Needs

Coding agents are powerful ships, but they’re sailing without a map. They can write code, run tests, and iterate — but they don’t know where they are in the codebase. Context engineering is the discipline of giving agents the architectural awareness they need to navigate effectively. Without it, even the best models waste tokens exploring dead ends. With it, a cheap model outperforms an expensive one.

https://medium.com/@xanther.ai/context-engineering-is-the-compass-your-coding-agent-needs-6eef30c66286?postPublishedType=initial

The Navigation Problem

Picture a ship in open water. It has a powerful engine, a skilled crew, and enough fuel to reach any destination. But it has no compass, no charts, and no GPS. What happens?

It explores. It tries directions. It backtracks when it hits land where it expected open water. Eventually, through trial and error, it might reach its destination — but it burns 3x the fuel and takes 5x the time.

This is exactly what happens when you point a coding agent at a large codebase without architectural context.

https://preview.redd.it/nr5idnhzj90h1.png?width=720&format=png&auto=webp&s=90ca6ff90066501de6e3f0c66828309d212b2832

The agent has all the capabilities it needs. It can read files, write code, run tests, search for patterns. But it doesn’t know the architecture. It doesn’t know that django/db/models/sql/compiler.py is the heart of query generation, or that changing BaseCache.set() affects every cache backend downstream. It discovers these things through exploration — expensive, token-heavy, error-prone exploration.

Without context engineering:

Agent: "I need to fix the cache race condition"
→ Searches for "cache" → finds 47 files
→ Reads django/core/cache/__init__.py → not helpful
→ Reads django/core/cache/backends/filebased.py → finds the class
→ Reads django/core/cache/backends/base.py → understands inheritance
→ Searches for "thread" → finds 23 files
→ Reads django/utils/autoreload.py → wrong file
→ Reads django/core/files/locks.py → relevant but doesn't know why yet
→ Eventually pieces together the architecture after 12 file reads
Total: ~4,000 tokens, 45 seconds, 2 wrong attempts

With context engineering:

Agent: "I need to fix the cache race condition"
→ Queries XCE: "FileBasedCache race condition threading"
→ Gets back: inheritance chain, threading concerns, related utilities, test infrastructure
→ Goes directly to the right files with full architectural understanding
Total: ~1,500 tokens, 15 seconds, correct on first attempt

Same agent. Same model. Same capabilities. The only difference is the map.

The Three Levels of Context

Not all context is created equal. There’s a hierarchy:

Level 1: Code Context (What exists)

This is what most tools provide today — file contents, function signatures, grep results. It answers “what code is here?” but not “why?” or “how does it connect?”

Tools at this level: file search, grep, symbol lookup, embeddings-based RAG.

Limitation: Finding a function doesn’t tell you what calls it, what it depends on, or what breaks if you change it.

Level 2: Structural Context (How things connect)

This captures relationships — call graphs, inheritance chains, import dependencies, module boundaries. It answers “what depends on what?” and “what’s the execution flow?”

Tools at this level: static analysis, dependency graphs, call chain extraction.

Limitation: Knowing the call graph doesn’t tell you the design intent or architectural role of each component.

Level 3: Architectural Context (Why things exist)

This captures design intent — why a module exists, what role it plays in the system, what design patterns it implements, what constraints it must satisfy. It answers “what is this component’s job?” and “what are the rules?”

Tools at this level: XCE’s PRAT-powered structured index.

This is the level that changes agent behavior. When an agent knows that CsrfViewMiddleware must run before CacheMiddleware (and why), it doesn't accidentally break that constraint. When it knows that BaseCache defines a contract that all backends must satisfy, it doesn't write a fix that violates that contract.

https://preview.redd.it/6xf1g7t2k90h1.png?width=720&format=png&auto=webp&s=bf6efe957fc9eb347c86c5ffa4d5f9f940d88a5a

Why embeddings fail for this:

Embedding-based code search finds textually similar code. But the questions agents actually need answered are structural:

  • "What depends on this function?" — not a text similarity question
  • "If I change this file, what breaks?" — requires call graph knowledge
  • "What's the inheritance chain?" — structural, not textual
  • "What module owns this logic?" — architectural, not lexical

Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain. Embeddings can't distinguish these cases.

The compass metaphor:

A compass doesn't tell you the answer. It tells you which direction to look. That's what architectural context does for agents — it doesn't write the fix, but it tells the agent:

  • Which files are relevant (and which aren't)
  • How those files relate to each other
  • What constraints must be preserved
  • What patterns to follow
  • What will break if you get it wrong
  • The agent still does the work. But it does the right work, in the right place, on the first try.

Real numbers:

We tested this on SWE-bench Verified (500 real bugs from Django, scikit-learn, sympy, matplotlib, pytest):

https://preview.redd.it/klbpkr2mk90h1.png?width=805&format=png&auto=webp&s=bbe7166f5ad2336455749f9ec2581c4326de4e6a

A $0.02/call model with the right context beats a $0.30/call model without it. The improvement scales with complexity:

  • Simple codebases (flat architecture): +8%
  • Medium codebases (some layering): +12%
  • Complex codebases (deep dependencies): +17%

This makes intuitive sense. If your codebase is a 500-line Express app, the agent doesn't need a map. If it's Django with 4,000 files across 50 modules with deep inheritance chains and cross-cutting middleware — the map is everything.

What we built:

We built a context layer that indexes codebases into a structural map (not just embeddings) and serves it via MCP. Any MCP-compatible agent (Claude Code, Cursor, Kiro, OpenCode, Windsurf, Cline) gets architectural context on every tool call without any changes to the agent itself.

npx xanther-cli init --api-key YOUR_KEY

One command indexes your repo. Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse?repo_id=YOUR_REPO_ID",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

The agent gets five tools: xce_get_context (full architectural context for a problem), xce_search (semantic search), xce_architecture_context (deep dive on a file/symbol), xce_trace (trace code to architecture), xce_impact_analysis (what breaks if you change files).

The takeaway:

Everyone's focused on making models smarter. That matters. But the bottleneck for coding agents right now isn't model capability — it's context quality. A fast ship without a compass burns fuel going in circles. A slower ship with a compass reaches the destination first.

Context engineering — giving agents the right information at the right time — is the multiplier that makes every model better. And unlike model improvements (which require billions in training), context improvements are cheap and compound with every model upgrade.

Links:

Free tier: 3 repos, 100 queries/month. Curious what others think about this approach — is context the bottleneck you're hitting too?

reddit.com
u/Economy_Leopard112 — 12 days ago
▲ 7 r/ContextEngineering+1 crossposts

Auto Graph Color

Anyone spinning up knowledge bases quicker than they have time to color might like this plugin... It’s super new so would love some feedback if anyone is interested in the v1.

linkedin.com
u/Willing-Topic556 — 11 days ago
▲ 3 r/ContextEngineering+1 crossposts

We hit this while building an RFP automation system. Client had hundreds of documents: past RFPs, RFIs, proposal templates, internal reference files spanning years. When we requested for single source of truth - they confessed that they had none. We had a hunch that this is going to lead to a funny outcome.

We ingested everything and started taking queries.

First real tests:

- "What's our pricing?" Three different numbers depending on which document you pull.

- "How many employees?" Four different answers.

- "What's our compliance certification status?" One doc says pending. Another says SOC2Type1. The most recent one says HiTrust.

At cogniswitch, we take a neuro-symbolic approach, still the system generated answers the team was not really stoked about. It was on a feedback call client's growth team mentioned that the answers are dated. Obviously. The documents just tons of conflicts/ contradictions.

We went back and asked for the source of truth. There wasn't one. These were live internal documents that had accumulated years of drift. Nobody had reconciled them because nobody needed to until an AI had to answer from all of them at once.

We ended up building a conflict detection layer before the answer generation layer. Scan the corpus for conflicting facts - pricing, headcount, certification status - with different stated values across documents. Flag them. Human resolves which is authoritative. Then you can build anything on top off this knowledge foundation.

Lesson learnt the hard way - gap with output-only evals: your benchmark asks whether the AI answered correctly. But if your knowledge base has contradictions, "correct" doesn't have a stable meaning.

Clear need for context evals - checking whether your retrieval corpus is internally consistent before you ever run a query - are barely a discipline. I don't know of good tooling for it. Most teams discover this problem the same way we did.

Anyone building RAG on messy enterprise document sets running into this?

reddit.com
u/Ok_Gas7672 — 14 days ago
▲ 18 r/ContextEngineering+5 crossposts

I built and test a zero dependency TUI library with modern layout support using the Convo-Lang VSCode extension

u/iyioioio — 13 days ago