r/Rag

▲ 0 r/Rag

Most RAG failures are not retrieval failures. They’re assumption inheritance failures.

Most RAG failures are not retrieval failures.

They’re assumption inheritance failures.

One thing I noticed after stress-testing long-context RAG pipelines:

once the retriever surfaces a weak or slightly wrong premise early, the generator often treats it as “ground truth” for the rest of the chain.

The dangerous part is that the reasoning still looks coherent.

The model keeps building on top of the initial assumption, retrieves supporting context around it, and gradually locks into a self-reinforcing narrative.

I started calling this:

Recursive Agreement

where each stage silently inherits the previous stage’s assumptions without re-validating them.

A few patterns consistently showed up in larger RAG systems:

• retrieval outputs becoming “authoritative” even when relevance is weak

• local coherence overpowering global correctness

• constraint decay across long multi-step chains

• agents optimizing for narrative consistency instead of contradiction detection

Ironically, increasing context size sometimes made this worse because the bad premise simply had more room to accumulate supporting evidence.

The biggest improvements came from surprisingly small structural changes:

• explicit assumption extraction before reasoning

• lightweight contradiction passes

• confidence scoring on retrieved context

• re-ranking focused on disagreement, not just similarity

• forcing checkpoints between retrieval and synthesis

Feels like a lot of “prompt engineering” discussions are actually architecture discussions in disguise.

I wrote a short free PDF breaking down these failure patterns and mitigation structures if anyone wants to explore the idea deeper.

(Free download in comments.)

reddit.com
u/HDvideoNature — 5 hours ago
▲ 12 r/Rag+1 crossposts

Introducing Exabase M-1: State-of-the-art AI memory with a smaller, cheaper model

We want to share some research we've been working on around memory retrieval for agents.

TLDR: our memory engine (M-1) just scored 96.4% on LongMemEval, the main benchmark for conversational memory. Highest reported score, and we did it with Gemini 3 Flash, not Pro.

The small model is the bit we care about most (cost efficiency).

When we started building our memory engine, we kept running into the same pattern: memory systems that only worked well when paired with big, expensive models.

The model ends up compensating for weak retrieval. Fine for a benchmark, but it falls apart in production where every query costs money and latency matters.

We wanted to know: can you build retrieval good enough that a cheap model gets the right answer?

That question led us to look at how human memory actually works – not as database lookup, but as reconstructive, associative, temporally-aware recall. We collaborated with Hyperplane Labs, a European applied research lab focused on cognitive AI architectures, on the retrieval architecture.

3 ideas that shaped the design:

  • Retrieval as reconstructive recall, not keyword search
  • Temporal awareness built into scoring, not bolted on
  • Context that's coherent and ordered, not just relevant

We evaluated on the most comprehensive benchmark for conversational memory – designed to stress multi-session reasoning, temporal understanding, and knowledge updates. The kinds of scenarios where current systems tend to break or fall back to larger models.

We achieved state-of-the-art results, with a smaller, cheaper model than every other system reported.

Full paper with methodology, comparative results, and downloadable data: https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark

The system powers our own apps in production, and the memory API is available if anyone wants to try it.

If you're building agents with memory, we'd be curious to hear what retrieval problems you're running into.

Especially around multi-session reasoning and temporal updates, which is where we've seen the biggest gap between current approaches and what's actually needed.

u/j-m-k-s — 12 hours ago
▲ 16 r/Rag

Am I alone in telling my RAG clients to re-do their data from scratch?

While I understand the use case for most RAQ systems is to allow LLMs to intelligently interrogate existing data/documents, but we can also see that's where the common problems occur.

I'm an old school IT guy and, back in the day, we always used the term 'garbage in, garbage out' when talking about systems. And from years of experience, it's nearly always crappy data that causes problems, not the solution itself.

So when I talk to clients about new systems, I immediately start talking about accuracy of retrieval. This is when I hit them with the 'garbage in, garbage out' talk and include how AI isn't a magic bullet to improve data accuracy. I start talking to them about how to spend considerable effort completely re-doing the data they want to interrogate, explaining how this effort will pay off in accuracy of retrieval. In one case, we started out with a blank spreadsheet where the client started adding in the data they wanted to interrogate as text organised into chunks.

This transparency helps the client understand the challenges. It also gives the client ownership of their data. Plus the exercise of transforming their old datastores into something designed for AI helps the client become more familiar with their own data, plus the 'cleaned data' is a new business asset to be used in other facets of the business.

And, it makes developing a RAG system much easier, tweakable, and reliable.

But I don't hear many people talking about challenging the client to clean their data. The emphasis seems to be on making the RAG jump through hoops (badly) to deal with crappy data. Am I just lucky to find amenable clients interested in clean data?

reddit.com
u/JackStrawWitchita — 16 hours ago
▲ 5 r/Rag

Struggling with LLM Re-Ranking in Our Product Recommendation System – Any Advice?

Hey fellow data enthusiasts,

I've been experimenting with using LLMs for re-ranking in our product recommendation system. We already have collaborative filtering and popularity-based algorithms in place, but real order data shows that almost half of our users end up buying products ranked beyond position 20. And keep in mind, our first page only shows 4–5 items.

Here’s what I’ve tried so far:

  1. Directly re-ranking the top 100 candidate products using an LLM. Unfortunately, due to attention limitations, the results were sometimes worse than the original ranking. The model tends to push popular items back, even though users clearly exhibit herd behavior.
  2. Feeding the model user demand signals and profiles, scoring each product individually. This was a mixed bag: sometimes it correctly promoted the products users wanted, sometimes the opposite. Overall, performance slightly lagged behind the original ranking.
  3. Hierarchical / group-wise re-ranking. For example, protecting the top 10 items while re-ranking items 11–100. This gave a modest +2pp lift in conversion.

A big challenge is that most of our users are new, so we have very little behavioral history to analyze, and even the data we have is noisy.

I’m curious if anyone has suggestions on:

  • Other techniques to improve LLM-based re-ranking under low-data / new-user scenarios
  • Using methods like GraphRAG or vector embeddings to enhance re-ranking effectiveness

Any thoughts or references would be greatly appreciated!

If you want, I can also draft an even punchier, highly upvotable Reddit version with a more casual/humorous tone that emphasizes the “half of users buy stuff past rank 20” pain point—it could increase engagement. Do you want me to do that?

reddit.com
u/Recent_Source_4251 — 15 hours ago
▲ 0 r/Rag

nobody tells you that RAG in production is mostly just babysitting a broken retrieval pipeline

every tutorial is embed your docs, query, done. built something "working" in like 3 days and genuinely thought I understood it.

then I started going deeper for a writeup and realized how much was quietly broken under the surface.

the retrieval step is where everything dies. not the model. not the prompt. the part every tutorial skips because it's "straightforward."

spent way too long thinking the LLM was hallucinating. it wasn't. it was answering correctly based on the wrong document. was blaming the model the whole time while the actual problem was vector search not knowing what a version number is. semantically nearest != correct. "v2.3 release notes" and "v1.8 release notes" look almost identical to an embedding model.

chunking is the other one. fixed-size chunking will cut a sentence in half, retrieve one half, and the model will confidently complete the thought. that's literally the problem you built RAG to solve. happening inside your solution.

stale indexes too. update a doc, forget to re-index, users get confidently wrong answers until someone notices. not even a hard problem, just nobody mentions it exists.

gone through this pipeline multiple times now across different projects. each tutorial solves a different 20% of it.

has anyone actually gotten to a point where this feels stable or is it just permanently on fire

reddit.com
u/SilverConsistent9222 — 20 hours ago
▲ 12 r/Rag

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Here is a Research paper that you guys might be interested in from PriceWaterhouseCooper.

Lexical retrieval using grep consistently outperforms vector retrieval across various agent harness architectures under realistic noise conditions.

https://arxiv.org/abs/2605.15184

u/Express-Passion4896 — 1 day ago
▲ 3 r/Rag

Surprise: I gave sonnet 4.6 a go at turning a 90-page pdf into markdown and it did an excellent job

I've been playing with RAG and like many faced the challenge of what to do with PDF ingestion. Super frustrating, I've tried 10 different pipelines in the last few months. I hadn't tried just going back to a basic LLM in a while. I asked gpt 5.5 the same, it performed poorly, but sonnet 4.6 did great

reddit.com
▲ 3 r/Rag

Is RAG for PDFs really marketable

I am planning to build a desktop app that allows users to query PDFs, weblinks, and local folders within a chat interface. But I can see that companies like Ollama and LM Studio already have similar apps. Is it worth building and competing with them, given that I'll be charging $50 for lifetime access while my open-source competitors offer it for free? I think my competitors don't focus on the researcher's niche as much as I plan to. Plus, if I can get 100 users to onboard, I'll be able to successfully break even. I don't need too much cash. Is this still a viable path to follow, or will I end up wasting my time? 

***Edit

Thanks for your feedback guys, I really appreciate it

reddit.com
u/Soren_Professor — 1 day ago
▲ 4 r/Rag+1 crossposts

Web scraping for LLMs was driving us insane, so we built our own Search API with native MCP support

Hey 👋

My team and I build AI agents, and web search has been our biggest pain point for the last six months.

The standard developer workflow right now is kind of awful: You hit a search API, get back links, write a scraper, deal with captchas and blocking, then end up feeding your LLM a giant pile of HTML full of cookie banners, menus, and random junk. The model gets confused and your token usage explodes.

So we decided to build something specifically for RAG pipelines and AI agents: Search Router (https://search-router.com)

A few things we focused on:

  • Speed: P99 latency under 800ms. Agents respond fast and users don’t sit around waiting.
  • MCP-ready: native support for Model Context Protocol. You can plug our config directly into Claude Desktop and let it run searches through the tool without burning Anthropic limits.
  • Clean JSON output: structured responses that are actually pleasant to work with programmatically.

What we shipped recently:

Added the Retrieved Context for LLM endpoint - instead of giving you the whole site or short snippets, our API returns a structured JSON with extracted relevant context. This heavily reduces the need for manual HTML cleanup and saves LLM tokens.

We’d genuinely love feedback.

The project is still very early, so we wanted people to be able to actually test it on real projects without worrying about limits.

We want your feedback:

The project just launched. So you can properly break it on your pet projects, we made an unlimited free tier during the launch period.

You just sign up (no card required) and get 2000 requests. Once the limit is out, you can just go to the dashboard and hit the "refill" button to get more free test credits.

Would love bug reports, edge cases, feature requests, or honestly just hearing where the product sucks right now!

u/ummitluyum — 1 day ago
▲ 37 r/Rag+1 crossposts

my coworker adrien (former elasticsearch / lucene committer) recently wrote a nice article about incorporating numerical attributes into a unified query plan with BM25 text scoring to provide better relevance in first-stage retrieval while still scaling to very large corpora

https://turbopuffer.com/blog/rank-by-attribute

for transparency, i work at turbopuffer : )

u/itty-bitty-birdy-tb — 1 day ago
▲ 3 r/Rag

Genuinely want to learn RAG

Hi Team,

I have developed RAG using self hosted vector DB by self chunking and embedded using ChatGPT embedding. I have asked different AI platforms (Gemini Pro, Claude, ChatGPT) to teach to perfect every steps. But feels like I get the answer to the only question I ask. Currently I am okay with the answer it is giving, but I feel like it can be made better. So far, I have used clean data and need to test with raw data. I am little lost and sometimes I get anxiety when it does not give result. But, I get different kind of happiness when it gives correct answer.

For instance, if I ask that it did not give answer for specific question, it will tailor the answer / system prompt in such a way that, when user asks in that particular way, it passes.

I understand this question is asked several times here but I genuinely wish to ace RAG and want to learn more. Happy to pay for the course too if it is too good.

No, I do not want shortcut schemes and I am willing to spend lots and lots of time tinkering in it. It has given me immense joy to develop but I feel like this is better way to learn.

I am very interested to learn more but I dont know how I could do it.

Could you please share any books / videos / lecture that has helped you, it would mean alot to me.

Sorry for the long post and many thanks for listening my story.

reddit.com
u/Sufficient-Ad-595 — 1 day ago
▲ 39 r/Rag

Do we really need embeddings vectors?

Re-embedding source documents that update 10+ times a day is incredibly expensive and slow. It's making me question if we actually need the embedding layer at all.

Has anyone tried completely dropping vector similarity and relying purely on keyword search?
My thought: What if we use a fast LLM upfront to expand the user's prompt into multiple keyword variations (simple terms, complex phrases, synonyms), and run those against a standard keyword index?

Has anyone run this pattern? Can LLM query expansion + pure keyword search actually match the accuracy of dense embeddings? Would love to hear if this actually saves money or just creates a new bottleneck.

reddit.com
u/sotpak_ — 2 days ago
▲ 4 r/Rag

What improved your RAG system accuracy the MOST?

Curious what actually moved the needle for people building production RAG systems.

Was it:

  • better embeddings?
  • hybrid retrieval?
  • reranking?
  • chunking?
  • metadata filtering?
  • larger models?

For me, retrieval improvements consistently mattered more than model upgrades.

Would love to hear real production experiences.

reddit.com
u/SheCodesSoftly — 1 day ago
▲ 2 r/Rag

Hot take: context pollution is becoming a bigger issue than hallucinations in RAG.

People talk a lot about hallucinations.

But honestly, I think a lot of “hallucinations” are just retrieval systems feeding garbage context into the model.

Once the context window gets polluted with:

  • partially relevant chunks
  • outdated docs
  • duplicated embeddings
  • weak semantic matches

the model starts reasoning on noisy evidence.

And the scary part is:
the answer still sounds intelligent.

Anyone else seeing this happen in production systems?

reddit.com
u/SheCodesSoftly — 1 day ago
▲ 4 r/Rag

Bigger models don’t fix bad retrieval.

A lot of RAG systems fail because:

  • the wrong chunks are retrieved
  • noisy context gets injected
  • relevance ranking is weak

Then teams try solving it by upgrading the LLM.

Feels like retrieval quality is still the most underrated part of AI infrastructure.

reddit.com
u/SheCodesSoftly — 1 day ago
▲ 3 r/Rag

The model can only reason about what retrieval gives it.

That sounds obvious.

But I think a lot of teams forget this while building RAG systems.

You can use the strongest LLM available…

but if retrieval sends:

  • incomplete evidence
  • outdated docs
  • loosely related chunks

the model is basically reasoning inside a distorted context window.

At that point the issue isn’t intelligence.

It’s information access.

reddit.com
u/SheCodesSoftly — 1 day ago
▲ 4 r/Rag

We replaced our RAG pipeline with persistent KV cache. It works. Now we want you to break it.

Hey Guys, I posted last week about replacing parts of our RAG pipeline with persistent KV instead of the usual chunk/embed/retrieve setup.

Way more people were interested than we expected, and a bunch asked if they could actually try it.

So we opened a beta.

This isn’t meant to replace RAG for everything.
If your data is massive, constantly changing every second, or way beyond context limits, traditional retrieval still makes sense.

But for certain workloads, it’s been surprisingly effective.

Think for , business docs, manuals, internal knowledge bases, etc.

repeated Q&A over the same document set
The model sees the full context once, KV stays persistent, and repeated queries don’t need the whole retrieval dance every time.

If the underlying information changes, we just resnapshot.

It’s basically Less infra. Less tuning. Fewer weird retrieval misses.

We’re looking for 5 people with real workloads who want to try it and help us figure out where it breaks.

Not toy prompts but real use cases would be helpful. Please either comment or DM me if you want to try it out. I will send a link.

Happy to answer any questions.

reddit.com
u/pmv143 — 2 days ago
▲ 23 r/Rag

Are people still using LangChain for their production RAG pipelines?

Feels like production RAG stacks are getting less LangChain-centric lately. A few months ago LangChain felt like the default answer for almost every LLM/RAG workflow discussion. Now I mostly see people moving toward LangGraph, MCP-style workflows, lighter custom orchestration, or fully in-house pipelines.

For people still using LangChain heavily in production RAG systems:

- what made you stay with it?

- did LangGraph replace most of your old chain setups?

- are you using LangSmith or Open-source tooling for observability/evals?

reddit.com
u/Meher_Nolan — 3 days ago
▲ 6 r/Rag

How to parse tables from pdfs with 100% accuracy?

I've tried a lot over the past 2w but can't find a simple solution. I basically have pdf's with 100 row tables, and want to extract the tables into csv's. I tried paid online services like extend, reducto, landing, gemini, none are 100% accurate since they are OCR models.

I get accurate text extraction if I use python pdf libraries like pdfplumber/camelot. The problem is that pdf's don't have a standard way of representing tables so the output columns are sometimes combined/split improperly. 2 columns get merged. I tried adjusting some parameters but it either over or under merges columns.

What is the solution to using python libraries properly? It's a pita to solve and I'm surprised it's not easier.

reddit.com
u/bravelogitex — 3 days ago
▲ 29 r/Rag

Fine-tuned RAG: teaching your retriever which embedding dimensions matter (+11% hit rate, +12% completeness, +9% faithfulness)

Hi all,

I developed a fine-tuned retrieval head (neural net) for RAG that transforms query embeddings before retrieval, so the system learns which embedding dimensions actually matter for your corpus — rather than weighting them all equally as standard cosine similarity does.

The problem

In any domain-specific corpus, some embedding dimensions are highly predictive for matching queries to the right passages, while others are effectively noise. Standard cosine similarity can't distinguish between the two, so retrieval gets pulled toward superficially similar but substantively irrelevant passages. The fine-tuned RAG is designed to prevent exactly that.

How it works

  1. Synthetic question generation — An LLM generates multiple questions per chunk in the corpus, for which the answers can be inferred from that chunk. This creates a dataset of question-chunk pairs (QA-pairs). These are embedded using an embedding model and divided into a training and validation set.
  2. Neural net training — A lightweight neural network using MNR loss is trained on the training QA-pairs. After each epoch, the model is evaluated on the validation set by measuring retrieval hit rate: the proportion of validation questions for which the correct chunk appears in the top-5 retrieved results. Retrieval works by embedding the question, passing it through the neural network to transform the embedding, and ranking all corpus chunks by cosine similarity to the transformed embedding.

Through this mechanism, the projection head learns for these 'type of questions' which dimensions in the embeddings are informative for finding the best chunks — and which are irrelevant.

Results

To validate the architecture, I used the Legal RAG Bench dataset as a proof of concept — evaluating on 100 held-out test questions.

Retrieval Hit Rate:

  • The fine-tuned retriever achieves 82% Hit Rate (k = 20), compared to 71% for the standard cosine retriever — an 11 percentage point improvement, meaning the correct chunk appears in the top 20 results significantly more often when the query embedding is first transformed through the fine-tuned retriever.

Answer quality (LLM-as-judge, 1–5 scale across 6 metrics):

  • Outperforms traditional RAG (top-k cosine sim) on all 6 metrics
  • Largest gains in completeness (+12%) and faithfulness (+9%)
  • Consistent improvement across every metric — not just isolated gains — suggesting that retrieving more relevant context has a broad positive effect on answer quality

Code and full write-up available on GitHub: https://github.com/BartAmin/Fine-tuned-RAG

u/Much_Pie_274 — 2 days ago