r/Rag

▲ 4 r/Rag+1 crossposts

Can RAGFlow damage a GPU?

Long story short, I downloaded RAGFlow on my gaming computer to explore the AI capabilities and to develop my own GraphRAG. But now, everytime a play any game, there is graphical artifacts.

I uninstalled everything, did every thing the old book told me to do, but still weird artifacts that were not there before.

Am I cooked?

reddit.com

u/Popeye_Qc — 6 hours ago

▲ 0 r/Rag

Is RAG still relevant in 2026?

As the title suggests , is RAG still valuable in 2026 ? Is it worth digging in for research ? I heared a couple of critics as modern LLMs got too powerful to be helped with RAG , What do you think?

reddit.com

u/Crazy-Economist-3091 — 14 hours ago

▲ 23 r/Rag+2 crossposts

What if retrieval used attention instead of embeddings? I built a local retriever with SOTA results on long-memory and code benchmarks.

Embedding-based RAG is easy to demo, but high-recall production retrieval is hard.

The core issue is that embeddings lose a lot of context. Nearest-vector search can miss evidence that a model would recognize if it could actually read the surrounding memory. Once recall starts failing, retrieval often turns into a pile of compensating tricks: chunk-size tuning, overlap tuning, keyword + semantic fusion, rerankers, metadata filters, query rewriting, summaries, thresholds, and more. These pieces can help, but nearest-vector search is still not the same thing as reading the evidence.

I built Attemory, an attention-native retrieval engine for long memory, documents, and codebases.

The core idea is simple: instead of embedding chunks and searching by vector distance, Attemory indexes raw corpora into reusable KV state. At search time, a local Qwen3.5 retrieval model attends over the indexed memory and the query, then returns compact evidence: memory ids, snippets, or file + line ranges.

So the retriever is not just matching compressed vectors. It is using model attention over model-readable memory.

My current view is that attention helps for three reasons.

First, embeddings force each chunk into a fixed vector before the query is known. That is efficient, but it can lose token-level details such as names, dates, code identifiers, negation, and local relationships between facts.

Second, attention lets the query interact with the original memory text at retrieval time. The model can score evidence in context instead of relying only on distance in embedding space.

Third, the retrieval policy is promptable. The system prompt, memory-local context, and query context can define what kind of evidence should be retrieved, while the returned candidates are still the original memory items.

The key performance idea is not to generate answers during retrieval. Attemory uses a decode-free retrieval path: index the corpus into reusable KV state, then use attention signals from the query to rank candidate memories. That keeps retrieval closer to model reading while avoiding a full generation loop for every candidate.

The benchmark results are something we take seriously, not a marketing slogan. The repo includes reproducible benchmark scripts, notes, commands, and result summaries. The results below are from raw corpus + raw benchmark query runs, without benchmark-specific retrieval hacks: no query rewriting, no summarization, no agent-driven exploration, and no external cloud retrieval service for retrieval.

Current results:

LongMemEval-S: 98.72% session Recall_any@5, 92.77% session Recall_all@5, 98.94% message Recall_all@50
LongMemEval-M: 94.89% session Recall_any@5, 83.62% session Recall_all@5, 92.55% message Recall_all@50
LoCoMo: 94.52% long-conversation QA accuracy
Semble: 0.9055 file-level NDCG@10 across 63 repos and 19 languages
SWE-QA: one Attemory code-search hint reduced Claude Code token usage by 43.8%, with near-tied judge quality across 15 repos and 720 questions

One result worth highlighting is LongMemEval-M. It is around 1.5M tokens / 5k messages, and many memory systems do not evaluate on it at all. Attemory still retrieves all labeled evidence messages in the top 50 for 92.55% of answerable queries.

Because the retrieval path is decode-free, query-time search remains efficient in practice. For large indexes, especially the largest tests I have run at nearly 10M tokens, retrieval still benefits significantly from GPU or Metal acceleration.

Attemory runs locally and exposes a Python / HTTP retrieval API.

I also built a repository search CLI on top of the same retrieval engine. With `atcode`, you can index a repo once, ask natural-language repository questions, and get compact file + line-range evidence back. That makes it easy to try the retrieval quality directly without wiring the API into an app first.

Attemory is still early stage, and I am working on MCP integrations for coding-agent frameworks right now.

I would love feedback from people building agents, memory systems, RAG pipelines, or code-search tools. If embeddings have become a bottleneck in your retrieval stack, please try Attemory and tell us what works, what breaks, and what you would want next.

u/langsfang — 22 hours ago

▲ 3 r/Rag

Should RAG systems ever make memories permanent?

Curious what people here think.

Most RAG stuff is retrieve, answer, disappear.

Which makes sense most of the time.

But for longer-running agents or tools, is there a point where some retrieved knowledge should become permanent memory?

Not everything. Obviously that would be a mess.

But maybe the system decides “this matters, I should remember it.”

Would that actually help, or does it just create more problems around stale knowledge and bad assumptions?

reddit.com

u/iCryptoDude — 20 hours ago

▲ 7 r/Rag

What is dragging Knowledge Graphs down?

Suppose:-

Law A states: Stealing is illegal.

Law B states: Theft of food in case of necessity is permitted.

Lexical search will only capture one of these; since there is no keyword match. Semantic search could potentially see it, but there is a big chance of the right chunk being outside the top-k window.

Knowledge graphs are supposed to solve this issue and sound really good on paper... but does not seem to be performing any better than other RAG techniques.

Is the issue mostly because the LLM at indexing time is not extracting enough complex relationships? Or it is the embedding model not bridging the gap (at indexing and query time)? Like it sounds the closest thing to how a human brain would store (or even update) information.

reddit.com

u/Nervous-Positive-431 — 22 hours ago

▲ 1 r/Rag

Most "our RAG is inaccurate" problems are actually retrieval problems.

I've spent a lot of time fixing RAG systems. I think "our RAG is inaccurate" problems are actually about finding the right information, not about the model generating answers.

The model usually isn't making things up. Its answering based on what it was given.

The real issue is that it's getting the piece of information to work with.

The biggest improvements I've seen come from:

Breaking up documents into chunks based on how they're structured not just using fixed sizes.
Adding a step to reorder the results after searching for vectors.
Creating a test set from questions people asked instead of just guessing what would work.

What surprised me most was how difference it made to switch models compared to improving how we find the right information.

Models that can reason make this even clearer. They don't fix information. They just give a more convincing answer based on the wrong idea.

GraphRAG definitely has its use for complex questions that involve many connected documents. For simple questions about a document, I've often found that fixing how we break up documents and find the right information solves the problem before we need a more complicated system.

For those who've shipped RAG to production:

What ended up making the biggest difference for accuracy? Was it the model, or was it finding the right information that was the real problem?

reddit.com

u/recro69 — 18 hours ago

▲ 4 r/Rag+1 crossposts

I want to build an Enterprise Knowledge Management System using company data. What all design decisions should I make?

Hi guys,

I have theoretical knowledge in RAG, Agents etc. I have been a Computer vision engineer for the past 6 years and now I am starting my projects in gen ai.

The use case is, I want to build a knowledge management system. How do I design the entire pipeline?

For the time being I have to do a POC with few documents in the share point and the storage, deployment etc will be on an HPC for now and later will be moved to cloud.

reddit.com

u/Appropriate_Dirt8284 — 22 hours ago

▲ 2 r/Rag

When to use Graph RAG? Traditional RAG vs GraphRAG

I have been playing with rag, recently i built a traditional RAG application on LangChain Docs.
During the development phase, I went through multiple errors - chunking failed, manual overlap chunks took me around 49 min to ingest into vector DB.(idk how can i make speed this up)

Anyhow, I was able to complete the project but the answers are not relavant in the retreival phase. After debugging, I realised I did a huge mistake in Ingestion.

Recently, I got to know about GraphDB, so I want to try it out.
But before doing that, I want to know when to use GraphRAG and when not to.
What are the pros and cons?
I got to know whenever I chat the graph keeps on building - i think this would cost me.
How can I solve this?

Let me know folks.
Any help is much appreciated.

reddit.com

u/Pleasant-Survey6861 — 1 day ago

▲ 23 r/Rag

What is the most useful RAG pipeline for you in production?

Hello everyone,

I’ve been enjoying working on RAG applications for quite some time now. I experiment with and use multiple RAG techniques for each client. What are you all doing in this area? What methods, techniques, and tech stacks do you use?

I believe that each of our experiences can serve as inspiration for one another.

reddit.com

u/Lanaxsa — 1 day ago

▲ 2 r/Rag

Provenance in RAG/agent memory authenticates the source, not the truth — we red-teamed our own poison defenses

We ship an open-source agent-memory core and added four defenses against memory/RAG poisoning: value-weighting, a corroboration gate, deterministic supersession, and earned-outcome credit. Then we red-teamed all four against an attacker who *knows* them.

All four fall, for the same reason: each scores a record by something computable from the record's *own content* — and the attacker writes the content. Value is self-declared, corroboration is self-sourced, the winning write is self-timed, the "success" is self-graded.

The only signals the writer can't author are provenance (where it came from) and cost. But the catch that matters for RAG: **provenance authenticates the source, not the truth.** MINJA poisons memory from inside a legitimate, authenticated session — real provenance, false content — so a provenance check waves it through. PoisonedRAG shows the same on the retrieval side.

So provenance/cost is a floor, not a fix. The rule we landed on: **write-cheap, influence-expensive** — store anything in its own scope, but require corroboration by *distinct* anchored sources before a memory can influence an answer outside its scope.

It's all textbook (Sybil, Goodhart, CRDT last-writer-wins, adaptive-eval) — the value is the runnable red-team of one real stack + the honest ceiling. Writeup + runnable probe: https://dancenitra.github.io/agora/public/posts/agent-memory-defense-provenance-not-truth.html

The part I don't have a clean answer for: the *authenticated-but-false* case — when every corroborating source is individually legitimate. How are you handling that in RAG?

reddit.com

u/Danculus — 2 days ago

▲ 6 r/Rag+2 crossposts

I built an LLM eval gate that can't silently pass

https://github.com/albertofettucini/faithgate

Most LLM eval setups I've seen have a failure mode ops people will recognize: the happy path is green, and every unhappy path is also green. Judge API dies, no scores, nothing to compare, pass. That's an availability metric wearing a quality-gate costume.

I built faithgate around the opposite default. It's a faithfulness regression gate (suite of cases, score per prompt/model version, diff vs baseline, nonzero exit on regression) where every ambiguous state fails closed. Zero matched cases: fail. Unscored run: fail. Every score an abstention: fail. Abstentions are a distinct state in storage, never coerced to 0.0, and there's a --max-abstained policy flag for when you actually want tolerance.

Reproducibility bits: every run writes a manifest with judge id, model, kind, ragas and runner versions, and the suite hash. If the judge changed between baseline and head, comparing the scores is meaningless, so the gate exits 3 unless you explicitly pass --allow-judge-change. A corrupted manifest also fails closed. Duplicate case keys resolve pessimistically (baseline keeps max, head keeps min) so dupes can't quietly lower the bar.

My favorite part lives in CI. Next to the normal green gate there's a proves-detection job that runs the gate against a deliberately regressed suite and inverts the exit code. If the gate ever loses the ability to catch a known-bad change, dependency bump, refactor, whatever, the pipeline itself goes red. Tests for the test.

Judge honesty: default is Claude via your own key (RAGAS underneath). The keyless offline mode is published as untrustworthy, 68% balanced on a 40-example hand-labeled set, catches 9/20 unfaithful, with a unit test asserting the weakness.

Storage is one SQLite file with WAL, no server. Python 3.9 to 3.13, MIT. Known limitation: case identity is content-based, rewording a question mints a new case.

u/ahumanbeingmars — 2 days ago

▲ 3 r/Rag

Bratan is a self-improving Retrieval-Augmented Generation framework built on an adversarial three-agent loop

Just curious to get your guys input on the direction/viability of this approach.

https://github.com/AllanWessels/Bratan

Bratan is a self-improving Retrieval-Augmented Generation framework built on an adversarial three-agent loop: Red Team breaks the pipeline. Blue Team fixes it. Judge keeps the score. They iterate against a co-evolving test set until your RAG converges on something genuinely good — not just something that scores well on a static benchmark.

u/Altruistic-Data-7773 — 1 day ago

▲ 48 r/Rag

fine-tuned a VLM for messy-PDF extraction, 46% → 91.1% on OmniDocBench. runs fully on your own hardware, looking for people to break it

edit:
46% to 79% (ranks 2nd) on Parsebench, on omnidocbench -> 91.1.

hey folks,

been heads-down on this for a while so figured i'd finally show it

TLDR; i've been working on document extraction, the boring-but-painful part where you take a nasty PDF (multi-column, merged-cell tables, half-scanned garbage) and try to get clean structured data out of it. took a base VLM sitting around 46% on OmniDocBench, did a bunch of LoRA + a few architecture changes, and got it to 91.1%. tables were the big unlock, that's usually where everything falls apart.

couple things someone might care about:

- it runs fully on your own hardware. no shipping documents off to some API. that was kinda the whole reason i started this.

- serves amazing on charts/tables (area I love to work on)

- near-zero hallucination, it doesn't invent rows or numbers that aren't there.

not trying to do a big pitch. i just want people to throw hard stuff at it and tell me where it breaks. so if you've got a PDF that's been the bane of your existence, the kind that makes every parser cry, drop it on me (or DM)

happy to nerd out on the training setup or the arch changes too if anyone's curious.

cheers

reddit.com

u/aabbyyyy038 — 3 days ago

▲ 6 r/Rag

Need advice on digitizing hospital paper records into a RAG system (first large-scale project)

Hi everyone,

I recently got an opportunity to pitch an AI solution to a hospital. The interesting part is that most of their patient records are still stored as physical paper files. They haven't digitized much yet.

My idea is to eventually build a RAG-based assistant where doctors or hospital staff can ask questions like:

"Summarize this patient's medical history."

"Has this patient ever been diagnosed with diabetes?"

"What medications has this patient taken before?"

The challenge is that I've never worked on a project that starts with years of paper records. I've built RAG systems before, but not for something this large or in healthcare.

I'm trying to figure out how I should approach this. Should I first propose a digitization phase (scanning + OCR + structuring the data) and then build the RAG system? Or is there a better way to tackle it?

I'd also love to hear from anyone who's worked on hospital records or healthcare AI.

What would your architecture look like?

What were the biggest challenges?

Any mistakes you made that I should avoid?

Is this something that's realistic for a small team, or am I underestimating the effort?

Right now I'm preparing my proposal for the hospital, so I want to make sure I'm thinking about this the right way before I commit to anything.

I'd really appreciate any advice or experiences you can share. Thanks!

reddit.com

u/MRScientists — 2 days ago

▲ 20 r/Rag+1 crossposts

I ship one Rust core to 14 languages from a single config

For the last few months I've been building document and RAG infrastructure (crawling, HTML-to-Markdown, extraction) on a Rust core, and shipping each library to Python, Node, Go, Ruby, Java, and a dozen more.

The hard part isn't the Rust. It's producing genuinely idiomatic native packages for every language and keeping them in sync as the core changes. Doing that by hand across 14 targets is a nightmare.

So I built alef: one config generates the bindings and the packaging for all targets straight from the Rust type definitions. It now drives all my polyglot libraries (the crawler, the HTML-to-Markdown engine, an LLM client, a tree-sitter grammar pack). All MIT.

https://github.com/xberg-io/alef

Happy to get into the binding and packaging weeds if you're shipping cross-language libraries.

u/Goldziher — 2 days ago

▲ 2 r/Rag

Best open source model for RAG application

We are in hackathon, and we want to build a RAG application which can run locally. we are having 2 laptops: 32GB ram with i7 13th gen processor (with no GPU) and another with 16GB ram and i5 12th gen processor with 4GB nvdia graphics card. then according to you which OS LLM should we use for best performance and avoid hallucinations?

reddit.com

u/ScalarNeo — 2 days ago

▲ 6 r/Rag+3 crossposts

Tried a recurrent architecture (HRM) for reasoning-retrieval, the bet held up.

The bet: BRIGHT is a retrieval benchmark where finding the right doc usually takes a few hops of reasoning, not just semantic overlap. Most embedders do a single forward pass. I wanted to see if a depth-recurrent architecture, one that loops over its own hidden state, would fit that better, so I built an embedder on HRM (Sapient's Hierarchical Reasoning Model). As far as I can tell it's the first time HRM's been used for retrieval.

The recurrence helped on the reasoning side, which was the whole bet. When I dialed the recurrence down at eval on pony (one of the BRIGHT domains), accuracy dropped with every loop I removed. Where it hit a wall was knowledge: the base was pretrained on a deliberately thin slice of text (Sapient built HRM-Text for pretraining efficiency, not breadth), so it's weak on knowledge-heavy domains. The part I find coolest: at 0.6B, the reasoning is coming from the architecture, not from scale.

Details:

~0.6B params, trained on one 3060 Ti (8GB).
Recipe's deliberately boring: mean-pool + L2, bidirectional (LLM2Vec style), contrastive InfoNCE. Only the backbone is unusual. Same recipe as RakanEmbed4B.

Numbers (BRIGHT, mean nDCG@10, 12 domains):

original: 18.1
query rewriting: 34.3
merged: 33.7

Weights are Apache-2.0 and the full BRIGHT eval harness is in the repo.

Open questions / discussion:

Would a massively pretrained HRM push this further? The ceiling here looks like knowledge, not reasoning, so a broadly-pretrained base might lift it a lot. I don't have the compute to try that myself.
Would other recurrent architectures show the same effect, or is something specific to HRM doing the work?

Model: https://huggingface.co/viventhraa96/HRM-Embed-0.6b

Code: https://github.com/okaybroda/hrm-embed

Full credits to Sapient Inc for open sourcing the code and the architecture for this work.

u/v1v55 — 1 day ago

▲ 6 r/Rag

RAG + MCP: Are we heading toward over-engineering?

Genuine question.

Every new AI stack seems to include RAG + MCP + Agents + Memory + Tools.

At what point does a well-designed MCP server replace the need for RAG, and when is RAG still the better choice?

Curious how experienced builders are deciding between the two.

reddit.com

u/Diamond_1974 — 3 days ago

▲ 6 r/Rag

I built BaryGraph - knowledge graph where every relationship is its own embedded document (not an edge)

Instead of node --edge--> node, every relationship is a first-class document with its own vector, called a BaryEdge. Stack pairs of BaryEdges recursively and you get "MetaBary" triads that surface structural bridges between concepts that live nowhere near each other in embedding space. Running locally on MongoDB Community + mongot + nomic-embed-text over the full English Wiktionary (6.6M docs). MCP server is live if you want to poke at it. Preprint + benchmark CSVs: https://zenodo.org/records/20186500

The problem I was chasing

Flat vector search treats a relationship as a byproduct of two points being close. That throws away information. Two papers can describe the same underlying phenomenon (a flyby anomaly in orbital mechanics, an anomalous residual in stellar dynamics) without ever citing each other and without their embeddings landing anywhere near each other. Nothing in standard RAG surfaces that connection.

What I did instead

Every relationship gets embedded too:

bary_vector = normalize(q·v(CM1) + q·v(CM2) + (1−q)·v(type))

q is connection quality, v(type) is a contextual embedding of what kind of relationship it is. This BaryEdge is now a retrievable document in its own right — not metadata on an edge.

Then it recurses: two BaryEdges at the same level get bridged by a third one level below, forming a MetaBary triad. Do that repeatedly and you climb an abstraction triads hierarchy built entirely from algebra — zero additional embedding calls above the base level. It's a forest (every node has at most one parent), so traversal to root is a single $graphLookup, no cycle handling.

Does it actually do anything useful?

Ran it against SimLex-999 and WordSim-353 as a sanity check (not the main claim, just "is the substrate coherent"). Raw cosine similarity barely correlates with human similarity judgments (ρ ≈ −0.04 on SimLex). Structural metrics — how many BaryEdges two words share, how much their relational neighborhoods overlap — correlate at ρ ≈ 0.32–0.53, p < 10⁻¹⁵. So the graph is encoding something cosine alone doesn't.

The part I actually care about is cross-domain bridging. Some probe traces from the live graph:

octopus neuroscience ↔ distributed sensor networks, bridged by shared structural-motif vocabulary (neuroarchitecture, smartdust)
collagen folding ↔ linguistic syntax, bridged by etymological + structural motif overlap (plicature / hypotaxis-parataxis)
grief ↔ depression, not bridged and this is a correctness demonstration, not a missing capability. The DSM-5 added a much-debated "bereavement exclusion" precisely because grief and depression share surface symptoms but are different kinds of state, with different prognosis and treatment
radioactive decay ↔ obsolete words falling out of use, bridged at a high abstraction level by register-varied decay verbs (collapsed, decayed, declined, disintegrated) — naming a Poisson-process state-loss pattern that both physics and historical linguistics instantiate, with no single word doing the work

That last one is the case flat retrieval structurally cannot produce — there's no embedding axis for "verbs co-occurring with reduction-of-state across unrelated domains."

Stack (all local, all free)

GitHub: https://github.com/oleksiy-perepelytsya/bary-vector

MongoDB Community Edition + mongot for storage/vector search
nomic-embed-text, 768-dim
Python 3.11+
Full build: ~6.66M documents, 8–14 hrs on a single workstation (8–16GB VRAM)

Try it

MCP server is public on request (SSE transport) — read-only tools for searching the live graph: find_word, semantic_search, edge_info, leaf_nodes, traverse_up, sample_metabary. If you've got an MCP-capable client you can point it at the graph and run your own probe queries in a few minutes.

What I'd actually want feedback on

Whether the cross-domain bridges hold up to someone who isn't me poking at them — try a probe query on a domain pair you know well and tell me if the bridge is real or if I'm pattern-matching myself into seeing structure that isn't there. Some bridges can be not obvious on the first look but they are actually the most intriguing ones and worth to be dug for the reason they built, so treat them as points of investigation
Whether this is worth comparing directly against GraphRAG/RAPTOR-style hierarchical retrieval (I haven't done that benchmark yet, and I know that's the first thing this sub will ask)
Whether anyone's tried something structurally similar and it fell apart at scale for reasons I haven't hit yet

Preprint, architecture spec, and the raw SimLex/WordSim CSVs are all here: https://zenodo.org/records/20186500

Happy to drop the MCP endpoint on request if there's interest.

reddit.com

u/adseipsum — 3 days ago

▲ 2 r/Rag

Need help with project

My part in this project is hybrid search, RRF, BM25. Please tell me where to study from bec most tutorials I seen are AI and bs. Please

reddit.com

u/PatienceAdmirable659 — 2 days ago