u/Financial_Ad8530

I Tried to Answer a Researcher's Question On a GPU

I Tried to Answer a Researcher's Question On a GPU

Lucas Soares asked the right question at ODSC London. I tried to answer it with Qwen3 32B, a reranker, and zero OpenAI calls.

//Background

Last year, a data scientist named Lucas Soares gave a workshop at the Open Data Science Conference in London. His central question was genuinely interesting:

>"How can we leverage LLMs to enhance research workflows without diminishing the cognitive engagement of researchers?"

He showed elegant ideas, structured outputs, hypothesis extraction, and evidence scoring. The whole thing ran on GPT-4o. Every single call, paid, cloud, OpenAI.

I read it and thought: what if you built the same thing, but nothing left your machine?

That question sat with me for a while. Then I had a Cloud GPU, a free afternoon, and a specific complaint from the comments section of my last article, someone asked why I didn't use a reranker.

So I built it. Here's what happened.

https://preview.redd.it/m5tnf40xom2h1.png?width=1280&format=png&auto=webp&s=61caac7316501eeae778a528cef63aeb621110da

//The Result First

Three research queries. One pipeline. Everything runs locally except the final polish, that's one Claude call at the very end.

Query 1: How does reranking improve RAG retrieval quality?

BGE Reranker top results:

  1. paper_03.txt — Vendi-RAG (score: 0.9506)
  2. paper_16.txt — InfoGain-RAG (score: 0.7686)
  3. paper_06.txt — Blended RAG (score: 0.4694)
  4. paper_14.txt — RankArena (score: 0.4469)
  5. paper_06.txt — Blended RAG (score: 0.094)

https://preview.redd.it/x3ro5tt3pm2h1.png?width=954&format=png&auto=webp&s=c456ddb389bbe97cc1c984a8778e704085af3c15

>!Qwen3 32B Analysis:!<

Reranking improves RAG retrieval quality by filtering out irrelevant or redundant documents, enhancing relevance and diversity, and ensuring

Only the most informative documents are used for answer generation.

>!Key findings:!<

  • Reranking filters irrelevant/misleading documents (InfoGain-RAG)
  • Hybrid reranking significantly enhances accuracy at scale (Blended RAG)
  • Information-gain-based reranking reduces noise and hallucination
  • Iterative diversity-aware retrieval improves multi-source reasoning (Vendi-RAG)

>!Time: 8.68 seconds total | Retrieval + rerank: 0.03s!<

Query 2: What are the main failure modes of RAG systems?

BGE Reranker top results:

  1. paper_07.txt — CARROT (score: 0.1545)
  2. paper_01.txt — RAG Stack Review (score: 0.0096)
  3. paper_13.txt — Ragas (score: 0.0046)

https://preview.redd.it/xkwz2ckcpm2h1.png?width=976&format=png&auto=webp&s=7cef8dfc07e75b2f02fef401614db21c5a71c431

>!Qwen3 32B Analysis:!<

Three fundamental failure modes identified:

  • Chunks retrieved in isolation — ignoring relationships and redundancy
  • Non-monotonic utility — more context can actively degrade output
  • Query-insensitive retrieval — same strategy for every question type

>!Time: 7.49 seconds total | Retrieval + rerank: 0.03s!<

Real arXiv papers. Real findings. Cited sources. Under 15 seconds per query.

//What This Is Actually Useful For

Before the technical breakdown — who should care about this?

  • Researchers who spend hours doing literature reviews manually. This pipeline reads 20 papers and surfaces the relevant findings in seconds, with source citations you can verify.
  • Developers building internal knowledge tools who want the answers grounded in real documents, not hallucinated from model weights.
  • Any team sitting on a corpus of documents — reports, papers, policies, case studies — that people reference but never fully read. Make it queryable.
  • Anyone who got burned by RAG hallucinations and wants a system where you can actually trace every answer back to its source.

//How It Works

The idea is called RAG — Retrieval-Augmented Generation. Instead of asking a model to answer from memory, you first retrieve the relevant text from real documents, then ask the model to reason only from what was retrieved.

My previous article built a basic version of this. It worked. Then someone in the comments asked why I didn't use a reranker. Fair point. This is the upgraded version.

https://preview.redd.it/u10elfj2qm2h1.png?width=796&format=png&auto=webp&s=4a629f9441cf983f941587cbf58603535d4d5769

The reranker is the piece that makes this meaningfully different from basic RAG. Here's why it matters.

//The Reranker — Why It's the Real Upgrade

In my last build, I used FAISS and got back the top-k most similar chunks. Similar by vector distance. Fast, reasonable, but blunt.

The problem: vector similarity finds things that look related. The reranker asks a different question: is this actually useful for answering this specific query?

It's a CrossEncoder model (BGE Reranker Base) that takes each retrieved chunk and scores it against the full query text directly. No vector shortcuts. It reads both and decides.

Look at the scores from Query 1:

  1. Vendi-RAG → 0.9506 ← extremely confident
  2. InfoGain-RAG → 0.7686 ← confident
  3. Blended RAG → 0.4694 ← moderate
  4. RankArena → 0.4469 ← moderate
  5. Blended RAG → 0.0944 ← low confidence

https://preview.redd.it/pxoe9gbbqm2h1.png?width=676&format=png&auto=webp&s=b36af219919a5538524c054ec71cd19c98ce7a04

That last result — score 0.09 — would have been ranked much higher by pure vector similarity. The reranker correctly identified it as a weak match and pushed it to the bottom. That's the signal vector search alone can't give you.

This directly addressed the criticism from my last article's comments. And the numbers back it up — retrieval plus reranking takes 0.03 seconds. Quality improvement costs almost nothing.

//The Stack

Everything local except one:

Component Tool Where It Runs
LLM Inference Qwen3 32B (Q4_K_M) Local — RTX PRO 6000
Embeddings BGE-base-en-v1.5 Local — RTX PRO 6000
Vector Store ChromaDB Local — RTX PRO 6000
Reranker BGE Reranker Base Local — RTX PRO 6000
Final Synthesis Claude (via AutoDL) One API call
Paper Source arXiv API Fetch only

Cloud GPU: NVIDIA RTX PRO 6000, Papers indexed: 20 real arXiv papers on RAG and retrieval, Chunks in ChromaDB: 74, Cost per query: ~$0.05 (the Claude synthesis call)

//Building It — The Key Steps

Step 1 — Fetch Real Papers

import arxiv

client = arxiv.Client()
search = arxiv.Search(
    query="RAG retrieval augmented generation quality reranking",
    max_results=20,
    sort_by=arxiv.SortCriterion.Relevance
)

papers = []
for result in client.results(search):
    papers.append({
        "title": result.title,
        "abstract": result.summary,
        "url": result.entry_id
    })

20 papers. Real titles, real abstracts, real findings. Not synthetic data.

Step 2 — BGE Embed and Index into ChromaDB

from sentence_transformers import SentenceTransformer
import chromadb

embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")

chroma_client = chromadb.PersistentClient(path="./data/chroma_db")
collection = chroma_client.get_or_create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"}
)

embeddings = embedder.encode(chunks, normalize_embeddings=True)
collection.add(ids=ids, embeddings=embeddings.tolist(),
               documents=chunks, metadatas=metas)

ChromaDB persists the index to disk. Index once, query forever.

Step 3 — Retrieve Then Rerank

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")

# Step 1: ChromaDB gets top-10 by vector similarity
results = collection.query(query_embeddings=query_embedding, n_results=10)

# Step 2: Reranker scores each against the actual query
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

# Step 3: Take top-5 by reranker score
ranked = sorted(zip(scores, candidates, sources), reverse=True)[:5]

This two-stage approach is what separates production RAG from demo RAG.

Step 4 — Qwen3 32B Analyzes Locally

response = requests.post(
    "http://127.0.0.1:11434/api/generate",
    json={
        "model": "qwen3:32b",
        "prompt": prompt,
        "think": False,       # disable reasoning mode for speed
        "options": {
            "temperature": 0.1,
            "num_predict": 800
        }
    }
)

think: False is important. Qwen3 has a built-in chain-of-thought reasoning mode that consumes tokens before generating the actual response. For structured analysis tasks, disabling it gives faster and cleaner output.

Step 5 — Claude Polishes the Final Report

response = requests.post(
    "https://www.autodl.art/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-7",
        "messages": [{"role": "user", "content": synthesis_prompt}],
        "max_tokens": 1000
    }
)

One call. Takes Qwen3's structured analysis and turns it into a readable research narrative. This is the only moment data leaves the GPU.

//Performance

Metric Result
Papers indexed 20 real arXiv papers
Total chunks in ChromaDB 74
Indexing time ~45 seconds
Retrieval + rerank 0.03 seconds
Qwen3 analysis 7–13 seconds
Average total pipeline ~10 seconds
External API calls 1 (Claude synthesis only)

https://preview.redd.it/nk08kotdrm2h1.png?width=625&format=png&auto=webp&s=9ce577c17fe6aef155ba3a7cb67327b43121ef89

The retrieval and reranking are essentially instant. The bottleneck is Qwen3 reasoning, and 10 secs for a cited, multi-paper research analysis is a trade I'll take every time.

https://preview.redd.it/oedesvnjrm2h1.png?width=1280&format=png&auto=webp&s=3e9f7be7c3d4a46d52169d74a460d9a335b92e04

https://preview.redd.it/0wftpufmrm2h1.png?width=1280&format=png&auto=webp&s=5a1d71921ca91bd4f2d058d708f5b0e45e8ec900

https://preview.redd.it/efjcnufmrm2h1.png?width=1280&format=png&auto=webp&s=b6201ab01f0bbf242b7d75c46fce1fcbd79e0ddd

https://preview.redd.it/3qj49vfmrm2h1.png?width=1280&format=png&auto=webp&s=ede235fb9802c3e3bdd7bf4b58746e9400b552ff

https://preview.redd.it/sau82wfmrm2h1.png?width=1280&format=png&auto=webp&s=8713358dc420cc107d9c26d6dcb13f6800ceb68d

https://preview.redd.it/pqfg7vfmrm2h1.png?width=1280&format=png&auto=webp&s=2997b3fbc6da78490f202cfacac9378c2e8023bc

https://preview.redd.it/xicqjvfmrm2h1.png?width=1280&format=png&auto=webp&s=1d06599109d038c656bcf43f7caa9f7d77eb4a11

https://preview.redd.it/k2525wfmrm2h1.png?width=1280&format=png&auto=webp&s=41d730448804083a2deb8a3f91696312cae8d2ec

//What Lucas Built vs What I Built

I want to be clear about this because honesty matters more than positioning.

Lucas's article is better in two specific ways. His structured Pydantic outputs — treated as research primitives, like hypotheses and evidence, as validated data objects — are a cleaner engineering pattern than what I built. And his agentic loop using GPT Researcher, where the system iteratively generates, critiques, and refines its own report, is genuinely more sophisticated than my single-pass pipeline.

What I built is better in different ways. Every component except the final synthesis runs on local hardware — no data leaves the GPU, no per-token cost on the heavy lifting. The reranker adds a quality layer that wasn't in his stack. And the system is actually deployed and measurable — not illustrative code snippets but a working pipeline with real latency numbers.

Neither is the complete answer. Both are pointing at the same thing from different angles.

https://preview.redd.it/ynrrf40qrm2h1.jpg?width=693&format=pjpg&auto=webp&s=8d6693ee746f0b9ae468ef057aef0bf7230a1c7e

//Where This Goes Next

The research workflow pipeline is a template, not a one-off. The same four steps — fetch, embed, rerank, synthesize — apply anywhere you have a document corpus and questions to ask of it.

  • Academic research teams use this to run literature reviews across hundreds of papers in minutes rather than weeks. Ask "what does 2024 literature say about attention mechanisms in vision transformers?" and get cited synthesis, not a hallucination.
  • Legal and compliance teams are indexing case law, contracts, and regulatory documents. Query across thousands of pages with sources you can actually verify in court.
  • Product teams building on top of their own support ticket history, user research, and internal wikis. Every answered ticket becomes training data for the next one.
  • Journalists and analysts who need to synthesize large document dumps quickly — FOIA releases, earnings transcripts, policy documents. The reranker ensures they get the most relevant excerpts, not just the most similar ones.

The pattern scales. What changes is the folder of documents you point it at and the questions you care about answering.

//What I'd Do Differently Next Time

Structured outputs from Qwen3. Right now, the analysis comes back as formatted text. Lucas's Pydantic approach — returning validated objects with typed fields for positions, evidence, and confidence scores — would make the outputs more reliable and composable. That's on my list.

The agentic loop. A single-pass pipeline answers questions. An iterative one refines them — generate a draft report, critique it against the sources, identify gaps, retrieve more evidence, and revise. That's where this gets genuinely powerful.

Bigger corpus. 20 papers are proof of concept. 500 papers is where retrieval quality really gets tested — and where the reranker earns its keep most visibly.

//Closing Thought

Lucas asked how LLMs can augment researchers without replacing their thinking. I think the answer lives somewhere in the pipeline I built today: fast enough to be useful, grounded enough to be trusted, local enough to be private.

The reranker wasn't in my last build. It's 0.03 seconds and meaningfully better results. Sometimes the upgrade is smaller than you expect.

Next time someone asks why I didn't use something, I'll try to just build it.

reddit.com
u/Financial_Ad8530 — 1 day ago

I Built a Research Pipeline for Reading Papers

The Problem

There’s a very specific kind of frustration that anyone doing research eventually runs into.

You have a question. Not a vague one, a precise one. Something like:

“How do rerankers actually improve RAG quality?”

“What are the real failure modes of retrieval systems?”

“Do hybrid retrieval methods consistently outperform vector search?”

You know the answer exists somewhere across a pile of papers. Probably 10 of them. Maybe 20.

So you start reading.

One abstract turns into three. One citation trail becomes another. Forty minutes later you have fragments of answers scattered across tabs, but still nothing grounded enough that you’d confidently hand to another engineer or researcher.

After doing this enough times, I got tired of it and built a small local research pipeline to handle the first pass for me.

The original goal wasn’t to build an “AI research agent” or some autonomous system. I just wanted something that could read a set of papers, pull the useful evidence, and answer questions with citations instead of hallucinations.

What surprised me was not the LLM. It was the reranker.

The Pipeline

The pipeline itself is pretty straightforward.

https://preview.redd.it/e5oi584dcm2h1.png?width=1478&format=png&auto=webp&s=f83410425dd79d52943e0b0ef5809965d10f2bd9

I pull papers from arXiv, chunk them, embed them with BGE, store everything in ChromaDB, retrieve candidate chunks with vector search, rerank them with BGE Reranker, then pass the highest-quality evidence into Qwen3 32B running locally.

https://preview.redd.it/ocj5carncm2h1.png?width=1280&format=png&auto=webp&s=603fa9b8505625a023f39323bc9620733efa4e35

At the very end, I make a single Claude call just to turn the raw analysis into something readable.

https://preview.redd.it/j2vucc2rcm2h1.png?width=1280&format=png&auto=webp&s=d6b9b48dd4edb319805616c0f084b27ad33cf743

Most of the stack is local. The only external step is the final synthesis.

The hardware was just a single RTX PRO 6000.

https://preview.redd.it/ry1cwhyvcm2h1.png?width=493&format=png&auto=webp&s=a6d9d2c2138ead59b7b3f82e7bf1f158883f931f

Nothing distributed.
No orchestration layer.
No fancy agent framework.

Just a retrieval pipeline focused on one thing:

finding useful evidence fast.

The Part That Actually Mattered

I expected the model to matter most. It didn’t. (T_T)

The reranker improved answer quality far more than switching generators.

Before adding reranking, the retrieval stage behaved the way most simple RAG systems behave: semantically related chunks would show up, but not necessarily useful ones.

The vector store was good at finding documents that looked related to the query, but not documents that actually answered it.

One query I tested was:

“How does reranking improve RAG retrieval quality?”

Before reranking, the pipeline returned a mix of useful evidence and loosely connected chunks. The model then had to reason across noisy or weak context, which degraded output quality surprisingly fast.

After adding BGE Reranker, the strongest evidence consistently floated to the top while weaker matches collapsed in score.

The surprising part was the cost.

The reranking step only added around 0.03 seconds of latency, but improved retrieval quality more than changing the LLM itself.

It honestly changed how I think about RAG systems.

A lot of discussion online focuses on bigger models and longer context windows, but most of the ugly failures I’ve seen in production RAG systems actually begin much earlier.

Bad retrieval poisons everything downstream.

What The Papers Kept Saying

One of the more interesting outputs came from asking the system about RAG failure modes themselves.

Across multiple papers, three themes kept appearing.

  1. First, chunks are usually retrieved in isolation, without understanding their relationships to each other.
  2. Second, retrieval quality is often non-monotonic. More context does not always improve answers. In many cases, adding additional chunks eventually degrades output quality because redundancy and conflicting information accumulate faster than useful signal.
  3. And third, most systems still use essentially the same retrieval strategy for every query type, even though technical questions and conceptual questions benefit from very different evidence structures.

That second point especially stuck with me. There’s still a common assumption that: more retrieved context = better answers

But a surprising number of papers suggest the opposite once retrieval noise crosses a certain threshold.

The Most Interesting Realization

Another thing that became obvious while building this: retrieval is fast now. Extremely fast. The retrieval plus reranking stage was almost instantaneous. The bottleneck was reasoning over evidence.

https://preview.redd.it/anekv0ywem2h1.png?width=625&format=png&auto=webp&s=6994f29b2bd51a3352a2563378829b8a6b12630c

And honestly, that feels correct.

I’d much rather spend 10 seconds analyzing grounded evidence than get a confident hallucination in 1 second.

https://preview.redd.it/svniv2g1fm2h1.png?width=1280&format=png&auto=webp&s=a33e5347192576db50b6abd551ae22fea2fae328

https://preview.redd.it/booc0pl3fm2h1.png?width=1280&format=png&auto=webp&s=33856fd49a305e6b3ce4370ca092c06e0121113d

https://preview.redd.it/1j50cnm4fm2h1.png?width=1280&format=png&auto=webp&s=87e409c49a18a7f5b25bdb1abc9179261862bcec

https://preview.redd.it/kn313lh5fm2h1.png?width=1280&format=png&auto=webp&s=215a9346bef66a5e592e634a0b690cd078c0e2b5

https://preview.redd.it/b64b5ebafm2h1.png?width=1280&format=png&auto=webp&s=fed6657bf53fe2cfe1c5203c7fa3b9fdd7c8fd74

https://preview.redd.it/s7nnxrhbfm2h1.png?width=1280&format=png&auto=webp&s=3c95bd3573dfd33bc64bd8a99a45aab652bb54e3

https://preview.redd.it/74s53i6dfm2h1.png?width=1280&format=png&auto=webp&s=d47682488f1ae45f9f219aa24598d5d5791f2bf2

Thanks! Claude! That trade feels worth it every time.

Where This Starts Becoming Useful

Right now the system is still small.

I only indexed around 20 papers and mostly worked with abstracts instead of full PDFs because I wanted to test architecture before scaling corpus size.

But even at that scale, the pattern already feels useful.

The same structure could work almost anywhere there are large document collections:

  • research papers, legal archives, internal company docs, support tickets, policy documents, analyst reports.

The architecture barely changes. Only the documents do.

What I’d Build Next

The next version probably needs:

  • full PDF ingestion
  • section-aware chunking
  • iterative retrieval
  • self-critique loops
  • evidence confidence scoring
  • retrieval refinement passes

Right now the system answers questions in one pass.

The more interesting version would generate a draft, critique it against the sources, identify weak evidence, retrieve again, and revise itself.

That’s where this starts becoming less like “RAG demo code” and more like actual research infrastructure.

The biggest lesson from this whole experiment was surprisingly simple. Most RAG tutorials skip reranking because it adds complexity. Most production RAG failures come from retrieval quality. Those two facts are probably related.

reddit.com
u/Financial_Ad8530 — 1 day ago
▲ 1 r/pytorch+1 crossposts

There's a specific kind of frustration that every developer knows. You're in the middle of something, you hit a wall, and you open the PyTorch docs. Twenty minutes later, you've read three pages, followed two rabbit holes, and you still haven't found the one line you needed.

I got tired of that. So I built something about it.

In four hours on a single GPU instance, I put together a system that lets you ask plain English questions and get answers pulled directly from real documentation — cited, grounded, no hallucination. Ask it "how do I move a model to GPU?" and it tells you .to(device), points you to exactly which file that came from, and moves on.

Here's how it went.

Result First

Before anything else, this is what it actually looks like in practice.

Q: How do I move a PyTorch model to a GPU?

https://preview.redd.it/vyl685rl8wzg1.png?width=1280&format=png&auto=webp&s=4e907f63366d625f716b68575fd5a42e269e04cb

Q: How do I use a tokenizer with Hugging Face Transformers?

https://preview.redd.it/qmf7j0sq9wzg1.png?width=1280&format=png&auto=webp&s=b733b846938047989e0f254b9fb1e411330c0cfa

Q: How do I use Dataloader in PyTorch?

https://preview.redd.it/cpny6k8r9wzg1.png?width=1280&format=png&auto=webp&s=da5f2bb7c9ed9e53d3e1ff4088f5c25156e6d8be

I also built a second version of this — the same architecture, but pointed at internal office documents instead of PyTorch. HR policies, IT procedures, and finance reimbursement guides.

An employee asks, "How do I request annual leave?" and gets a cited answer in under 2 seconds. Same idea, completely different world.

Q: How do I request annual leave?

https://preview.redd.it/5m6dedju9wzg1.png?width=1280&format=png&auto=webp&s=96b22caf0fb282af286421e08d594a3e7c7c9038

Q: How do I submit a travel reimbursement?

https://preview.redd.it/tivazy7v9wzg1.jpg?width=1280&format=pjpg&auto=webp&s=31d162aef9b12430ccd1c96c943021f6ac5f2a1c

Q: Who should I contact for IT support?

https://preview.redd.it/t0supqhx9wzg1.png?width=1280&format=png&auto=webp&s=86ba7fff79775a14e44793ea4a0da1936c49765a

Both versions. One afternoon. One GPU. This becomes genuinely useful anywhere people are tired of manually searching through documentationwhether that’s developers jumping between hundreds of pages to find a single method, teams building internal assistants that understand their own codebase or company policies, or new hires trying to onboard into an unfamiliar framework, tool, or organization without constantly asking someone else for help.

The Concept

This pattern is called RAG — Retrieval-Augmented Generation. The name is a mouthful, but the idea is simple: instead of asking a language model to answer from memory (where it might hallucinate), you first retrieve the relevant text from a real source, then ask the model to generate an answer based only on what was retrieved.

It's the difference between asking someone to guess an answer and handing them the right page of a textbook first.

Here's the full flow:

https://preview.redd.it/nqrma0jy9wzg1.png?width=628&format=png&auto=webp&s=121bb05e3a3040e9c8161d1339b968acb10c81f7

The key insight: the LLM never has to know PyTorch from memory. It only has to read what you hand it. That's what keeps the answers grounded and the sources honest.

Step by Step

1. Setup

Everything ran on a single nstance for a Cloud GPU platform, One GPU. No cluster. No expensive infrastructure. That matters — it means this is something you can actually replicate.

https://preview.redd.it/knx64osz9wzg1.png?width=1057&format=png&auto=webp&s=78cb2d609f26bdf929580804fa6d4d5516c5d662

GPU NVIDIA RTX 5090 — 32GB VRAM
CUDA 13.0
Framework PyTorch
Cost $0.38 / hour
Region Singapore-A

2. Data

Developer Assistant: I pulled the actual source repositories — not a curated sample, the real thing:

git clone --depth 1 https://github.com/huggingface/transformers.git
git clone --depth 1 https://github.com/pytorch/pytorch.git
  • 884 corpus files across both repos
  • ~6.2 million characters of raw text
  • 9,192 chunks after splitting

That's a realistic knowledge base. Not a demo dataset with 20 files. The kind of scale where retrieval actually has to work.

https://preview.redd.it/6uwuycx0awzg1.png?width=499&format=png&auto=webp&s=97a84989b930eab5aa9f7359a44bfba459c4afa0

Office Knowledge Assistant: For the internal office version, I generated a structured synthetic dataset simulating a real company's internal documentation:

  • 300 documents across HR, IT, Finance, Operations, and Admin
  • Topics: leave policy, remote work, reimbursement, VPN access, onboarding, and more
  • ~600 chunks after splitting

Smaller scale, but deliberately structured to mirror the messy reality of how internal company knowledge actually lives — spread across departments, sometimes overlapping, never perfectly organized.

https://preview.redd.it/f6ivxnx1awzg1.png?width=547&format=png&auto=webp&s=82286915d69c1d43b9415a826064c142ec1ff161

3. Collect and Prepare the Documents

For the developer assistant, this meant cloning the repos. For the office assistant, generating the document set. Either way, the output is the same: a folder of raw text files representing your knowledge domain.

The preprocessing step strips noise, normalizes whitespace, and converts everything to clean UTF-8 text. Nothing fancy — just making sure the data is consistent before it gets split up.

# prepare_corpus.py — simplified version
import os

def prepare_corpus(source_dir, output_dir):
    files_processed = 0
    total_chars = 0
    
    for root, _, files in os.walk(source_dir):
        for fname in files:
            if fname.endswith(('.rst', '.md', '.txt')):
                with open(os.path.join(root, fname), 'r', errors='ignore') as f:
                    text = f.read()
                
                # Clean and normalize
                text = clean_text(text)
                
                # Write to output
                out_path = os.path.join(output_dir, fname)
                with open(out_path, 'w') as f:
                    f.write(text)
                
                files_processed += 1
                total_chars += len(text)
    
    print(f"Prepared corpus files: {files_processed}")
    print(f"Total characters: {total_chars:,}")

Output when you run it:

Prepared corpus files: 884 
Total characters: 6,264,627
Output directory: /root/dev_doc_rag/corpus

3. Chunking

You can't embed an entire 50-page document as a single vector — the signal gets lost. You split it into chunks, small enough to be semantically focused, large enough to contain a complete thought.

The important detail: overlapping chunks. If you split cleanly at every 512 tokens, you'll sometimes cut a sentence right in the middle of the answer. Overlap means each chunk shares some content with its neighbors, so nothing falls through the cracks.

def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    
    return chunks

Result: 9,192 chunks from 884 files. Each chunk is one searchable unit.

4. Embedding

Every chunk gets converted into a dense vector — a list of numbers that represents its semantic meaning. Chunks that mean similar things will have similar vectors, even if they use different words. That's what makes semantic search work.

FAISS (Facebook AI Similarity Search) stores all those vectors and makes it fast to find the closest matches to any new query.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"Indexed {index.ntotal} chunks")

Indexing 9,192 chunks on the RTX 5090 took ~13.43 seconds. Once the index is built, it lives in memory and queries hit it in milliseconds.

5. Query Processing

When a user asks a question, the same embedding model converts it to a vector, FAISS finds the top-k most similar chunks, and those chunks get handed to the LLM as context.

def answer_question(query, index, chunks, model, llm, k=5):
    # Embed the query
    query_embedding = model.encode([query]).astype('float32')
    
    # Retrieve top-k chunks
    distances, indices = index.search(query_embedding, k)
    relevant_chunks = [chunks[i] for i in indices[0]]
    
    # Build prompt
    context = "\n\n".join(relevant_chunks)
    prompt = f"""Answer the following question based only on the provided context.
    
Context:
{context}

Question: {query}

Answer:"""
    
    # Generate answer
    response = llm.generate(prompt)
    sources = [chunk_sources[i] for i in indices[0]]
    
    return response, sources

The LLM doesn't browse the internet. It doesn't guess. It reads what FAISS found and answers from that. That's the whole trick.

The Performance

Developer Documentation Assistant

Metric Result
Indexing time 13.43 seconds
Query latency 2.3 – 2.6 seconds
Files indexed 884
Total chunks 9,192

Office Knowledge Assistant

Metric Result
Indexing time 8.35 seconds
Query latency 0.15 – 1.96 seconds
Files indexed 300
Total chunks ~600

The office assistant is faster because it's a smaller index, fewer vectors to search. The developer assistant handles a 15x larger dataset and still responds in under 3 seconds. Both are interactive. Neither requires a cluster.

https://preview.redd.it/a87mfw33awzg1.png?width=565&format=png&auto=webp&s=a0919280ac3a94404028af8eb48ae4533f7a11c8

Warning Honestly

  • Smaller models drift. When I used a lighter LLM for generation, the answers occasionally padded themselves with unnecessary detail or made small inferential leaps that weren't in the source text. Bigger models stay closer to the retrieved content.
  • Similar documents confuse retrieval. If you have 10 files that all describe the same leave policy with slightly different wording, FAISS might return 5 of them as top-k for one query. The answer might be fine, but the sources feel redundant.
  • Synthetic data has limits. The office assistant ran on documents I generated to simulate company policies. Real internal documents are messier — inconsistent formats, missing context, ambiguous wording. The system would need more careful preprocessing in a real deployment.

Both systems I built are deliberately simple. A developer doc assistant. An office knowledge base. But the same four steps — collect, chunk, embed, query — apply to a much wider surface area than that.

Think about what "documents your team is tired of searching" looks like in different contexts:

  • Legal teams have contracts, clauses, and precedents. Instead of a lawyer spending an hour locating a specific indemnification clause across 200 past contracts, a RAG system retrieves it in seconds.
  • Support teams have ticket histories, resolution logs, and product manuals. A RAG assistant trained on past resolved tickets can suggest answers to new ones automatically, cutting handling time dramatically.
  • Research teams have papers, notes, and literature reviews. Ask "what did the 2023 papers say about attention mechanisms in vision transformers?" and get a synthesized answer with citations, instead of manually rereading 40 PDFs.
  • Onboarding is a particularly compelling one. Every company has a mountain of documentation that new hires need to absorb in their first few weeks. Instead of burying them in Notion pages, give them a system they can just ask. The knowledge is already there — it just needs to be made queryable.

The architecture doesn't change. The embedding model doesn't change. FAISS doesn't change. What changes is the folder of documents you point it at.

That's the part I find genuinely interesting about this — it's a general-purpose tool dressed up as a specific solution. Once you understand the pipeline, you start seeing document retrieval problems everywhere.

Four hours. One GPU. Two working systems.

The developer documentation assistant handles 884 real files from PyTorch and Hugging Face, answers in under 3 seconds, and cites its sources. The office assistant handles 300 internal policy documents across 5 departments and responds in under 2 seconds.

Neither of these is rocket science. The pieces — FAISS, sentence transformers, a language model — are all open and well-documented. What this project is really about is putting them together in the right order and pointing them at a real problem.

If you're sitting on a pile of documentation that people in your team are tired of searching through manually, this is the pattern you want. The setup cost is one afternoon. The payoff is a system that keeps working after you've moved on to the next thing.

That's a trade I'll take every time.

reddit.com
u/Civil_Lingonberry678 — 11 days ago