
I Tried to Answer a Researcher's Question On a GPU
Lucas Soares asked the right question at ODSC London. I tried to answer it with Qwen3 32B, a reranker, and zero OpenAI calls.
//Background
Last year, a data scientist named Lucas Soares gave a workshop at the Open Data Science Conference in London. His central question was genuinely interesting:
>"How can we leverage LLMs to enhance research workflows without diminishing the cognitive engagement of researchers?"
He showed elegant ideas, structured outputs, hypothesis extraction, and evidence scoring. The whole thing ran on GPT-4o. Every single call, paid, cloud, OpenAI.
I read it and thought: what if you built the same thing, but nothing left your machine?
That question sat with me for a while. Then I had a Cloud GPU, a free afternoon, and a specific complaint from the comments section of my last article, someone asked why I didn't use a reranker.
So I built it. Here's what happened.
//The Result First
Three research queries. One pipeline. Everything runs locally except the final polish, that's one Claude call at the very end.
Query 1: How does reranking improve RAG retrieval quality?
BGE Reranker top results:
- paper_03.txt — Vendi-RAG (score: 0.9506)
- paper_16.txt — InfoGain-RAG (score: 0.7686)
- paper_06.txt — Blended RAG (score: 0.4694)
- paper_14.txt — RankArena (score: 0.4469)
- paper_06.txt — Blended RAG (score: 0.094)
>!Qwen3 32B Analysis:!<
Reranking improves RAG retrieval quality by filtering out irrelevant or redundant documents, enhancing relevance and diversity, and ensuring
Only the most informative documents are used for answer generation.
>!Key findings:!<
- Reranking filters irrelevant/misleading documents (InfoGain-RAG)
- Hybrid reranking significantly enhances accuracy at scale (Blended RAG)
- Information-gain-based reranking reduces noise and hallucination
- Iterative diversity-aware retrieval improves multi-source reasoning (Vendi-RAG)
>!Time: 8.68 seconds total | Retrieval + rerank: 0.03s!<
Query 2: What are the main failure modes of RAG systems?
BGE Reranker top results:
- paper_07.txt — CARROT (score: 0.1545)
- paper_01.txt — RAG Stack Review (score: 0.0096)
- paper_13.txt — Ragas (score: 0.0046)
>!Qwen3 32B Analysis:!<
Three fundamental failure modes identified:
- Chunks retrieved in isolation — ignoring relationships and redundancy
- Non-monotonic utility — more context can actively degrade output
- Query-insensitive retrieval — same strategy for every question type
>!Time: 7.49 seconds total | Retrieval + rerank: 0.03s!<
Real arXiv papers. Real findings. Cited sources. Under 15 seconds per query.
//What This Is Actually Useful For
Before the technical breakdown — who should care about this?
- Researchers who spend hours doing literature reviews manually. This pipeline reads 20 papers and surfaces the relevant findings in seconds, with source citations you can verify.
- Developers building internal knowledge tools who want the answers grounded in real documents, not hallucinated from model weights.
- Any team sitting on a corpus of documents — reports, papers, policies, case studies — that people reference but never fully read. Make it queryable.
- Anyone who got burned by RAG hallucinations and wants a system where you can actually trace every answer back to its source.
//How It Works
The idea is called RAG — Retrieval-Augmented Generation. Instead of asking a model to answer from memory, you first retrieve the relevant text from real documents, then ask the model to reason only from what was retrieved.
My previous article built a basic version of this. It worked. Then someone in the comments asked why I didn't use a reranker. Fair point. This is the upgraded version.
The reranker is the piece that makes this meaningfully different from basic RAG. Here's why it matters.
//The Reranker — Why It's the Real Upgrade
In my last build, I used FAISS and got back the top-k most similar chunks. Similar by vector distance. Fast, reasonable, but blunt.
The problem: vector similarity finds things that look related. The reranker asks a different question: is this actually useful for answering this specific query?
It's a CrossEncoder model (BGE Reranker Base) that takes each retrieved chunk and scores it against the full query text directly. No vector shortcuts. It reads both and decides.
Look at the scores from Query 1:
- Vendi-RAG → 0.9506 ← extremely confident
- InfoGain-RAG → 0.7686 ← confident
- Blended RAG → 0.4694 ← moderate
- RankArena → 0.4469 ← moderate
- Blended RAG → 0.0944 ← low confidence
That last result — score 0.09 — would have been ranked much higher by pure vector similarity. The reranker correctly identified it as a weak match and pushed it to the bottom. That's the signal vector search alone can't give you.
This directly addressed the criticism from my last article's comments. And the numbers back it up — retrieval plus reranking takes 0.03 seconds. Quality improvement costs almost nothing.
//The Stack
Everything local except one:
| Component | Tool | Where It Runs |
|---|---|---|
| LLM Inference | Qwen3 32B (Q4_K_M) | Local — RTX PRO 6000 |
| Embeddings | BGE-base-en-v1.5 | Local — RTX PRO 6000 |
| Vector Store | ChromaDB | Local — RTX PRO 6000 |
| Reranker | BGE Reranker Base | Local — RTX PRO 6000 |
| Final Synthesis | Claude (via AutoDL) | One API call |
| Paper Source | arXiv API | Fetch only |
Cloud GPU: NVIDIA RTX PRO 6000, Papers indexed: 20 real arXiv papers on RAG and retrieval, Chunks in ChromaDB: 74, Cost per query: ~$0.05 (the Claude synthesis call)
//Building It — The Key Steps
Step 1 — Fetch Real Papers
import arxiv
client = arxiv.Client()
search = arxiv.Search(
query="RAG retrieval augmented generation quality reranking",
max_results=20,
sort_by=arxiv.SortCriterion.Relevance
)
papers = []
for result in client.results(search):
papers.append({
"title": result.title,
"abstract": result.summary,
"url": result.entry_id
})
20 papers. Real titles, real abstracts, real findings. Not synthetic data.
Step 2 — BGE Embed and Index into ChromaDB
from sentence_transformers import SentenceTransformer
import chromadb
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
chroma_client = chromadb.PersistentClient(path="./data/chroma_db")
collection = chroma_client.get_or_create_collection(
name="research_papers",
metadata={"hnsw:space": "cosine"}
)
embeddings = embedder.encode(chunks, normalize_embeddings=True)
collection.add(ids=ids, embeddings=embeddings.tolist(),
documents=chunks, metadatas=metas)
ChromaDB persists the index to disk. Index once, query forever.
Step 3 — Retrieve Then Rerank
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
# Step 1: ChromaDB gets top-10 by vector similarity
results = collection.query(query_embeddings=query_embedding, n_results=10)
# Step 2: Reranker scores each against the actual query
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
# Step 3: Take top-5 by reranker score
ranked = sorted(zip(scores, candidates, sources), reverse=True)[:5]
This two-stage approach is what separates production RAG from demo RAG.
Step 4 — Qwen3 32B Analyzes Locally
response = requests.post(
"http://127.0.0.1:11434/api/generate",
json={
"model": "qwen3:32b",
"prompt": prompt,
"think": False, # disable reasoning mode for speed
"options": {
"temperature": 0.1,
"num_predict": 800
}
}
)
think: False is important. Qwen3 has a built-in chain-of-thought reasoning mode that consumes tokens before generating the actual response. For structured analysis tasks, disabling it gives faster and cleaner output.
Step 5 — Claude Polishes the Final Report
response = requests.post(
"https://www.autodl.art/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-7",
"messages": [{"role": "user", "content": synthesis_prompt}],
"max_tokens": 1000
}
)
One call. Takes Qwen3's structured analysis and turns it into a readable research narrative. This is the only moment data leaves the GPU.
//Performance
| Metric | Result |
|---|---|
| Papers indexed | 20 real arXiv papers |
| Total chunks in ChromaDB | 74 |
| Indexing time | ~45 seconds |
| Retrieval + rerank | 0.03 seconds |
| Qwen3 analysis | 7–13 seconds |
| Average total pipeline | ~10 seconds |
| External API calls | 1 (Claude synthesis only) |
The retrieval and reranking are essentially instant. The bottleneck is Qwen3 reasoning, and 10 secs for a cited, multi-paper research analysis is a trade I'll take every time.
//What Lucas Built vs What I Built
I want to be clear about this because honesty matters more than positioning.
Lucas's article is better in two specific ways. His structured Pydantic outputs — treated as research primitives, like hypotheses and evidence, as validated data objects — are a cleaner engineering pattern than what I built. And his agentic loop using GPT Researcher, where the system iteratively generates, critiques, and refines its own report, is genuinely more sophisticated than my single-pass pipeline.
What I built is better in different ways. Every component except the final synthesis runs on local hardware — no data leaves the GPU, no per-token cost on the heavy lifting. The reranker adds a quality layer that wasn't in his stack. And the system is actually deployed and measurable — not illustrative code snippets but a working pipeline with real latency numbers.
Neither is the complete answer. Both are pointing at the same thing from different angles.
//Where This Goes Next
The research workflow pipeline is a template, not a one-off. The same four steps — fetch, embed, rerank, synthesize — apply anywhere you have a document corpus and questions to ask of it.
- Academic research teams use this to run literature reviews across hundreds of papers in minutes rather than weeks. Ask "what does 2024 literature say about attention mechanisms in vision transformers?" and get cited synthesis, not a hallucination.
- Legal and compliance teams are indexing case law, contracts, and regulatory documents. Query across thousands of pages with sources you can actually verify in court.
- Product teams building on top of their own support ticket history, user research, and internal wikis. Every answered ticket becomes training data for the next one.
- Journalists and analysts who need to synthesize large document dumps quickly — FOIA releases, earnings transcripts, policy documents. The reranker ensures they get the most relevant excerpts, not just the most similar ones.
The pattern scales. What changes is the folder of documents you point it at and the questions you care about answering.
//What I'd Do Differently Next Time
Structured outputs from Qwen3. Right now, the analysis comes back as formatted text. Lucas's Pydantic approach — returning validated objects with typed fields for positions, evidence, and confidence scores — would make the outputs more reliable and composable. That's on my list.
The agentic loop. A single-pass pipeline answers questions. An iterative one refines them — generate a draft report, critique it against the sources, identify gaps, retrieve more evidence, and revise. That's where this gets genuinely powerful.
Bigger corpus. 20 papers are proof of concept. 500 papers is where retrieval quality really gets tested — and where the reranker earns its keep most visibly.
//Closing Thought
Lucas asked how LLMs can augment researchers without replacing their thinking. I think the answer lives somewhere in the pipeline I built today: fast enough to be useful, grounded enough to be trusted, local enough to be private.
The reranker wasn't in my last build. It's 0.03 seconds and meaningfully better results. Sometimes the upgrade is smaller than you expect.
Next time someone asks why I didn't use something, I'll try to just build it.