u/False_Being_6483 — reddlx

Metric

Score

Faithfulness

0.974 ✅

Context Precision

0.993 ✅

Answer Relevancy

0.820 ⚠️

Context Recall

0.889 ⚠️

Hey everyone, long post, but we're genuinely stuck and would love some input from people who've been down this road.

What we're building

A fully voice-driven RAG bot. User asks a question out loud, we transcribe it, retrieve context, and speak the answer back. No keyboard, no UI — just talk and listen.

How our retrieval stack works (quick overview)

We went with a two-layer parent-child chunking setup:

Parent blocks are ~300–500 words, child snippets are ~80–150 words
Children are indexed in Pinecone (dense) + BM25Okapi on parent text (sparse)
At query time, we do a hybrid search (0.7 dense + 0.3 BM25), then a conditional sibling expansion step — if a child's score beats the batch mean, we pull its siblings, score them with cosine, stitch survivors in reading order, and pass the whole context block to the LLM
Then MMR for diversity, then Pinecone's bge-reranker-v2-m3 cross-encoder for final ranking
We also generate section and document summary chunks and index those separately
For tables and images, we inject 300 chars of surrounding parent text into the embed so BM25 can actually surface them
Each text chunk gets 3 LLM-generated questions appended to the embed — this was specifically to bridge the gap between how someone speaks a question vs. how a document is written

Honestly, we're pretty happy with the architecture. The problems are downstream.

Our RAGAS eval results (13 questions)

Metric	Score
Faithfulness	0.974 ✅
Context Precision	0.993 ✅
Answer Relevancy	0.820 ⚠️
Context Recall	0.889 ⚠️

Two specific failures are dragging those numbers down.

Problem 1 — Answer relevancy scoring 0.0 on a dead-simple question

The question: "What was the ratio of job openings to unemployment in 2022?"

Context precision is 0.99. Context recall is 1.0. The retrieved context has the exact table with year-by-year ratios sitting right there. The LLM clearly found the data. But RAGAS scored answer relevancy at zero.

Our best guess? The LLM answered with framing language — something like "based on the table, the values were..." instead of just stating the number directly. RAGAS embeds the generated answer and the question, computes similarity, and if the answer is hedged or context-wrapped, the embedding drifts far enough from the question that it scores poorly.

This feels like either a prompt issue (we need to tell the LLM to answer directly and not reference the source) or just RAGAS noise on short numeric answers. Has anyone seen this specific pattern?

Problem 2 — Context recall dropping to 0.5 on multi-hop questions

The question: "What was the trend in job openings to unemployment ratio from 2018 to 2023, and how does this relate to [CEO survey insight]?"

The reference answer needs two separate pieces — the trend data AND a CEO survey finding. We're consistently pulling one but not both.

The bottleneck is our retrieval pipeline: we cap at k=10 parents, then MMR cuts to 8, then the reranker cuts to 3–5. By the time we hand context to the LLM, the second hop has been pruned out entirely.

What we're thinking of trying

For the multi-hop recall problem:

Raise k specifically for queries we detect as multi-hop (we already have keyword-based detection for this)
Either re-enable our graph expansion layer (we have a KG with summary_similarity and entity overlap edges built out, but currently bypassed) or add a sub-question decomposition step before retrieval — split "A and how does it relate to B" into two separate retrievals, then merge

For the answer relevancy 0.0:

Tighten the prompt — something like "answer directly and concisely, do not reference the source or table."
Or just accept it as a RAGAS artifact on numeric answers and move on

The core question we're stuck on

For anyone who's built a multi-hop RAG and gone through the MMR + reranker pipeline — how do you balance diversity vs. completeness for compound questions? MMR is great for avoiding redundant chunks, but it's actively hurting us when both hops are legitimately needed and happen to talk about related topics (so MMR treats the second one as redundant).

In a voice context, especially, we can't just throw 10 chunks at the LLM and hope — latency matters, and bloated context causes rambling answers.

Any war stories, approaches, or even just "yeah, sub-question decomp is the right call" would be genuinely helpful. Thanks in advance.