I stopped using LangChain for my retrieval pipeline — here's what the benchmark numbers actually look like
Building a transcript intelligence system for management consultants. The use case: query across 10+ hours of client meetings and get cited, verifiable answers — not summaries, exact source spans with speaker and timestamp.
Started with LangChain. Switched to a custom pipeline. Here's the honest account.
Why I left LangChain
It's great for prototyping. It's not great when you need partial failure recovery, concurrent independent stages, and stateful checkpointing on long documents. Once I needed the pipeline to survive mid-run crashes and resume from the last completed stage without restarting, LangChain became more obstacle than tool. Built a custom DAG runner instead.
The decision I'm most confident about
The backend never calls an LLM at query time. It returns an evidence pack — ranked source spans, citations, topic structure. The client LLM does synthesis. This keeps query latency at 2-3 seconds regardless of how many transcripts are in the system, and it means retrieval quality and synthesis quality are independently debuggable. This separation has saved me more debugging time than anything else.
The problem nobody warned me about
My design partner's transcripts are Hinglish — Hindi and English mixed, sometimes Devanagari script mid-sentence. Naive FTS indexing on raw text means English queries hit a Devanagari index and return zero results. Not a retrieval failure — an indexing failure. Took me an embarrassingly long time to find it.
The fix involved pre-extracting a domain glossary per transcript before translation, injecting it as locked terms so the translator doesn't destroy acronyms and proper nouns, and indexing only on the translated text. Naive translation alone doesn't work — it flattens the terminology that actually matters in business conversations.
The benchmark numbers
Tested on one 2.5hr Hinglish business meeting, 30 questions across 3 difficulty sets, graded against the actual transcript.
On a single transcript, Claude with the full document in context scores 87%. My system scores 70%. Claude wins — expected, it reads everything at once.
At 4 transcripts (~10 hours of meetings), Claude's context window saturates. It starts confusing which meeting said what and filling gaps with plausible-sounding wrong answers. My system's score improves as the library grows because it only ever retrieves the relevant portion of content per query. The crossover is somewhere between transcript 2 and 4.
One fabricated answer in 30: asked about a resignation decision, system returned a wrong answer it had no evidence for. That's a synthesis prompt failure not a retrieval failure — the right content was retrieved, the prompt had no rules for what to do with ambiguous evidence. Fixing it now with explicit abstention logic.
What I'd tell myself from 2 months ago
Build abstention first. "I don't know" is more valuable than a confident wrong answer in any high-stakes context. I bolted it on late and it cost me benchmark cycles.
Also: graph expansion only helps when your edges are high quality. Noisy edges actively hurt retrieval. I overestimated how clean automatically extracted relationships would be.
Still open questions
How do you handle cross-document temporal reasoning — not just "what did person X say about topic Y" but "how has their position evolved across calls"? And at what point does adding more retrieved context start hurting synthesis quality rather than helping it?
Genuinely curious if anyone has hit the bilingual FTS problem and solved it differently