u/Deep-Stick9614

Working on a production-scale document AI chat system using a LangGraph-based RAG pipeline with Pinecone.

The main challenge we’re facing is reliability with unpredictable user queries. We’ve added multiple LLM stages and a lot of prompt instructions over time to improve accuracy, but now the system is becoming inconsistent and sometimes hallucinates or fails on valid document-related questions.

Some common issues:
- Broad questions don’t retrieve enough overall document context
- Some answers are spread across multiple chunks, but retrieval misses important pieces, leading to incomplete responses
- Sometimes the system says it cannot answer even when the information clearly exists in the document

We’ve also tried approaches like re-ranking and MMR retrieval, but they’re still not consistently solving the problem for this use case.

Biggest challenge is that we cannot predict the type of questions users may ask, so the system needs to dynamically adapt retrieval and reasoning strategies without relying on heavily overloaded prompts.

Has anyone faced similar issues in production RAG systems?
Would love to hear practical or architectural approaches that improved:
- Retrieval reliability
- Multi-chunk reasoning
- Handling broad vs specific queries
- Prompt complexity/LLM instability
- Dynamic retrieval strategies

Production RAG System Struggling With Unpredictable User Queries — Looking for Architectural Advice