u/Automatic_Fault4483

Reddit and X are littered with people struggling to implement Q&A RAG over internal docs, aka the use case that tens of thousands of companies are pining for. What I don't get is why the community treats this type of use case as a bespoke problem for every implementation. I've built this type of agentic RAG several times and it's always the same, and I would bet for 99% of use cases there's a simple standard that will suffice. The 1% of remaining use cases are ones that involve extremely weird data formats like, idk, super niche structured data that's only used to represent building blueprints in Zimbabwe.

Here's the one agentic RAG to rule them all. Any internal docs RAG should be able to follow this blueprint as a starting point and strip out the parts that aren't needed.

Tell me why this won't work for your use.

The assumption is this is for internal docs so the upper bound on data might be a few hundred GiB.

Modalities Supported

PDF (textual, handwritten, images)
Tabular (CSV, TSV, XLSX)
Plain text (including docx, JSON, yaml, etc.)
Images
Audio
Video

Ingestion

Take every modality and standardize to an embeddable format. OCR the PDFs, transcribe audio/video. If you want visual recognition of videos as extra credit, take one frame per second as images. Any modern transcription or text extraction model (e.g. AWS) should be able to get the job done.

Chunking

Chunk as needed to preserve your ability to cite chunks in a pinch in the metadata. Include the page number for PDFs, the row range for CSVs, the cell range for XLSX, the timestamps for audio/video.

Chunking strategy doesn't have to be that complicated - use a recursive text split, a static chunk size per modality, whatever. Optimizing beyond a sane, reasonable strategy is diminishing returns.

Embedding

Use any modern embedding model to embed the chunks. Performance variations are minor and unpredictable. If you need multimodal then add another column to your search index for that modality. Save in Postgres, use Pinecone, offload to LlamaIndex, etc. Performance differences are minor at this scale. Use an index like HNSW if needed, with a minimum filter count threshold to prevent overfiltering.

Querying the Index

Use embedding search + BM25 with a reranker. You can optimize with fancy techniques like HyDE or SIRA if you want, but be wary of diminishing returns once you have the basic setup down.

The index is a search index. The main goal is to find relevant documents, not to answer the question wholesale.

Completing the Q&A

Leverage the search index to find the relevant documents. Let the agent decide to either search again, answer the question, or pull the document(s) in their entirety to examine more closely. Set up a code execution sandbox to allow the agent to examine the document as needed (pandas for csvs, pypdf for PDFs, etc.).

-----

Everything else (GraphRAG, BGE-m3, fiddling with embedding benchmarks, etc.) is noise with diminishing returns and should only be addressed once the problem is "Things work, they're just a bit slow and once in a blue moon I find a document wasn't fetched correctly". Unless you're building a massive enterprise-scale search index (Perplexity, Glean, etc.) that needs to be best-in-class, this setup should get the job done.

One agentic RAG to rule them all. Debate me.