I'm trying to build a chat agent with a persona that can perform RAG from a PDF (converted to chunked embeddings for easier search). Using Llama 3.2B, tried to give a detailed system prompt about the persona and the basic things the bot needs to know about itself and how it should answer. Explicitly stating that it should acknowledge that it doesn't know something if the information is not contained in the PDF content only works up to an extent.
I read somewhere that apps like NotebookLM use routing of the prompt by intent classification and strict mathematical gating of information from RAG. So, I started routing by getting to know the intent of the user first. If they said "hello there billy", the router sends it to the LLM for a response instead of doing RAG. But this breaks the persona of the bot every now and then when the user asks something like "how's the day feeling?" which gets wrongly routed to RAG and the bot ends up saying "I don't know" as instructed in the system prompt.
I am new to this and I'm asking here for suggestions after exploring a bunch of different system prompts, different models (Llama, Gemma different size versions of them under 8B). Is it a limitation of the model size itself? I get that NotebookLM might be using a million-context model but should I take the route of Open-notebook or similar methods for even this simple conversational bot?