
u/JackStrawWitchita

Am I alone in telling my RAG clients to re-do their data from scratch?
While I understand the use case for most RAQ systems is to allow LLMs to intelligently interrogate existing data/documents, but we can also see that's where the common problems occur.
I'm an old school IT guy and, back in the day, we always used the term 'garbage in, garbage out' when talking about systems. And from years of experience, it's nearly always crappy data that causes problems, not the solution itself.
So when I talk to clients about new systems, I immediately start talking about accuracy of retrieval. This is when I hit them with the 'garbage in, garbage out' talk and include how AI isn't a magic bullet to improve data accuracy. I start talking to them about how to spend considerable effort completely re-doing the data they want to interrogate, explaining how this effort will pay off in accuracy of retrieval. In one case, we started out with a blank spreadsheet where the client started adding in the data they wanted to interrogate as text organised into chunks.
This transparency helps the client understand the challenges. It also gives the client ownership of their data. Plus the exercise of transforming their old datastores into something designed for AI helps the client become more familiar with their own data, plus the 'cleaned data' is a new business asset to be used in other facets of the business.
And, it makes developing a RAG system much easier, tweakable, and reliable.
But I don't hear many people talking about challenging the client to clean their data. The emphasis seems to be on making the RAG jump through hoops (badly) to deal with crappy data. Am I just lucky to find amenable clients interested in clean data?