

Ran the same question 3 ways against a knowledge graph. Retrieved the same 90 entities and triples each time. LLM output still varied. That's the finding.
Most demos are run against curated documents nobody's seen fail. We wanted to test differently - so we decided to up the ante and asked Claude to generate a pediatric antibiotic protocol on the fly, fed it into a knowledge graph pipeline neither of us had touched beforehand, and then ran questions against it live.
The screenshot is two different phrasings of the same clinical question, run against the same document. Same entities. same triples, both times.
This is what deterministic retrieval actually looks like in practice. No LLM in the retrieval path - the system traverses a knowledge graph of entities and relationships, not chunks of text. So the same conceptual territory gets covered regardless of how you worded the question.
What happened after retrieval is the interesting part. Open-ended phrasing got a longer, more explanatory answer. Pointed phrasing got a tighter one. Same concepts retrieved underneath, different output on top. That split is useful. If your stack doesn't separate retrieval from synthesis clearly, you'll end up tuning the model when the problem is retrieval, or rebuilding retrieval when the problem is synthesis.
This test let us isolate exactly which layer the inconsistency lived in - and it was definitely not the retrieval.