u/Ok_Gas7672 — reddlx

Image 1 — Ran the same question 3 ways against a knowledge graph. Retrieved the same 90 entities and triples each time. LLM output still varied. That's the finding.

Image 2 — Ran the same question 3 ways against a knowledge graph. Retrieved the same 90 entities and triples each time. LLM output still varied. That's the finding.

Ran the same question 3 ways against a knowledge graph. Retrieved the same 90 entities and triples each time. LLM output still varied. That's the finding.

Most demos are run against curated documents nobody's seen fail. We wanted to test differently - so we decided to up the ante and asked Claude to generate a pediatric antibiotic protocol on the fly, fed it into a knowledge graph pipeline neither of us had touched beforehand, and then ran questions against it live.

The screenshot is two different phrasings of the same clinical question, run against the same document. Same entities. same triples, both times.

This is what deterministic retrieval actually looks like in practice. No LLM in the retrieval path - the system traverses a knowledge graph of entities and relationships, not chunks of text. So the same conceptual territory gets covered regardless of how you worded the question.

What happened after retrieval is the interesting part. Open-ended phrasing got a longer, more explanatory answer. Pointed phrasing got a tighter one. Same concepts retrieved underneath, different output on top. That split is useful. If your stack doesn't separate retrieval from synthesis clearly, you'll end up tuning the model when the problem is retrieval, or rebuilding retrieval when the problem is synthesis.

This test let us isolate exactly which layer the inconsistency lived in - and it was definitely not the retrieval.

u/Ok_Gas7672 — 3 days ago

▲ 6 r/MachineLearning

Sharing all KGC 2026 decks. More production-grade KG systems than I've seen at any conference. [D]

Didn't make it to New York for the Knowledge Graph Conference this year, but caught some talks virtually and managed to download all the decks. Sharing them below because some of what was shown is worth knowing about.

Majority of the presentations described live production systems. Enterprises showing up with real engineers delivering real compliance requirements. That's not usual for most ai eventss. Most talks are proofs of concept with a "coming soon to prod" slide at the end.

For eg - Bloomberg showed a formal dependency model for ontology governance. AbbVie walked through ARCH, their internal KG for drug and disease-area intelligence, connected to a scoring engine, a researcher dashboard, and an LLM companion for plain-language queries. The KG is the source of truth. The LLM is the interface. Even Morgan Stanley showed continuous SHACL drift detection on risk reporting data - automated weekly checks that alert when the semantic layer deviates from what's governed.

Crux: knowledge graphs are being actively used as infrastructure, not a retrieval layer on top of vectors. The graph is doing reasoning work, not lookup work.

We've been skeptical of the "only using vector dbs" framing for a while. These production systems are the clearest evidence I've seen of where that breaks down - and what the alternative actually looks like when it's running. Link to the all the decks in the comment.

All decks here:

https://drive.google.com/drive/folders/1Csdv4hZePrBMJGggsisPXYBueTRCK1kV?usp=sharing

reddit.com

u/Ok_Gas7672 — 9 days ago

▲ 6 r/AI_Governance

Bulk of the Non AI work (to make AI actually work) is change management. Not even knowledge management

This is the third time I've seen this exact people gap/potentially failure mode.

Most of the mid-market companies are not prepared to deploy AI systematically. Not because of the model, not because of the tooling. Because there is literally no one whose job it is to own the knowledge that feeds the system.

We work with large enterprises also - think big pharma. They figured a part of this a long time ago. They have Knowledge management functions, ontologists, content governance teams, information architecture - these are established roles. These just don't exist in smaller cos. But they badly need these.

AI is the forcing function but gosh that leads to scope creep, PoC expands or gets delayed or gets narrowed down drastically because the org is just not ready.

Nobody really owns knowledge management. Marketing has one version. Legal has another. The clinical team has a third. There is obviously no consensus - because nobody ever had to.

The pitch is "upload your documents and the AI will keep itself up to date." The reality is hunting across five different teams to find out who owns the current version of a stat, and discovering that nobody has a process for what happens when something changes. Vendors trying that - good luck. No FDE can do that.

The AI tool is downstream of all of this. It can surface conflicts. It can flag stale content. But it can't fix the organizational reality that nobody agreed on who owns the answer. Teams that are actually getting this to work have usually solved the ownership problem first. Or they've scoped the tool narrowly enough that ownership is manageable.

Anyone else seeing this in mid-market specifically? Curious whether this is a segment problem or a universal one that just shows up more visibly at this scale.

reddit.com

u/Ok_Gas7672 — 11 days ago

▲ 3 r/ContextEngineering+1 crossposts

We hit this while building an RFP automation system. Client had hundreds of documents: past RFPs, RFIs, proposal templates, internal reference files spanning years. When we requested for single source of truth - they confessed that they had none. We had a hunch that this is going to lead to a funny outcome.

We ingested everything and started taking queries.

First real tests:

- "What's our pricing?" Three different numbers depending on which document you pull.

- "How many employees?" Four different answers.

- "What's our compliance certification status?" One doc says pending. Another says SOC2Type1. The most recent one says HiTrust.

At cogniswitch, we take a neuro-symbolic approach, still the system generated answers the team was not really stoked about. It was on a feedback call client's growth team mentioned that the answers are dated. Obviously. The documents just tons of conflicts/ contradictions.

We went back and asked for the source of truth. There wasn't one. These were live internal documents that had accumulated years of drift. Nobody had reconciled them because nobody needed to until an AI had to answer from all of them at once.

We ended up building a conflict detection layer before the answer generation layer. Scan the corpus for conflicting facts - pricing, headcount, certification status - with different stated values across documents. Flag them. Human resolves which is authoritative. Then you can build anything on top off this knowledge foundation.

Lesson learnt the hard way - gap with output-only evals: your benchmark asks whether the AI answered correctly. But if your knowledge base has contradictions, "correct" doesn't have a stable meaning.

Clear need for context evals - checking whether your retrieval corpus is internally consistent before you ever run a query - are barely a discipline. I don't know of good tooling for it. Most teams discover this problem the same way we did.

Anyone building RAG on messy enterprise document sets running into this?

reddit.com

u/Ok_Gas7672 — 14 days ago

▲ 13 r/AIQuality+1 crossposts

We wrapped up a did a 120-question UAT with a CMO and his team. This is where it gets funny. As per one of their team member - we had a 99% accuracy and answer completeness score. The CMO actually flagged a bunch of answers as hallucinations.

We pulled every flagged answer and traced it back through the source documents. For context - we have a neuro-symbolic approach towards grounding agents.

There was 0 fabrication and every answer was grounded in the actual clinical guidance we'd ingested. What actually got flagged:

- Answer used "physician" where the organization says "provider." And it sourced from a document that the reviewer didn't know had been uploaded.

- The CMOs definition of hallucination: the AI made something up that wasn't in any source. Our definition: the AI went to the open internet instead of using the knowledge base. Figured the hard way that those two are not the same thing.

And it turns out there's a third definition that came up separately - using a valid source document to give an incorrect answer. That one is neither of the other two.

We eventually did clear the "hallucinations" by working with the CMO where each answer came from. But the exercise made us realize what we had taken for granted: if you don't align on what you're measuring before UAT starts, your accuracy scores mean nothing.

You get misaligned pass/fail calls on things that should have been caught much earlier.

This is not specific to just healthcare. Anyone building eval pipelines for regulated domains is going to hit this. The terminology needs to come from a shared definition not from a random article on the internet.

reddit.com

u/Ok_Gas7672 — 15 days ago