
Anthropic just confirmed pointing Claude at a warehouse doesn't work. What are folks here doing differently for production agentic analytics?
I've been deep-reading Anthropic's new post on how their internal teams use Claude for self-service analytics. There's a line buried in it that I think is the whole story: pointing Claude at a warehouse and letting the agents execute "can create a false sense of precision."
The rest of the post describes the four-layer stack they built to undo that. Canonical datasets, semantic layer enforcement, skill files for every domain, adversarial review on critical queries. Without the skill files, their internal accuracy sits at 21%. With them, 95%. Without active maintenance, it drifts back to 65% in a single month.
For anyone here running an agentic analytics stack in production:
Are you actually maintaining the skill / context layer at the cadence Anthropic implies is necessary, or are you finding cheaper ways to keep it from rotting?
Is anyone running LLM-drafted definitions through a human review loop, or has everyone settled on hand-authored only?
This breakdown of what Anthropic got right and what they got wrong for teams without a 50-person data org lines up with what I keep seeing in the field, but I want to know how others are actually closing the maintenance gap.
If your stack is in production today, what's the part that actually breaks?