Building AI for communications: context layer, hard rules, multi-model conflict
I've been building an AI workspace for communications teams and the same failure keeps showing up across every client I've onboarded. Sharing the architecture I'm landing on in case it helps anyone else working on AI for non-technical professional domains.
The failure pattern
Out-of-the-box LLMs are remarkable at generating plausible language and useless at generating correct language for a specific organization. They miss what matters most: context. The story behind the org, the prior decisions, the way this particular company talks about itself.
Most teams try to fix this by stuffing context into a system prompt or uploading a bunch of brand docs into a vector store. That works for two weeks. Then the narrative drifts. New strategy lands and never gets reflected. Old talking points keep coming back out. The model writes from an outdated version of the organization because nobody's tending the layer.
Garbage in, garbage out, but slower and harder to spot.
What I'm building toward
Three pieces, all of which seem necessary, none of which alone are sufficient:
- A living context archive, not a brand doc dump. Structured fields (positioning, voice, audience), free-form vault, memory entries from past conversations. Auditable. Has a visible state ("Empty / Sparse / Growing / Solid") so the user can see what's underspecified. Gets re-audited every ~90 days via a guided conversation where the model proposes updates and the user accepts, edits, or skips each one.
- Hard operational rules from experienced practitioners. LLMs are generalists by design. Without explicit constraints ("third person externally," "no fabricated quotes," "EASY ON THE EM-DASHES"), they default to the most generic version of whatever you asked for. The rules layer is separate from the context layer because it's about how not what. (This is where my expertise comes in. I've spent 25 years in organizational comms)
- Multi-model adversarial review. One ai model generates a draft. second model attacks it for the failure modes I care about (advisory hedging, fabricated specifics, off-brand voice). Both passes are visible to the user. The point isn't averaging. Consensus among models is worse than useless. It converges on the safest, most reliable answer. Conflict surfaces where the work actually is.
On top of that: a risk classifier that decides when to require a human review step before output reaches the user. Human-in-the-loop isn't a fallback for low-confidence cases. For high-stakes work it's the point. The model's job is to do the legwork and surface decisions. A human's job is to make them.
What's still open
- The audit conversation pattern works but has been brittle (model paraphrases the existing field instead of byte-quoting it, flip-flops between values, hits token limits mid-JSON). Most of my last week was filter logic to catch those failure modes.
- Memory hygiene at scale. When does old context become noise vs. useful long-tail? Haven't solved it.
- Adversarial review costs roughly 2x per turn. Worth it for high-risk responses, overkill for "hey reformat this list." Currently risk-gated, but the classifier is the weak link.
Happy to go deeper on any of these. Curious if anyone else is doing similar work in other professional domains (legal, medical, finance) where the context + hard rules + human in loop shape probably generalizes.