r/mlops

▲ 6 r/mlops+2 crossposts

I built an LLM eval gate that can't silently pass

https://github.com/albertofettucini/faithgate

Most LLM eval setups I've seen have a failure mode ops people will recognize: the happy path is green, and every unhappy path is also green. Judge API dies, no scores, nothing to compare, pass. That's an availability metric wearing a quality-gate costume.

I built faithgate around the opposite default. It's a faithfulness regression gate (suite of cases, score per prompt/model version, diff vs baseline, nonzero exit on regression) where every ambiguous state fails closed. Zero matched cases: fail. Unscored run: fail. Every score an abstention: fail. Abstentions are a distinct state in storage, never coerced to 0.0, and there's a --max-abstained policy flag for when you actually want tolerance.

Reproducibility bits: every run writes a manifest with judge id, model, kind, ragas and runner versions, and the suite hash. If the judge changed between baseline and head, comparing the scores is meaningless, so the gate exits 3 unless you explicitly pass --allow-judge-change. A corrupted manifest also fails closed. Duplicate case keys resolve pessimistically (baseline keeps max, head keeps min) so dupes can't quietly lower the bar.

My favorite part lives in CI. Next to the normal green gate there's a proves-detection job that runs the gate against a deliberately regressed suite and inverts the exit code. If the gate ever loses the ability to catch a known-bad change, dependency bump, refactor, whatever, the pipeline itself goes red. Tests for the test.

Judge honesty: default is Claude via your own key (RAGAS underneath). The keyless offline mode is published as untrustworthy, 68% balanced on a 40-example hand-labeled set, catches 9/20 unfaithful, with a unit test asserting the weakness.

Storage is one SQLite file with WAL, no server. Python 3.9 to 3.13, MIT. Known limitation: case identity is content-based, rewording a question mints a new case.

u/ahumanbeingmars — 1 day ago

▲ 3 r/mlops

Is there any course to learn MLOps?

I am looking for

How a devops guy trains model?
How a devops guy creates pipeline for the model?
How a devops guy deploys model?
How a devops guy monitors model?
Also, best practices of MLOps.

r/mlops

I built an LLM eval gate that can't silently pass

Is there any course to learn MLOps?

How would you select a LLM model for internal hosting

Trapped in a Remote Sensing degree but I want an MLOps Career

How do you check a local model is actually ready before you deploy it as an agent?

Interesting shift in “Platform Engineering / MLOps” interviews — lots of Kubernetes operations, very little ML

Recruiter/Hiring Manager Translation Table

Gave a talk on AI observability in prod — the demo-vs-production gap is bigger than most teams admit

I built an OSS local harness for long-running coding agents: context engineering, council planning, fresh retries, etc

Best AI Gateways in 2026: OpenRouter vs Portkey vs Orq vs LiteLLM

How we finally got real observability into our LLM stack

if you want to get into video understanding and vlms, here is roughly where i would start

What does llm governance mean in practice?

What I learned adding long-term memory without turning it into messy RAG

I built a tool to calculate the "Interconnect Tax" (40% efficiency loss) in GPU Clouds.

How are you all actually evaluating LLM/agent systems in prod? LLM-as-judge feels shaky

Physical AI MLOps Challenges

Airflow is becoming our biggest bottleneck, what did you migrate to ?

What are you guys using for ml workloads in production nowadays?

What tools should I use to develop a training pipeline?