u/Due_Drama_5825

I’ve been working on a small open-source benchmark for agents that diagnose dbt pipeline failures, and I’d value feedback from analytics engineers.

Repo: https://github.com/ambesaenterprise/ambesa-bench

The benchmark includes four deterministic dbt scenarios, each with a golden-outcome contract. The contract grades whether an agent can identify the failure, explain the root cause, avoid unsafe fixes, and propose a sensible remediation where appropriate.

The included reference agent is intentionally minimal and scores 2/4. That’s by design. The point is to create a baseline others can beat, not to present the reference agent as production-grade.

The two failed cases are also intentional: they test whether an agent understands that source data should not simply be edited to make a test pass, and that sometimes “no code fix, alert a human” is the right answer.

I’d appreciate feedback on:

Do these scenarios feel realistic?
Is the grading contract useful or too strict?
What dbt/analytics engineering failure should be added next?

Feedback request: benchmark for agents diagnosing dbt pipeline failures