
I built a Goodhart-proof AI coding agent that runs locally on 4GB VRAM. It physically cannot see your tests.
I've been researching how AI coding agents inevitably optimize for metric-passing rather than problem-solving (Goodhart's Law). Commercial tools rely on prompt engineering and post-hoc review, but these are disciplinary, not architectural.
I built an open-source 4-layer pipeline (Planning → Execution → Verification → Optimization) where information asymmetry is enforced via strict TypedDict contracts and LangGraph state isolation: • The execution agent never receives acceptance criteria, unit tests, or the verification rubric. • Verification is blind: it evaluates git diffs without author identity or original prompt context. • Retry feedback is sanitized to abstract guidance only (prevents rubric memorization). • Neo4j graph analysis replaces context-window stuffing with precise AST dependency mapping.
Results: 26s/feature, $0.03 cost (local 3B model execution + API reasoning), reproducible benchmarks. Open-source under MIT.
Repo: https://github.com/illyar80/developer-farm
I'm particularly interested in feedback on:
- Formal verification approaches to guarantee isolation properties
- Multi-model fallback strategies for the execution layer
- Benchmarking frameworks for "Goodhart-resistance" in autonomous agents
Would appreciate critiques and suggestions from folks working on AI alignment, evaluation, or agentic systems.