Guardrails take an 8B model from 53% to 99% on agentic tasks - model size might not be the bottleneck
New results from Forge show that adding guardrails to an 8B parameter model pushed its agentic task performance from 53% to 99%. That is a 46 percentage point jump without changing the underlying model at all.
What makes this surprising is where the performance ceiling actually sits. The assumption in the local inference community has been that bigger models are needed for reliable agentic behavior - tool calling, multi-step reasoning, structured outputs. But if an 8B model can hit 99% with the right scaffolding, the bottleneck may not be model intelligence at all. It may be that small models know what to do but lack the discipline to do it consistently without external structure.
The implication is straightforward: if you are choosing between spending compute on a larger model versus investing in better guardrails and tooling around a smaller one, the guardrails route might deliver more reliable gains. That changes the economics of local agentic workflows considerably.
For people running agents locally: has your experience been that better prompting and guardrails matter more than model size for reliable tool use?