I audited a fine-tuned LLM that lost 50 percentage points on BFCL after training. Here’s what actually caused it.
A client came to me with a Qwen-2.5-7B LoRA fine-tune that was supposed to improve function-calling performance. Instead, it regressed by 50 percentage points on BFCL benchmarks compared to the base model.
They wanted to know if they should just retrain. The real answer was more uncomfortable than that.
What I found after the audit:
The regression wasn’t one thing, it was layered. The training data had contamination issues that pushed the model away from the structured output format BFCL tests for. The LoRA rank and alpha config was reasonable on paper but wrong for this use case. Then on top of that, the inference stack (SGLang with FP8 quantization) was introducing its own silent degradation that wasn’t being separated from the model quality issues in evaluation.
So when they asked “is the model bad?”, the honest answer was: we don’t fully know, because the serving layer is also broken, and your eval setup can’t isolate them.
The recommendation nobody wanted:
Don’t ship it. Not yet. Fix the inference stack first so you have a clean measurement baseline, then re-evaluate whether the LoRA actually needs to be retrained or just reconfigured.
The most expensive thing in ML production isn’t retraining, it’s shipping something you can’t diagnose when it breaks.
A few things this reinforced for me:
• BFCL regression in fine-tunes is almost never just a data problem. It’s usually a data + config + serving interaction.
• FP8 quantization with SGLang needs explicit validation against your eval suite before you trust it in production.
• “The fine-tune made it worse” and “the serving stack made it worse” are different problems that look identical if your eval pipeline doesn’t separate them.
Happy to go deeper on any part of this, BFCL decomposition, LoRA config decisions, or the inference stack audit side.
(Full writeup on my medium profile if you want the methodology.)