u/KingHenryMorgan

A client came to me with a Qwen-2.5-7B LoRA fine-tune that was supposed to improve function-calling performance. Instead, it regressed by 50 percentage points on BFCL benchmarks compared to the base model.

They wanted to know if they should just retrain. The real answer was more uncomfortable than that.

What I found after the audit:

The regression wasn’t one thing, it was layered. The training data had contamination issues that pushed the model away from the structured output format BFCL tests for. The LoRA rank and alpha config was reasonable on paper but wrong for this use case. Then on top of that, the inference stack (SGLang with FP8 quantization) was introducing its own silent degradation that wasn’t being separated from the model quality issues in evaluation.
So when they asked “is the model bad?”, the honest answer was: we don’t fully know, because the serving layer is also broken, and your eval setup can’t isolate them.

The recommendation nobody wanted:

Don’t ship it. Not yet. Fix the inference stack first so you have a clean measurement baseline, then re-evaluate whether the LoRA actually needs to be retrained or just reconfigured.

The most expensive thing in ML production isn’t retraining, it’s shipping something you can’t diagnose when it breaks.

A few things this reinforced for me:
• BFCL regression in fine-tunes is almost never just a data problem. It’s usually a data + config + serving interaction.
• FP8 quantization with SGLang needs explicit validation against your eval suite before you trust it in production.
• “The fine-tune made it worse” and “the serving stack made it worse” are different problems that look identical if your eval pipeline doesn’t separate them.

Happy to go deeper on any part of this, BFCL decomposition, LoRA config decisions, or the inference stack audit side.

(Full writeup on my medium profile if you want the methodology.)

I recently built a RAG system for a client that needed to make two internal knowledge sources — Document360 and SharePoint, queryable through a single interface.

Here’s the architecture I landed on and the decisions I’d revisit.

The stack:
• Document360 + SharePoint as source systems (very different ingestion patterns)
• Aurora PostgreSQL with pgvector for embedding storage
• Standard chunking + embedding pipeline (sentence-transformers)
• FastAPI serving layer

The decisions that mattered most:

Ingestion unification is harder than it looks. Document360 has a clean API. SharePoint is SharePoint. Getting consistent metadata, versioning, and chunk-level provenance across both required a normalization layer I hadn’t initially scoped. Build this first, not as an afterthought.
Aurora + pgvector is underrated for enterprise RAG. Everyone reaches for Pinecone or Weaviate, but if your client is already AWS-native and data residency matters (it always does in enterprise), pgvector inside Aurora gives you retrieval quality that’s good enough and operational complexity that’s much lower. The hybrid search (vector + BM25 fallback) is what made the difference in retrieval precision.
Chunk strategy is a retrieval strategy. I used hierarchical chunking, smaller chunks for retrieval, larger parent chunks for context injection. The naive fixed-size approach gave poor results on structured docs like policy documents and SOPs. This was the single biggest quality improvement.

What I’d do differently:

Add a re-ranking step (cross-encoder) before context injection. I skipped it for scope reasons and it cost retrieval quality on ambiguous queries. It’s not optional for production enterprise RAG.

Also: build an eval harness before you build the pipeline. I built mine halfway through and had to retrofit it. Painful.

Happy to go deeper on the pgvector setup, chunking strategy, or the SharePoint ingestion headaches if anyone’s dealing with similar problems.

(Full architecture writeup on my medium profile.)

I audited a fine-tuned LLM that lost 50 percentage points on BFCL after training. Here’s what actually caused it.

Architecture decisions I made building a production RAG backend for enterprise knowledge bases — what I’d do differently