u/Gaga_Cee_3903

Sharing a small project that ended up being more about benchmarking discipline than about fine-tuning per se.

Setup

Base: Mistral-7B-Instruct-v0.3
Method: QLoRA, r=16, α=32, all projection matrices, 2 epochs, LR=2e-4
Data: 1,690 Indian legal instruction pairs (IndianKanoon scrape + GPT-4o synthesis with citation-forcing prompt). 1,521 train / 169 eval.
Tracking: MLflow (training, evaluation, A/B). Streamlit dashboard for side-by-side base vs fine-tuned.

The methodological point

My RTX 5090 on CUDA 12.8 cannot run a transformers forward pass — every cuBLAS GEMM kernel hits CUBLAS_STATUS_INVALID_VALUE on sm_120, across fp16/bf16/fp32/bnb-4bit. This forced the entire inference pipeline through llama.cpp's GGUF stack, which has its own CUDA kernels.

The accidental benefit: every comparison in the project runs at the same quantization level, on the same inference engine, with the same seed. Most "fine-tuned LLM" reports compare an FP16 transformers base against an INT4 llama.cpp fine-tune and attribute the entire delta to fine-tuning — that conflates the training intervention with the inference-stack intervention. I rebuilt the vanilla base as a Q4_K_M GGUF specifically so fine-tuning is the only varying factor.

Quantization results (same 169-item eval)

Variant       Size       Latency      ROUGE-L
─────────────────────────────────────────────
FP16 GGUF     14.50 GB   10.8 ms/tok  0.367
INT8 (Q8_0)    7.70 GB    6.8 ms/tok  0.371
INT4 (Q4_K_M)  4.37 GB    4.9 ms/tok  0.371

Q4_K_M shows no measurable quality loss vs FP16 within ROUGE noise. INT8 is strictly dominated by INT4 when k-quants are available.

Fine-tune evaluation (full eval set): ROUGE-1 = 0.57, ROUGE-L = 0.40, RAGAS Faithfulness = 0.66 (judge: GPT-4o-mini, embeddings: text-embedding-3-small).

Things that didn't go to plan and what I'd change

r=8 underfit — the model defaulted to generic hedges instead of citing IPC/CrPC sections. Bumping to r=16 fixed it.
3-epoch checkpoint had lower training loss but worse RAGAS faithfulness than 2-epoch — citation memorization applied to wrong contexts.
I built the eval set after the first training run. Later runs use a slightly drifted eval set. The eval set is the spec; write it first.
Dataset size is the bottleneck. 1,690 pairs is below the regime where a 7B model cleanly absorbs a niche domain. Next iteration targets 5,000+ with an explicit refusal subset.

Repo, training notebooks, MLflow runs, Streamlit dashboard, and the LoRA adapter on the Hub are all open. Happy to take questions, especially on the Blackwell/cuBLAS debugging or the eval methodology.

Repo: github.com/gauravgarwal9011/NyayGPT

Adapter: huggingface.co/gauravgarwal/NyayaGPT-Mistral7B-adapter

NyayaGPT: 7-day QLoRA fine-tune of Mistral-7B on Indian legal Q&amp;A, with an apples-to-apples quantization benchmark forced by a broken cuBLAS on RTX 5090

NyayaGPT: 7-day QLoRA fine-tune of Mistral-7B on Indian legal Q&A, with an apples-to-apples quantization benchmark forced by a broken cuBLAS on RTX 5090