I fine-tuned Gemma 3 27B on code and got 98.78% HumanEval / 73% MBPP. Here’s the honest breakdown including all the eval bugs I hit.

Model: https://huggingface.co/KK9922/Forge-Gemma-3-27B-GGUF

Code + eval harness: https://github.com/thesis09/Finetuned-Google-Gemma3-27B-It-for-code-generator-or-vibe-coder

Demo video: https://youtu.be/3acwPjRmo74

Quant: Q4_K_M GGUF (~17GB)
Runs on: RTX 3060 12GB (25 GPU layers), RTX 3090/4090 (full offload)

What this is

QLoRA fine-tune of google/gemma-3-27b-it for code generation. Python, JS, Java, C++, C. Trained on ~33K samples (self-oss-instruct + CodeAlpaca, filtered and deduplicated) on an H100 80GB. Full pipeline: dataset curation → training → LoRA merge → GGUF export → FastAPI inference server → eval harness.

I’m posting this because the eval story is more interesting than the benchmark numbers, and r/machinelearningnews deserves the real version rather than the “I got 99%!” hype.

The numbers

Benchmark	Score	Notes
HumanEval pass@1	98.78% (162/164)	Full 164-problem set
MBPP pass@1	73%	100-problem sanitized split
DebugBench	74%	Token-overlap metric, NOT execution-based — see below

Base model (gemma-3-27b-it) for comparison: ~84% HumanEval, ~72% MBPP

So the fine-tune is +14.8pp on HumanEval, roughly flat on MBPP.

Why there’s a 27-point gap between HumanEval and MBPP

This is the part I want to be upfront about.

98.78% HumanEval looks incredible. But CodeAlpaca and self-oss-instruct both contain HumanEval-adjacent problems. Some of that gain is the model having seen similar problems during training, not purely better code reasoning. MBPP tests a different problem style — mathematical formula implementations, number theory, string manipulation edge cases. The model was never specifically trained on those.

MBPP 73% ≈ base model 72% is the honest generalization signal. The fine-tune improved structured code output and formatting without breaking general Python reasoning. No catastrophic forgetting. But it also didn’t improve on tasks outside the training distribution.

If you’re looking for a model that specifically crushes MBPP-style algorithmic problems, this isn’t it. If you want structured, formatted, immediately-runnable code output with a consistent style, this is pretty good.

The eval bugs — this is the interesting part

HumanEval was 0% until I fixed my eval script

First run: 0% pass@1 on 50 problems. I panicked. The model was fine.

The issue: my eval code prepended the function stub to the model’s response every time. At temperature 0.1, the model returns the complete function including the def line. So I was creating:

def add(a, b): # from fn_prompt
"""Add two..."""
def add(a, b): # from model response — DUPLICATE
"""Add two..."""
return a + b

Python silently used the second definition (which is just the body with no context). Every test failed. Fixed with a 3-case assembly function that detects whether the model returned a full function, body only, or nothing, and handles each correctly.

After fix: 98.78% on full 164 problems.

MBPP was 9% until I figured out what it was actually testing

9% felt catastrophic. Ran it again. Still 9%.

Turned out: MBPP test assertions hardcode the expected function name. Like assert min_cost([[1,2],[3,4]], 1, 1) == 4. My eval prompt just said “write a function” — the model wrote correct logic under a name like minimum_cost_path and got NameError on every test.

Fix: regex the first assert statement to extract the expected function name, inject it into the prompt. Also had to exclude Python builtins from the regex because two problems had tests like assert set(my_func(...)) == {1,2} — outer set() is a comparison wrapper, not the function name.

Also added “NO extra parameters” to the prompt because the model kept adding optional params like length to sorting functions. Correct logic, wrong signature, TypeError.

After all fixes: 73%.

DebugBench trained on 0 samples

My data pipeline loaded buggy→fixed pairs from Rtian/DebugBench by looking for row.get("fixed_code", ""). The actual field is "solution". Every row was skipped. The function returned 0 samples and I missed it in the output.

The model achieves 74% on DebugBench entirely from the base model’s pre-existing capability, not from any training. Worth noting when interpreting that number.

The tokenizer bug you’ll hit if you try to export Gemma 3 yourself

This one’s a gift if you’re trying to GGUF any Gemma 3 model.

Older llama.cpp (pre-b3447) doesn’t recognize Gemma 3’s SentencePiece tokenizer hash. A common workaround patches convert_hf_to_gguf.py to return "llama-bpe" for unrecognized tokenizers.

Do not do this. The export will succeed, the model will generate text, and the text will look mostly fine. Then you’ll notice variable names are missing:

def dijkstra(graph, start):
= {start: 0} # "distances" vanished
= [] # "priority_queue" vanished
heapq.heappush(, (0, start))

Words that exist in Gemma’s SentencePiece vocab but not in llama-bpe decode to empty strings. Silently. No error.

Fix: use llama.cpp b3447 or later (natively supports Gemma 3’s tokenizer hash) AND restore the original tokenizer files from google/gemma-3-27b-it before exporting. I also use chat_format=None in llama-cpp-python and build the raw Gemma 3 prompt string manually, which bypasses whatever residual weirdness is in the built-in Gemma formatter.

Running it locally

RTX 3060 12GB:

./llama-cli \
-m gemma3-forge-Q4_K_M.gguf \
--n-gpu-layers 25 \
-c 4096 \
--temp 0.1 \
--top-k 40 \
--top-p 0.95 \
--repeat-penalty 1.1 \
-p "<start_of_turn>user\nWrite a binary search in Python<end_of_turn>\n<start_of_turn>model\n"

25 GPU layers uses ~10-11GB VRAM. If you have more, increase it. If you get OOM, drop to 20.

With the FastAPI server:

python main.py --model gemma3-forge-Q4_K_M.gguf --gpu-layers 25
# exposes OpenAI-compatible API at localhost:8080

Works with Open WebUI, continue.dev, or any OpenAI-compatible client. System prompt is baked in by default but overridable.

Sampling that works well for code: - temp=0.1 (any higher and identifier names get weird) - min_p=0.05 (this is the one that kills the def func(arr,): bug class) - repeat_penalty=1.1 (gentle, doesn’t distort code)

Recommended system prompt

You are Forge, an elite precision coding assistant.
Response structure: one-sentence summary, then complete code in a fenced block,
then 3-5 bullet explanation, then 2+ edge cases.
Never write TODO, placeholder code, or incomplete functions.
When debugging: root cause in one sentence, fixed code with # FIXED: comments.
Always state time and space complexity.

What I’d change if I ran training again

• 3-5 epochs instead of < 1. Loss hit 0.22 at step 50 and barely moved for 950 more steps. The model converged early. More epochs would squeeze more out of the data.

• Fix the DebugBench field name before training. 4,253 debugging examples that were never used.

• Add MBPP-style training data. The gap between HumanEval and MBPP scores is a direct result of the training data not covering mathematical formula implementations.

• HumanEval+ evaluation. I couldn’t get evalplus installed in the local environment during the eval run. HumanEval+ (80x more test cases per problem) would give a more honest picture of whether the model is actually solving problems or pattern-matching.

File sizes and hardware requirements

Format	Size	Min VRAM
bfloat16 (training/eval)	109 GB	80GB (H100)
Q4_K_M GGUF (this release)	~17 GB	~12GB (partial offload)
Q4_K_M full GPU offload	~17 GB	~18GB (3090/4090)

For CPU-only: needs ~32GB RAM, will be slow.

Happy to answer questions about the training setup, the eval harness, the tokenizer bug, or anything else. The GitHub has the full pipeline code if you want to reproduce or extend this.

408 people downloaded it in the first 24 hours which I did not expect at all. Thanks to whoever those 408 people are.

u/Thesis992