r/machinelearningnews

▲ 0 r/machinelearningnews+1 crossposts

Frontier model collapse is near

Hi all this is to inform you all that many frontline models like GPT, sonnet opus and or Gemma even are at stage of collapsing as they have frequently started drifting and running away from provided work either stretching that work too long even longer than a human productivity timeline. Or taking shortcuts. Daily new frequent incident tickets are a signal too. Better to save your work by saving and storing somewhere safe.

reddit.com
u/DingoShort3945 — 12 hours ago

Google Translate and DeepL still give completely different outputs for the same sentence in 2026. Why hasn't this been solved yet?

Tried something out of curiosity last week. Took a few sentences with slightly tricky phrasing and ran them through several MT engines. Same input, same language pair, completely different outputs. Not just stylistic differences, actual meaning divergence in some cases.

I get that training data and architecture choices differ but we're years into transformer-based MT now and the gap between leading engines on the same input still surprises me sometimes.

Has anyone else noticed this? Is this a problem with how these models work or just a matter of more training data eventually closing the gap? And does it actually matter for most use cases or is it only a problem at the edges?

reddit.com
u/EchoElectronic5581 — 22 hours ago
▲ 48 r/machinelearningnews+5 crossposts

[D] PINN loss functions: why physics-informed networks often fail to train

hysics-Informed Neural Networks are interesting because they break the standard ML paradigm: instead of approximating an unknown function from data alone, they exploit a known PDE constraint that the solution must satisfy. In principle this should make them converge faster and generalize better.

In practice the loss function makes them notoriously hard to train. The loss is a weighted sum of multiple terms (PDE residual, boundary conditions, initial conditions, data), each with different scales and gradient magnitudes. Several papers have characterized what goes wrong:

Wang, Teng & Perdikaris (2021) showed empirically and theoretically that during training, the gradients from different loss components become severely imbalanced. The optimizer follows whichever loss has the loudest gradient, regardless of which one matters most.

Wang, Yu & Perdikaris (2022) used Neural Tangent Kernel theory to show that the PDE residual term has much smaller eigenvalues than the boundary loss. The network learns boundaries quickly and interior physics slowly — often it never catches up.

Krishnapriyan et al. (NeurIPS 2021) demonstrated that even on simple PDEs like the convection equation, PINNs systematically fail to converge as the convection coefficient grows. This is on textbook problems with reasonable hyperparameters.

Mitigations exist (adaptive loss weighting, causal training, curriculum approaches, architectural fixes that hard-code boundary conditions) but none has fully solved the problem.

I wrote a longer version with full references and applications here: https://cristobalsantana.substack.com/p/the-pinn-loss-function-where-physics

Curious if anyone here has dealt with these training pathologies in production and what worked for you.

u/Illustrious-Crew5070 — 2 days ago

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Most translation models are audio pipelines with a TTS layer bolted on at the end. That's not simultaneous interpretation and Alibaba's Qwen team just built a clear technical case for the difference.

They released Qwen3.5-LiveTranslate-Flash: a real-time multimodal translation model that processes audio and video frames simultaneously, clones the original speaker's voice in the output, and covers 60 input languages at 2.8 seconds of latency.

No turn-detection. No generic synthesis voice replacing the speaker.

Here's what's actually interesting:

→ Vision-enhanced comprehension reads lip movements, gestures, and on-screen text alongside audio — robust in noisy or degraded audio environments

→ Semantic unit prediction via "reading units" processing commits to output segments mid-sentence, enabling continuous streaming without waiting for full utterances

→ Real-time voice cloning replicates the original speaker's voice profile from a single spoken sentence

→ Dynamic keyword configuration lets you inject domain-specific glossaries at runtime — brand names, medical terms, legal vocabulary

→ FLEURS and CoVoST2 benchmarks: outperforms major commercial alternatives across multilingual speech translation tasks

Full analysis: https://www.marktechpost.com/2026/05/20/alibaba-qwen-team-introduces-qwen3-5-livetranslate-flash-real-time-multimodal-interpretation-across-60-languages-at-2-8-second-latency/

Technical details: https://qwen.ai/blog?id=qwen3.5-livetranslate

https://preview.redd.it/rx8ahgg8592h1.png?width=1856&format=png&auto=webp&s=b80784f947e9827537d652972c2c6031a011ee39

reddit.com
u/ai-lover — 1 day ago
▲ 24 r/machinelearningnews+1 crossposts

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

Most LLM inference optimization forces a choice: fast drafting with a weak auxiliary model, or accurate generation with full Standard autoregressive (AR) decoding. NVIDIA Researchers just built a third option into the weights themselves.

They released Nemotron-Labs-Diffusion — a 3B/8B/14B model family trained on a joint Autoregressive AR-diffusion objective that supports three decoding modes from one checkpoint: standard AR, parallel diffusion decoding, and self-speculation, where the same model drafts and verifies without any auxiliary head.

Here's what's actually interesting:

→ Self-speculation achieves 5.99× tokens per forward over Qwen3-8B with comparable accuracy on a 10-task benchmark

→ Average acceptance length: 6.82 (with LoRA) vs. 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP — same draft length of 31

→ AR and diffusion objectives peak at the same loss coefficient (α=0.3) and improve together — they don't compete for model capacity

→ Speed-of-light analysis shows a theoretical ceiling of 7.60× TPF at block length 32; current confidence-based sampling realizes only ~3×, leaving headroom for better samplers

Full analysis: https://www.marktechpost.com/2026/05/20/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b/

Paper: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL

Model weights: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion

Technical details: https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive

https://i.redd.it/veehv38rv92h1.gif

reddit.com
u/ai-lover — 1 day ago

Experimenting with continuity, Ifinally got it right! The next agent starts from what actually happened, not from zero.

The problem was simple and well known by every coder / vibecoder: every new Codex / Claude / Copilot session kept rediscovering the same repo structure, files, decisions, failed commands, current task state, and validation steps, wasting context and tokens.

I have been trying to work around that problem with different approaches: small handoffs, heavy memory systems, context engines...

I finally found it: Operational continuity for AI coding agents. I built an open-source (Python CLI) continuity runtime so agents don’t restart from zero every session, and it already made my own AI coding workflow feel much less like restarting from scratch every time.

The continuity idea is not to add more hidden memory or dump more context into the prompt. AICTX keeps operational continuity inside the repo:

  • active Work State;
  • next actions;
  • decisions and handoffs;
  • failure memory;
  • validation evidence;
  • execution summaries;
  • repo context relevant to the current task.

The next agent should resume from what actually happened, not infer everything again from README + chat history.

A few parts I’m currently focusing on:

Execution Contracts

Each resume can include a compact contract for the next agent: first action, edit scope, canonical validation command, expected evidence, and finalize instruction. The goal is not only “remember context”, but guide the next execution safely.

Continuity View

I’m experimenting with a deterministic Mermaid continuity view generated from repo-local AICTX artifacts. It shows the current operational state of the repo visually: Work State, open handoffs, relevant failures, execution contracts, summaries, RepoMap hints, and portable continuity status.

Here you can see what it looks like. The link to this view can be returned after each task, so the next session has an inspectable continuity map. I’m still working on making it easier to read.

Portability

The continuity lives with the repository. The idea is that useful operational state should not be locked inside one chat, one vendor, one local machine, or one agent tool. If the repo moves, the continuity can move with it ... if you want it to!

Easy to use

pip install aictx
aictx install
aictx init

# Then keep using your coding agent normally, they will take care of use it

MCP support

I'm also working on MCP support so compatible agents can access AICTX continuity directly as tools, resources and prompts instead of only relying on repo instructions and CLI commands.

The MCP server is local-first. It is not a cloud memory service, not a daemon you have to manage manually, and not a generic shell/filesystem server. Compatible agents launch it locally through stdio:

aictx mcp-server --repo . --profile full

I’m also packaging Claude Code and Codex plugin artifacts around the same model: MCP-first when available, CLI fallback when not. Copilot support remains best-effort through repo instructions and VS Code MCP config where supported.

The medium-term benefit

Agent-based development starts to feel less like a sequence of isolated chats and more like an ongoing engineering process:

  • less rediscovery;
  • fewer repeated failed commands;
  • clearer handoffs;
  • better validation discipline;
  • less instruction boilerplate once agents can call AICTX through MCP;
  • a cleaner path for Claude, Codex and Copilot integrations;
  • easier switching between Codex, Claude, Copilot or other agents;
  • and a repo that can explain its current state to the next session.

GitHub: https://github.com/oldskultxo/aictx

Docs: https://aictx.org

I would love technical feedback, especially from people using coding agents across multiple sessions.

Collaborators welcome! It is still evolving!

reddit.com
▲ 55 r/machinelearningnews+1 crossposts

🌍 OlmoEarth v1.1: 3x cheaper to run than v1 with the same SOTA performance, fully open

Today we’re releasing OlmoEarth v1.1. It’s 3x cheaper to run than v1 while delivering the same state-of-the-art performance—and fully open.

Compute is the largest cost when running OlmoEarth at hundreds of thousands of square kilometers. Partners use v1 today for mangrove tracking, forest-loss classification, and country-scale crop-type mapping. v1.1 makes that work cheaper to sustain.

Where the savings come from: we feed the model about 3x fewer tokens per Sentinel-2 input. Since compute scales quadratically with token count, even modest reductions compound into real efficiency gains. Done naively, this hurts accuracy noticeably; recovering it took changes to how we pretrain the model. Read more in our tech report: https://allenai.org/papers/olmoearth_v1_1

One useful property for researchers: we held the pretraining dataset constant from v1. The differences cleanly isolate the methodological change, not the data or the architecture family.

v1.1 is available now in the same sizes as v1: Nano, Tiny, and Base. All are open weights, with open training code available. If you're running v1 and v1.1 works for your task, expect significant speedups during fine-tuning and inference.

🤗 Models: https://huggingface.co/collections/allenai/olmoearth

📝 Blog: https://allenai.org/blog/olmoearth-v1-1

u/ai2_official — 2 days ago
▲ 12 r/machinelearningnews+1 crossposts

🧬 flux-genotype: A self-evolving AI kernel that runs on CPU with Ollama — mutates its own architecture

`🧬 Flux‑Genotype – A CPU LLM that rewrites itself`

I've been working on an open-source kernel called **flux-genotype**. It orchestrates local models (TinyLlama, Llama 3.2, Hermes 3, DeepSeek-Coder) into a self-modifying ecosystem. Everything runs on **CPU** — I tested it on a Xeon without AVX2, 20 GB RAM.

> **Important:** this is an alpha. It works, it mutates, it evolves — but there's a lot of work ahead. The **MetaDesigner**, in particular, is the module I'm focusing on next. Right now it proposes architectural changes by writing new `.flux` files, but the validation and application pipeline needs to be more robust. The vision is to make it fully autonomous: an external architect that watches the ecosystem, diagnoses weaknesses, and rewrites the structure to improve confidence. It's not there yet, but the foundation is solid.

## How it works

  1. Ask a question → fast model (TinyLlama) answers.
  2. Judge model evaluates the answer (0–1). Initially this was Llama 3.2.
  3. If confidence drops below the golden ratio threshold (≈0.618), the ecosystem mutates its own structure.
  4. A **MetaDesigner** (Hermes 3) writes new `.flux` architecture files, which get validated by a Lark parser and applied.
  5. The system tracks confidence history with EMA and adapts temperature dynamically.

## Real example of self‑modification

The mutation can also replace the Judge. During one of the growth cycles, the MetaDesigner proposed swapping the Judge from **Llama 3.2** to **DeepSeek-Coder 6.7B**. The new configuration was tested, scored better, and the ecosystem applied the change permanently.

The system is not just tweaking parameters — it's rewriting its own **division of labor between models**.

## Why this is different

- It mutates its own architecture, not just model weights.

- It can replace its own Judge with a different model if performance improves.

- It has memory (confidence history with Exponential Moving Average).

- It uses a custom language (`.flux`) with a formal grammar — not YAML, not JSON.

- It runs on modest hardware. No GPU. Just a CPU and 20 GB of RAM.

## If you want to understand the architecture deeply

I wrote a **technical manifesto** that defines FLUX as a formal Architecture Description Language for self-evolving cognitive ecosystems. It covers the fractal design, the OODA loop, the role of the golden ratio, and the long-term vision (including the MetaDesigner). It's in the repo:

📄 `/papers/FLUX-Kernel.pdf`

## The companion novel

There's also a novel called **"IF THIS IS A ROBOT"** (in Italian and English, CC BY-NC-SA 4.0) that tells the story of a guy who finds this kernel running on a forgotten server. The novel is basically the kernel's manual. But the code stands on its own.

## Links

- **Repo:** [github.com/flux-genotype/nodo_zero](https://github.com/flux-genotype/nodo_zero)

- Kernel is **MIT-licensed**. Novel is **CC BY-NC-SA 4.0**.

Happy to answer questions, and **open to collaborators** who want to help push the MetaDesigner forward.

reddit.com
u/Inner-Dot-7490 — 3 days ago

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

Most "4-bit training" results come from small models on short token horizons because the format breaks before you can validate it. That's not pretraining — and NVIDIA just drew a clear line between the two.

They introduced the first public 4-bit pretraining run at multi-trillion-token scale — a 12B hybrid Mamba-Transformer (Nemotron-Nano-12B-v2-Base architecture) trained on 10 trillion tokens in NVFP4, a microscaling format with 16-element blocks, E4M3 block scales, and an FP32 per-tensor scale, with downstream accuracy closely tracking an FP8 baseline.

Here's what's actually interesting:

→ MMLU-Pro 5-shot: 62.58% (NVFP4) vs 62.62% (FP8). MMLU 76.57 vs 77.36. GSM8K CoT 92.27 vs 89.08. Validation loss within 1% of FP8 in the stable phase

→ Recipe = selective BF16 (~16% of linear layers) + 16×16 Random Hadamard Transforms on Wgrad inputs + 2D 16×16 weight scaling + stochastic rounding on gradients. Ablations show all four are required

→ Only linear-layer GEMMs run in NVFP4 — attention, embeddings, normalization, master weights, gradients, and optimizer states stay in BF16/FP32

→ On an 8B model, MXFP4 needed 1.36T tokens (+36%) to match NVFP4's loss at 1T tokens

Full Analysis: https://www.marktechpost.com/2026/05/18/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon/

Paper: https://arxiv.org/pdf/2509.25149

https://preview.redd.it/114lxr5x0v1h1.png?width=1462&format=png&auto=webp&s=c0f5be370e3b75ae7bec2d6eef9c3895f414cfab

reddit.com
u/ai-lover — 4 days ago

recursive thought at 11:33am

future simulation may be the foundation of intelligence itself. humans constantly predict outcomes before reality fully unfolds. fear intuition curiosity and decision making all seem connected to recursive future modeling under uncertainty. maybe consciousness is not awareness alone but a system designed to navigate possible timelines before action occurs.

reddit.com
u/DingoShort3945 — 5 days ago

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

Most open-source world models either need 8 GPUs to run or drop to 480p to survive. That's not an efficiency problem — it's an architecture problem. NVIDIA just addressed it directly.

They introduced SANA-WM — a 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing 720p video with precise 6-DoF camera control from a single image and a camera trajectory, running inference on a single GPU with no multi-GPU dependency anywhere in the pipeline.

Here's what's actually interesting:

→ Hybrid Gated DeltaNet + softmax backbone keeps recurrent state at constant D×D size regardless of video length — solving the quadratic memory explosion that makes 961-frame sequences infeasible with standard softmax attention

→ Dual-branch camera control: UCPE at latent-frame rate for global trajectory + Plücker mixing at raw-frame rate for intra-stride motion — CamMC 0.2047, best among all compared methods

→ Second-stage refiner (17B LTX-2 + rank-384 LoRA, 3 Euler steps) cuts long-horizon visual drift ΔIQ from 3.09 to 0.31 on Hard trajectories → 22.0 videos/hour on 8 H100s — 36× higher throughput vs LingBot-World at 14B+14B parameters

→ Distilled variant: 34s per 60s 720p clip on a single RTX 5090 with NVFP4 quantization

Full analysis: https://www.marktechpost.com/2026/05/16/nvidia-introduces-sana-wm-a-2-6b-parameter-open-source-world-model-that-generates-minute-scale-720p-video-on-a-single-gpu/

Paper: https://arxiv.org/pdf/2605.15178

Project page: https://nvlabs.github.io/Sana/WM/

GitHub Page: https://github.com/NVlabs/Sana

https://preview.redd.it/ny5cruolhg1h1.png?width=1358&format=png&auto=webp&s=a8e45f60221194c7df2b94ba99d15f002a34304b

reddit.com
u/ai-lover — 6 days ago
▲ 126 r/machinelearningnews+17 crossposts

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out. ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

  1. Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
  2. Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
  3. Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
  4. Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
  5. Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
  6. Music - ACE-Step v1 generates a 30s instrumental from Director's brief
  7. Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
  8. Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about):

  • 1280×720, not 640×640 default. Costs more but matches what producers want
  • 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up
  • flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults)
  • Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker
  • Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out
  • Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work:

  • ParaAttention FBCache (lossless 2× on Wan2.2)
  • torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2×
  • AITER MoE acceleration on Qwen director (vLLM)
  • End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

u/Inevitable-Log5414 — 8 days ago
▲ 25 r/machinelearningnews+2 crossposts

Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

https://arxiv.org/abs/2605.12484v1

reddit.com
u/LakshyAAAgrawal — 6 days ago

I have a question about AI Engineering

What should i know/learn about to work as one. I used to think that AI Engineering is just about deploying models for different situation but i guess its more than that right?

reddit.com
u/Tony71814 — 6 days ago
▲ 5 r/machinelearningnews+4 crossposts

Thoth v3.22.0 just dropped and it turns the app into a real developer workbench

Developer Studio gives you a dedicated coding surface with repo linking, code threads, diffs, todos, test detection, Git operations, and a live inspector that stays in sync during long runs.

Custom Tools let you convert any repo into a tool. Thoth can inspect it, propose commands, validate them, test them, and promote them into your normal chat workflow.

Docker Sandbox adds a safe execution mode with persistent containers, network controls, and clean import paths so you can experiment without risking your actual repo.

Plus a long list of upgrades across workflows, Home status, chat streaming, Settings, onboarding, embeddings, and overall stability.

u/Acceptable-Object390 — 7 days ago

Applied Item Response Theory (1968 psychometrics) to 242K cancer drug sensitivity measurements — IRT recovers rankings where averaging fails under sparsity

u/testofschool — 5 days ago

I fine-tuned Gemma 3 27B on code and got 98.78% HumanEval / 73% MBPP. Here’s the honest breakdown including all the eval bugs I hit.

I fine-tuned Gemma 3 27B on code and got 98.78% HumanEval / 73% MBPP. Here’s the honest breakdown including all the eval bugs I hit.

 

Model: https://huggingface.co/KK9922/Forge-Gemma-3-27B-GGUF

Code + eval harness: https://github.com/thesis09/Finetuned-Google-Gemma3-27B-It-for-code-generator-or-vibe-coder

Demo video: https://youtu.be/3acwPjRmo74

Quant: Q4_K_M GGUF (~17GB)
Runs on: RTX 3060 12GB (25 GPU layers), RTX 3090/4090 (full offload)

What this is

QLoRA fine-tune of google/gemma-3-27b-it for code generation. Python, JS, Java, C++, C. Trained on ~33K samples (self-oss-instruct + CodeAlpaca, filtered and deduplicated) on an H100 80GB. Full pipeline: dataset curation → training → LoRA merge → GGUF export → FastAPI inference server → eval harness.

I’m posting this because the eval story is more interesting than the benchmark numbers, and r/machinelearningnews deserves the real version rather than the “I got 99%!” hype.

The numbers

Benchmark Score Notes
HumanEval pass@1 98.78% (162/164) Full 164-problem set
MBPP pass@1 73% 100-problem sanitized split
DebugBench 74% Token-overlap metric, NOT execution-based — see below

Base model (gemma-3-27b-it) for comparison: ~84% HumanEval, ~72% MBPP

So the fine-tune is +14.8pp on HumanEval, roughly flat on MBPP.

Why there’s a 27-point gap between HumanEval and MBPP

This is the part I want to be upfront about.

98.78% HumanEval looks incredible. But CodeAlpaca and self-oss-instruct both contain HumanEval-adjacent problems. Some of that gain is the model having seen similar problems during training, not purely better code reasoning. MBPP tests a different problem style — mathematical formula implementations, number theory, string manipulation edge cases. The model was never specifically trained on those.

MBPP 73% ≈ base model 72% is the honest generalization signal. The fine-tune improved structured code output and formatting without breaking general Python reasoning. No catastrophic forgetting. But it also didn’t improve on tasks outside the training distribution.

If you’re looking for a model that specifically crushes MBPP-style algorithmic problems, this isn’t it. If you want structured, formatted, immediately-runnable code output with a consistent style, this is pretty good.

The eval bugs — this is the interesting part

HumanEval was 0% until I fixed my eval script

First run: 0% pass@1 on 50 problems. I panicked. The model was fine.

The issue: my eval code prepended the function stub to the model’s response every time. At temperature 0.1, the model returns the complete function including the def line. So I was creating:

def add(a, b):        # from fn_prompt
"""Add two..."""
def add(a, b):        # from model response — DUPLICATE
"""Add two..."""
return a + b

Python silently used the second definition (which is just the body with no context). Every test failed. Fixed with a 3-case assembly function that detects whether the model returned a full function, body only, or nothing, and handles each correctly.

After fix: 98.78% on full 164 problems.

MBPP was 9% until I figured out what it was actually testing

9% felt catastrophic. Ran it again. Still 9%.

Turned out: MBPP test assertions hardcode the expected function name. Like assert min_cost([[1,2],[3,4]], 1, 1) == 4. My eval prompt just said “write a function” — the model wrote correct logic under a name like minimum_cost_path and got NameError on every test.

Fix: regex the first assert statement to extract the expected function name, inject it into the prompt. Also had to exclude Python builtins from the regex because two problems had tests like assert set(my_func(...)) == {1,2} — outer set() is a comparison wrapper, not the function name.

Also added “NO extra parameters” to the prompt because the model kept adding optional params like length to sorting functions. Correct logic, wrong signature, TypeError.

After all fixes: 73%.

DebugBench trained on 0 samples

My data pipeline loaded buggy→fixed pairs from Rtian/DebugBench by looking for row.get("fixed_code", ""). The actual field is "solution". Every row was skipped. The function returned 0 samples and I missed it in the output.

The model achieves 74% on DebugBench entirely from the base model’s pre-existing capability, not from any training. Worth noting when interpreting that number.

The tokenizer bug you’ll hit if you try to export Gemma 3 yourself

This one’s a gift if you’re trying to GGUF any Gemma 3 model.

Older llama.cpp (pre-b3447) doesn’t recognize Gemma 3’s SentencePiece tokenizer hash. A common workaround patches convert_hf_to_gguf.py to return "llama-bpe" for unrecognized tokenizers.

Do not do this. The export will succeed, the model will generate text, and the text will look mostly fine. Then you’ll notice variable names are missing:

def dijkstra(graph, start):
= {start: 0}       # "distances" vanished
= []               # "priority_queue" vanished
heapq.heappush(, (0, start))

Words that exist in Gemma’s SentencePiece vocab but not in llama-bpe decode to empty strings. Silently. No error.

Fix: use llama.cpp b3447 or later (natively supports Gemma 3’s tokenizer hash) AND restore the original tokenizer files from google/gemma-3-27b-it before exporting. I also use chat_format=None in llama-cpp-python and build the raw Gemma 3 prompt string manually, which bypasses whatever residual weirdness is in the built-in Gemma formatter.

Running it locally

RTX 3060 12GB:

./llama-cli \
  -m gemma3-forge-Q4_K_M.gguf \
  --n-gpu-layers 25 \
  -c 4096 \
  --temp 0.1 \
  --top-k 40 \
  --top-p 0.95 \
  --repeat-penalty 1.1 \
  -p "<start_of_turn>user\nWrite a binary search in Python<end_of_turn>\n<start_of_turn>model\n"

25 GPU layers uses ~10-11GB VRAM. If you have more, increase it. If you get OOM, drop to 20.

With the FastAPI server:

python main.py --model gemma3-forge-Q4_K_M.gguf --gpu-layers 25
# exposes OpenAI-compatible API at localhost:8080

Works with Open WebUI, continue.dev, or any OpenAI-compatible client. System prompt is baked in by default but overridable.

Sampling that works well for code: - temp=0.1 (any higher and identifier names get weird) - min_p=0.05 (this is the one that kills the def func(arr,): bug class) - repeat_penalty=1.1 (gentle, doesn’t distort code)

Recommended system prompt

You are Forge, an elite precision coding assistant.
Response structure: one-sentence summary, then complete code in a fenced block,
then 3-5 bullet explanation, then 2+ edge cases.
Never write TODO, placeholder code, or incomplete functions.
When debugging: root cause in one sentence, fixed code with # FIXED: comments.
Always state time and space complexity.

What I’d change if I ran training again

•             3-5 epochs instead of < 1. Loss hit 0.22 at step 50 and barely moved for 950 more steps. The model converged early. More epochs would squeeze more out of the data.

•             Fix the DebugBench field name before training. 4,253 debugging examples that were never used.

•             Add MBPP-style training data. The gap between HumanEval and MBPP scores is a direct result of the training data not covering mathematical formula implementations.

•             HumanEval+ evaluation. I couldn’t get evalplus installed in the local environment during the eval run. HumanEval+ (80x more test cases per problem) would give a more honest picture of whether the model is actually solving problems or pattern-matching.

File sizes and hardware requirements

Format Size Min VRAM
bfloat16 (training/eval) 109 GB 80GB (H100)
Q4_K_M GGUF (this release) ~17 GB ~12GB (partial offload)
Q4_K_M full GPU offload ~17 GB ~18GB (3090/4090)

For CPU-only: needs ~32GB RAM, will be slow.

Happy to answer questions about the training setup, the eval harness, the tokenizer bug, or anything else. The GitHub has the full pipeline code if you want to reproduce or extend this.

408 people downloaded it in the first 24 hours which I did not expect at all. Thanks to whoever those 408 people are.

u/Thesis992 — 7 days ago