
Compressed Whisper large-v3-turbo to 368 MB with Q3_K-matched QAT — multilingual WER results
I’ve released Orbination Whisper AI, an experiment in compressing Whisper large-v3-turbo into a compact multilingual speech-to-text engine.
The default model is 368 MB using Q3_K quantization and runs through a Go runtime built on whisper.cpp, with no Python required at runtime. It supports CPU/GPU backends and includes CLI + HTTP server modes.
I focused on reducing the train/inference mismatch by training with the actual ggml Q3_K quantize/dequantize path in the forward pass, using a straight-through estimator and teacher distillation. The goal was to make the exported Q3_K checkpoint behave like the model seen during training, rather than fine-tuning first and losing accuracy after quantization.
WER on held-out FLEURS, using beam search in the deployed Go runtime:
- Q3_K, 368 MB: EN 0.065, ES 0.050, FR 0.065, EL 0.148
- Q4_K, 474 MB: EN 0.062, ES 0.048, FR 0.063, EL 0.124
- Q5_K, 574 MB: EN 0.061, ES 0.047, FR 0.061, EL 0.110
- FP16 upper bound, 1.6 GB: EN 0.061, ES 0.046, FR 0.060, EL 0.108
The interesting part for me is that the high-resource languages stay close across precisions, while Greek shows the biggest sensitivity to quantization.
Repo:
https://github.com/amichail-1/Orbination-Whisper-AI
I’d be interested in feedback from people working with Whisper, whisper.cpp, QAT, or multilingual ASR deployment.