u/Lanaxsa

Hey folks — spent the last week running a real-world RAG benchmark and the results surprised me enough that I wanted to share and get a sanity check from the community.

Test-Setup

Domain: Turkish-language enterprise RAG (emails, contracts, postmortems, SOWs, CSVs — 60 mixed docs)
Stack: ParadeDB (Postgres + pgvector + native BM25), Vercel AI SDK, Google embedding (1536d), no reranker yet
Test set: 20 questions across 7 categories — simple lookup, multi-hop chains, contradiction resolution (doc A says X, doc B says Y — which is current?), numeric/CSV aggregation, hallucination traps (asking about projects that don't exist), Turkish morphology variants
All models served via gateway, so latency is network-dominated, not local-inference

Models tested (open-weight only) — Q1-Q2 baseline score out of 5

Gemma 4 26B-A4B (MoE, 26B total / 4B active) — 5.0 / 5
Ministral 14B (dense) — 4.75 / 5
Qwen 3.6-27B (dense) — 4.75 / 5
Nemotron 3 Super 120B (MoE, 120B / 12B active) — 3.5 / 5
Qwen 3-30B A3B (MoE, 30B / 3B active) — 2.0 / 5
Qwen 3 Next 80B (MoE, 80B / 3B active) — 1.0 / 5

What surprised me

Big model does not equal better RAG. Qwen 3-Next 80B and Qwen 3-30B A3B both fell apart on tool-calling discipline — hallucinating arguments, skipping retrieval, confidently making things up. Nemotron 120B got the easy answer right but missed nuance. Meanwhile Gemma's 4B active params and Ministral's 14B dense crushed it.
Ministral 14B is the dark horse. Smaller footprint than anything else on the list and Turkish output quality is arguably the cleanest of the bunch. It only loses points on completeness — sometimes skips a citation Gemma would catch. For edge or laptop deployment this is hard to beat.
Gemma 4 26B-A4B is the most disciplined. Best at I-don't-know refusals (didn't hallucinate on trap questions about fictional projects). Best at multi-hop chains. Tool calls are minimal and on-target — 3 to 4 times fewer calls than Ministral for the same answer. The 4B active param MoE design is genuinely impressive here.

What about you — which models have you tried on similar RAG workloads? Any that surprised you in either direction? Anything you'd recommend I throw into the next round?

TL;DR: Gemma 4 26B-A4B wins on quality, Ministral 14B is the small-footprint sweet spot, the bigger MoE models (Qwen 3-Next 80B, Qwen 3-30B A3B) underperformed badly on tool use.

What is the most useful RAG pipeline for you in production?

Benchmarked Gemma 4 26B-A4B vs Ministral 14B vs Qwen3 variants on a Turkish RAG workload — small models punch way above their weight