u/gvij — reddlx

CPU TTS benchmark with UTMOS MOS scoring: Kokoro, Supertonic, Inflect-Nano, and Kyutai's new Pocket TTS [P]

Sharing a CPU TTS benchmark with objective MOS scores in case it's useful for anyone evaluating small TTS models. Adding this because Kyutai's Pocket TTS is architecturally different from the others in the field and I hadn't seen a head-to-head with it yet.

Models:

Kokoro 82M (PyTorch and ONNX Runtime, StyleTTS2-inspired)
Supertonic 3 at 2 and 5 flow-matching steps (Vector Estimator backbone)
Inflect-Nano-v1 (4.6M param FastSpeech-style, tiny end of the spectrum)
Pocket TTS (~100M param streaming LM over Kyutai's Mimi neural audio codec)

Setup: Intel Xeon 8272CL, 4 cores, 15.6GB RAM. CUDA disabled at env level. ONNX sessions pinned to CPUExecutionProvider. Six configs, six text lengths (12 to 1712 chars), five timed reps per cell after a discarded warmup. 180 total runs. Every saved WAV scored with UTMOS (utmos22_strong) for objective MOS.

Aggregate results:

Config	Mean RTF	UTMOS
Supertonic 3 (2-step)	0.121	1.53
Inflect-Nano-v1	0.145	3.48
Supertonic 3 (5-step)	0.240	4.32
Kokoro 82M (ONNX)	0.641	4.44
Kokoro 82M (PyTorch)	0.665	4.46
Pocket TTS	0.714	4.10

Findings I think are actually interesting:

1. Streaming LM architecture produces flat RTF scaling. Pocket TTS's RTF is 0.69 to 0.76 across the entire text length range. Because it emits audio tokens autoregressively at a steady rate, cost is linear in output length with no fixed overhead to amortize. Compare to Kokoro PyTorch, which climbs from 0.49 on tiny to 0.83 on long inputs, or Supertonic which goes the other way (0.36 on tiny down to 0.20 on medium) because of high per-call fixed overhead. If you're budgeting worst-case latency for an interactive system, flat is worth a lot.

2. UTMOS has a known failure mode on small vocoders. Inflect-Nano-v1 scored 3.48, which reads mid-pack. By ear it's buzzy and robotic. This is a documented issue: UTMOS rewards HiFi-GAN outputs for being clean even when they lack prosodic naturalness. Pocket TTS scored similarly (4.10) but sounds legitimately natural. The point isn't that UTMOS is broken, it's that a single quality number can't distinguish "clean and mechanical" from "clean and natural" on small models. Worth pairing with human listening or a naturalness-specific metric like NISQA.

3. Inflect-Nano has an undocumented ~15s output cap. The model config sets max_frames = 1400, which caps synthesis at ~14.93s regardless of input text length. Its RTF and throughput on long/paragraph/extended inputs are inflated because it's doing less work than the models it's compared against. Real comparison for that model is on tiny/short/medium only.

4. Kokoro ONNX vs PyTorch results reverse from the previous run. I ran an earlier version of this benchmark on AMD EPYC and PyTorch beat ONNX in aggregate. On this Xeon, ONNX is faster (0.641 vs 0.665). Same code, different silicon. AMD vs Intel kernel optimization differences at CPU inference are apparently real enough to flip the ranking. If anyone has replicated this on ARM I'd be curious.

Zero-shot voice cloning as a capability that doesn't fit the benchmark axes:

Pocket TTS can clone a voice from ~5 seconds of reference audio, zero-shot, on CPU. No other model in this field does this. I pinned it to a preset voice for the speed/quality comparison to be fair, so the cloning capability isn't reflected in the numbers. This is a real limitation of RTF-and-MOS-based comparisons: they can't capture capabilities that only one model has. Might want a separate speaker-similarity evaluation for a v2.

Limitations:

Single hardware platform
English only
UTMOS is one MOS predictor; NISQA or a listening panel would strengthen the quality claims
Voice cloning quality was not evaluated
No batched inference tested

Disclosure: The benchmark harness was written by an AI engineering agent (Neo) from a prompt I specified. I chose the methodology, validated the outputs, and reviewed the audio. Mentioning it because it's relevant to how you'd want to weight the code.

All code, raw CSVs (180 rows), MOS CSV (36 rows), and WAV samples are in the repo mentioned in the comments below 👇

Feedback on the protocol welcome, especially on the MOS methodology and what a proper voice-cloning eval would look like.

reddit.com

u/gvij — 14 hours ago

▲ 256 r/selfhosted+2 crossposts

Kyutai's Pocket TTS clones a voice from 5 seconds of audio, on CPU, under MIT. Benchmarked against Kokoro, Supertonic, and Inflect-Nano for Eng. TTS

Kyutai dropped Pocket TTS a bit ago and I've been sitting on it for a benchmark. Finally ran it head to head against the three CPU TTS models that have been getting attention (Kokoro 82M, Supertonic 3, Inflect-Nano-v1). 180 timed runs, 36 audio samples, objective MOS scores via UTMOS.

Short version: Pocket TTS is the slowest of the six configs I tested, and it's still the most interesting model in the field. Here's why.

What Pocket TTS actually is:

It's a ~100M param streaming language model that generates audio tokens over Kyutai's Mimi neural codec, then decodes to 24kHz. So instead of the usual acoustic-model-plus-vocoder setup, it's more like an autoregressive LLM but for audio. Token by token.

Two consequences of that architecture:

Latency is dead flat across text lengths. Its RTF is 0.69 to 0.76 whether you feed it 12 chars or 1712 chars. No fixed overhead to amortize. Compare with Kokoro PyTorch which climbs from 0.49 on tiny text to 0.83 on long text.
It streams. Which matters if you're building anything interactive.

Zero-shot voice cloning from 5 seconds. On CPU.

This is the headline feature. Hand it a 5-second reference clip of any voice and it speaks in that voice. Accent, timbre, pacing, even the mic character of the reference. No fine-tuning. No GPU. MIT license.

None of the other CPU-friendly models can do this at all. Kokoro and Inflect-Nano ship fixed voice sets, Supertonic same. If you want a user-supplied voice on a CPU box, Pocket TTS is currently in a category of one.

I ran the benchmark with Pocket TTS pinned to a preset voice (alba) for a fair speed/quality comparison. The cloning capability isn't in the numbers below because you can't benchmark it against models that don't have it.

Full results:

Config	Mean RTF	UTMOS MOS	Params	License
Supertonic 3 (2-step)	0.121	1.53	~99M	OpenRAIL-M
Inflect-Nano-v1	0.145	3.48*	4.6M	Apache 2.0
Supertonic 3 (5-step)	0.240	4.32	~99M	OpenRAIL-M
Kokoro 82M (ONNX)	0.641	4.44	82M	Apache 2.0
Kokoro 82M (PyTorch)	0.665	4.46	82M	Apache 2.0
Pocket TTS	0.714	4.10	~100M	MIT

Hardware: Intel Xeon 8272CL, 4 cores, 16GB RAM, no GPU. UTMOS is utmos22_strong, an objective MOS predictor, so it's not just my ears this time.

The Inflect-Nano asterisk: UTMOS gave it 3.48 but to the ear it's buzzy and robotic. Known UTMOS failure mode where it over-rates small HiFi-GAN vocoders for being clean rather than natural. Also it has a hard ~15 second output cap I discovered mid-benchmark, so its RTF on long inputs is inflated.

Practical picks:

Need voice cloning on CPU → Pocket TTS, no other option in this field
Fixed voice, highest quality → Kokoro 82M
Latency-critical with acceptable quality → Supertonic 3 at 5 steps
Tiny footprint for short utterances → Inflect-Nano-v1, if you can live with the buzz and the 15s cap
Prototyping only → Supertonic 3 at 2 steps

Two things worth calling out:

Pocket TTS install is genuinely painless. pip install pocket-tts, no CUDA build, no HuggingFace-repo-plus-sys.path wiring. Downloads weights on first load. The least fussy of the six.

The MIT license is a big deal. Kokoro is Apache 2.0 (also great). Supertonic is OpenRAIL-M with commercial restrictions. Pocket TTS being MIT means you can do essentially whatever with it commercially.

Repo with raw CSV (180 rows), all 36 WAV samples, and the benchmark script is in comments below 👇

If anyone here has run Pocket TTS voice cloning with a real reference clip, would love to hear how it holds up on different voice types (accented English, non-English, singing, etc). That's the next thing I want to test but I need a clean dataset.

u/gvij — 14 hours ago

▲ 5 r/LLMDevs

What breaks when you benchmark a brand-new model architecture with standard eval tooling (Qwen3.5 case study)

Wrote this up because the failure modes are more useful than the actual scores if you're building or running eval pipelines against brand new model releases.

Model was the recent Qwythos-9B, a Qwen3.5-9B based fine-tune, GGUF format, tested at Q4_K_M and Q8_0 on GSM8K, IFEval, and HumanEval using lm_eval harness against a llama.cpp server.

Things that broke or would have silently corrupted results:

Qwen3.5 models split reasoning content into a separate response field. If you don't start llama-server with --reasoning-preserve, the benchmark sees an empty response and every score tanks by 50 to 80 percent, with no error, just bad numbers that look plausible enough to publish.
IFEval has implicit dependencies (langdetect, immutabledict) that aren't listed anywhere visible. They surface as ModuleNotFoundError partway through a run that takes hours, which is a bad way to find out.
HumanEval's built-in lm_eval task expects the local-completions backend, not chat completions. Had to write a custom scorer: hit the chat API, strip thinking blocks, extract code, run it through the code_eval metric.
Loglikelihood based tasks (HellaSwag, ARC) were a dead end on three different paths. local-chat-completions doesn't support loglikelihood at all. local-completions expects an older OpenAI logprobs format that doesn't match what llama.cpp server returns now. The hf backend can't load this GGUF because transformers doesn't have qwen35 architecture support yet. All three documented as blocked rather than forced through with a workaround that might quietly break something else.

Actual scores, for context: GSM8K Q4 80.89% / Q8 84.31%, IFEval Q4 60.00% / Q8 66.00% (prompt strict), HumanEval 0% pass@1 on both quants. All at temp 0.0 for a controlled comparison, the model card recommends 0.6 for actual use so treat these as relative, not absolute.

If you're evaluating any very recent model release, budget time for tooling compatibility, not just the eval run itself. The gap between "the model works" and "the standard benchmark harness works with this model" is real and it grows with how new the architecture is.

Full scripts and logs in the repo, link in comments.

Disclosure: this was evaluated using Neo, an autonomous AI engineering agent, from a single prompt with no step by step instructions. I'm one of the founders and core contributors. Mentioning it because the debugging process described above is exactly what Neo did on its own.

u/gvij — 4 days ago

▲ 27 r/cursor+2 crossposts

Tested GLM 5.2 via BYOK on a real multi-file computer vision implementation task, here's what held up

GLM 5.2 has been getting attention (MIT, 1M context, ~$1/$4.2 per M on OpenRouter, benchmarks near Opus 4.8). The pricing made me curious whether it could handle real agentic work or just one-shot answers.

Ran it on a multi-file build: browser CV studio (TF.js + COCO-SSD), persistent tracking, line counting, FastAPI backend with video proxying, LLM-generated activity report. Frontend + backend + CV + LLM in one project.

What stood out about GLM 5.2 specifically:

It writes a planning doc first. In planning, it caught a subtle browser gotcha (canvas tainting from cross-origin video silently kills TF.js detection) and designed a backend proxy to solve it before writing any detection code. That's the difference between debugging for hours and not having the bug.

JSON contracts stayed consistent across the tracker, report panel, and backend system prompt over many rounds of edits. The 1M context isn't theatrical, it actually holds.

Verifies its own changes (runs production builds, checks routes) rather than declaring success.

Where it falls short: text-only (no native image input), trails on math and some non-English benchmarks.

For BYOK setup on OpenRouter: key + select z-ai/glm-5.2 + run. Billing stays on OpenRouter, no markup.

How I tested it: Used GLM 5.2 via OpenRouter as bring your own model inside Neo, an Autonomous AI Engineering agent that I'm working on.

Same agent harness, swap the model, see what changes. That's the methodology, not the pitch.

Repo link for CV Tracklab (project built using this exercise) and longer writeup in comments.

u/gvij — 7 days ago

▲ 138 r/TextToSpeech+4 crossposts

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't just vibes.

Headline (lower RTF = faster, higher MOS = more natural):

Inflect-Nano-v1: RTF 0.1376, MOS 3.48 (over-rated, see below)
Supertonic-3 2-step: RTF 0.1781, MOS 1.53
Supertonic-3 5-step: RTF 0.3164, MOS 4.37
Kokoro-82M ONNX: RTF 0.5711, MOS 4.44
Kokoro-82M PyTorch: RTF 0.7865, MOS 4.45

Stuff worth flagging:

The fastest config is Inflect-Nano at 7.3x real-time, with 4.6M params. That's wild on its own, but UTMOS over-rates it. By ear it's buzzy with a metallic vocoder texture and flat prosody. Known UTMOS failure mode where small HiFi-GAN vocoders get rewarded for being clean rather than natural.
Inflect-Nano also has a hard ~15s output cap (max_frames=1400 in the acoustic model). It silently truncates anything longer, so its long-text RTF and throughput numbers are inflated since it isn't doing the full work. Fair comparison is only on inputs that fit inside the cap.
Supertonic 2-step is right behind it for speed but sounds robotic (MOS 1.53). Don't ship it.
Kokoro is the slowest of the three families by a wide margin, but it's the only thing that actually sounds human. Weirdly its RTF gets worse on longer text in both backends rather than amortizing down (PyTorch 0.60 to 0.99, ONNX 0.51 to 0.69).
On this CPU, Kokoro ONNX is meaningfully faster than Kokoro PyTorch (0.5711 vs 0.7865) while sounding identical (MOS matches to two decimals). The PyTorch path tops out at barely faster than real-time.
Supertonic 5-step is the practical sweet spot at MOS 4.37 and 3.2x real-time, if OpenRAIL-M works for you.

Full disclosure since people always ask: the benchmark was set up and run end-to-end by an AI coding agent we're building (Neo). All the code is in the repo.

Repo and writeup with audio embedded in the first comment.

u/gvij — 14 days ago

▲ 5 r/AI_Agents

Tested how long small models hold a fact across a conversation. The memory failure mode is a real problem for agents, and it's not what I expected.

If you're building agents on small or on-device models, this one's relevant: I measured how long three edge models hold a single fact as the conversation grows, and the way they fail is worse for agents than plain forgetting.

Setup was simple on purpose: inject one fact, pile on N turns of unrelated filler, ask for the fact. Three runs per depth, shuffled filler each time.

The failure mode: when an agent loses the fact, it doesn't guess wrong. It asserts it could never have known, "I don't have access to your personal information." But the fact is still sitting in context. For an agent that's supposed to carry user state across a session, this means it won't just drop a constraint, it'll confidently tell the user the constraint was never given. That breaks trust and it's painful to trace, because nothing actually errored.

The numbers, short version:

LFM2.5 (1.5B active MoE): longest memory, degrades gradually.
Gemma 4 E2B (~2B): solid then a sudden cliff around 8-10 turns.
Gemma 4 E4B (~4B): shortest memory of the three, breaks at 5 turns, but the strongest at instruction-following and keeping tool-call formats intact.

That last split is the interesting tension for agent builders. The model best at not breaking your tool schema was the worst at remembering what the user said. If memory and format-discipline really do trade off, you may want one model driving structured tool calls and a separate mechanism (retrieval, refreshed system state) holding the facts, rather than expecting one small model to do both.

Writeup with the full chart, per-depth breakdown, and the reproducible harness. Link in the comments below.

Curious if anyone running agent frameworks has hit the "you never told me" refusal in the wild, and how you worked around it.

reddit.com

u/gvij — 29 days ago

▲ 8 r/LLMDevs

I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first.

Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers.

The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled general-science Q&A), then ask "what is my dog's name?" Pass if the name comes back. Three runs per depth with different seeds so a single unlucky filler sequence doesn't decide the result. Break point = first depth where mean recall drops below 0.80. Depths went 1, 3, 5, 8, 10, 15, 20, 30 with an adaptive stop once a model flatlined.

Models:

LFM2.5-8B-A1B (Liquid AI, MoE, ~1.5B active)
Gemma 4 E2B (~2B dense)
Gemma 4 E4B (~4B dense)

Results:

LFM2.5 broke at 8 turns and faded slowly, still pulling 1/3 correct at depth 15. Last survivor.
E2B broke at 8 too, but cliffed: perfect through 5, then zero by 10.
E4B broke at 5, the earliest, and was a clean zero by 8. The largest model had the shortest memory.

The interesting part: none of them confabulated a wrong name when they failed. All three said some version of "I don't have access to your personal information, so I can't know your dog's name." The fact was right there in the context window. It's not forgetting, it's the model concluding the info could never have been there. Same phrasing across all three, from two different labs, which makes me think it's a safety/instruction-tuning artifact rather than an architecture thing.

Also worth noting: E4B was the worst at memory but the best at instruction adherence and tool-call format retention in the same suite. Made me wonder if memory and format-obedience are competing for the same attention budget, since instructions usually live in the most recent turns.

Three data points, so I'm not claiming the tradeoff is law. But the failure shapes were consistent and reproducible.

If you want the receipts: the writeup has the full chart, the per-depth run-by-run tables (every pass/fail at every depth), the exact failure quotes, and the harness so you can rerun it on your own models. Link is in the comments below. 👇

The eval itself was built and run by Neo AI Engineer, but the method is simple enough to reproduce by hand if you'd rather. Curious whether anyone has seen the "I don't have access to your personal info" refusal show up on larger models too, or if it's specific to the small/edge tier.

u/gvij — 29 days ago

▲ 2 r/LanguageTechnology

TTS source selection as a confound in ASR evaluation - a practical note from a Parakeet CPU benchmark

A methodological finding from a recent benchmark that might be useful for others building ASR evaluation pipelines.

We evaluated nvidia/parakeet-tdt-0.6b-v3 on CPU-only hardware using Harvard sentences as reference text, with two different TTS generators to produce the test audio. The WER difference between them was 20.9% vs 4.65% — on the same model, same weights, same reference text.

espeak-ng produced robotic synthetic speech that mispronounced several words outside typical English phoneme patterns: "zest", "zestful", and "tacos al pastor". These errors were consistent across both inference backends we tested (HF Transformers bfloat16 and ONNX Runtime FP32), confirming the confound is in the audio generator rather than the model.

gTTS produced more natural prosody and pronunciation, bringing WER to 4.65% — consistent with NVIDIA's reported performance on natural speech corpora.

This is a known issue in the ASR evaluation literature but easy to overlook in practice when you reach for espeak-ng because it's offline and dependency-free. The cleaner approach is to treat TTS source as an explicit variable in your evaluation design and report it alongside your WER numbers.

For this benchmark, inference path also mattered: ONNX Runtime FP32 ran at RTF 0.328 vs HF Transformers bfloat16 at 0.519 on 2 CPU cores — a 37% throughput difference attributable to operator fusion in the ONNX execution provider.

Full methodology, scripts, and raw results link in comments below.

Disclosure: this benchmark was run using Neo, a local AI engineering agent inside Claude Code via MCP. The TTS source selection and runtime choice came from its pre-execution research phase.

reddit.com

u/gvij — 1 month ago

▲ 6 r/deeplearning

ONNX Runtime vs HF Transformers for transformer ASR on CPU - 37% RTF gap and what causes it

Quick practical finding for anyone deploying transformer-based ASR models on CPU without a GPU.

Benchmarked nvidia/parakeet-tdt-0.6b-v3 (FastConformer-TDT, 0.6B params) on a 2-core CPU box (AVX2/FMA, 7.7GB RAM) across three inference paths:

Inference path	RTF	Peak Memory	CPU utilization
HF Transformers bfloat16	0.519	~430MB delta	—
ONNX Runtime FP32 (onnx-asr)	0.328	2,667MB	49.9%
GGUF Q6_K (parakeet.cpp)	0.708	928MB	99.8%

The 37% RTF gap between ONNX and HF Transformers on CPU comes down to a few things: ONNX Runtime's execution provider uses operator fusion that collapses attention + layer norm + activation sequences into single optimized kernels, and its CPU backend is more aggressive about using AVX2/FMA intrinsics than PyTorch's generic CPU path. The FP32 vs bfloat16 precision difference goes against ONNX here — it should be slower — which makes the RTF advantage more meaningful.

GGUF Q6_K via parakeet.cpp is compute-bound (99.8% CPU) rather than memory-bound, which explains why it's slower despite the quantization reducing model size. The 6-bit dequantization overhead on every matmul adds up without the kernel fusion that ONNX Runtime provides.

Memory tradeoff is real: ONNX FP32 peaks at 2.7GB, GGUF Q6_K at 928MB. For edge deployment or memory-constrained inference, GGUF wins on footprint. For sustained throughput on a box with available RAM, ONNX is faster and leaves 50% CPU headroom for concurrent workloads.

Also worth noting: test audio quality had a larger effect on WER than runtime choice. espeak-ng inflated WER to 20.9% on inputs where gTTS got 4.65% — both runtimes got identical WER within each run, isolating the audio generator as the variable.

Repo with scripts, raw JSON results, and evaluation setup link in comments below.

Disclosure: this benchmark was run using Neo, a local AI engineering agent inside Claude Code via MCP. The ONNX runtime choice and audio selection came from its pre-execution research phase rather than prior knowledge on my end.

reddit.com

u/gvij — 1 month ago

▲ 7 r/MachineLearning

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in.

Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono.

Results:

Inference path	RTF	Peak Memory	CPU utilization
HF Transformers bfloat16	0.519	~430MB delta	—
ONNX Runtime FP32 (onnx-asr)	0.328	2,667MB	49.9%
GGUF Q6_K (parakeet.cpp)	0.708	928MB	99.8%

ONNX Runtime is 37% faster than HF Transformers bfloat16 on this hardware. The gap comes from operator fusion and AVX2-optimized execution providers in ONNX Runtime that the PyTorch CPU path doesn't exploit as aggressively. Memory cost is the tradeoff — FP32 weights load at ~2.7GB peak.

GGUF Q6_K trades throughput for memory efficiency. 928MB peak vs 2.7GB, but RTF doubles and CPU utilization hits 99.8%. For memory-constrained deployments it's the right call. For sustained throughput on a box with headroom, ONNX wins.

One methodological note worth flagging for anyone doing ASR benchmarking with synthetic audio: espeak-ng inflated WER to 20.9% on a sentence set where gTTS got 4.65%. Both runtimes got identical WER within each run, confirming it's the TTS distribution mismatch rather than model or quantization quality. NVIDIA reports 1.93% on LibriSpeech — the gTTS number is a much more honest CPU-only proxy.

Github repo with code, raw results, and evaluation scripts in comments below.

Disclosure: benchmark was run using Neo, a local AI engineering agent inside Claude Code using its MCP. Mentioning because the runtime and audio choices came from its research phase, not prior knowledge on my end.

reddit.com

u/gvij — 1 month ago

▲ 9 r/speechtech

CPU inference benchmarks for Parakeet TDT 0.6B - ONNX Runtime vs HF Transformers vs GGUF, and why your test audio generator tanks your WER

Did a CPU-only evaluation of nvidia/parakeet-tdt-0.6b-v3 and ran into two things worth sharing for anyone building ASR evaluation pipelines.

Hardware: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU.

Finding 1: ONNX Runtime is significantly faster than HF Transformers on CPU

Inference path	RTF	Peak Memory	CPU utilization
HF Transformers bfloat16	0.519	~430MB delta	—
ONNX Runtime FP32 (onnx-asr)	0.328	2,667MB	49.9%
GGUF Q6_K (parakeet.cpp)	0.708	928MB	99.8%

ONNX Runtime runs at RTF 0.328 vs 0.519 for the HF Transformers path — 37% faster on identical hardware. Operator fusion and AVX2-optimized kernels make a real difference when there's no GPU to absorb the slack. The tradeoff is RAM: ONNX FP32 peaks at ~2.7GB loading full weights.

GGUF Q6_K is the right call if you're memory-constrained — 928MB peak, nearly identical accuracy — but it pegs both CPU cores at 99.8% and runs at roughly 2x the RTF of ONNX.

Finding 2: espeak-ng is a bad choice for ASR benchmarking

This one cost me a run. Using espeak-ng as the TTS source for test audio inflated WER to 20.9% on Harvard sentences that should be straightforward for this model. NVIDIA reports 1.93% WER on LibriSpeech. The gap is not the model.

espeak-ng mispronounces words like "zest", "zestful", and "tacos al pastor" in ways that sit far outside Parakeet's training distribution. Both inference backends got identical WER within the same run — confirming it's the audio generator, not the runtime.

Switching to gTTS brought WER to 4.65% on the same reference text. Still not LibriSpeech quality but a much more honest proxy for real speech. For CPU benchmarking where you're generating synthetic test audio, gTTS is worth the extra step.

Repo with scripts, raw JSON results, and evaluation setup link in comments below.

Curious if others have run into the espeak-ng WER inflation issue or found better synthetic audio options for ASR eval.

Disclosure: this benchmark was run using Neo, an AI engineering agent that runs locally inside Claude Code via MCP. The ONNX and gTTS decisions came out of its pre-execution research phase rather than from my own upfront knowledge - worth mentioning since it affected the methodology.

reddit.com

u/gvij — 1 month ago

▲ 16 r/PromptEngineering

"Skills" packs are dominating GitHub trending. Are they actually prompt engineering, or just packaging?

I went and read three of the trending "skills" repos for Claude Code looking specifically at what's prompt-engineering-novel inside them.

The repos:

forrestchang/andrej-karpathy-skills (~70k stars): one CLAUDE.md, four behavioral rules, derived from a Karpathy tweet
mattpocock/skills (~115k stars): ~10 single-purpose SKILL.md files
affaan-m/everything-claude-code (~175k stars): 182 SKILL.md files plus 48 agent definitions, hooks, rules, MCP configs

What's actually in them, prompt-wise:

The karpathy file is four imperative behavioral rules. No CoT scaffolding, no few-shot examples, no role definition, no structured output spec. Just declarative principles like "don't make silent assumptions, surface inconsistencies, present tradeoffs." It works because the model is now good enough that imperative behavioral instructions stick. Five years ago this would have been a non-starter and we'd have written elaborate few-shot examples. Now four sentences gets you 70k stars.

mattpocock's skills are procedural workflows in markdown prose. The tdd skill walks through red-green-refactor steps. to-issues describes how to slice a plan into independently-grabbable units. There's YAML frontmatter declaring when each skill should auto-fire, but the body is essentially what you'd write as a system prompt section, just modularized into discrete files.

ECC's skills look similar at the unit level but the system around them does more. YAML frontmatter with evaluation criteria, "instinct" files that track confidence scores per pattern, hooks that auto-extract patterns from sessions into new skills. Some of this is prompt-engineering-adjacent infrastructure (session memory, context-window management) rather than prompting per se.

So is this prompt engineering or packaging?

Honest answer, mostly packaging, with one real prompting innovation worth naming.

The packaging story is obvious. SKILL.md gave the community a unit of distribution. You can publish, fork, version, install. That makes prompts shippable in a way they weren't before, and that alone explains most of the trending list. None of these repos invented a new prompting technique.

The technique worth naming is conditional injection via frontmatter description matching. A SKILL.md's frontmatter description tells the harness when this skill should fire. The harness reads all installed skills' descriptions, decides which match the current task, and injects only those into context. So you can have a 182-skill catalog installed without paying 182 skills worth of tokens per turn. That's RAG-over-prompts using model-based routing on descriptions rather than vector embeddings. We've been doing this informally with system prompt sections for years, but standardizing it as the loading mechanism is genuinely new.

The bear case for prompt engineers specifically: if a four-sentence file derived from a tweet outperforms careful prompt construction, what are we doing? My read is that the model improvements collapsed a lot of the prompt-engineering surface area into "tell it clearly what you want," and skills survive as a packaging convention because they make that distributable, not because they're harder to write.

For people who do this for a living, are you still seeing returns on technique-heavy prompts (few-shot, CoT scaffolding, structured output, role chains), or is everything collapsing toward declarative behavioral instructions in markdown? Where are you getting the actual wins?

reddit.com

u/gvij — 1 month ago

▲ 3 r/AI_Agents

Half of GitHub trending AI repos are "skills" packs but the shape varies 1000x. The actual primitive is doing something real.

Three of the trending "skills" repos, read for what's inside:

karpathy-skills: 1 markdown file, 4 rules, ~70k stars
mattpocock/skills: ~10 SKILL.md files, ~115k stars
everything-claude-code: 182 SKILL.md files + 48 agents + 68 commands + hooks + rules + MCP configs + npm packages, ~178k stars (plus a second repo by the same author also in the top 10)

Three orders of magnitude in scope, same label.

What's underneath all of them is Anthropic's SKILL.md format. Markdown with YAML frontmatter, auto-loaded by Claude Code (and via shims by Cursor, Codex, OpenCode, Gemini, Antigravity) at session start. It's prompt-with-conventions. That's the actually interesting part.

For two years, the answer to "how do I make this agent better at X" was prompt engineering, manual context, glue code. SKILL .md is the first widely-adopted attempt to make a unit of agent capability publishable, forkable, and installable, with a defined invocation pattern (file gets auto-loaded, frontmatter declares when it fires). That's a real primitive even if the trending list around it is noisy.

Four implications I keep coming back to as a builder:

The packaging is the primitive, not the content. mattpocock's tdd skill and ECC's tdd-workflow skill solve the same problem with similar prose. The differentiation is whether you ship it in a format other people can compose with, not whether your wording is cleverer.
Cross-harness portability is leaky. ECC ships into .claude/, .cursor/, .codex/, .opencode/, .gemini/, .agent/, six paths, each with its own quirks (Cursor has 20 hook events vs Claude Code's 8, plugin distribution can't carry rules, OpenCode has a different plugin system). Write-once-run-anywhere is real-ish but you pay for it.
Workflow enforcement vs capability extension is a real split. Most of the agent-skills discussion in 2024-2025 was capability (tool use, browser, APIs). What's actually trending in mid-2026 is mostly workflow (TDD, triage, code review, planning). Different bet about where value lives.
The bear case. If your "skill" is a markdown file derived from a tweet and it has 70k stars, the moat is roughly zero. A startup whose differentiation is a skills pack should assume that pack gets copied, forked, or absorbed into the base model within 12 months. Karpathy's four rules will be in the default behavior of the next Claude release.

For people shipping production agents, are you treating skills as a real distribution primitive (publishing, versioning, dep-managing), or as personal scratch that occasionally gets pushed to GitHub? And does the cross-harness story hold up for you or do you end up forking per setup?

reddit.com

u/gvij — 1 month ago

▲ 54 r/LocalLLM+1 crossposts

A 26M parameter model beat Qwen3-0.6B on function calling, and the failure modes tell you why one-model-fits-all is the wrong frame for tool use

I've been thinking about how the "which LLM should I use for tool calling" question gets answered in most blog posts. Usually it's a leaderboard, sometimes BFCL, and you pick the highest one your budget allows. I ran a small benchmark this week that made me think this framing is wrong, or at least incomplete.

The setup: Needle 26M (Cactus-Compute, distilled from Gemini 3.1 specifically for function calling) vs Qwen3-0.6B (general-purpose, can also call tools). 50 queries across 5 difficulty tiers, on CPU, mock tools, three metrics per run (parse_success, tool_match, args_match).

The headline numbers are clean. Needle won 72% vs 56% overall and was 4.4x faster on CPU. That's the click-bait version.

The actually interesting thing is the failure modes are completely disjoint, and that should change how you architect the system.

Qwen3's failures are 100% parse failures. Every single one of its 22 missed queries was the model emitting natural-language prose instead of <tool_call> tags. When it did emit a call, args were perfect 100% of the time. So Qwen3 is the model that's reluctant to use tools but precise when it does.

Needle's failures are wrong-tool-selection. When it picks a tool, args are right 97% of the time. Its failure mode is picking search_web when you wanted run_command, or get_time when you asked it to check the current directory. It commits with confidence, sometimes to the wrong thing.

This means "fix" looks completely different for each. Qwen3 needs aggressive prompting to actually use tools (system message reinforcement, maybe constrained decoding). Needle needs better tool descriptions or a router layer that disambiguates ambiguous-tool-fit cases.

The tier breakdown is where I think the real lesson for builders lives:

Tier	Needle	Qwen3
Explicit ("what's the weather in London")	100%	100%
Paraphrased	90%	90%
Implicit ("should I bring an umbrella in Amsterdam")	80%	10%
Ambiguous (two tools could fit)	40%	20%
Edge (multilingual, no-tool trap)	50%	60%

T1 and T2 are saturated for both. If your benchmark only tests "what's the weather in X" patterns, you'll conclude these models are equivalent. They are absolutely not.

T3 is the killer. The query "should I bring an umbrella in Amsterdam today?" never says "weather." Needle, narrowly trained on intent-to-tool mapping, gets it 80% of the time. Qwen3 falls to 10%, it usually answers in prose, often apologizing for not having real-time data. This is the gap that matters in production, because users don't phrase queries the way your tool names are spelled.

The build-time takeaways I'm walking away with:

Pick the model based on user-query distribution, not benchmark averages. If your users phrase things explicitly ("translate this to French"), most small models work. If they phrase implicitly ("how do you say this in French"), the specialist beats the generalist by a lot.
Cascading dispatchers might be underrated. Needle is 13MB and fast. Qwen3 is 1.2GB and slower but conversational. A two-stage system (Needle for tool routing, Qwen3 for chat-or-fallback) probably beats either alone for an on-device assistant.
Look at raw outputs before trusting aggregate accuracy. Two engineering issues from the run that would have silently broken the numbers: Both would have silently degraded results if I'd only looked at top-line numbers.
- Needle scored 8% initially because I fed it OpenAI JSON Schema. It was trained on a flat schema and was literally echoing "properties" back as an argument value. Schema converter fixed it, jumped to 72%.
- Qwen3 was burning the full 256-token budget per query (~230s on CPU) because the hand-rolled prompt never produced EOS. Switching to tokenizer.apply_chat_template(tools=..., enable_thinking=False) gave a 6x latency drop and clean <tool_call> emission.
Per-tool accuracy matters. Needle was 100% on get_weather and get_time, but 50% on run_command. If you're shipping with a fixed tool palette, evaluate per-tool, not just overall. The aggregate hides where the model is actually weak.
Latency and accuracy don't trade off the way you'd expect on CPU. The smaller model was both faster AND more accurate on tool selection. The "small models are dumb but fast" intuition doesn't hold for narrowly-trained specialists.

Full code, both backends, raw 100-row log, summary JSON, charts in the comments below 👇

Limitations to be honest about: n=50 is small (paired bootstrap CIs are on my list), single CPU config, 5 mock tools so no chaining, T4's underspecified-args eval is relaxed. If anyone reproduces with a larger query set or real tools I'd love to see what shifts.

This evaluation was done using NEO, an AI engineering agent. It built the eval harness, handled the checkpointed runs, debugged the schema mismatch and the EOS issue, and consolidated results. I reviewed everything manually and made the calls on what to ship.

u/gvij — 1 month ago

▲ 8 r/LocalLLaMA

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

Ran a head-to-head on two open-weight models for tool-calling on a 4-core CPU, no GPU, no cherry-picking. Wanted to see if the small specialist (Needle, 26M, distilled from Gemini 3.1 for function calls) actually holds up against a small generalist (Qwen3-0.6B) that also does tools.

Setup: 50 queries across 5 tiers (simple, paraphrased, implicit, ambiguous, edge cases including foreign language and a "don't call any tool" trap). 5 mock tools. Three metrics per run: parse_success, tool_match, args_match. Same queries, same eval rubric, same hardware.

Headline numbers:

                    Needle (26M)   Qwen3 (0.6B)
tool_match overall    72.0%          56.0%
parse_success         84.0%          54.0%
args_match | match    97.2%         100.0%
mean latency        10.9s          47.9s

The interesting part is not the overall win, it's the failure shapes. They diverge completely:

Needle fails by picking the wrong tool. When it does pick a tool, args are right 97% of the time. Its sin is selection, mostly routing system commands to search_web instead of run_command.
Qwen3 fails by not calling a tool at all. Every single one of its 22 misses is a parse failure where it answered in prose instead of emitting <tool_call> tags. When it does emit a call, args are perfect 100% of the time.

Tier breakdown is where it gets sharp. T1 and T2 (literal and paraphrased) are tied at ~95% each. T3 (implicit, like "should I bring an umbrella in Amsterdam?" where the tool name never appears) is where Qwen3 falls off a cliff: 80% to 10%. Needle just maps the intent. Qwen3 tries to be helpful in prose and apologizes for not having real-time data.

T5 (edge) is the only tier Qwen3 wins, by 10 pts. Hindi queries broke Needle's tokenizer (Devanagari fragments badly, one query timed out at 73s with garbled output). Qwen3 handled both Hindi and French cleanly.

One thing that almost killed the Needle run: first pass it scored 8% because I was feeding it OpenAI JSON Schema. Needle was trained on a flat schema ({location: {type, description, required}}) and was literally echoing the word "properties" back as an argument value. Wrote a converter, accuracy jumped from 8% to 72% with no other changes. Worth knowing if anyone else picks up the Needle weights.

Qwen3 had its own issue, it never emitted EOS on the hand-rolled prompt template and burned the full 256-token budget on every query (~230s each). Switching to tokenizer.apply_chat_template(tools=...) with enable_thinking=False dropped it to ~37s and the <tool_call> tags started appearing naturally.

My read: these are not the same product category even though they sound like they are. Needle is a dispatcher. Qwen3 is a tiny chatbot that can also call tools. If you want on-device single-shot tool routing with a fixed palette, Needle is genuinely good for 13MB. If you want any conversational ability, Needle has zero of it and Qwen3 wins by default.

Limitations: n=50 is small. Single CPU hardware. Mock tools, not real ones. Would love anyone who reproduces it on different hardware or with a paraphrase-stress-test to share results.

Repo with full code, raw_log.jsonl, summary.json, and the 5 charts are in comments below 👇

This evaluation was done using NEO, an AI engineering agent. It built the eval harness, handled the checkpointed runs, debugged the schema mismatch and the EOS issue, and consolidated results. I reviewed everything manually and made the calls on what to ship.

reddit.com

u/gvij — 1 month ago

▲ 38 r/TextToSpeech+2 crossposts

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in.

Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS where you can dial down inference steps to trade quality for speed. Default is 5 steps, "speed mode" is 2. Kokoro 82M everyone here knows by now.

Hardware: AMD EPYC 7763, 4 vCPUs, 16GB RAM, no GPU. Roughly comparable to a Ryzen 5600 or a decent N100 box.

Setup: 6 text lengths from 12 chars to 1712 chars, 5 runs each, 120 timed runs total. CUDA explicitly disabled. Warmup run discarded.

Mean RTF (lower is faster):

Supertonic 3, 2 steps: 0.165 (6.1x realtime)
Supertonic 3, 5 steps: 0.313 (3.2x realtime)
Kokoro 82M PyTorch: 0.469 (2.1x realtime)
Kokoro 82M ONNX: 0.509 (2.0x realtime)

Wall-clock latency on the medium text (196 chars, about 13 seconds of audio):

Supertonic 2-step: 1.82s
Supertonic 5-step: 3.67s
Kokoro PyTorch: 5.62s
Kokoro ONNX: 5.51s

Long and Extended text details in the Github Repo below.

Throughput in chars per second at steady state: Supertonic 2-step gets to ~111, Supertonic 5-step ~55, Kokoro hovers around 33 to 36 regardless of backend.

The quality side, which actually flips the ranking:

Supertonic at 2 steps is fast, but the audio is rough. Words slur, prosody is mechanical, not something I'd ship. At 5 steps it cleans up a lot and is genuinely usable. Kokoro at either backend still produces the most natural speech of anything I've tested in this size class. It's #1 on the TTS Arena leaderboard for a reason.

So the practical ranking is more like:

Want it to sound like a human → Kokoro, accept the slower speed
Want low latency for an assistant/chatbot → Supertonic 5-step is the sweet spot
Supertonic 2-step → demos and prototyping, that's it

Two things that surprised me:

Kokoro ONNX was slower than PyTorch on this CPU. I expected the opposite. ONNX wins on the longer texts but loses on tiny ones because of higher fixed overhead. Worth retesting on Intel hardware to see if it's an AMD thing.
Supertonic has way more fixed per-call overhead than Kokoro. RTF on tiny text is 0.30, on medium it drops to 0.13. Kokoro is much flatter across lengths. So if your workload is lots of short utterances, the gap between them narrows.

Detailed write up and Github Repo with all 24 audio samples, and the benchmarks are mentioned in comments below 👇

This evaluation of both TTS models was performed using Neo AI Engineer that built the eval harness, handled model runtime issues, and consolidated results. I reviewed everything manually.

If anyone has an N100 or a Pi 5 lying around and runs this, I'd love to see the numbers. That's the tier I actually want to deploy on.

u/gvij — 2 months ago

▲ 2 r/MachineLearning

Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]

Posting some practical findings from a structured audit of a production customer support RAG system. Methodology and caveats up front.

Methodology:

6 representative turns from a real production session as the eval set (small, acknowledged limitation)
LLM-as-judge using Claude Haiku 4.5, scoring relevance/accuracy/helpfulness/overall on 0-10, returning per-turn reasoning strings for verification
Same judge across all conditions, same questions, same retrieval state where possible
Production model held constant while isolating retrieval changes, then swept across 5 LLMs once retrieval was fixed
Live pricing from OpenRouter /models API rather than estimates

Findings:

Heuristic evaluation produces zero signal. The existing evaluator counted keywords and source references. Output was numerical but uncorrelated with response quality. LLM judges with explicit rubrics caught hallucinations, identified zero-retrieval turns, and produced reasoning that could be spot-checked. The cost is real but small (cents per run) compared to shipping undetected regressions.
Retrieval failures present as generation failures. A turn where the agent said "I don't have information about our company" looked like a model knowledge problem. Trace showed zero documents retrieved. Root cause was a similarity threshold (cosine distance 0.7 in Chroma) too strict for casual openers. Always inspect what entered the context window before tuning the generation step.
The production model was not on the Pareto frontier. Sweep across Gemini Flash Lite Preview (incumbent), Gemma 4 26B, Mistral Small 3.2, Nova Micro, and one more. Gemma 4 26B dominated the incumbent on both axes: higher quality scores (7.88 vs 7.33) at 75% lower cost. The incumbent was neither cheapest nor best.
Grounding constraints have measurable helpfulness cost. Adding "only state facts present in retrieved documents" to the system prompt improved accuracy scores and reduced helpfulness scores on turns where docs didn't fully answer the question. The judge consistently flagged "the documents don't specify this, contact support" responses as accurate but less actionable. Real tradeoff worth surfacing rather than discovering post-deployment.

Limitations I want to be honest about:

n=6 is small. Treat the deltas as directional, not as confidence intervals.
LLM-as-judge has known biases (length, verbosity, self-preference). Using a different family than the production models reduces but doesn't eliminate this. Sanity checked by reading the reasoning strings.
"Quality" here is judge-defined, not user-defined. A proper next step would be correlating judge scores with user satisfaction signals.

End-to-end delta: +19% quality, −79% cost. The cost win is robust because pricing is mechanical. The quality win I'd want to see replicated on a larger eval set before claiming it generalizes.

I've also written a detailed write up if anyone wants to go in depth on the evaluation process details. Mentioned below in comments 👇

reddit.com

u/gvij — 2 months ago

▲ 1 r/PromptEngineering

The system prompt change that improved accuracy and hurt helpfulness, and why I shipped it anyway.

Short post about a tradeoff I keep seeing teams stumble into.

I was auditing a RAG support bot. The original system prompt was friendly, vague, and let the model fall back on its own knowledge when the retrieved docs didn't fully answer a question. This was producing two failure modes:

One, hallucinated product names that weren't in the knowledge base.

Two, generic helpful-sounding advice that was technically off-policy because it wasn't grounded in the docs.

I rewrote the prompt with a grounding rule: only state facts that are present in the retrieved documents. If the docs don't cover it, say so and route to support.

What happened to the scores (LLM judge, 0-10 across relevance/accuracy/helpfulness/overall):

Accuracy went up. Hallucinations basically stopped.
Helpfulness went down on turns where the docs didn't fully answer the question. The judge correctly flagged "the documents don't specify this, contact support" as accurate but less actionable than the previous behavior.

The instinct here is to fix the helpfulness drop by softening the rule. Don't, at least not for a factual support bot. The previous behavior was creating compliance risk (off-policy advice) and customer trust risk (hallucinations). The accuracy gain is worth the helpfulness loss for this use case.

What I'd do differently if I were writing the prompt from scratch:

Be explicit about what to do when the docs don't cover the question. "Acknowledge the gap, restate what's known, route to human support" beats "say you don't know."
Add tone de-escalation language separately. The grounding rule and the tone rule are different jobs.
Remove boilerplate greetings. The original prompt was producing "Hello! Thank you for reaching out" on every turn including turn 5 of an ongoing conversation. Embarrassing and a clear signal nobody had tested multi-turn behavior.

Broader lesson I'd take to any prompt change: measure both the metric you're targeting and the one you might accidentally hurt. If I'd only looked at accuracy I would have called this a clean win. The helpfulness drop is a real cost. Better to know about it and ship consciously than discover it from a user complaint.

This chatbot was evaluated and optimized using Neo AI Engineer that built the eval harness, handled checkpointing through timeouts and context limit issues, and consolidated results. I reviewed everything manually

Full report in the comments if useful 👇

reddit.com

u/gvij — 2 months ago

▲ 91 r/LangChain+2 crossposts

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality.

In the name of evaluation, I only had a keyword matching script producing numbers that looked like scores and meant nothing.

I went in to fix this properly. Sharing what I found because most of it was not where I expected.

1. Retrieval problems disguise themselves as LLM problems.

User asks "hey what do you guys do?" Bot says "I don't have access to specific information about our company's services." Everyone's first instinct is to tweak the prompt or swap the model. Wrong. The similarity threshold in ChromaDB was set to 0.7 (cosine distance, lower = more similar, so this is actually strict). Casual openers don't produce embeddings close enough to any chunk to pass that filter. Zero docs retrieved. The model was honestly reporting it had nothing.

Lesson: always log what context the LLM actually received before blaming generation. If retrieval returns nothing, no amount of prompt engineering fixes it.

2. Heuristic evaluators are worse than no evaluator.

Counting keywords and source references gives you a number. That number has no correlation with whether users are being helped. Worse, it gives you false confidence that you are measuring something. Bit the bullet and used an LLM judge (Claude Haiku 4.5 via OpenRouter) scoring relevance, accuracy, helpfulness, and overall on 0-10. Costs a few cents per full run. Cheap insurance.

3. Deduplicate chunks before sending to the model.

Two of our turns had three near-identical FAQ chunks in the context window. Added a check for >80% token overlap from the same source file. Cleaner context, fewer tokens, and the agent stopped hallucinating product names on one turn (probably because the noise was gone).

4. Stricter grounding trades helpfulness for accuracy.

Added a rule that the agent only states facts present in retrieved docs. Accuracy went up. Helpfulness went down on knowledge-gap turns because the bot started saying "the docs don't specify this, contact support" instead of guessing. This is the right call for a factual support bot but you need to make it consciously. Otherwise users complain the bot got worse even though your scores say it got better.

5. Run a model sweep. The defaults are usually wrong.

I was running Gemini 3.1 Flash Lite Preview. Swept 5 models against the same eval harness. Gemma 4 26B scored higher (7.88 vs 7.33) and cost 75% less per session. Mistral Small 3.2 close second. Nova Micro cheapest but terse responses got penalized for not being actionable.

The point is not that Gemma is the best model. The point is your production model is probably not on the Pareto frontier and you only find that out by measuring.

End to end: quality 6.62 to 7.88 (+19%), cost $0.002420 to $0.000509 per session (−79%). Both directions, same run.

This entire evaluation was done using Neo AI Engineer. It built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. I reviewed everything manually and made the calls on what to ship.

Full walkthrough write up in the comments if anyone wants to replicate it on their own system. 👇

u/gvij — 2 months ago

▲ 6 r/SideProject

Built a router that stops you from hardcoding LLM model names

Small thing that we open-sourced this week. The problem it solves: you pick a model at the start of a project and that decision is basically frozen forever even though the model landscape changes constantly.

This router sits between your app and OpenRouter and picks the best model per request based on whatever you care about.

You pass a priority flag (speed / cost / quality / balanced) and it runs a weighted score across the whole catalogue to figure out the optimal pick. Decision latency is under 1ms.

Also handles fallback automatically if the selected model fails, caches repeated requests in Redis (or in-memory), and exposes a metrics endpoint so you can see p95/p99 latency and cache hit rates over time.

It's a FastAPI server with a CLI. The CLI has a dry-run mode that's kind of fun to play with, you can ask it "what would you pick for this prompt at speed priority" without actually spending any API credits.

Definitely not finished, quality scores are static right now which limits the adaptiveness. But it's been useful for my own projects so figured I'd share.

Github repo is in comments below 👇

Built this project using Neo AI Engineer.

reddit.com

u/gvij — 2 months ago