u/gvij

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU
▲ 38 r/TextToSpeech+2 crossposts

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in.

Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS where you can dial down inference steps to trade quality for speed. Default is 5 steps, "speed mode" is 2. Kokoro 82M everyone here knows by now.

Hardware: AMD EPYC 7763, 4 vCPUs, 16GB RAM, no GPU. Roughly comparable to a Ryzen 5600 or a decent N100 box.

Setup: 6 text lengths from 12 chars to 1712 chars, 5 runs each, 120 timed runs total. CUDA explicitly disabled. Warmup run discarded.

Mean RTF (lower is faster):

  • Supertonic 3, 2 steps: 0.165 (6.1x realtime)
  • Supertonic 3, 5 steps: 0.313 (3.2x realtime)
  • Kokoro 82M PyTorch: 0.469 (2.1x realtime)
  • Kokoro 82M ONNX: 0.509 (2.0x realtime)

Wall-clock latency on the medium text (196 chars, about 13 seconds of audio):

  • Supertonic 2-step: 1.82s
  • Supertonic 5-step: 3.67s
  • Kokoro PyTorch: 5.62s
  • Kokoro ONNX: 5.51s

Long and Extended text details in the Github Repo below.

Throughput in chars per second at steady state: Supertonic 2-step gets to ~111, Supertonic 5-step ~55, Kokoro hovers around 33 to 36 regardless of backend.

The quality side, which actually flips the ranking:

Supertonic at 2 steps is fast, but the audio is rough. Words slur, prosody is mechanical, not something I'd ship. At 5 steps it cleans up a lot and is genuinely usable. Kokoro at either backend still produces the most natural speech of anything I've tested in this size class. It's #1 on the TTS Arena leaderboard for a reason.

So the practical ranking is more like:

  • Want it to sound like a human → Kokoro, accept the slower speed
  • Want low latency for an assistant/chatbot → Supertonic 5-step is the sweet spot
  • Supertonic 2-step → demos and prototyping, that's it

Two things that surprised me:

  1. Kokoro ONNX was slower than PyTorch on this CPU. I expected the opposite. ONNX wins on the longer texts but loses on tiny ones because of higher fixed overhead. Worth retesting on Intel hardware to see if it's an AMD thing.
  2. Supertonic has way more fixed per-call overhead than Kokoro. RTF on tiny text is 0.30, on medium it drops to 0.13. Kokoro is much flatter across lengths. So if your workload is lots of short utterances, the gap between them narrows.

Detailed write up and Github Repo with all 24 audio samples, and the benchmarks are mentioned in comments below 👇

This evaluation of both TTS models was performed using Neo AI Engineer that built the eval harness, handled model runtime issues, and consolidated results. I reviewed everything manually.

If anyone has an N100 or a Pi 5 lying around and runs this, I'd love to see the numbers. That's the tier I actually want to deploy on.

u/gvij — 5 days ago

Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]

Posting some practical findings from a structured audit of a production customer support RAG system. Methodology and caveats up front.

Methodology:

  • 6 representative turns from a real production session as the eval set (small, acknowledged limitation)
  • LLM-as-judge using Claude Haiku 4.5, scoring relevance/accuracy/helpfulness/overall on 0-10, returning per-turn reasoning strings for verification
  • Same judge across all conditions, same questions, same retrieval state where possible
  • Production model held constant while isolating retrieval changes, then swept across 5 LLMs once retrieval was fixed
  • Live pricing from OpenRouter /models API rather than estimates

Findings:

  1. Heuristic evaluation produces zero signal. The existing evaluator counted keywords and source references. Output was numerical but uncorrelated with response quality. LLM judges with explicit rubrics caught hallucinations, identified zero-retrieval turns, and produced reasoning that could be spot-checked. The cost is real but small (cents per run) compared to shipping undetected regressions.
  2. Retrieval failures present as generation failures. A turn where the agent said "I don't have information about our company" looked like a model knowledge problem. Trace showed zero documents retrieved. Root cause was a similarity threshold (cosine distance 0.7 in Chroma) too strict for casual openers. Always inspect what entered the context window before tuning the generation step.
  3. The production model was not on the Pareto frontier. Sweep across Gemini Flash Lite Preview (incumbent), Gemma 4 26B, Mistral Small 3.2, Nova Micro, and one more. Gemma 4 26B dominated the incumbent on both axes: higher quality scores (7.88 vs 7.33) at 75% lower cost. The incumbent was neither cheapest nor best.
  4. Grounding constraints have measurable helpfulness cost. Adding "only state facts present in retrieved documents" to the system prompt improved accuracy scores and reduced helpfulness scores on turns where docs didn't fully answer the question. The judge consistently flagged "the documents don't specify this, contact support" responses as accurate but less actionable. Real tradeoff worth surfacing rather than discovering post-deployment.

Limitations I want to be honest about:

  • n=6 is small. Treat the deltas as directional, not as confidence intervals.
  • LLM-as-judge has known biases (length, verbosity, self-preference). Using a different family than the production models reduces but doesn't eliminate this. Sanity checked by reading the reasoning strings.
  • "Quality" here is judge-defined, not user-defined. A proper next step would be correlating judge scores with user satisfaction signals.

End-to-end delta: +19% quality, −79% cost. The cost win is robust because pricing is mechanical. The quality win I'd want to see replicated on a larger eval set before claiming it generalizes.

I've also written a detailed write up if anyone wants to go in depth on the evaluation process details. Mentioned below in comments 👇

reddit.com
u/gvij — 7 days ago

The system prompt change that improved accuracy and hurt helpfulness, and why I shipped it anyway.

Short post about a tradeoff I keep seeing teams stumble into.

I was auditing a RAG support bot. The original system prompt was friendly, vague, and let the model fall back on its own knowledge when the retrieved docs didn't fully answer a question. This was producing two failure modes:

One, hallucinated product names that weren't in the knowledge base.

Two, generic helpful-sounding advice that was technically off-policy because it wasn't grounded in the docs.

I rewrote the prompt with a grounding rule: only state facts that are present in the retrieved documents. If the docs don't cover it, say so and route to support.

What happened to the scores (LLM judge, 0-10 across relevance/accuracy/helpfulness/overall):

  • Accuracy went up. Hallucinations basically stopped.
  • Helpfulness went down on turns where the docs didn't fully answer the question. The judge correctly flagged "the documents don't specify this, contact support" as accurate but less actionable than the previous behavior.

The instinct here is to fix the helpfulness drop by softening the rule. Don't, at least not for a factual support bot. The previous behavior was creating compliance risk (off-policy advice) and customer trust risk (hallucinations). The accuracy gain is worth the helpfulness loss for this use case.

What I'd do differently if I were writing the prompt from scratch:

  • Be explicit about what to do when the docs don't cover the question. "Acknowledge the gap, restate what's known, route to human support" beats "say you don't know."
  • Add tone de-escalation language separately. The grounding rule and the tone rule are different jobs.
  • Remove boilerplate greetings. The original prompt was producing "Hello! Thank you for reaching out" on every turn including turn 5 of an ongoing conversation. Embarrassing and a clear signal nobody had tested multi-turn behavior.

Broader lesson I'd take to any prompt change: measure both the metric you're targeting and the one you might accidentally hurt. If I'd only looked at accuracy I would have called this a clean win. The helpfulness drop is a real cost. Better to know about it and ship consciously than discover it from a user complaint.

This chatbot was evaluated and optimized using Neo AI Engineer that built the eval harness, handled checkpointing through timeouts and context limit issues, and consolidated results. I reviewed everything manually

Full report in the comments if useful 👇

reddit.com
u/gvij — 7 days ago
▲ 91 r/LangChain+2 crossposts

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality.

In the name of evaluation, I only had a keyword matching script producing numbers that looked like scores and meant nothing.

I went in to fix this properly. Sharing what I found because most of it was not where I expected.

1. Retrieval problems disguise themselves as LLM problems.

User asks "hey what do you guys do?" Bot says "I don't have access to specific information about our company's services." Everyone's first instinct is to tweak the prompt or swap the model. Wrong. The similarity threshold in ChromaDB was set to 0.7 (cosine distance, lower = more similar, so this is actually strict). Casual openers don't produce embeddings close enough to any chunk to pass that filter. Zero docs retrieved. The model was honestly reporting it had nothing.

Lesson: always log what context the LLM actually received before blaming generation. If retrieval returns nothing, no amount of prompt engineering fixes it.

2. Heuristic evaluators are worse than no evaluator.

Counting keywords and source references gives you a number. That number has no correlation with whether users are being helped. Worse, it gives you false confidence that you are measuring something. Bit the bullet and used an LLM judge (Claude Haiku 4.5 via OpenRouter) scoring relevance, accuracy, helpfulness, and overall on 0-10. Costs a few cents per full run. Cheap insurance.

3. Deduplicate chunks before sending to the model.

Two of our turns had three near-identical FAQ chunks in the context window. Added a check for >80% token overlap from the same source file. Cleaner context, fewer tokens, and the agent stopped hallucinating product names on one turn (probably because the noise was gone).

4. Stricter grounding trades helpfulness for accuracy.

Added a rule that the agent only states facts present in retrieved docs. Accuracy went up. Helpfulness went down on knowledge-gap turns because the bot started saying "the docs don't specify this, contact support" instead of guessing. This is the right call for a factual support bot but you need to make it consciously. Otherwise users complain the bot got worse even though your scores say it got better.

5. Run a model sweep. The defaults are usually wrong.

I was running Gemini 3.1 Flash Lite Preview. Swept 5 models against the same eval harness. Gemma 4 26B scored higher (7.88 vs 7.33) and cost 75% less per session. Mistral Small 3.2 close second. Nova Micro cheapest but terse responses got penalized for not being actionable.

The point is not that Gemma is the best model. The point is your production model is probably not on the Pareto frontier and you only find that out by measuring.

End to end: quality 6.62 to 7.88 (+19%), cost $0.002420 to $0.000509 per session (−79%). Both directions, same run.

This entire evaluation was done using Neo AI Engineer. It built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. I reviewed everything manually and made the calls on what to ship.

Full walkthrough write up in the comments if anyone wants to replicate it on their own system. 👇

u/gvij — 8 days ago

Built a router that stops you from hardcoding LLM model names

Small thing that we open-sourced this week. The problem it solves: you pick a model at the start of a project and that decision is basically frozen forever even though the model landscape changes constantly.

This router sits between your app and OpenRouter and picks the best model per request based on whatever you care about.

You pass a priority flag (speed / cost / quality / balanced) and it runs a weighted score across the whole catalogue to figure out the optimal pick. Decision latency is under 1ms.

Also handles fallback automatically if the selected model fails, caches repeated requests in Redis (or in-memory), and exposes a metrics endpoint so you can see p95/p99 latency and cache hit rates over time.

It's a FastAPI server with a CLI. The CLI has a dry-run mode that's kind of fun to play with, you can ask it "what would you pick for this prompt at speed priority" without actually spending any API credits.

Definitely not finished, quality scores are static right now which limits the adaptiveness. But it's been useful for my own projects so figured I'd share.

Github repo is in comments below 👇

Built this project using Neo AI Engineer.

reddit.com
u/gvij — 11 days ago

Built a routing layer for multi-model pipelines, picks the right LLM per request based on priority

If you're building agents that chain multiple LLM calls, you've probably hit this: not every step in your pipeline needs the same model. A quick extraction step doesn't need Opus.

A final synthesis step probably shouldn't use Flash. But you still end up hardcoding something and hoping it works for all of them.

This router lets you set a priority flag per request (speed / cost / quality / balanced) and it picks the best model automatically using a weighted score.

Routing decision is under 1ms since it's pure math, no extra network hop. Auto-fallback if the selected model fails, Redis caching for repeated requests, metrics endpoint for p95/p99 latency per model.

Built on OpenRouter, so anything in their catalogue is fair game. Would be pretty easy to wire into an agent pipeline at the LLM call layer.

Github repo is in comments below 👇

Built this project using Neo AI Engineer.

reddit.com
u/gvij — 11 days ago
▲ 18 r/openrouter+1 crossposts

Built a router that picks the right LLM for each request automatically, under 1ms overhead

Been working on something that's been bugging me for a while. Every time I build something with LLMs, I end up hardcoding a model name and then spending weeks second-guessing that decision. Is gpt-4o overkill for this? Should I have used Haiku? What happens when this model has an outage?

So we built a router that handles all of this at the request level. You tell it your priority (speed, cost, quality, or balanced) and it scores every model in the catalogue using a weighted formula across latency, cost, and quality dimensions, then picks the best one. The whole scoring decision takes under 1ms because it's just math, no network call.

The weights look like this:

  • speed priority: 0.70 latency / 0.20 cost / 0.10 quality
  • cost priority: 0.20 / 0.70 / 0.10
  • quality priority: 0.10 / 0.20 / 0.70

It sits in front of OpenRouter, so you get access to the full catalogue. If the selected model fails, it falls back to the next best candidate automatically. Repeated identical requests hit Redis cache (or in-memory if you're not running Redis). FastAPI server with a CLI for dry-runs if you want to see routing decisions without burning tokens.

Curious if anyone has tried something similar or has thoughts on the scoring approach. The quality scores are static right now which is the obvious weak point.

Github repo is in comments below 👇

This project was built using Neo AI Engineer. Evaluated by myself.

u/gvij — 11 days ago
▲ 101 r/LocalLLM+1 crossposts

Compared qwen3.6, qwen3-coder, and deepseek-coder on three coding benchmarks. All running locally on Ollama

Wanted to figure out which of the popular coding models actually deserves the disk space, so I built a small eval harness and ran four of them through three benchmarks: code generation, function calling, and multi step agent tasks.

Everything runs through Ollama on CPU. No GPU, no API keys, no cloud.

Headline numbers:

deepseek-coder:33b is the strongest at writing single functions (90% on code gen) but it cratered on agent tasks (10%). That actually fits with how it was trained. It's heavily fine tuned for code completion, optimized for producing correct code given a clear prompt. As soon as the task requires planning across steps and reasoning about intermediate outputs, that specialization stops helping. Same harness scored it 90% on code gen, so it's not a scoring issue, the gap is real.

qwen3.6:27b was the best balanced model across the board. 80% on code gen, 84% on tool selection, 100% on agent tasks. This is what I'd keep on disk if I could only pick one.

qwen3-coder:30b sits in the middle on everything. Solid but not exceptional.

qwen3.6:35b-a3b (the MoE) matched 27b on agent and tool tasks but lost some ground on raw code gen.

Two practical things if you reproduce this:

The default num_predict of 2048 is not enough for any reasoning model that emits <think> blocks. qwen3.6 burned its entire budget thinking and never reached the answer. I had to bump it to 8192 and strip the think tags before parsing.

Dense 27b models on CPU need a longer timeout than you'd guess. I set 1200s per task for qwen3.6 dense and 600s for the MoE models.

This evaluation was done using Neo AI Engineer, which built the eval harness, handled checkpointed runs, timeout issues, context limit issues and consolidated the results. I manually reviewed the outcomes.

Code, full report, and per task JSON in the comments 👇

u/gvij — 15 days ago
▲ 17 r/artificial+1 crossposts

Opensourced a new dataset generation tool: Synthetic data flywheel.

The feedback loop is the interesting part. After each cycle the pipeline pulls out the pairs that failed the quality filter and uses them as seeds for generation in the next cycle.

The idea is that the generator keeps being pushed toward examples the judge finds hard, so the dataset does not just accumulate easy cases.

The judge can be run locally with Ollama or through OpenRouter or Anthropic. You can also calibrate it against your own labels to get a sense of how much it agrees with human judgment before you trust it at scale.

Fine-tuning is handled via an auto-generated Unsloth notebook, runs on a free Colab T4.

Github project link is in comments below 👇

u/gvij — 18 days ago

Built this because I wanted a reproducible way to build fine-tuning datasets without doing it all by hand.

You give it seed prompts or an existing dataset, it generates instruction-output pairs via any OpenRouter model, scores them with a local or remote LLM judge, and exports a clean JSONL you can use directly for training.

You can also ingest datasets straight from HuggingFace and filter or relabel them through the same pipeline.

The export step lets you set a score threshold and a train/val split ratio so what comes out is ready to use.

MIT licensed, everything is stored locally, no data leaves your machine unless you choose a cloud judge backend.

Github project link is in comments below 👇

reddit.com
u/gvij — 18 days ago
▲ 33 r/LocalLLM+3 crossposts

OpenAI dropped Privacy Filter last month under Apache 2.0 and I wanted to see how it actually stacks up against the other serious open weight option for PII detection, GLiNER large-v2.1. Ran a full head to head on 600 labeled samples from ai4privacy (400 English, 200 across French, German, Spanish, Italian, Dutch).

The headline finding is that openai/privacy-filter is genuinely strong, but you'd never know it from a quick benchmark.

Here's why:

Openai/privacy-filter is a token classifier with a GPT style BPE tokenizer. BPE prepends a space to most tokens, so when you decode token boundaries back to character offsets, every span is off by one character compared to a human annotation. Score the model with strict exact span matching, which is the obvious first thing to do, and it looks much worse than it is. Almost every "miss" is actually a correct detection with a one character offset.

The numbers tell the story:

Model Strict F1 Boundary F1
GLiNER large-v2.1 0.367 0.416
openai/privacy-filter 0.155 0.498

The 0.34 strict to boundary gap for openai/privacy-filter is entirely tokenizer artifact, not real misses. Once you score with boundary overlap (any character overlap with correct label), the model wins overall.

Per category on boundary scoring (English):

  • EMAIL: openai 0.99, GLiNER 0.73
  • PHONE: openai 0.67, GLiNER 0.51
  • PERSON: openai 0.69, GLiNER 0.62
  • DATE: openai 0.27, GLiNER 0.26
  • ADDRESS: GLiNER 0.39, openai 0.37

EMAIL is essentially solved. 0.987 F1 in English, 1.000 across the multilingual set.

A few other things worth knowing if you're considering deploying it:

  • It's faster than GLiNER on CPU (~2.8 vs ~1.1 samples/sec) thanks to the MoE sparse activation. 1.5B total params but only 50M active per forward pass.
  • Multilingual performance is actually stronger than English on boundary scoring. Counterintuitive given the model card flags non-English as a risk, but the numbers are what they are.
  • The model is more conservative than GLiNER. Higher precision, lower recall. If you're building a redaction pipeline where missing PII is unacceptable, GLiNER's recall heavy profile may be a better fit. If false positives break downstream parsing, openai/privacy-filter wins.
  • It needs trust_remote_code=True and the dev branch of transformers right now. The model class hasn't landed in a stable release yet. Mildly annoying but not a blocker.
  • The eight categories are fixed (person, address, email, phone, url, date, account_number, secret). For anything outside that you'd need GLiNER's zero shot interface.

Two openai/privacy-filter categories (account_number and secret) had no equivalent gold labels in ai4privacy and were excluded from scoring. A finance or credentials heavy dataset would be needed to evaluate those.

Full writeup, Code, predictions and all CSVs in the comments below 👇

Disclosure: I work on Neo AI Engineer, and the eval pipeline was built by Neo from a single prompt. I reviewed the methodology and validated the results before publishing. The numbers and findings stand on their own, happy to talk about the agent side separately if anyone's interested.

u/gvij — 22 days ago

Was going through AI Signals today which tracks open issues across trending 300+ AI/ML repos. Pulled up openclaw (366k stars) and these four stood out. Sharing for anyone who wants to contribute or just knows what's coming.

1. Images sent through channels never reach the model Discord, Telegram, Feishu, OpenWebUI — when a user sends an image, the channel adapter strips it and passes only text to the model. Vision-capable models on the other end respond as if no image was provided, or hallucinate descriptions. The fix is extending the channel adapter interface to pass image URLs and base64 payloads through. Multiple issues open on this, #23452 is the clearest summary.

2. WhatsApp and Telegram die permanently on a brief DNS blip The reconnect logic on both channel adapters only handles mid-session disconnects. If a DNS failure hits during initial connection, the error escapes the retry loop entirely and the channel exits with no further reconnect attempts. Gateway keeps running, channel is just dead. People are running external watchdog scripts to restart it. Issues #2198 and #13506.

3. No way to see what's holding the session lock The session store uses .jsonl.lock files and they get stuck regularly — large sessions, cron load, parallel agents, config reloads. When it happens all models fail with session file locked (timeout 10000ms) and your only option is to manually kill the lock file or restart the gateway. The lock file already contains pid and createdAt but there's no CLI command to surface that. An openclaw locks command reading that metadata and showing active waiters would give operators something to actually work with. Issue #31489 and #11950.

4. No parallel coordination between agents Current session_spawn and send tools only support hierarchical delegation — one agent passes work down and waits. There's no way for multiple agents to collaborate on a task simultaneously or share state. Issue #43367 shows people trying to run parallel coding agents and hitting config overwrites and lock contention on top of the architecture limitation.

Issues and AI Signals link in the comments below 👇

reddit.com
u/gvij — 23 days ago
▲ 1 r/github

If you're building on the Copilot SDK and want to show users what agents or skills are available before they start a conversation, you're stuck. The only way to enumerate them right now is to create a full session first.

VS Code team flagged this in issue #1161 because they want to surface these in the UI pre-session. Makes sense. Feels like a pretty fundamental gap for anyone building tooling on top of the SDK.

SDK is in public preview so hopefully this gets prioritized. Anyone else running into this while building extensions or integrations?

Issue link in comments below.

reddit.com
u/gvij — 23 days ago
▲ 801 r/LocalLLM+2 crossposts

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

  • HumanEval: code generation
  • HellaSwag: commonsense reasoning
  • BFCL: function calling

Total samples:

  • HumanEval: 164
  • HellaSwag: 100
  • BFCL: 400

Results:

BF16

  • HumanEval: 56.10% 92/164
  • HellaSwag: 90.00% 90/100
  • BFCL: 63.25% 253/400
  • Avg accuracy: 69.78%
  • Throughput: 15.5 tok/s
  • Peak RAM: 54 GB
  • Model size: 53.8 GB

Q4_K_M

  • HumanEval: 50.61% 83/164
  • HellaSwag: 86.00% 86/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.54%
  • Throughput: 22.5 tok/s
  • Peak RAM: 28 GB
  • Model size: 16.8 GB

Q8_0

  • HumanEval: 52.44% 86/164
  • HellaSwag: 83.00% 83/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.15%
  • Throughput: 18.0 tok/s
  • Peak RAM: 42 GB
  • Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

  • 1.45x faster than BF16
  • 48% less peak RAM
  • 68.8% smaller model file
  • nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by ~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

  • GGUF via llama-cpp-python
  • n_ctx: 32768
  • checkpointed evaluation
  • HumanEval, HellaSwag, and BFCL all completed
  • BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

u/gvij — 25 days ago
▲ 192 r/claude+4 crossposts

Ran a small head-to-head eval between Kimi K2.6 and Claude Opus 4.7 on 10 hard reasoning, coding, and analysis tasks.

Setup:

  • Kimi: moonshotai/kimi-k2.6
  • Opus: anthropic/claude-opus-4.7
  • Both via OpenRouter
  • Judge: GPT-5.4
  • A/B anonymized judging
  • 10 tasks total

Results:

  • Kimi wins: 6
  • Opus wins: 4
  • Ties: 0
  • Avg judge score: Opus 8.0, Kimi 7.2
  • Avg latency: Opus 29.7s, Kimi 496.8s
  • Avg total tokens: Opus 3,561, Kimi 14,297

The interesting part is that Kimi won more tasks, but Opus had the higher average score.

Kimi was stronger on tasks where exhaustive reasoning and detailed coverage mattered. It won the Zebra puzzle, causal inference, Redis rate limiter, production memory leak debugging, autonomous vehicle ethics, and Alzheimer’s trial critique.

Opus was much faster, more concise, and more reliable. It won the St. Petersburg paradox, distributed ID generator, query optimization, and repeated duopoly game theory task.

Kimi also had two bad failure cases: one upstream JSONDecodeError from OpenRouter/Moonshot, and one response that spent around 21k completion tokens in reasoning but never emitted final content. Opus completed all 10 tasks cleanly.

My takeaway:

Kimi K2.6 is surprisingly strong when it completes properly, especially for deep reasoning and long-form implementation tasks.

But Opus 4.7 is much faster and more predictable. For interactive coding agents, Opus still feels safer. For slower offline evals or deep analysis, Kimi looks very interesting.

The eval was performed by Neo AI engineer.

Complete breakdown of the evaluation along with approach, code, prompts in mentioned in comments below 👇

This was a small eval, only 10 tasks, so don’t treat this as a full benchmark. But the result was interesting enough to share.

u/gvij — 25 days ago
▲ 43 r/Qwen_AI

Ran a serving benchmark on 8 small and mid-size models on a single H100 80GB to figure out which ones are actually worth running in production.

Setup:

- vLLM 0.19.1, vllm bench serve

- 100 prompts per run, 128 in / 128 out tokens

- Concurrency: 1, 4, 8, 16

- Metrics: throughput (tok/s) and TTFT (ms)

Throughput at c=16 (tok/s):

- Gemma 4 E2B-it: 3180

- Gemma 4 E4B-it: 2015

- Qwen 3.6 35B-A3B-FP8: 1243

- Gemma 4 26B-A4B-it: 1033

- Qwen 3.6 35B-A3B: 718

- Qwen 3.6 27B-FP8: 557

- Qwen 3.6 27B: 439

- Gemma 4 31B-it: 226

Three findings:

  1. Small expert models dominate. Gemma E2B hit 14x the throughput of Gemma 31B dense on the same GPU. TTFT under load: 55 ms vs 4.1 seconds. Architecture is eating parameter count for serving workloads.

  2. FP8 is a bigger win on MoE than dense. Qwen 35B-A3B FP8 vs BF16: +73% throughput. Qwen 27B dense FP8 vs BF16: +27%. MoE benefits more because expert weight movement through HBM is the bottleneck, and FP8 halves that traffic. For MoE on H100, FP8 should be the default now.

  3. Dense 30B-class models don't serve on a single H100. Gemma 31B dense TTFT goes from 130 ms at c=1 to 4159 ms at c=16. Treat it as a batch model, not a serving model.

Who should use what (just my personal preference, you should run your own evals):

- Latency-sensitive chat: Gemma 4 E2B-it

- High-throughput batch: Gemma E2B-it, or E4B if you need more capability

- Quality + speed balance: Qwen 3.6 35B-A3B in FP8 (~1,200 tok/s)

- Skip dense 27B and 31B unless you have a specific reason

Disclosure: The complete experimentation setup, evaluation and analysis was performed end to end by Neo AI Engineer based on my initial task prompt and then I evaluated the final outcome manually.

u/gvij — 27 days ago

The workflow this solves: I want to contribute to open source, I check GitHub trending, I see what's popular, but I have no idea which of those repos has a contributor-friendly issue queue. So I open tabs, drill into Issues, scan for help-wanted labels, get tired, close everything.

This tool shows both axes in one view. Top 360 repos in AI/ML and SWE, sorted by stars / forks / 24h growth / momentum. Each row pulls live open-issue counts from GitHub split into features, bugs, and enhancements.

The pattern that emerges when you put both axes together:

  • Megaprojects (Linux, React, transformers) are popular but have tight issue queues. Hard to break in.
  • Stagnant repos have lots of open issues but no momentum. Your PR sits forever.
  • Mid-size rising repos with healthy issue counts are the actual contributor sweet spot. Visible work, responsive maintainers, real entry points.

This tool makes that third category easy to find.

A few examples from today's data:

  • openclaw: AI assistant repo, +572 stars in 24h, 913 open enhancements
  • everything-claude-code: agent harness, +1.1k stars in 24h, 145 open enhancements
  • ollama: +75 stars, 28 open issues, very active maintainer team

Project link is in the comments below 👇

Built by NEO AI Engineer. Posting here because the contributor-flow angle felt like a fit for this subreddit.

u/gvij — 29 days ago