r/AIToolsPerformance

Guardrails take an 8B model from 53% to 99% on agentic tasks - model size might not be the bottleneck

New results from Forge show that adding guardrails to an 8B parameter model pushed its agentic task performance from 53% to 99%. That is a 46 percentage point jump without changing the underlying model at all.

What makes this surprising is where the performance ceiling actually sits. The assumption in the local inference community has been that bigger models are needed for reliable agentic behavior - tool calling, multi-step reasoning, structured outputs. But if an 8B model can hit 99% with the right scaffolding, the bottleneck may not be model intelligence at all. It may be that small models know what to do but lack the discipline to do it consistently without external structure.

The implication is straightforward: if you are choosing between spending compute on a larger model versus investing in better guardrails and tooling around a smaller one, the guardrails route might deliver more reliable gains. That changes the economics of local agentic workflows considerably.

For people running agents locally: has your experience been that better prompting and guardrails matter more than model size for reliable tool use?

reddit.com
u/IulianHI — 2 days ago

llama.cpp MTP support landed - Qwen3.6 27B hits 2.44x speedup on Strix Halo, 2.17x on RTX 3090

New benchmarks confirm that Multi-Token Prediction speculative decoding has landed in mainline llama.cpp as of PR #22673 (commit 4f13cb7, merged May 16). Testing on Qwen3.6 27B in single-stream chat at temperature 0, median of 5 runs, shows substantial speedups across two different hardware configurations.

On the Strix Halo (Framework Desktop, ROCm 7.0.2), the Q4_K_M quant went from 11.7 tokens per second to 21.2 tokens per second - a 1.81x improvement. The Q8_0 quant saw an even more dramatic jump, from 7.4 to 18.1 tokens per second. The overall peak speedup reached 2.44x on this platform.

On the RTX 3090 rig, the speedup factor came in at 2.17x. These are real, usable gains for anyone running local inference with Qwen3.6 models.

What stands out is that the higher-precision Q8_0 quant saw a proportionally larger boost than Q4_K_M on Strix Halo. The lower baseline speed of Q8_0 means MTP has more room to accelerate, and the final speed ends up competitive with the smaller quant. That is a meaningful data point for people choosing between quantization levels - the gap between Q4 and Q8 narrows significantly once MTP is in play.

For anyone who has updated to the latest llama.cpp build: are you seeing similar speedup ratios with MTP on other models, or does this seem specific to Qwen3.6?

reddit.com
u/IulianHI — 3 days ago
▲ 12 r/AIToolsPerformance+3 crossposts

open-source AI evaluation platform

he problem I kept seeing:

Companies are deploying AI agents into healthcare, legal, and finance. Their testing process is one developer asking it a few questions and saying "looks good."

The people who actually know what a correct answer looks like — doctors, lawyers, compliance officers — have zero tools they can use. Everything in the eval space requires Python, CLI setup, or JSON configs. Completely inaccessible to domain experts.

What I built:

EvalDesk — open source, self-hostable, no-code AI evaluation.

The workflow is three steps:

Designed specifically so a doctor or lawyer can use it without an engineer in the room. Self-hostable so sensitive data never leaves your infrastructure — critical for HIPAA and legal contexts.

Current features:

What I'm looking for:

Honest feedback. Is this solving a real problem or am I wrong about the gap? Anyone working in AI deployment in regulated industries — does this workflow actually match how your team operates?

GitHub: https://github.com/ramandagar/EvalDesk

u/Immediate-Tap-4777 — 5 days ago

Orthrus achieves up to 7.8x tokens per forward pass on Qwen3-8B with frozen backbone

A new approach called Orthrus claims up to 7.8x more tokens per forward pass on Qwen3-8B while keeping the model backbone completely frozen. The key detail: the output distribution is provably identical to the original model. That means no quality degradation - same results, significantly faster throughput.

The method works on Qwen3 variants including the 1.7B size. What makes this notable compared to typical speed optimizations is that the backbone stays frozen, so there is no fine-tuning or retraining of the base model required. The speedup comes purely from how tokens are processed during inference, not from cutting corners on model quality.

This is a different approach from quantization or speculative decoding. Those methods trade either precision or additional compute budget for speed. Orthrus appears to restructure the forward pass itself to produce multiple tokens per step while mathematically guaranteeing the same output distribution.

The code and paper are both publicly available. For anyone running Qwen3 models locally: does a provably lossless 7.8x throughput increase change the calculus on which model size you would deploy, or are there bottlenecks elsewhere in the pipeline that would limit real-world gains?

reddit.com
u/IulianHI — 6 days ago

RTX 5090 is the only GPU tier going up in price while everything else drops - what's driving this?

Someone tracked EU GPU prices across 15 stores over 50+ days at a 6-hour scrape cadence, collecting roughly 126,000 readings. The finding is straightforward but odd: RTX 5090 is the only tier where prices are increasing. Everything else is falling, with mid-range AMD cards seeing the steepest declines.

For local inference builders, this creates a weird decision point. The 5090 is the card most people want for large model workloads, but it is moving in the opposite direction of the rest of the market. The data suggests a clear tier divergence - high-end Nvidia demand is insulated or even growing, while the mid-range and AMD side are softening.

The question is whether this is purely AI workload demand keeping 5090 prices elevated, or if supply constraints and scalper dynamics are also in play. For people who have been watching GPU prices in their region: are you seeing the same pattern where the 5090 keeps climbing while everything else gets cheaper, and at what price point does the 5090 stop making sense compared to multi-card alternatives?

reddit.com
u/IulianHI — 7 days ago

A small model trained on its own mistakes hit 80% on HumanEval and beat GPT-3.5 on math

Here is a counterintuitive result: a small model that learned by training on its own errors reached 80% on HumanEval and outperformed GPT-3.5 on math benchmarks. No human-labeled data, no massive distillation from a larger teacher - just the model generating solutions, verifying them, and learning from where it went wrong.

The approach was inspired by a line in the DeepSeek-R1 paper about models improving through verifiable rewards. The insight was that if you have tasks where correctness can be checked automatically - code execution, math verification - you do not need human feedback or a bigger model to supervise. The model becomes its own teacher by attempting, failing, getting a signal about what failed, and updating accordingly.

What makes this surprising is the efficiency. The model is small, the mechanism is simple, and the results punch well above the model's weight class. It suggests that a lot of the performance gap between small and large models comes down to training methodology, not just parameter count.

The obvious question for local inference fans: if this self-training approach works this well for code and math, could it be applied to fine-tune small local models for specific domains without needing any external data or API calls?

reddit.com
u/IulianHI — 7 days ago

Web search for AI agents is hitting a wall - Google and Cloudflare are closing doors simultaneously

Two developments are converging to make web-based search problematic for AI tools. Google is shutting down its free search index, limiting the free tier to just 50 domains for site-specific search, with an inheritance date of January 1st, 2027 and no public pricing listed for advanced searches. At the same time, Cloudflare's new site default challenges all AI bots attempting to scrape web information across all their customers.

The combination is brutal. One move restricts the primary search API that many tools rely on, and the other blocks the fallback approach of direct scraping. For anyone building agents or RAG systems that depend on real-time web information, the pipeline is narrowing fast.

What are people actually using for web search in AI workflows right now - are there viable alternatives to Google's index, or is the community moving toward curated knowledge bases and local data instead of live web retrieval?

reddit.com
u/IulianHI — 8 days ago

Kimi K2.5 vs K2.6 NVFP4 - Nvidia's quantized Moonshot models compared

Nvidia has released NVFP4 quantized versions of both Kimi K2.5 and Kimi K2.6, Moonshot AI's auto-regressive language models built on an optimized transformer architecture. Both quantizations use Nvidia's NVFP4 format, which targets efficient inference deployment.

The key distinction is the base model generation. Kimi K2.6 is the newer release, meaning any architecture improvements, training data updates, or performance gains in the upstream model carry through to the quantized version. The NVFP4 quantization itself is consistent across both, so the comparison really comes down to what Moonshot changed between K2.5 and K2.6 at the full-precision level.

What is worth noting is that NVFP4 is a relatively aggressive quantization. For people considering these models for local deployment, the question is whether the quality loss from 4-bit quantization is acceptable for their workload, and whether K2.6's architectural improvements are enough to compensate for or even exceed K2.5's performance at full precision.

The use case split would be: K2.6 NVFP4 if you want the latest architecture with efficient inference, K2.5 NVFP4 if you need stability and have already validated K2.5's output quality for your specific tasks.

For anyone who has run both: does K2.6 actually show measurable quality improvements over K2.5 at NVFP4 precision, or are the differences too small to notice in practical use?

reddit.com
u/IulianHI — 8 days ago

Someone is cooling a DGX system with tap water running Qwen3.5-122B at 18.77 tok/s

The setup: a DGX system running Qwen3.5-122b-a10B at Q6_K precision, 110GB memory usage, 80k context window, continuous vision analyses at 18.77 tokens per second. The cooling solution is tap water, keeping GPU temperatures below 68 degrees Celsius at 95% utilization.

What makes this notable is the contrast. DGX systems are enterprise-grade hardware with sophisticated cooling infrastructure designed for data centers. This person bypassed all of that for a garden-variety water supply and it is working. The unknown is longevity - they note uncertainty about how often the water needs changing.

The context is that Qwen3.5-122b-a10B is a MoE model where only 10B parameters are active per token, which is why 110GB of memory can serve it. But 18.77 tok/s with vision analysis at 80k context on a single system is a serious throughput number, and the cooling is the bottleneck being addressed here, not compute.

The fair question is whether this is a clever hack or a ticking time bomb for the hardware. Mineral buildup, corrosion, and microbial growth in an open-loop tap water system over weeks and months could degrade cooling performance or damage the hardware entirely.

For anyone running high-utilization inference on enterprise gear with unconventional cooling: what is the longest you have gone without issues, and did you treat the water at all?

reddit.com
u/IulianHI — 10 days ago

Needle distills Gemini tool calling into a 26M parameter model running at 1200 tok/s decode

A new open-source project called Needle has distilled function-calling and tool-use capabilities from Gemini down to a 26 million parameter model. The reported performance numbers are striking: 6000 tokens per second on prefill and 1200 tokens per second on decode, running on consumer devices.

The motivation behind the project was frustration with the lack of effort toward building agentic models that can run on budget phones. Rather than accepting that tool calling requires large models, the team investigated how small a model could be while still reliably handling function calling tasks. The answer turned out to be 26M parameters - tiny enough to run on hardware that would struggle with even a 1B model.

What makes this worth paying attention to is the implication for agent architectures. If tool calling can be offloaded to a model this small and fast, it changes how you think about the orchestration layer. You do not need your main reasoning model to also handle structured output formatting - a 26M model can parse intent into function calls at speeds that are essentially instant relative to the reasoning step.

The open question is how well Needle handles edge cases compared to native tool calling in larger models. Are people finding that distilled tool-calling models maintain reliability across complex multi-tool workflows, or does accuracy fall off quickly once you move beyond simple single-function invocations?

reddit.com
u/IulianHI — 9 days ago

Intel Optane build runs 1T param Kimi K2.5 at 4 tok/s - is persistent memory viable for local inference?

Someone built a system using Intel Optane Persistent Memory that reportedly runs Kimi K2.5, a 1 trillion parameter model, locally at approximately 4 tokens per second. The build leverages Optane as its standout component, which is an unusual choice since Optane persistent memory modules have been largely discontinued by Intel.

The stat line is attention-grabbing - a trillion parameters locally at any speed is rare. But 4 tok/s is firmly in "readable but slow" territory, roughly half the speed of typical human reading. The question is whether the cost and complexity of sourcing discontinued Optane modules makes sense compared to more conventional approaches like multi-GPU setups or even offloading to standard DDR5 RAM.

For anyone familiar with Optane-based inference builds: how does the random access performance of persistent memory actually compare to standard DDR4/DDR5 when running models this large, and is the used market for Optane modules still practical enough to recommend to someone considering a similar build?

reddit.com
u/IulianHI — 10 days ago

Tried 9 AI Tools Recently, Here’s What I Actually Still Use

Tried a lot of AI tools over the last few months, and honestly most of them were cool for like 10 minutes then I never opened them again.

These are the few I actually kept using consistently:

ChatGPT Pro – probably the tool I use the most overall. Mainly for brainstorming, fixing problems, rewriting stuff and random research. Still needs fact checking sometimes but huge time saver.

Claude – feels calmer and better for long explanations or writing. I use it more when I want cleaner structured answers.

Cursor – genuinely one of the best AI coding tools I tried. Feels much more useful than basic autocomplete because it actually understands your files and project structure.

Perplexity – replaced Google for a lot of quick searches honestly. Way faster when I just need an answer + sources without opening 15 tabs.

Canva AI – surprisingly useful for quick visuals, thumbnails and simple edits. Not perfect but saves a lot of time.

Kling AI – probably the AI video tool that impressed me the most recently. Prompt adherence is actually decent compared to a lot of other generators.

ElevenLabs – still probably the best sounding AI voices overall from what I tested.

Polyvoice – found it pretty useful for translating voice/video content into other languages without completely killing the original vibe of the audio.

Notion AI – not something I use daily, but useful when organizing notes, content ideas or summarizing things quickly.

Most AI tools honestly feel overhyped after a while, but a few actually become part of your workflow.

What AI tools do you guys actually use regularly?

reddit.com
u/Ethan_Builder — 9 days ago

80 tok/s and 128K context on 12GB VRAM - Qwen3.6 35B A3B with MTP changes the value of entry-level GPUs

A new configuration report shows Qwen3.6 35B A3B hitting over 80 tokens per second with 128K context on just 12GB of VRAM, using the latest llama.cpp build with the MTP PR. The reported draft acceptance rate is above 80%.

Why this matters: 12GB VRAM has been the budget tier for local inference for years - think RTX 3060 and 4070 territory. Getting a 35B parameter model (even a MoE with 3B active parameters) to run at 80+ tok/s with long context on that hardware significantly extends the useful life of these cards. The combination of MoE architecture keeping active parameters small, MTP speculative decoding accelerating generation, and quantization fitting everything into limited VRAM creates a compounding effect.

The kicker is the 128K context. That is not a toy context window. It means real document processing, multi-file code analysis, and extended conversations are all feasible on hardware that costs under $300 used.

Fair question: with the Qwen3.6 35B A3B available at $0.15/M tokens via API with 262K context, and an uncensored variant now available with all 19 MTP heads preserved (KLD 0.0015), is the local setup still worth the configuration effort for people who already have 12GB cards, or does the API pricing make local only worthwhile for privacy-sensitive workloads?

reddit.com
u/IulianHI — 13 days ago

Qwen3.6 35B-A3B MoE runs practically on just 12GB VRAM with IQ4_XS quant

New benchmarks show that Qwen3.6 35B-A3B, a Mixture-of-Experts model, is surprisingly usable on an RTX 3060 with only 12GB of VRAM. The setup uses the IQ4_XS GGUF quantization running on Windows with 32GB DDR4-3200 system RAM and CUDA 13.x.

The key detail is the -ncmoe parameter in llama.cpp. Since this is a MoE architecture, lowering the -ncmoe value keeps more MoE blocks on the GPU rather than offloading to system RAM. Tuning this setting makes a significant difference in performance on constrained VRAM setups.

What is notable here: 12GB has been considered the bare minimum for running anything beyond small models locally. A 35B parameter model fitting into that budget - even as a MoE where only a fraction of parameters are active per token - changes the calculus on what hardware is actually needed for capable local inference. The A3B designation means only 3B parameters are active at any given step, which explains how it fits.

The model is also available in an uncensored variant with native MTP preserved, reporting a KL divergence of just 0.0015 with 10 out of 100 refusals and all 19 MTP heads intact - available in Safetensors, GGUF, NVFP4, and GPTQ-Int4 formats.

For anyone running this on similar low-VRAM hardware: what -ncmoe value are you settling on, and how is token throughput holding up at longer context lengths?

reddit.com
u/IulianHI — 13 days ago

NVIDIA Star Elastic packs 30B, 23B, and 12B reasoning models in one checkpoint with zero-shot slicing

NVIDIA released Star Elastic, a single checkpoint that contains 30B, 23B, and 12B reasoning models through what they call "zero-shot slicing." The idea is that you load one model file and can extract different sizes depending on your VRAM or speed requirements, rather than downloading separate checkpoints for each configuration.

The concept is being compared to scalable video coding, where one stream serves multiple quality levels. If it works as described, this could simplify local deployment significantly - one download, multiple usable model sizes depending on your hardware on any given day.

What stands out is that this reportedly went live 11 days ago but barely got traction. For a release from NVIDIA that directly targets local inference flexibility, that seems like surprisingly low visibility.

The open question is quality at each slice. A 12B model carved from a 30B checkpoint is not the same as a purpose-trained 12B model. The architecture presumably uses some form of elastic depth or width pruning, but the details are thin so far.

For anyone who has actually run the different slice sizes: how does the 12B and 23B reasoning quality compare to purpose-built models at those same sizes - is there a noticeable capability drop, or does the zero-shot slicing preserve enough to make it genuinely competitive?

reddit.com
u/IulianHI — 12 days ago

Someone debugged plane WiFi at 10km altitude using a local LLM on their laptop

Someone on a flight couldn't get their Ubuntu laptop to load the plane's captive portal - the WiFi connected but the login page wouldn't appear. The fix came from running Qwen 3.6 35B A3B locally, which diagnosed that systemd-resolved was using DNS settings that blocked the captive portal redirect.

That is a genuinely surprising use case for local inference. No cloud API, no internet connection needed - the model ran entirely on the laptop at 10km altitude and solved a networking issue that was preventing internet access in the first place. The circular dependency is what makes it interesting: you need the model to fix the problem that is preventing you from reaching the model.

The context here is that Qwen 3.6 35B A3B is a MoE architecture where only 3B parameters are active per token, which is why it can run on a laptop without dedicated GPU VRAM. It is exactly the kind of model that makes offline, on-device troubleshooting viable.

The implication is straightforward: local models are crossing from "nice to have" into "actually practical for real-time problem solving in situations where cloud is not available." A laptop fixing its own connectivity issue mid-flight is hard to argue with.

What is the most unexpectedly useful thing you have solved with a local model that you could not have done with a cloud API?

reddit.com
u/IulianHI — 12 days ago

AI tools organized by goals: startup, SaaS, business, TikTok, ecommerce, automation

If your goal is to build a startup or SaaS

* ChatGPT → ideation, MVP planning, UX copy, customer research synthesis

* Notion → product specs, roadmap, internal documentation

* Linear → clean issue tracking when you start shipping fast

* Stripe → simple way to start monetizing immediately

* Framer → fast landing pages without engineering bottlenecks

* Make → early-stage automations between tools without heavy backend work

* n8n → more advanced workflows if you need full control later

At the early stage, speed matters more than architecture.

If your goal is to scale a business internationally

* PolyVoice AI → translate and localize content to enter new markets faster

* ChatGPT → adapt messaging, ads, and positioning per country

* Notion → centralize strategy and market learnings

* Stripe → handle multi-country payments and scaling revenue streams

* Make / n8n → connect systems across regions and tools

International scaling is mostly about removing language + operational friction.

If your goal is to grow a TikTok account

* Kling AI → generate cinematic short-form videos quickly

* Midjourney → visuals, concepts, and creative direction

* Runway → AI video editing and effects

* ElevenLabs → realistic AI voiceovers

* PolyVoice AI → translate content to scale into multiple countries

* CapCut → fast editing for daily output

* Metricool → understand what actually performs

* ChatGPT → hooks, scripts, content angles, repurposing

The real bottleneck is consistent output, not ideas.

If your goal is to build an ecommerce brand

* Shopify → launch store quickly and iterate

* Klaviyo → email automation and retention

* Triple Whale → better visibility on ad performance

* Midjourney → product visuals and ad creatives

* Kling AI → video ads at scale

* Pika → animated product content

* ElevenLabs → UGC-style voiceovers

* PolyVoice AI → localize ads for international markets

* Loox → reviews and social proof

Modern ecommerce is basically creative testing at scale.

If your goal is to automate repetitive work

* Zapier → easiest entry point for automation

* Make → visual workflow automation

* n8n → advanced / self-hosted automation control

* Airtable → lightweight operational database

* Google Sheets → surprisingly powerful automation hub

If something repeats, it’s usually automatable.

reddit.com
u/Ethan_Builder — 12 days ago

Gemma 4 26B hits 600 tok/s on single RTX 5090 with DFlash - is MTP already obsolete?

A benchmark using vLLM 0.19.2rc1 shows Gemma 4 26B hitting 600 tokens per second on a single RTX 5090 (32GB VRAM) using DFlash speculative decoding. The setup pairs an AWQ 4-bit quant of the main model with the z-lab DFlash draft model, running a workload of 256 input tokens and 1024 output tokens.

What makes this worth discussing: DFlash uses parallel block diffusion drafting rather than the autoregressive approach behind MTP. The claim is that DFlash should be a better alternative to MTP specifically because of faster parallel drafting. And 600 tok/s on a single consumer GPU is a serious number for a 26B model.

The timing is interesting too. Most attention has been on MTP implementations for Gemma 4 and Qwen3.6, but DFlash quietly shipped for Gemma 4 26B and barely got noticed.

For people who have tried both DFlash and MTP on the same hardware: does DFlash actually deliver higher sustained throughput in real workloads, or does the 600 tok/s only hold under benchmark-friendly conditions?

reddit.com
u/IulianHI — 14 days ago