r/LocalLLaMA

So... anyone copped one of these?

Been almost a year since mass hysteria erupted upon the death of NVIDIAs GPU monopoly. How are your Huawei GPUs? Does CUDA work on them yet?

u/entsnack — 5 hours ago

▲ 6 r/LocalLLaMA+5 crossposts

I built my own harness to replace claude.ai | Self-hosted, beautiful, and works from any device

I built a harness for myself. It runs on my own machine. Its beautiful, fast, and built for professional work (I use it for my job).

The aesthetics and design was important to me, had to be easy to use, informative, feature rich but not bloated.

Features I built into it:

Runs locally in a browser: remote-accessible from any device with a browser, full mobile UI, installable as a home-screen app in Android.
Tabs: Multiple sessions open at once, and sessions keep running even if I close the browser
Prebuilt AGENTS.md per project type: Coding, Writing, Business, Legal, General
Epic mode: Projects too big for one conversation get a persistent spec, task list, and state that future sessions pick up
Persistent memory: browse, edit, or delete every memory, and see exactly what got recalled into any chat
Nightly worker agents: Automated nightly code review + fixer agents lay out verified, one-click fixes every morning
Private web search & deep research: Self-hosted search engine, cited reports (Crawl4AI).
Independent reviewer agent: A clean-context critique of code, docs, or writing, instead of the model grading its own work
Chats & projects: Quick chats each get an isolated folder with full tool access; serious work lives in project workspaces
Handles every file type, both ways: Reads and authors PDF, Word, Excel, PowerPoint, images, charts; plus skills and saved prompts with fill-in variables
Voice both ways: Local Whisper dictation and read-aloud replies; no audio leaves the machine

Nice touches: full system terminal access, a ⌘K command palette with cross-project search, a usage dashboard that counts every token, live system status, a real note-taking app, plan mode, light/dark themes, native notifications when a turn finishes, and hourly encrypted backups.

My goal with this post is to inspire others to build their own harness for themselves, its actually tons of fun! Like building out your workshop or garage the way you want to help you create, repair, & experiment on things.

u/PilgrimOfHaqq — 2 hours ago

▲ 28 r/LocalLLaMA

Late to the party but... Holy MTP

Just ran Qwen 3.6 27B using MTP for the first time. Doubled my t/s. Wow. That is all. I'm going to go look for abliterated MTP models now.

reddit.com

u/UniqueIdentifier00 — 1 hour ago

▲ 1.3k r/LocalLLaMA+1 crossposts

If trends hold, Mythos-class capability may be running on high-end consumer hardware within ~2 years

u/Jenna_AI — 8 hours ago

▲ 104 r/LocalLLaMA

I told Gemma 4 12B (Q8_0, no cache quant) to write a single-file 3D bowling simulator in WebGL. It's terrible, but honestly better than I expected.

Just sharing some slop. Used opencode as the harness.

I know this model isn't really recommended for coding, but I was just curious how it would handle this at near-lossless Q8_0. It made a couple tool call errors, but did correct itself quickly.

This was a one-shot pass after a quick plan session. I'm sure it could be made better with a few more turns, but I don't really care enough.

12B actually surpassed my expectations. I assumed it wouldn't work at all, but it... kinda does.

u/_TheWolfOfWalmart_ — 6 hours ago

▲ 96 r/LocalLLaMA+1 crossposts

Turn off thinking in LM Studio

Go to the My Models page in LM Studio.
Select a model, such as Qwen3.5.
Locate Inference on the right-hand sidebar.
Scroll down to find the Prompt Template and enter into template(Jinja ) section.
Add {%- set enable_thinking = false %} to the first line of the template.
Reload your model.

reddit.com

u/Ok-Conference-9984 — 8 hours ago

▲ 23 r/LocalLLaMA

Prefill vs. decoding and local LLM ROI: is prefill underrated?

I'm trying to understand why, when people discuss the ROI of running LLMs locally, they almost always focus on output speed (decoding) and rarely on input speed (prefill), which seems like it could have a significant impact on hardware ROI.

Yesterday I saw a post on X where someone was running GLM 5.2 on 4 NVIDIA DGX Spark (4bit, speculative decoding, and other optimizations), achieving around 60 output tokens/s with 6 concurrent users in batch. Those are already great numbers. Assuming a hypothetical 24/7 agentic workload, that would be about 5.18 million output tokens per day, roughly $22/day using a price of $4.40 per million output tokens.

However, from what I read, the prefill throughput on the same setup is around 3,000 tokens/s (!)

It's true that prefill is cheaper (around $1.40 per million input tokens for GLM 5.2), but we're talking about roughly 50× higher throughput.
So why does almost nobody seem to consider prefill when discussing ROI?

Even though decoding is typically 3–5× more expensive per million tokens than prefill, prefill is often 10–30× faster (and in this case, around 50× faster)... Shouldn't that have a major impact on ROI? Maybe even more than output?

Am I missing something, or is the real input/output token ratio very different from what I'm imagining?

u/GabryIta — 5 hours ago

▲ 381 r/LocalLLaMA+1 crossposts

New open model from Tencent Hy: Hy3 (295B total 21B active - apache 2.0)

Collection: https://huggingface.co/collections/tencent/hy3

From elie on 𝕏: https://x.com/eliebakouch/status/2074011171661701466

edit: To clarify: this is the non-preview version of Hy3 and they changed their license from the community one (restrictive + not allowed in SK, UK, EU) to Apache 2.0

huggingface.co

u/Nunki08 — 11 hours ago

▲ 17 r/LocalLLaMA

New strix halo box: GMKtec EVO-X3, superior cooling to avoid thermal throttling, $3,600

https://www.gmktec.com/products/gmktec-evo-x3-ai-mini-pc-amd-ryzen-ai-max-395

Features: USB 4. Adds a dedicated OCuLink port. OCuLink provides a direct, high-speed cable connection to a desktop graphics card, minimizing data loss and improving external GPU (eGPU) performance compared to USB4.

Price : More than double what I paid for my first strix halo that cost $1,600 for 128gb machine.

That box looks ready for gorgon halo to be released in the next 3 months.

u/Terminator857 — 8 hours ago

▲ 23 r/LocalLLaMA+5 crossposts

Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]

Project Release / Research Draft] Hierarchos at 232M Parameters: Preliminary Findings From a Recurrent Memory-Augmented Assistant Model

Technical Report: July 2nd, 2026

Project: Hierarchos / KortexHOS

Authors: Makhi Burroughs / netcat420, Lost Time, and the Hierarchos project team

TL;DR:

We built and trained Hierarchos, an experimental 232M-parameter recurrent, memory-augmented language model from scratch. It is not a GPT-3/3.5-class model, but it successfully proves that a hybrid non-Transformer architecture (combining an RWKV backbone, hierarchical manager/worker loops, differentiable slot-based LTM, and a deterministic suffix automaton) can survive training, avoid collapse, and maintain short-form instruction coherence. Most of our breakthroughs came from fixing subtle train/inference parity mismatches and numerical stability bugs.

Dataset: netcat420/Experiment_0.1 (Alpaca format)
Training: 13 epochs on an RTX 6000 Blackwell (96GB) rental.

1. Introduction & Background

Modern LLMs are heavily dominated by Transformer scaling. Hierarchos explores a different path: can recurrent state, explicit memory retrieval, hierarchical iterative computation, and bounded local inference make a small model vastly more parameter-efficient?

Hierarchos isn't a direct clone of any single architecture, but a hybrid inspired by:

RWKV-style recurrence: For efficient sequence processing without traditional attention.
Titans-style neural memory: For persistent test-time memory.
Hierarchical reasoning (HRM): Multi-level recurrent modules (Manager/Worker) to iteratively refine state.

2. Architecture Overview

[Token Input] -&gt; [ROSA Suffix Matcher / DeepEmbed Modulator]
       |
       v
[Long-Term Memory] &lt;-&gt; [Top-k Associative Lookup]
       |
       v
[Manager Recurrent Cell] -&gt; (Produces Context Plan &amp; Drift Vector)
       |
       v
[Worker Recurrent Cell]  -&gt; (Refines local state / clamps drift)
       |
       v
[RWKV Backbone (Clamped Channel-Mix)] -&gt; [Next-Token Logits]

Key Components:

ROSA: A deterministic suffix-automaton path predicting continuation tokens based on exact repeated suffix patterns.
DeepEmbed: A token-specific modulation path that influences RWKV channel mixing.
LTM Subsystem: Learned slow-memory keys/values combined with fast working-memory values.
Manager/Worker Loop: High-level manager handles broad context to produce a target plan; the lower-level worker refines token-local state using a regularized drift vector.

3. Core Engineering Lessons (The "Gotchas")

A low training loss does not guarantee coherent chat. We had to fix several critical state-contract and numerical stability bugs to make the model usable:

1. Chat/Training Drift Mismatch

The Bug: During live streaming chat, the loop was feeding the previous drift state back into the model on every single token. During training, this state is reseeded at Truncated Backpropagation Through Time (TBPTT) chunk boundaries.
The Fix: We aligned the inference code to only reseed at boundary limits. Before this fix, live chat logits diverged sharply from training loss; after the fix, logit error dropped to near-zero.

2. Supervised LTM Inner Updates Mismatch

The Bug: Giving the model supervised memory updates during training that it can't replicate during zero-label live inference creates a crutch. The model learns to rely on a hidden training-only helper signal.
The Fix (v0.20.4): Implemented --ltm-training-mode read-only. Training keeps the memory structures but stops doing supervised fast-memory writes, perfectly mirroring inference.

3. Unbounded RWKV Channel Mixing

The Bug: Long runs exposed activation spikes in the ReLU-squared channel-mix FFN path, which were amplified by DeepEmbed modulation into NaN gradients.
The Fix: Implemented key clamps (--rwkv-channel-mix-key-clamp 12.0), DeepEmbed clamps (4.0), and excluded DeepEmbed identity gates from AdamW weight decay.

4. Evaluation & Smoke Test Results

Because cloud costs add up, we benchmarked the model locally on a CPU preset via a ROG Ally (--eval-limit 100), ensuring passive learning was disabled and working memory was cleared to mimic static chat.

Bounded Local Benchmark Metrics (--eval-limit 100)

Benchmark	Metric	Score	Std. Err.
ARC Easy	acc	0.3600	0.0482
ARC Easy	acc_norm	0.3200	0.0469
HellaSwag	acc	0.3400	0.0476
HellaSwag	acc_norm	0.3700	0.0485
TruthfulQA MC1	acc	0.2200	0.0416

Real-world Coherence Check:

The Good: Assistant-shaped, follows short instruction prompts well due to the Alpaca training data. Nontrivial commonsense and QA signal prove the weights didn't collapse.
The Bad: Brittle on long context lengths, weak on arithmetic/factual recall. Coherence is comparable to the GPT-2 era, not modern GPT-3.5+ systems.

5. Proposed Ablation & Scaling Plan

We want to transform this from a promising prototype into a rigorous scientific result. Our next step requires scaling tiers and isolated component testing.

Proposed Isolation Testing (Ablations)

No LTM / Read-Only LTM: Isolating exactly how much slot memory helps.
No ROSA / No DeepEmbed: Evaluating the real token-efficiency gains of suffix-matching and modulation.
Baseline Matches: Running a direct Transformer 232M and RWKV-only 232M on the exact same token budget to prove true comparative architecture efficiency.

Future Scaling Target Tiers

Tier	Model Size	Token Target	Purpose
Scout	300M–500M	20B–50B	Validate loss slope and stability scaling.
Real v1	1B–1.5B	100B–300B	Test architecture limits beyond small-scale behavior.
Serious	3B	600B–1.5T	Establish a truly competitive local open-source alternative.

Target Data Mix for Foundation Training:

Instead of jumping straight into instruction SFT data, a scaled run will prioritize high-quality base data:

35-50%: FineWeb / FineWeb-Edu style clean web text
20-30%: Dolma / DCLM curated web data
8-15%: Code and tech documentation
5-12%: Math, science, and academic proofs
1-5%: In-house assistant conversational SFT (applied exclusively in late-stage tuning)

6. What We Can (and Cannot) Claim Safely

What is supported by the data:

Hierarchos is a functional, coherent 232M experimental assistant checkpoint.
Combining recurrent sequence loops, memory slots, and hierarchical workers is viable and stable with the right clamps.
The findings provide a solid engineering roadmap for non-Transformer architecture stability.

What is NOT supported (Do not hype this!):

No claims of GPT-3.5 level math, coding, or logic.
No claims of attention/Transformer superiority at equal parameter counts yet (baselines pending).
Not production-ready for heavily quantized or low-bit local deployments yet due to drift sensitivity.

Final Thoughts

Hierarchos 232M shows that small, alternative architectures are still a deeply fruitful area of LLM research if you can conquer the train/inference state drift.

We would love to hear feedback from anyone working on recurrent neural memory or hierarchical backbones! Full code, scripts, and logs are in progress.

References:

Brown et al. **Language Models are Few-Shot Learners.** arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Hoffmann et al. **Training Compute-Optimal Large Language Models.** arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Peng et al. **RWKV: Reinventing RNNs for the Transformer Era.** arXiv:2305.13048. https://arxiv.org/abs/2305.13048
Behrouz et al. **Titans: Learning to Memorize at Test Time.** arXiv:2501.00663. https://arxiv.org/abs/2501.00663
Wang et al. **Hierarchical Reasoning Model.** arXiv:2506.21734. https://arxiv.org/abs/2506.21734
Zellers et al. **HellaSwag: Can a Machine Really Finish Your Sentence?** arXiv:1905.07830. https://arxiv.org/abs/1905.07830
Clark et al. **Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.** arXiv:1803.05457. https://arxiv.org/abs/1803.05457
Lin et al. **TruthfulQA: Measuring How Models Mimic Human Falsehoods.** arXiv:2109.07958. https://arxiv.org/abs/2109.07958
Hugging Face. **FineWeb dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb
Hugging Face. **FineWeb-Edu dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Allen AI. **Dolma dataset.** https://huggingface.co/datasets/allenai/dolma
DataComp-LM. **DCLM Baseline dataset.** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

github repository with the architecture and the released model weights: https://github.com/necat101/Hierarchos

u/PhysicsDisastrous462 — 4 hours ago

▲ 12 r/LocalLLaMA+1 crossposts

Impact of dual PCIe 5 x8 vs dual PCIe 5 x16 for dual Rtx 6000 pro max-q

i have an existing system with:
Asus ROG MAXIMUS Z790 DARK HERO LGA1700
Intel Core i9-14900K
G.Skill Trident Z5 RGB 2x48GB DDR5 6800MHz CL34 (x2, 4 sticks in total), right now im running only 2 sticks.

The issue with my motherboard it doesn’t support dual gpu in 16x mode, it changes the pice 5 to 8x.

The alternative is to change to brand new Threadripper PRO that supports multiple PCIe 5 at x16, but it would make the cost much higher than expected at this moment.

My question is: how bad the difference would be between dual 8x and x16 running inference with vllm and graphic generation and LoRA (flux.2, z-image, video generation, etc).

Appreciate the help 🙏

reddit.com

u/BitXorBit — 8 hours ago

▲ 237 r/LocalLLaMA+2 crossposts

Kyutai's Pocket TTS clones a voice from 5 seconds of audio, on CPU, under MIT. Benchmarked against Kokoro, Supertonic, and Inflect-Nano for Eng. TTS

Kyutai dropped Pocket TTS a bit ago and I've been sitting on it for a benchmark. Finally ran it head to head against the three CPU TTS models that have been getting attention (Kokoro 82M, Supertonic 3, Inflect-Nano-v1). 180 timed runs, 36 audio samples, objective MOS scores via UTMOS.

Short version: Pocket TTS is the slowest of the six configs I tested, and it's still the most interesting model in the field. Here's why.

What Pocket TTS actually is:

It's a ~100M param streaming language model that generates audio tokens over Kyutai's Mimi neural codec, then decodes to 24kHz. So instead of the usual acoustic-model-plus-vocoder setup, it's more like an autoregressive LLM but for audio. Token by token.

Two consequences of that architecture:

Latency is dead flat across text lengths. Its RTF is 0.69 to 0.76 whether you feed it 12 chars or 1712 chars. No fixed overhead to amortize. Compare with Kokoro PyTorch which climbs from 0.49 on tiny text to 0.83 on long text.
It streams. Which matters if you're building anything interactive.

Zero-shot voice cloning from 5 seconds. On CPU.

This is the headline feature. Hand it a 5-second reference clip of any voice and it speaks in that voice. Accent, timbre, pacing, even the mic character of the reference. No fine-tuning. No GPU. MIT license.

None of the other CPU-friendly models can do this at all. Kokoro and Inflect-Nano ship fixed voice sets, Supertonic same. If you want a user-supplied voice on a CPU box, Pocket TTS is currently in a category of one.

I ran the benchmark with Pocket TTS pinned to a preset voice (alba) for a fair speed/quality comparison. The cloning capability isn't in the numbers below because you can't benchmark it against models that don't have it.

Full results:

Config	Mean RTF	UTMOS MOS	Params	License
Supertonic 3 (2-step)	0.121	1.53	~99M	OpenRAIL-M
Inflect-Nano-v1	0.145	3.48*	4.6M	Apache 2.0
Supertonic 3 (5-step)	0.240	4.32	~99M	OpenRAIL-M
Kokoro 82M (ONNX)	0.641	4.44	82M	Apache 2.0
Kokoro 82M (PyTorch)	0.665	4.46	82M	Apache 2.0
Pocket TTS	0.714	4.10	~100M	MIT

Hardware: Intel Xeon 8272CL, 4 cores, 16GB RAM, no GPU. UTMOS is utmos22_strong, an objective MOS predictor, so it's not just my ears this time.

The Inflect-Nano asterisk: UTMOS gave it 3.48 but to the ear it's buzzy and robotic. Known UTMOS failure mode where it over-rates small HiFi-GAN vocoders for being clean rather than natural. Also it has a hard ~15 second output cap I discovered mid-benchmark, so its RTF on long inputs is inflated.

Practical picks:

Need voice cloning on CPU → Pocket TTS, no other option in this field
Fixed voice, highest quality → Kokoro 82M
Latency-critical with acceptable quality → Supertonic 3 at 5 steps
Tiny footprint for short utterances → Inflect-Nano-v1, if you can live with the buzz and the 15s cap
Prototyping only → Supertonic 3 at 2 steps

Two things worth calling out:

Pocket TTS install is genuinely painless. pip install pocket-tts, no CUDA build, no HuggingFace-repo-plus-sys.path wiring. Downloads weights on first load. The least fussy of the six.

The MIT license is a big deal. Kokoro is Apache 2.0 (also great). Supertonic is OpenRAIL-M with commercial restrictions. Pocket TTS being MIT means you can do essentially whatever with it commercially.

Repo with raw CSV (180 rows), all 36 WAV samples, and the benchmark script is in comments below 👇

If anyone here has run Pocket TTS voice cloning with a real reference clip, would love to hear how it holds up on different voice types (accented English, non-English, singing, etc). That's the next thing I want to test but I need a clean dataset.

u/gvij — 10 hours ago

▲ 6 r/LocalLLaMA+1 crossposts

Local models + big context = slow. How are you orchestrating "map-reduce" style agent workflows?

I tried running local models (qwen3.6*, ds4 flash, gemma4*, etc) on my mbp pro m5 with 128Gb of unified memory and concluded the bottleneck is context size. The moment a conversation gets long (16k is already the bottleneck), inference slows to a crawl. If you work with Hermes agent you know this context size is almost the default with the bloated things. So my working strategy has become: chop every task into tiny pieces, spin up a fresh short session for each piece, and only pass the summary/output forward to the next step.

A concrete example: I want to scrape a bunch of sources overnight, collect the info, then generate a morning dashboard from all of it. The naive approach (one agent, one growing context) is unusable locally. The map-reduce approach feels right: many small parallel workers each doing one tiny extraction, then an aggregator that only sees the short summaries. But I'm building this by hand and it's fiddly.

What I'm wondering:

Which open-source agent orchestration frameworks actually support this "stateless worker + tiny context per call" pattern well? Most agents I've looked at (CrewAI, AutoGen, default LangChain) drag the full history along, which is the opposite of what I need.
Anyone built something like this already? A fan-out/fan-in pipeline where each LLM call stays small and the aggregator works only on compressed summaries?
How are you keeping context per call minimal while still getting useful aggregation at the end?

PS: Not looking for hosted/API solutions specifically local-model constraints. Curious what patterns, frameworks, or projects people have landed on. Thanks!

reddit.com

u/gevezex — 4 hours ago

▲ 29 r/LocalLLaMA

Am I Expecting Too Much?

I’ve been trying to get started with local coding help after finding Claude Code really useful but I’m just running into a ton of problems.

On my RTX 5090, I’ve been running Qwen 3.6 27B UD_4 at 131K context, no KV quants, quant’d by Unsloth. I’ve been using Cline in VS Code, as it seemed the closest parallel to Claude Code.

Recently, I wanted to try implementing some plans I had made with Fable. I had Fable prepare detailed, junior-engineer-friendly implementation plans for a pretty basic Python app, which I threw into the repo.

But I just can’t get anywhere with Qwen. It writes code that contains a ton of mistakes, tries to write terminal commands that are just broken with basic syntax errors, etc. It can’t follow even these very detailed plans going basically step by step.

To be clear, I’m not expecting Opus level performance or even Sonnet level work, but it just doesn’t work at all, even within the harness and via basic terminal calls.

Am I expecting too much?

Is something in my approach or setup wrong?

Is there a better harness or model I should swap to?

reddit.com

u/adcimagery — 10 hours ago

▲ 14 r/LocalLLaMA

The cyber shelf - 4x 16gb home lab

Been a long time lurker of this subreddit, learned a whole lot from here and Gemini. I've finally got my rig somewhere I feel I could share. Lot of people talk about racks for their home lab but all I managed was this kitchen rack. I was just dipping toes in the water with my first 16gb card and just ended up stacking them. This is my 4x 16gb card build (bifurcated main slot, riser cable on one pcie3 1x slot that runs two llama.cpp instances of qwen 3.6 spec decoding q4_0 with one context train of 150k each, 1000 tok/s prompt processing, 45-60tok/s generation. I5 processor with 32gb ddr4, but I'm all on vram. Used opencode to build up the backend that does the llamacpp management and token counting. If these calcs are right (haha no idea really) says here I've saved 60 bucks already! Everything is buggy as hell but that's a skill issue on my end. Was trying to build a router so I could run a parallel 2 on one set of cards and run a parallel 1 on the other set, then forward them to the right server and that's where I am now. AMA or leave a (mean) comment or suggestion!

u/HippEMechE — 5 hours ago

▲ 226 r/LocalLLaMA+1 crossposts

Qwen3.6 27B local vs Opus 4.8, voxel engine in raw C with zero frameworks

Sunday experiment. Same prompt to both. Build a voxel world in plain C. No engine, no game library, no framework, just the compiler. The model does its own chunk meshing, render loop and memory management by hand.

Left is Claude Code on Opus 4.8. Right is Qwen3.6 27B local on vLLM, the new NVFP4 quant, 256k context. Runs around 130 TPS on an RTX 6000 Blackwell 96GB through my own coding agent.

Opus clearly understands voxel physics. Terrain holds, chunks line up, collision works. The 27B compiles and renders, then tears itself apart on screen.

The quality gap I expected. What I did not expect was a local 27B handling C at all. Almost every local demo is Python or TypeScript with a framework doing the work. Strip that away and you are left with raw pointers and manual allocation, exactly where I assumed a quantized model would fall over. It did not. Rough, but it builds and runs.

Everyone watches the frontier race. Nobody talks about the bottom catching up. Two years ago this prompt gave you a segfault on a local model. Now it gives you a broken world that still runs on a card under your desk. The ceiling barely moved. The floor sprinted.

u/codehamr — 13 hours ago

▲ 217 r/LocalLLaMA

New model: GigaChat3.5-432B-A28B (with day-0 GGUF support!)

New model from Sberbank:

https://huggingface.co/ai-sage/GigaChat3.5-432B-A28B

Base version also available: https://huggingface.co/ai-sage/GigaChat3.5-432B-A28B-base

Most important is the're also made a GGUF version: https://huggingface.co/ai-sage/GigaChat3.5-432B-A28B-GGUF

For now it's not in master branch yet but one can build from this PR: https://github.com/ggml-org/llama.cpp/pull/25342

u/unbannedfornothing — 15 hours ago

▲ 7 r/LocalLLaMA

Got my Ascent GX10 two days ago, ran REAP-pruned NVFP4 DeepSeek-V4-Flash on a single Spark, and it stays consistent at long context

Got my Ascent GX10 two days ago and spent the last couple of days pushing a REAP-pruned NVFP4 DeepSeek-V4-Flash setup on a single Spark by patching the eugr/spark-vllm-docker image.

Credit where it’s due: the REAPs were done by 0xSero. I’m just the person who wired it up, validated it, and pushed it through the machine.

The main thing I wanted to check was long-context consistency, and the interesting part is how steady the throughput stays as context scales up.

I also vibecoded a Grafana dashboard in Hermes so I can watch the Spark(served at 262k context with VLLM) without living in raw logs.

Here are the numbers:

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
deepseek-v4-flash	pp4092 (c1)	835.41 ± 0.00	835.41 ± 0.00			4902.67 ± 0.00	4898.18 ± 0.00	4902.67 ± 0.00
deepseek-v4-flash	tg128 (c1)	23.38 ± 0.00	23.38 ± 0.00	27.00 ± 0.00	27.00 ± 0.00
deepseek-v4-flash	pp4092 (c2)	544.31 ± 0.00	556.92 ± 284.68			9950.97 ± 5084.31	9946.48 ± 5084.31	9950.97 ± 5084.31
deepseek-v4-flash	tg128 (c2)	16.76 ± 0.00	24.85 ± 0.63	29.00 ± 0.00	29.00 ± 0.00
deepseek-v4-flash	pp4092 (c4)	458.66 ± 0.00	215.93 ± 54.18			20228.56 ± 5074.88	20224.07 ± 5074.88	20228.56 ± 5074.88
deepseek-v4-flash	tg128 (c4)	14.17 ± 0.00	23.87 ± 0.75	31.00 ± 0.00	28.75 ± 1.79
deepseek-v4-flash	pp4092 (c1)	827.54 ± 0.00	827.54 ± 0.00			4949.25 ± 0.00	4944.77 ± 0.00	4949.25 ± 0.00
deepseek-v4-flash	tg512 (c1)	22.15 ± 0.00	22.15 ± 0.00	29.00 ± 0.00	29.00 ± 0.00
deepseek-v4-flash	pp4092 (c2)	259.55 ± 0.00	483.59 ± 353.80			18211.16 ± 13320.06	18206.67 ± 13320.06	18211.16 ± 13320.06
deepseek-v4-flash	tg512 (c2)	20.64 ± 0.00	22.90 ± 0.56	30.00 ± 0.00	30.00 ± 0.00
deepseek-v4-flash	pp4092 (c4)	193.07 ± 0.00	105.06 ± 34.55			43677.81 ± 14362.48	43673.32 ± 14362.48	43677.81 ± 14362.48
deepseek-v4-flash	tg512 (c4)	20.12 ± 0.00	23.66 ± 1.74	31.00 ± 0.00	29.50 ± 1.12
deepseek-v4-flash	pp16384 (c1)	768.42 ± 0.00	768.42 ± 0.00			21326.14 ± 0.00	21321.66 ± 0.00	21328.51 ± 0.00
deepseek-v4-flash	tg128 (c1)	22.14 ± 0.00	22.14 ± 0.00	27.00 ± 0.00	27.00 ± 0.00
deepseek-v4-flash	pp16384 (c2)	668.24 ± 0.00	533.52 ± 199.36			35697.41 ± 13337.33	35692.92 ± 13337.33	35698.70 ± 13337.36
deepseek-v4-flash	tg128 (c2)	7.83 ± 0.00	22.87 ± 0.86	28.00 ± 0.00	28.00 ± 0.00
deepseek-v4-flash	pp16384 (c4)	636.72 ± 0.00	273.62 ± 59.03			62805.30 ± 13548.80	62800.81 ± 13548.80	62806.27 ± 13547.83
deepseek-v4-flash	tg128 (c4)	5.81 ± 0.00	22.51 ± 1.40	28.00 ± 0.00	27.25 ± 0.83
deepseek-v4-flash	pp16384 (c1)	769.23 ± 0.00	769.23 ± 0.00			21303.79 ± 0.00	21299.30 ± 0.00	21303.79 ± 0.00
deepseek-v4-flash	tg512 (c1)	22.23 ± 0.00	22.23 ± 0.00	30.00 ± 0.00	30.00 ± 0.00
deepseek-v4-flash	pp16384 (c2)	499.36 ± 0.00	503.44 ± 253.74			43631.21 ± 21988.43	43626.72 ± 21988.43	43631.21 ± 21988.43
deepseek-v4-flash	tg512 (c2)	15.40 ± 0.00	22.65 ± 0.16	28.00 ± 0.00	28.00 ± 0.00
deepseek-v4-flash	pp16384 (c4)	425.47 ± 0.00	197.99 ± 48.93			88138.11 ± 21781.16	88133.62 ± 21781.16	88138.11 ± 21781.16
deepseek-v4-flash	tg512 (c4)	13.09 ± 0.00	22.30 ± 0.63	30.00 ± 0.00	29.50 ± 0.50
deepseek-v4-flash	pp65536 (c1)	655.34 ± 0.00	655.34 ± 0.00			100007.10 ± 0.00	100002.61 ± 0.00	100014.84 ± 0.00
deepseek-v4-flash	tg128 (c1)	18.01 ± 0.00	18.01 ± 0.00	23.00 ± 0.00	23.00 ± 0.00
deepseek-v4-flash	pp65536 (c2)	622.19 ± 0.00	468.70 ± 157.58			157651.57 ± 53003.64	157647.08 ± 53003.64	157657.64 ± 53004.05
deepseek-v4-flash	tg128 (c2)	2.27 ± 0.00	21.03 ± 0.62	26.00 ± 0.00	25.50 ± 0.50
deepseek-v4-flash	pp65536 (c4)	613.00 ± 0.00	256.18 ± 52.33			266959.62 ± 54527.17	266955.14 ± 54527.17	266963.48 ± 54526.99
deepseek-v4-flash	tg128 (c4)	1.54 ± 0.00	20.92 ± 1.06	28.00 ± 0.00	26.50 ± 0.87
deepseek-v4-flash	pp65536 (c1)	656.34 ± 0.00	656.34 ± 0.00			99855.20 ± 0.00	99850.71 ± 0.00	99861.54 ± 0.00
deepseek-v4-flash	tg512 (c1)	21.32 ± 0.00	21.32 ± 0.00	27.00 ± 0.00	27.00 ± 0.00
deepseek-v4-flash	pp65536 (c2)	579.74 ± 0.00	462.74 ± 172.85			164598.02 ± 61483.52	164593.53 ± 61483.52	164604.29 ± 61483.75
deepseek-v4-flash	tg512 (c2)	6.88 ± 0.00	20.94 ± 0.91	28.00 ± 0.00	27.50 ± 0.50
deepseek-v4-flash	pp65536 (c4)	545.41 ± 0.00	234.86 ± 51.30			293034.26 ± 64009.23	293029.77 ± 64009.23	293037.88 ± 64009.22
deepseek-v4-flash	tg512 (c4)	5.09 ± 0.00	21.33 ± 0.70	28.00 ± 0.00	27.50 ± 0.87
deepseek-v4-flash	pp131072 (c1)	558.69 ± 0.00	558.69 ± 0.00			234608.36 ± 0.00	234603.87 ± 0.00	234621.63 ± 0.00
deepseek-v4-flash	tg128 (c1)	19.10 ± 0.00	19.10 ± 0.00	23.00 ± 0.00	23.00 ± 0.00
deepseek-v4-flash	pp131072 (c2)	548.87 ± 0.00	406.83 ± 132.39			360340.23 ± 117258.53	360335.75 ± 117258.53	360347.52 ± 117259.06
deepseek-v4-flash	tg128 (c2)	1.05 ± 0.00	19.13 ± 0.22	25.00 ± 0.00	24.00 ± 1.00
deepseek-v4-flash	pp131072 (c4)	546.73 ± 0.00	196.89 ± 56.72			602040.49 ± 121723.14	602036.01 ± 121723.14	602053.75 ± 121723.14
deepseek-v4-flash	tg128 (c4)	0.70 ± 0.00	20.11 ± 1.47	25.00 ± 0.00	24.00 ± 1.22
deepseek-v4-flash	pp131072 (c1)	573.71 ± 0.00	573.71 ± 0.00			228466.93 ± 0.00	228462.44 ± 0.00	228473.65 ± 0.00
deepseek-v4-flash	tg512 (c1)	18.50 ± 0.00	18.50 ± 0.00	24.00 ± 0.00	24.00 ± 0.00
deepseek-v4-flash	pp131072 (c2)	531.49 ± 0.00	409.53 ± 143.78			365049.44 ± 128158.79	365044.96 ± 128158.79	365059.40 ± 128161.25
deepseek-v4-flash	tg512 (c2)	3.62 ± 0.00	18.88 ± 0.88	26.00 ± 0.00	25.00 ± 1.00
deepseek-v4-flash	pp131072 (c4)	526.27 ± 0.00	188.42 ± 54.45			631612.72 ± 130990.99	631608.23 ± 130990.99	631626.03 ± 130991.41
deepseek-v4-flash	tg512 (c4)	2.09 ± 0.00	19.28 ± 0.45	26.00 ± 0.00	25.00 ± 1.22
deepseek-v4-flash	pp162816 (c1)	534.93 ± 0.00	534.93 ± 0.00			304375.99 ± 0.00	304371.51 ± 0.00	304384.97 ± 0.00
deepseek-v4-flash	tg128 (c1)	20.62 ± 0.00	20.62 ± 0.00	24.00 ± 0.00	24.00 ± 0.00
deepseek-v4-flash	pp162816 (c2)	521.46 ± 0.00	387.00 ± 126.26			470838.82 ± 153616.52	470834.33 ± 153616.52	470847.89 ± 153616.37
deepseek-v4-flash	tg128 (c2)	0.81 ± 0.00	19.09 ± 0.42	24.00 ± 0.00	24.00 ± 0.00
deepseek-v4-flash	pp162816 (c4)	519.15 ± 0.00	186.62 ± 53.53			789169.74 ± 158960.31	789165.25 ± 158960.31	789174.99 ± 158955.06
deepseek-v4-flash	tg128 (c4)	0.54 ± 0.00	19.86 ± 0.79	25.00 ± 0.00	24.00 ± 1.22
deepseek-v4-flash	pp162816 (c1)	542.47 ± 0.00	542.47 ± 0.00			300144.05 ± 0.00	300139.56 ± 0.00	300160.34 ± 0.00
deepseek-v4-flash	tg512 (c1)	18.50 ± 0.00	18.50 ± 0.00	24.00 ± 0.00	24.00 ± 0.00
deepseek-v4-flash	pp162816 (c2)	508.47 ± 0.00	388.37 ± 134.13			476007.57 ± 164392.18	476003.08 ± 164392.18	476017.56 ± 164391.67
deepseek-v4-flash	tg512 (c2)	2.87 ± 0.00	17.99 ± 0.36	24.00 ± 0.00	23.00 ± 1.00
deepseek-v4-flash	pp162816 (c4)	495.46 ± 0.00	207.66 ± 42.84			818907.10 ± 168931.83	818902.61 ± 168931.83	818912.38 ± 168926.54
deepseek-v4-flash	tg512 (c4)	1.98 ± 0.00	18.75 ± 0.49	28.00 ± 0.00	25.25 ± 1.64

What stood out to me is that this thing stays surprisingly consistent at long context on a single Spark. The prefill and tg numbers don’t collapse the way you might expect as you stretch from 4K to 162K, and that was the whole point of the test.

Next up I’ll post the 180B REAP benchmarks too, and if the hardware cooperates I want to try longer contexts, maybe up to 500K.

u/Dry-Tough-8068 — 6 hours ago

▲ 14 r/LocalLLaMA

Using Codex instead of Opencode

Hello everyone,

I have spent quite a lot of time trying to make Opencode feel more like Codex (the Windows app), and it got me thinking

If I am chasing a "Codex" like experience, is there any reason to use Opencode instead of Codex itself?

For reference, I am running Qwen 3.6 27b Q8

reddit.com

u/wgaca2 — 14 hours ago

▲ 469 r/LocalLLaMA

Qwen & Gemma on deadlock situation (For Benchmarks Numbers)?

I have this feeling for sometime. Also noticed few similar tweets online before.

u/pmttyji — 19 hours ago

r/LocalLLaMA

So... anyone copped one of these?

I built my own harness to replace claude.ai | Self-hosted, beautiful, and works from any device

Late to the party but... Holy MTP

If trends hold, Mythos-class capability may be running on high-end consumer hardware within ~2 years

I told Gemma 4 12B (Q8_0, no cache quant) to write a single-file 3D bowling simulator in WebGL. It's terrible, but honestly better than I expected.

Turn off thinking in LM Studio

Prefill vs. decoding and local LLM ROI: is prefill underrated?

New open model from Tencent Hy: Hy3 (295B total 21B active - apache 2.0)

New strix halo box: GMKtec EVO-X3, superior cooling to avoid thermal throttling, $3,600

Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]

Project Release / Research Draft] Hierarchos at 232M Parameters: Preliminary Findings From a Recurrent Memory-Augmented Assistant Model

TL;DR:

1. Introduction & Background

2. Architecture Overview

Key Components:

3. Core Engineering Lessons (The "Gotchas")

1. Chat/Training Drift Mismatch

2. Supervised LTM Inner Updates Mismatch

3. Unbounded RWKV Channel Mixing

4. Evaluation & Smoke Test Results

Bounded Local Benchmark Metrics (--eval-limit 100)

Real-world Coherence Check:

5. Proposed Ablation & Scaling Plan

Proposed Isolation Testing (Ablations)

Future Scaling Target Tiers

Target Data Mix for Foundation Training:

6. What We Can (and Cannot) Claim Safely

Final Thoughts

Impact of dual PCIe 5 x8 vs dual PCIe 5 x16 for dual Rtx 6000 pro max-q

Kyutai's Pocket TTS clones a voice from 5 seconds of audio, on CPU, under MIT. Benchmarked against Kokoro, Supertonic, and Inflect-Nano for Eng. TTS

Local models + big context = slow. How are you orchestrating "map-reduce" style agent workflows?

Am I Expecting Too Much?

The cyber shelf - 4x 16gb home lab

Qwen3.6 27B local vs Opus 4.8, voxel engine in raw C with zero frameworks

New model: GigaChat3.5-432B-A28B (with day-0 GGUF support!)

Got my Ascent GX10 two days ago, ran REAP-pruned NVFP4 DeepSeek-V4-Flash on a single Spark, and it stays consistent at long context

Using Codex instead of Opencode

Qwen &amp; Gemma on deadlock situation (For Benchmarks Numbers)?

Qwen & Gemma on deadlock situation (For Benchmarks Numbers)?