r/LocalLLaMA

Image 1 — Opencode you naughty minx
Image 2 — Opencode you naughty minx
▲ 764 r/LocalLLaMA+1 crossposts

Opencode you naughty minx

Man, AI agents getting pretty crazy these days. :)

(local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)

u/Jenna_AI — 3 hours ago

Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image

I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix.

Vulkan seems to work. I tested a quick prompt, hopefully it's stable because if so, this gives me 48gb of VRAM to play with. Had to buy a new powersupply, but for $300 and to be able to leverage my older 7800xt - well worth it, I think.

u/Jorlen — 4 hours ago
▲ 140 r/LocalLLaMA+2 crossposts

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using BeeLlama v0.1.2, with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate.

Tests were done with Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about vLLM study, but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison.

Here are my findings:

  • PPL Hides the Tail, KLD Exposes It. Through q4_0, the entire PPL range stays under 0.01 above bf16. Even turbo3_tcq only adds ~0.02 PPL. But 99.9% KL divergence tells a different story: while q5_0 (at 34.4% of bf16) is obviously behind q8_0, it's still not bad. But then q4_0's tail KLD is 32% worse than q5_0's. Now this is what breaks your tool calls and JSON structure.
  • Rotation closed the gap at 4 bits. llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways.
  • TCQ saves the low end. turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than turbo2. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well!
  • Asymmetric KV beats symmetric at the same size. q5_0/q4_0 is the same memory as q4_1/q4_1 but beats it across all test configs in 99.9% precision. After K reaches q5_0, the next useful bit goes to V, not to q5_1 K.
  • Higher model precision means more cache damage. Q5_K_S took 3-5% more 99.9% precision damage than IQ4_XS at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool.
  • q8 is mostly a luxury tier, unless you have spare VRAM. q8_0/q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7-98.2% across configs, so full q8_0/q8_0 at 53.1% is mostly validation when you don't struggle with VRAM anyways.

Here's the article, with all the data and quite a bit of analysis:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

u/Anbeeld — 7 hours ago

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M. But it does start to slow down noticeably beyond 150k so I'd only do this if I ever really want the larger context.

This is using APEX-I-Quality or Q4_K_XL quants both are better than Q4_K_M (IQ4_NL_XL for beyond 512k context).

I have a total of 32GB of DDR4-2666 which is slightly above minimum DDR4.

I see a lot of users with better GPUs and more VRAM seem to be getting less efficiency and have to drop context all the way to 64k or below to run at good tps, I don't understand why. But here are two things I learned from my tweaking so far.

First, since 35B-A3B is an MoE model. It only needs ~3.5B to be in the VRAM during runtime.

8GB is enough to hold the active model layers (~3GB) + GPU buffers (~2GB) + 262144 KV Cache at q8_0 (2.56GB). It's a tight fit, but works.

Messing with the engine's parameters like forcing all layers to be on VRAM or other runtime parameters like sm, fa, etc, seem to actually slow down the model for me and/or exhausts my VRAM and system RAM.

Look at this screenshot for example, there's a misunderstanding of MoE that believes it must fit in its entirety in VRAM to run optimally.

https://preview.redd.it/cpc4r9q7cr2h1.png?width=1197&format=png&auto=webp&s=89bd03a4537825b862472009225a7a99b7fbd8b4

Second, just like Windows 11 sucks for gaming, all that "enhanced experience" also has an impact on LLM inference. Running a compact Linux from terminal (I chose Ubuntu Server) would only use up about 800MB of system RAM and practically no VRAM, compared to Windows 11, and it gives me a +25% boost to tps!

Here are some numbers for the same llama.cpp parameters:

On Windows

  • Inference is <27 tps and drops quickly beyond 100k, in fact it starts dropping from the first few thousands of output tokens.
  • System memory is 28GB+ full, and if I mess with other parameters in llama.cpp it just fills up immediately (~31GB) dragging tps down with it
  • The highest context I was able to run stable is 512k at turbo quant 4 for KV

On Ubuntu Server (fresh double-boot install 2 days ago, installed on a 160GB partition from my fastest nvme)

  • Inference is ~34 tps and doesn't drop, it often goes up to ~37 during generating tokens!
  • System memory is 22GB full, giving me a full 8GB of system RAM to run i3wm/x11 with whatever software I need (no eye candy composers/apps that use the GPU because that'll use up precious VRAM)
  • I was able to get to 1M context on IQ4_NL_XL and turbo4 quant for KV

So far its been good enough. But I have an older small GPU I can connect and use for the operating system while keeping the 3070 Ti entirely dedicated to the LLM.

--------------------

Both profiles are coding focused and should work under Windows 11 too but with a lot less memory left.

Main profile with 256K context:

llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_XL.gguf \
  --jinja \
  --parallel 1 \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --reasoning-budget 4096 \
  -n 32768 \
  --no-context-shift \
  --no-mmap \
  -c 262144 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --host 0.0.0.0

and with 512K context:

llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_XL.gguf \
  --jinja \
  --parallel 1 \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --reasoning-budget 4096 \
  -n 32768 \
  --no-context-shift \
  --no-mmap \
  -c 524288 \
  --rope-scale 2 \
  --rope-scaling yarn \
  --yarn-orig-ctx 262144 \
  --cache-type-k turbo4 \
  --cache-type-v turbo4 \
  --host 0.0.0.0

I hope someone finds this helpful. I love this community and I'm in the Qwen3.7-35B-A3B waiting room with the rest eating my nails in anticipation lol

reddit.com
u/Alternative-Cat-1347 — 2 hours ago
▲ 14 r/LocalLLaMA+2 crossposts

I built a local autonomous agent that streams every reasoning step live in the UI — no black boxes

Hey r/ollama,

I've been building Pragma, an open-source autonomous agent that runs entirely on Ollama. The thing that bothered me about most agents is that you have no idea what they're actually doing — you give them a task and wait.

Pragma shows you everything in real time: every thought, every tool call, every observation, as it happens.

What it does:

  • Runs a ReAct loop (think → act → observe → repeat) and streams each step live in the UI
  • Two models: a small reasoning model for orchestration, a coding model (Qwen 2.5 Coder) for code generation
  • Skill palette: filesystem, shell, web search, LLM calls, and more — each skill is a folder you can add to
  • Threads with persistent history, working directory per conversation
  • No API key, no cloud, everything stays local

Stack: FastAPI + Vanilla JS + WebSocket. No framework magic, every file is understandable in isolation.

Tested on: NVIDIA RTX A2000 12GB with Gemma 4 E4B (reasoning) + Qwen 2.5 Coder 7B (code). 12GB VRAM is the practical minimum for Gemma, 24GB gives more headroom.

Repo: https://github.com/homoagens/pragma

Happy to answer questions about the architecture, the skill system, or how the ReAct loop works.

u/HomoAgens1 — 3 hours ago

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

https://preview.redd.it/kz66mxzseq2h1.jpg?width=4096&format=pjpg&auto=webp&s=da98623808c4bde0dc79b239c8cf8930c5572769

https://preview.redd.it/ocsigi0veq2h1.jpg?width=4096&format=pjpg&auto=webp&s=eb4b053e46e434b2c54de7fff6c584e01c80ea5e

This pic is not representing bench setup, just happily captured while I figured out running same model over 3 GPUs. Halo is always busy, 3090s are waiting Halo does his job.

In short.

1. Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs is pretty good for running the recently popular 27B or 31B dense models.

2. The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s. You might experience up to several times better PP/s and TG/s on small densed models, depending on the situation, and it can be useful in multi coding agents scenarios.

3. Basically using riser cable can achieve eGPU's slot flexibility to fit 2slot NVLink with small mod on typical motherboard pcie 3090 cards.

4. Depending on KVcache types in vLLM, not only max context length and concurrent requests change but speed differs a lot in longer context. It might look good at beginning but not promising longer run.

5. For power efficiency, 27B dense models get better PP/s and TG/s per watt on eGPU. But for 122B, running on Strix halo alone via llama cpp showed better power efficiency than combined 3 GPUs.

6. NVLink does not do anything on llama.cpp's layer split, I have tried recent -sm tensor, gaining Tg/s was 30%ish but pp/s down performance was too big, so I stopped, and continue to vLLM on dual 3090.

I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B densed models of my Bosgame M5 Strix Halo, So I decided to do some scrambling to overcome it. Recently, these dense models are getting much more attention than 70B+ MoE models. To run them better I bought single 3090 via local second hand market, after I saw improvement, then quickly moved to dual egpu setup via both nvme pcie 4x4.

I was hesitated to try NVLink since no gurantee on my eGPU case, and 3 slot NVLink was too expensive(600USD+). Still I wanted to see if I could improve the eGPU's PHB speed which has to go through CPU.
But most 3090 cards including mine are 3 slot thick, so I end up buying a 2slot bridge for around $250 including custom fees.
For this, I removed the 3 fan shroud on the top 3090 and roughly attached 120mm fans with a 3D printed side blow duct to make it fit. Surprisingly, the temperature of this modded 3090 actually stays lower than the unmodded one on bottom.

Test Environment:

  • Fedora 43
  • llama cpp: Strix halo performance power mode, build 9221.
    • 122B test was split by -sm layer using rocm7.2.3 and cuda.
    • 27B test used rocm 7.2.3 as baseline. (Comparing rocm 7.2.3 and vulkan radv, rocm has better pp/s and vulkan has better tg/s). Benchmarks were repeated only 2 times.
    • Note: Since MTP is not fully implemented in llama cpp benchmarks yet, I borrowed the code_python MTP metrics (-pp/s% and +tg/s%) from kyuz0's strix halo toolbox for the 27B and 122B (using 35B A3B Moe stats) to plot simulated MTP lines. (https://kyuz0.github.io/amd-strix-halo-toolboxes/mtp.html)
  • vLLM: Nightly build. 3090s are power limited to 230W each.
  • vLLM benchmarks followed the Club 3090 direction:
    • Narrative: "Write a detailed 800-word essay explaining transformer attention." (max_tokens=1000)
    • Code: "Write a Python implementation of quicksort with comments explaining each step." (max_tokens=800)
    • Sampling: temp=0.6, top_p=0.95, top_k=20, presence_penalty=0.0, enable_thinking=false. Three warmups and five measured runs.
    • Since Club 3090 doesn't have benchmarks based on context depth, I added those tests.

Benched vLLM models - Qwen 3.6 27B

Recipe Quantization KV cache Context Concurrency Drafter
docker-compose-dual (small, INT4 Standard) AutoRound INT4 fp8_e5m2 131K 4 (total ~524K) MTP=3
turbo (High-Concurrency) AutoRound INT4 TQ3 (3-bit) 262K 4 (total ~1048K) MTP=3
mixed-bf16 (Precision,kinda Q6 feeling) Mixed (INT4+8) bfloat16 110K 2 (total ~220K) MTP=3
mixed-fp8 (Sweet Spot) Mixed (INT4+8) fp8_e5m2 131K 2 (total ~262K) MTP=2
autoround INT8 (Largest) AutoRound INT8 fp8_e5m2 115K 1 (total ~115K) MTP=3

Mixed bf16, Mixed fp8, Autoround INT8 recipes are small edited from Club 3090's recipe to look for better than Q4 level of quantization.
(I noticed MTP 2 on mixed-fp8 recipe while I am writing, too much work again to fix, so, keep it mind some different condition)

Benched vLLM models - Qwen 3.6 27B

Recipe KV cache Context Concurrency Drafter
awq-bf16 (pure AWQ) bf16 262K 262K × 1, 131K × 2, 65K × 4 MTP=4
awq_autoround (hybrid awq) bf16 262K 262K × 1, 131K × 2, 65K × 4 MTP=4
int8 (larger context) INT8 340K ~ 392K 262K × 1, 170K × 2, 98K × 4 MTP=4
docker-compose-bf16 (default) bf16 60K 60K × 1 MTP=4

Awq_autoround recipe is also small edited from original.

Results:

Triple : dual 3090 + Strix halo

122B Q4 K XL unsloth, q8_0, Strix Halo vs Triple

https://preview.redd.it/k3owfjdupq2h1.png?width=1600&format=png&auto=webp&s=0ac542116870087ebdbeeb959ab7bb6e398b802b

https://preview.redd.it/avlcn0hpoq2h1.png?width=1600&format=png&auto=webp&s=a824f6b42c48e2b4e3ae7690a36b473ca8d8c81c

Strix halo (llama cpp 27B MTP Q6 K XL unsloth, 25GB including mmproj)
vs Dual 3090, Qwen3.6-27B-Mixed-AutoRound Minachist 28.9GB)
I chose these quants since considerably good enough quality and size wise close

https://preview.redd.it/gl5xz5ufqq2h1.png?width=1600&format=png&auto=webp&s=4f14f93ffacd94fbb68c6bb52f462012fad0882f

https://preview.redd.it/n93cgeshqq2h1.png?width=1600&format=png&auto=webp&s=98d219e97e13137db627d66d84124aae84275a74

Power efficiency
Rough calculation, but for 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.

https://preview.redd.it/s2ryohacsq2h1.png?width=1600&format=png&auto=webp&s=e0764be736283bb211e52ed67110b0b9e28fc8ad

https://preview.redd.it/8xdltx0esq2h1.png?width=1600&format=png&auto=webp&s=2d0d2a8b637aae66c5c2511c95e2b1c6baae8ae5

NVLink on / off

Tested NVLink on vs off. As concurrency and context go up, NVLink defends the bandwidth bottleneck pretty well.

BF16 cache senario

https://preview.redd.it/92qm9owysq2h1.png?width=1600&format=png&auto=webp&s=af40d019a444877c1d7128b30dbc5b0d80837c66

https://preview.redd.it/6zqs4g80tq2h1.png?width=1600&format=png&auto=webp&s=4951dc402159bd64d8959ebdf5fe1f42c8b5d9e2

fp8 cache case.

https://preview.redd.it/yzcgl1wjtq2h1.png?width=1600&format=png&auto=webp&s=6b6e547721a6daeb480423b5928c5a30cdf98e51

https://preview.redd.it/zopa2nlktq2h1.png?width=1600&format=png&auto=webp&s=25f05e0a183ae75627f2ae1071ea9318f91dfe0a

INT4 quant's fp8 senario

https://preview.redd.it/6um96q5qtq2h1.png?width=1600&format=png&auto=webp&s=463dfd330cd6f783ab9d6e446f58dc15be568326

https://preview.redd.it/e4j0sj3stq2h1.png?width=1600&format=png&auto=webp&s=4655627f234372ea7d4c847aaaca9faeb2080f7b

Gemma4 31B's case
Gemma-4-31B-it-AutoRound-AWQ, mattbucci, BF16 cache

https://preview.redd.it/rey8p3zytq2h1.png?width=1600&format=png&auto=webp&s=aa573c264af1e3fed6a87ec0837bca32066116b3

https://preview.redd.it/wera6hiztq2h1.png?width=1600&format=png&auto=webp&s=d8c92a6abffcbd0d866c17a7d3ecf2a19764a47c

This shows differences based on quantization and KV cache types. You can see how much max context length and speed fluctuate just by changing the cache type.
on Amphere card, TQ3 was pretty bad to keep Tg/s despite it can give more context amount..

https://preview.redd.it/j6y2cg6nvq2h1.png?width=1164&format=png&auto=webp&s=52eef18357c23d2341444e3e7e873902837fd87d

https://preview.redd.it/jb917qmovq2h1.png?width=1164&format=png&auto=webp&s=e94a60d752d0ad6bf28c070015a15c1cb37a0759

Code vs Narrative MTP

When concurrency is 1, code generation is always faster than narrative. But as you can see, when concurrency is 2 and it goes into deeper context, code speed drops and gets reversed by narrative. Seems like a weird load happens when concurrent requests and long context combine.

https://preview.redd.it/pcw1duwdwq2h1.png?width=1600&format=png&auto=webp&s=f6366e31b70af3d3d3361288320b9ebba4cda5c8

Huge thanks to
Club 3090 (https://github.com/noonghunna/club-3090/tree/master),
kyuz0's toolbox (https://github.com/kyuz0/amd-strix-halo-toolboxes), and DasDigitaleMomentum's distrobox (https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox)

reddit.com
u/Rattling33 — 4 hours ago

I fine-tuned Cohere Transcribe to support diarization and timestamps

Hi

I'll keep it short:
Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models).

BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.

SO I trained the model to support it. It follows the standard timestamp standard.

The output now looks like this:

&lt;|spltoken0|&gt;&lt;|t:0.0|&gt; Welcome back. &lt;|t:1.5|&gt;&lt;|spltoken1|&gt;&lt;|t:1.5|&gt; Thanks. &lt;|t:2.4|&gt;

Which is an easily parsable format.

The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.

The model supports up to 4 speakers per 30 seconds, and using the diarize_long.py script, it could accurately identify up to 32 people.

It's available for free on huggingface.

Enjoy!

u/iamMess — 4 hours ago

Blackwell and PDL performance increase

Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.)

In short, PDL enables more efficient execution of kernels and as a result better performance. So far, it's not enabled by default, if you don't know about it, you will likely miss it.

To enable PDL you need to build Llama.cpp with the '-D GGML_CUDA_PDL=ON' flag and it's not yet enabled for all kernels, there is likely more performance to be had once more kernels are enabled with PDL.

(To later disable PDL, if needed, do 'export GGML_CUDA_PDL=0' before starting llama.cpp)

Benchmarks

Model pp512 tg128 pp512 @ PDL tg128 @ PDL pp % tg %
Qwen 3.6 35B.A3B MXFP4 5412.39 ± 62.58 172.72 ± 3.94 5416.55 ± 58.92 183.03 ± 0.93 0 5.97
Qwen 3.6 35B.A3B UD-Q5_K_XL 4564.77 ± 47.55 162.24 ± 6.67 4582.22 ± 45.65 177.11 ± 1.29 0 9.17
Gemma 4 26B.A4B NVFP4 6728.74 ± 89.56 107.39 ± 2.44 6850.46 ± 97.86 112.71 ± 0.38 1.8 4.95
Qwen 3.6 27B NVFP4 2687.16 ± 70.18 41.31 ± 0.03 2708.97 ± 55.56 42.22 ± 0.05 0 2.2

(All tests run with b9282 and results are best of two on an RTX Pro 4500 Blackwell 32GB.)

Conclusion

There is virtually no difference on pre-fill, however there is on average 5% to 6% performance boost on token generation based on above tests. According to the PR, somewhere between 4% and 10% improvement on token generation is expected.

As mentioned, this is not enabled by default when building, if you are on Blackwell, this is a free lunch and worth trying out.

reddit.com
u/UncleRedz — 3 hours ago

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.

TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following for all the experiments:

  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
  • no mmproj (no image input support needed for now)
  • for more details, see configuration below

The quants tested:

Configuration

My models-preset.ini contents:

version = 1
[Qwen3.6-35B-A3B]
# Unsloth variant
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
# ByteShape variant
# m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
top-k = 20
repeat-penalty = 1.0
presence-penalty = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true

Benchmark results

I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.

Unsloth ByteShape Δ
PP tok/s 585 564 -4%
TG tok/s 25.4 33.1 +30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

  • Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
  • I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
  • I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

u/OsmanthusBloom — 8 hours ago

Experts first llama.cpp

This is for all with 12GB VRAM.

Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers. The reason is I own an RTX 2060 with 12GB VRAM. That sounds big but is too little for dense models. That is why I use mainly MoE models because of that. The problem is, you need to split some layers to the CPU lane.

As you all surely know, Qwen3.6-35B-A3B uses only 8 experts per token; the rest are unused, so why not fill the experts into VRAM instead of complete layers full of unused experts?

I started to create a UI to monitor which experts are used. This already showed me that the first layers are more important to have on VRAM than the last ones; the reason is that they would change the experts more frequently than the others. Unfortunately, n-cpu-moe with llama.cpp will let the first layers on the CPU, so I tried -ot, but that's another story. With the optimized setup, I was able to reach about 22 tk/s. (Remember the 2060 has only about half the CUDA cores of a 3060.) With the default --n-cpu-moe, I get 19 tk/s

I only run Q6 models, since the degradation at coding is visible. My context is not quantized (same reason), and because of Java development, I need a big context window of 100k.

However, with my expert variant and a hit rate of about 62%, it increased to 26 tks. The break-even point was at a 42% hit rate. This means the prompt has used 42% of the chosen experts on the GPU in my cache. As I tested smaller sizes of RAM (built-in argument to specify the VRAM usage), another use case came into my mind. With a good profile, you can reduce the usage a lot without sacrificing speed.

Now, to my question. Is there a person who would like to give it a test? I really would like to know how it behaves on a 3060/4060 or similar. (CUDA is a requirement, and Qwen 35B A3B or Gemma 26B A4B). Currently, it is tested only on Linux.

Really, I don't want to earn any stars or so. I don't care; I just want to know how much it increases the token generation on which NVIDIA graphics card.

It would need the following: checkout and build https://github.com/adrianhoehne/llama.cpp

Start it with the additional arguments:

./build/bin/llama-server --moe-layer-perf-out experts.json \
--cpu-moe \
--ctx-size 100000 \
--parallel 1

Then start a prompt and wait. This will take longer than usual because every expert is still on the CPU.

After that, exchange the arguments to

./build/bin/llama-server --moe-hot-cache experts.json \
--moe-hot-cache-max-mib -1 \
--moe-hot-cache-auto-reserve-mib 1024 \
--moe-hot-cache-update-rate 0.10 \
--cpu-moe \
--ctx-size 100000 \
--parallel 1

And start measurement.

I also included the view of which experts are used to the Llama UI:

Button for ui

reddit.com
u/comanderxv — 8 hours ago

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system. But was curious about perceision. I know quantization hurts the model, but I don't think I had really understoof that till I tested it myself.

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend: llama.cpp Vulkan

Eval: wikitext-2 (583 chunks, ctx 512)

Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth's dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers
(no shit I asked claude to make me a table to copy pasta)

Metric MXFP4 Q4_K_M Q5_K_M UD-Q5_K_M
Same top-1 89.4% 89.6% 93.0% 94.0%
Mean KL divergence 0.0746 0.0685 0.0308 0.0217
Max KL (worst token) 13.04 5.93 8.19 4.75
File size 44.7 GB 45.2 GB 52.9 GB 55.2 GB

UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.

Here's the thing nobody talks about: token accuracy compounds exponentially.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM's are auto regressive. Yann LeCun is always talking about this and that LLM's suffer from exponentially diverging error probabilities. This is were all your hallicunations and stuff happen.

MXFP4 (89.4%) &gt; 100 token output: 0.0014% chance of perfect agreement

UD-Q5_K_M (94%) &gt; 100 token output: 0.21% chance of perfect agreement

That's not a big number, but on long refactoring tasks or multi step reasoning, you feel it. MXFP4 "goes off the rails" way more often.

There is a speed trade off to all of this though.

refill (batch 512): MXFP4 still fastest (hardware kernels)

Prefill (batch 4096): MXFP4 wins again

Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding (which is decode-bound anyway), the speed hit is negligible.

For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you're on Nvidia hardware, are you seeing different tradeoffs than RDNA?

https://preview.redd.it/0z8kkkhjkp2h1.png?width=1130&format=png&auto=webp&s=aadcce727dc26d756d67d4e356a709aa96fd030f

reddit.com
u/alphatrad — 9 hours ago

[NEW] Supra-50M Released!

https://preview.redd.it/kx39ammxno2h1.jpg?width=1080&format=pjpg&auto=webp&s=d1a2d5b27920a5b61a50547a6e70a6378445cae4

SupraLabs released a new model! - Supra-50M

Supra-50M is a compact 50M-parameter causal language model (BASE and INSTRUCT versions) built from scratch by SupraLabs using a Llama-style architecture, trained on 20 billion tokens of high-quality educational web text. Despite being significantly smaller than comparable open models, it achieves competitive or superior results on several key benchmarks. This is our first SupraLabs Scaling Up Plan model.

🤗 Supra-50M-Base | Supra-50M-Instruct

What comes next?

  • Supra-124M — Base, Chat, Experimental Reasoning
  • Supra-350M — Base, Chat, Reasoning, Coding

🏆 Benchmarks

Benchmark Supra-50M (ours) GPT-2 (124M) SmolLM-135M OpenELM-270M
Parameters 50M 124M (2.5×) 135M (2.7×) 270M (5.4×)
BLiMP (linguistics) 76.3% 63.0% 69.8% N/A
SciQ (science) 77.2% 53.2% 73.4% 84.70%
ARC-Easy (knowledge) 52.2% 42.0% 49.2% 45.08%
PIQA (logic) 62.2% 63.0% 67.3% 69.75%
HellaSwag (context) 31.8% 29.5% 42.0% 46.71%

🧠 Architecture & Hyperparameters

Hyperparameter Value
Architecture Llama (decoder-only transformer)
Parameters ~50M
Vocab size 32,000
Hidden size 512
Intermediate size 1,408
Hidden layers 12
Attention heads 8
Key-value heads 4 (GQA)
Max position embeddings 1,024
RoPE theta 10,000
Tied embeddings Yes

📚 Training Data

Property Value
Dataset HuggingFaceFW/fineweb-edu (sample-100BT)
Total tokens 20B
Sequence length 1,024 tokens
Storage format Memory-mapped binary (uint16, ~40 GB)

🔤 Tokenizer

Custom Byte-Level BPE tokenizer trained from scratch on 500,000 documents sampled from fineweb-edu (sample-10BT).

Property Value
Type ByteLevelBPETokenizer
Vocabulary size 32,000
Min frequency 2
Special tokens &lt;s&gt;, &lt;pad&gt;, &lt;/s&gt;, &lt;unk&gt;, &lt;mask&gt;

⚙️ Training Configuration

Parameter Value
Epochs 1
Per-device batch size 32
Gradient accumulation steps 4
Effective batch size 128 × 1,024 tokens
Learning rate 6e-4
LR scheduler Cosine
Warmup ratio 2%
Optimizer AdamW Fused (β1=0.9, β2=0.95)
Weight decay 0.1
Max grad norm 1.0
Precision bfloat16
torch.compile Enabled
Hardware Single GPU
Final loss 3.259

🚀 Inference — Instruct version

import os, warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

import torch
from transformers import pipeline, AutoTokenizer, logging
logging.set_verbosity_error()

MODEL_ID = "SupraLabs/Supra-50M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, clean_up_tokenization_spaces=False)
pipe = pipeline(
    "text-generation",
    model=MODEL_ID,
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
)

def build_prompt(instruction, input_text=""):
    if input_text.strip():
        return (
            "Below is an instruction that describes a task, paired with an input "
            "that provides further context. Write a response that appropriately "
            "completes the request.\n\n"
            f"### Instruction:\n{instruction}\n\n"
            f"### Input:\n{input_text}\n\n### Response:\n"
        )
    return (
        "Below is an instruction that describes a task. Write a response that "
        "appropriately completes the request.\n\n"
        f"### Instruction:\n{instruction}\n\n### Response:\n"
    )

def generate(instruction, input_text=""):
    result = pipe(
        build_prompt(instruction, input_text),
        max_new_tokens=512, do_sample=True, temperature=0.7,
        top_k=50, top_p=0.9, repetition_penalty=1.15,
        pad_token_id=pipe.tokenizer.pad_token_id,
        eos_token_id=pipe.tokenizer.eos_token_id,
        return_full_text=False
    )
    return result[0]['generated_text'].strip()

while True:
    print("\nEnter an instruction (or 'exit' to quit):")
    user_input = input().strip()
    if user_input.lower() == "exit":
        break
    print("\nEnter additional context (optional, press Enter to skip):")
    context_input = input().strip()
    print(f"\nResponse:\n{generate(user_input, context_input)}\n")

Base version

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="SupraLabs/Supra-50M_BASE",
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

def generate_text(prompt, max_new_tokens=150):
    result = pipe(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=True, temperature=0.5,
        top_k=25, top_p=0.9, repetition_penalty=1.2,
        pad_token_id=pipe.tokenizer.pad_token_id,
        eos_token_id=pipe.tokenizer.eos_token_id
    )
    return result[0]['generated_text']

prompt = "The importance of education is"
print(f"Prompt: {prompt}\n" + "-" * 40)
print("\nOutput:\n" + generate_text(prompt))

💬 Sample Outputs

Prompt: "The main concept of physics is "

>

Prompt: "Artificial intelligence is "

>

Prompt: "Once upon a time, "

>

First model in the SupraLabs Scaling Up Plan. Feedback welcome!

reddit.com
u/Dangerous_Try3619 — 12 hours ago

Qwen 3.6. struggling with German

Hi everyone,

I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.

So far, I’ve experimented with Hermes Agent and tried Qwen 3.6 (27B & 35B) as well as Gemma 41B. My workflow involves transcribing audio with Whisper and then feeding the transcript to a local AI. This works fine with a cloud model, but I cannot use a cloud solution in production due to patient data and privacy concerns. I want to handle everything locally.

My main issue is that Qwen 3.6 struggles with German. It sometimes produces technically correct words that aren’t commonly used in natural German. Additionally, the text can sometimes feel very “AI-like,” whereas cloud models produce much more natural-sounding results. Second problem I am experiencing that both models sometimes cannot distinguish what is important and what is not important, cloud models handle this way better...

I’m wondering if there’s a targeted approach to make local models behave better—would fine-tuning help here? Has anyone managed to get this working in a meaningful way for structured German text documentation?

I’ve built a complex iterative skill setup, which works well with DeepSeek V4, but the local results are disappointing. I don’t understand why generating text documentation from one-hour therapy sessions locally seems so difficult, and I’d love to hear what has worked for others.

Thanks in advance!

reddit.com
u/xchris1337xy — 10 hours ago

[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

Probably most of you are aware that using anything other than -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of -ctk q8_0 -ctv q4_0 pps tanks.

I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use cmake -DGGML_CUDA_FA_ALL_QUANTS=ON .. which will take very long.

But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great.

Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16:

https://github.com/ggml-org/llama.cpp/discussions/23470

reddit.com
u/Ueberlord — 11 hours ago

Seeking resources to read about llama.cpp server and how offloading works

SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version. Using the --ngl option to offload.


First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.

I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.

I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.

If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)

Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.

Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.


EDIT: Offloading also has a nice bonus benefit of being QUIET. For anyone with a very loud GPU fan, it's a nice break. Yes it's slower but I can work on other tabs and windows while it processes and actually hear myself think. I might do more of this.

reddit.com
u/Jorlen — 10 hours ago

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

arxiv.org
u/Thrumpwart — 6 hours ago