

Opencode you naughty minx
Man, AI agents getting pretty crazy these days. :)
(local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)


Man, AI agents getting pretty crazy these days. :)
(local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)
Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image
I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix.
Vulkan seems to work. I tested a quick prompt, hopefully it's stable because if so, this gives me 48gb of VRAM to play with. Had to buy a new powersupply, but for $300 and to be able to leverage my older 7800xt - well worth it, I think.
Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using BeeLlama v0.1.2, with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate.
Tests were done with Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about vLLM study, but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison.
Here are my findings:
q4_0, the entire PPL range stays under 0.01 above bf16. Even turbo3_tcq only adds ~0.02 PPL. But 99.9% KL divergence tells a different story: while q5_0 (at 34.4% of bf16) is obviously behind q8_0, it's still not bad. But then q4_0's tail KLD is 32% worse than q5_0's. Now this is what breaks your tool calls and JSON structure.turbo4 has no quality advantage over q4_0, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways.turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than turbo2. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well!q5_0/q4_0 is the same memory as q4_1/q4_1 but beats it across all test configs in 99.9% precision. After K reaches q5_0, the next useful bit goes to V, not to q5_1 K.Q5_K_S took 3-5% more 99.9% precision damage than IQ4_XS at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool.q8_0/q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7-98.2% across configs, so full q8_0/q8_0 at 53.1% is mostly validation when you don't struggle with VRAM anyways.Here's the article, with all the data and quite a bit of analysis:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M. But it does start to slow down noticeably beyond 150k so I'd only do this if I ever really want the larger context.
This is using APEX-I-Quality or Q4_K_XL quants both are better than Q4_K_M (IQ4_NL_XL for beyond 512k context).
I have a total of 32GB of DDR4-2666 which is slightly above minimum DDR4.
I see a lot of users with better GPUs and more VRAM seem to be getting less efficiency and have to drop context all the way to 64k or below to run at good tps, I don't understand why. But here are two things I learned from my tweaking so far.
First, since 35B-A3B is an MoE model. It only needs ~3.5B to be in the VRAM during runtime.
8GB is enough to hold the active model layers (~3GB) + GPU buffers (~2GB) + 262144 KV Cache at q8_0 (2.56GB). It's a tight fit, but works.
Messing with the engine's parameters like forcing all layers to be on VRAM or other runtime parameters like sm, fa, etc, seem to actually slow down the model for me and/or exhausts my VRAM and system RAM.
Look at this screenshot for example, there's a misunderstanding of MoE that believes it must fit in its entirety in VRAM to run optimally.
Second, just like Windows 11 sucks for gaming, all that "enhanced experience" also has an impact on LLM inference. Running a compact Linux from terminal (I chose Ubuntu Server) would only use up about 800MB of system RAM and practically no VRAM, compared to Windows 11, and it gives me a +25% boost to tps!
Here are some numbers for the same llama.cpp parameters:
On Windows
On Ubuntu Server (fresh double-boot install 2 days ago, installed on a 160GB partition from my fastest nvme)
So far its been good enough. But I have an older small GPU I can connect and use for the operating system while keeping the 3070 Ti entirely dedicated to the LLM.
--------------------
Both profiles are coding focused and should work under Windows 11 too but with a lot less memory left.
Main profile with 256K context:
llama-server \
-m Qwen3.6-35B-A3B-Q4_K_XL.gguf \
--jinja \
--parallel 1 \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0 \
--reasoning-budget 4096 \
-n 32768 \
--no-context-shift \
--no-mmap \
-c 262144 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--host 0.0.0.0
and with 512K context:
llama-server \
-m Qwen3.6-35B-A3B-Q4_K_XL.gguf \
--jinja \
--parallel 1 \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0 \
--reasoning-budget 4096 \
-n 32768 \
--no-context-shift \
--no-mmap \
-c 524288 \
--rope-scale 2 \
--rope-scaling yarn \
--yarn-orig-ctx 262144 \
--cache-type-k turbo4 \
--cache-type-v turbo4 \
--host 0.0.0.0
I hope someone finds this helpful. I love this community and I'm in the Qwen3.7-35B-A3B waiting room with the rest eating my nails in anticipation lol
Hey r/ollama,
I've been building Pragma, an open-source autonomous agent that runs entirely on Ollama. The thing that bothered me about most agents is that you have no idea what they're actually doing — you give them a task and wait.
Pragma shows you everything in real time: every thought, every tool call, every observation, as it happens.
What it does:
Stack: FastAPI + Vanilla JS + WebSocket. No framework magic, every file is understandable in isolation.
Tested on: NVIDIA RTX A2000 12GB with Gemma 4 E4B (reasoning) + Qwen 2.5 Coder 7B (code). 12GB VRAM is the practical minimum for Gemma, 24GB gives more headroom.
Repo: https://github.com/homoagens/pragma
Happy to answer questions about the architecture, the skill system, or how the ReAct loop works.
In short.
1. Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs is pretty good for running the recently popular 27B or 31B dense models.
2. The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s. You might experience up to several times better PP/s and TG/s on small densed models, depending on the situation, and it can be useful in multi coding agents scenarios.
3. Basically using riser cable can achieve eGPU's slot flexibility to fit 2slot NVLink with small mod on typical motherboard pcie 3090 cards.
4. Depending on KVcache types in vLLM, not only max context length and concurrent requests change but speed differs a lot in longer context. It might look good at beginning but not promising longer run.
5. For power efficiency, 27B dense models get better PP/s and TG/s per watt on eGPU. But for 122B, running on Strix halo alone via llama cpp showed better power efficiency than combined 3 GPUs.
6. NVLink does not do anything on llama.cpp's layer split, I have tried recent -sm tensor, gaining Tg/s was 30%ish but pp/s down performance was too big, so I stopped, and continue to vLLM on dual 3090.
I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B densed models of my Bosgame M5 Strix Halo, So I decided to do some scrambling to overcome it. Recently, these dense models are getting much more attention than 70B+ MoE models. To run them better I bought single 3090 via local second hand market, after I saw improvement, then quickly moved to dual egpu setup via both nvme pcie 4x4.
I was hesitated to try NVLink since no gurantee on my eGPU case, and 3 slot NVLink was too expensive(600USD+). Still I wanted to see if I could improve the eGPU's PHB speed which has to go through CPU.
But most 3090 cards including mine are 3 slot thick, so I end up buying a 2slot bridge for around $250 including custom fees.
For this, I removed the 3 fan shroud on the top 3090 and roughly attached 120mm fans with a 3D printed side blow duct to make it fit. Surprisingly, the temperature of this modded 3090 actually stays lower than the unmodded one on bottom.
Test Environment:
-sm layer using rocm7.2.3 and cuda.Benched vLLM models - Qwen 3.6 27B
| Recipe | Quantization | KV cache | Context | Concurrency | Drafter |
|---|---|---|---|---|---|
| docker-compose-dual (small, INT4 Standard) | AutoRound INT4 | fp8_e5m2 | 131K | 4 (total ~524K) | MTP=3 |
| turbo (High-Concurrency) | AutoRound INT4 | TQ3 (3-bit) | 262K | 4 (total ~1048K) | MTP=3 |
| mixed-bf16 (Precision,kinda Q6 feeling) | Mixed (INT4+8) | bfloat16 | 110K | 2 (total ~220K) | MTP=3 |
| mixed-fp8 (Sweet Spot) | Mixed (INT4+8) | fp8_e5m2 | 131K | 2 (total ~262K) | MTP=2 |
| autoround INT8 (Largest) | AutoRound INT8 | fp8_e5m2 | 115K | 1 (total ~115K) | MTP=3 |
Mixed bf16, Mixed fp8, Autoround INT8 recipes are small edited from Club 3090's recipe to look for better than Q4 level of quantization.
(I noticed MTP 2 on mixed-fp8 recipe while I am writing, too much work again to fix, so, keep it mind some different condition)
Benched vLLM models - Qwen 3.6 27B
| Recipe | KV cache | Context | Concurrency | Drafter |
|---|---|---|---|---|
| awq-bf16 (pure AWQ) | bf16 | 262K | 262K × 1, 131K × 2, 65K × 4 | MTP=4 |
| awq_autoround (hybrid awq) | bf16 | 262K | 262K × 1, 131K × 2, 65K × 4 | MTP=4 |
| int8 (larger context) | INT8 | 340K ~ 392K | 262K × 1, 170K × 2, 98K × 4 | MTP=4 |
| docker-compose-bf16 (default) | bf16 | 60K | 60K × 1 | MTP=4 |
Awq_autoround recipe is also small edited from original.
Results:
Triple : dual 3090 + Strix halo
122B Q4 K XL unsloth, q8_0, Strix Halo vs Triple
Strix halo (llama cpp 27B MTP Q6 K XL unsloth, 25GB including mmproj)
vs Dual 3090, Qwen3.6-27B-Mixed-AutoRound Minachist 28.9GB)
I chose these quants since considerably good enough quality and size wise close
Power efficiency
Rough calculation, but for 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.
NVLink on / off
Tested NVLink on vs off. As concurrency and context go up, NVLink defends the bandwidth bottleneck pretty well.
BF16 cache senario
fp8 cache case.
INT4 quant's fp8 senario
Gemma4 31B's case
Gemma-4-31B-it-AutoRound-AWQ, mattbucci, BF16 cache
This shows differences based on quantization and KV cache types. You can see how much max context length and speed fluctuate just by changing the cache type.
on Amphere card, TQ3 was pretty bad to keep Tg/s despite it can give more context amount..
Code vs Narrative MTP
When concurrency is 1, code generation is always faster than narrative. But as you can see, when concurrency is 2 and it goes into deeper context, code speed drops and gets reversed by narrative. Seems like a weird load happens when concurrent requests and long context combine.
Huge thanks to
Club 3090 (https://github.com/noonghunna/club-3090/tree/master),
kyuz0's toolbox (https://github.com/kyuz0/amd-strix-halo-toolboxes), and DasDigitaleMomentum's distrobox (https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox)
Hi
I'll keep it short:
Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models).
BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the tokenizer.
SO I trained the model to support it. It follows the standard timestamp standard.
The output now looks like this:
<|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>
Which is an easily parsable format.
The timestamps are accurate within 0.097 seconds on average, and 90% are within 0.006 seconds.
The model supports up to 4 speakers per 30 seconds, and using the diarize_long.py script, it could accurately identify up to 32 people.
It's available for free on huggingface.
Enjoy!
Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.)
In short, PDL enables more efficient execution of kernels and as a result better performance. So far, it's not enabled by default, if you don't know about it, you will likely miss it.
To enable PDL you need to build Llama.cpp with the '-D GGML_CUDA_PDL=ON' flag and it's not yet enabled for all kernels, there is likely more performance to be had once more kernels are enabled with PDL.
(To later disable PDL, if needed, do 'export GGML_CUDA_PDL=0' before starting llama.cpp)
| Model | pp512 | tg128 | pp512 @ PDL | tg128 @ PDL | pp % | tg % |
|---|---|---|---|---|---|---|
| Qwen 3.6 35B.A3B MXFP4 | 5412.39 ± 62.58 | 172.72 ± 3.94 | 5416.55 ± 58.92 | 183.03 ± 0.93 | 0 | 5.97 |
| Qwen 3.6 35B.A3B UD-Q5_K_XL | 4564.77 ± 47.55 | 162.24 ± 6.67 | 4582.22 ± 45.65 | 177.11 ± 1.29 | 0 | 9.17 |
| Gemma 4 26B.A4B NVFP4 | 6728.74 ± 89.56 | 107.39 ± 2.44 | 6850.46 ± 97.86 | 112.71 ± 0.38 | 1.8 | 4.95 |
| Qwen 3.6 27B NVFP4 | 2687.16 ± 70.18 | 41.31 ± 0.03 | 2708.97 ± 55.56 | 42.22 ± 0.05 | 0 | 2.2 |
(All tests run with b9282 and results are best of two on an RTX Pro 4500 Blackwell 32GB.)
There is virtually no difference on pre-fill, however there is on average 5% to 6% performance boost on token generation based on above tests. According to the PR, somewhere between 4% and 10% improvement on token generation is expected.
As mentioned, this is not enabled by default when building, if you are on Blackwell, this is a free lunch and worth trying out.
A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.
TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.
I fixed the following for all the experiments:
The quants tested:
My models-preset.ini contents:
version = 1
[Qwen3.6-35B-A3B]
# Unsloth variant
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
# ByteShape variant
# m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
top-k = 20
repeat-penalty = 1.0
presence-penalty = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true
I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.
| Unsloth | ByteShape | Δ | |
|---|---|---|---|
| PP tok/s | 585 | 564 | -4% |
| TG tok/s | 25.4 | 33.1 | +30% |
The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.
This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.
This is for all with 12GB VRAM.
Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers. The reason is I own an RTX 2060 with 12GB VRAM. That sounds big but is too little for dense models. That is why I use mainly MoE models because of that. The problem is, you need to split some layers to the CPU lane.
As you all surely know, Qwen3.6-35B-A3B uses only 8 experts per token; the rest are unused, so why not fill the experts into VRAM instead of complete layers full of unused experts?
I started to create a UI to monitor which experts are used. This already showed me that the first layers are more important to have on VRAM than the last ones; the reason is that they would change the experts more frequently than the others. Unfortunately, n-cpu-moe with llama.cpp will let the first layers on the CPU, so I tried -ot, but that's another story. With the optimized setup, I was able to reach about 22 tk/s. (Remember the 2060 has only about half the CUDA cores of a 3060.) With the default --n-cpu-moe, I get 19 tk/s
I only run Q6 models, since the degradation at coding is visible. My context is not quantized (same reason), and because of Java development, I need a big context window of 100k.
However, with my expert variant and a hit rate of about 62%, it increased to 26 tks. The break-even point was at a 42% hit rate. This means the prompt has used 42% of the chosen experts on the GPU in my cache. As I tested smaller sizes of RAM (built-in argument to specify the VRAM usage), another use case came into my mind. With a good profile, you can reduce the usage a lot without sacrificing speed.
Now, to my question. Is there a person who would like to give it a test? I really would like to know how it behaves on a 3060/4060 or similar. (CUDA is a requirement, and Qwen 35B A3B or Gemma 26B A4B). Currently, it is tested only on Linux.
Really, I don't want to earn any stars or so. I don't care; I just want to know how much it increases the token generation on which NVIDIA graphics card.
It would need the following: checkout and build https://github.com/adrianhoehne/llama.cpp
Start it with the additional arguments:
./build/bin/llama-server --moe-layer-perf-out experts.json \
--cpu-moe \
--ctx-size 100000 \
--parallel 1
Then start a prompt and wait. This will take longer than usual because every expert is still on the CPU.
After that, exchange the arguments to
./build/bin/llama-server --moe-hot-cache experts.json \
--moe-hot-cache-max-mib -1 \
--moe-hot-cache-auto-reserve-mib 1024 \
--moe-hot-cache-update-rate 0.10 \
--cpu-moe \
--ctx-size 100000 \
--parallel 1
And start measurement.
I also included the view of which experts are used to the Llama UI:
Se están probando los modelos nuevos en el Huawei Ascend 910B
Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system. But was curious about perceision. I know quantization hurts the model, but I don't think I had really understoof that till I tested it myself.
Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Eval: wikitext-2 (583 chunks, ctx 512)
Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M
TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth's dynamic precision approach is really good. I might need to test it at lower quants now.
The Numbers
(no shit I asked claude to make me a table to copy pasta)
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.
Here's the thing nobody talks about: token accuracy compounds exponentially.
A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM's are auto regressive. Yann LeCun is always talking about this and that LLM's suffer from exponentially diverging error probabilities. This is were all your hallicunations and stuff happen.
MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement
UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement
That's not a big number, but on long refactoring tasks or multi step reasoning, you feel it. MXFP4 "goes off the rails" way more often.
There is a speed trade off to all of this though.
refill (batch 512): MXFP4 still fastest (hardware kernels)
Prefill (batch 4096): MXFP4 wins again
Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger
For interactive coding (which is decode-bound anyway), the speed hit is negligible.
For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.
What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you're on Nvidia hardware, are you seeing different tradeoffs than RDNA?
Supra-50M is a compact 50M-parameter causal language model (BASE and INSTRUCT versions) built from scratch by SupraLabs using a Llama-style architecture, trained on 20 billion tokens of high-quality educational web text. Despite being significantly smaller than comparable open models, it achieves competitive or superior results on several key benchmarks. This is our first SupraLabs Scaling Up Plan model.
🤗 Supra-50M-Base | Supra-50M-Instruct
| Benchmark | Supra-50M (ours) | GPT-2 (124M) | SmolLM-135M | OpenELM-270M |
|---|---|---|---|---|
| Parameters | 50M | 124M (2.5×) | 135M (2.7×) | 270M (5.4×) |
| BLiMP (linguistics) | 76.3% | 63.0% | 69.8% | N/A |
| SciQ (science) | 77.2% | 53.2% | 73.4% | 84.70% |
| ARC-Easy (knowledge) | 52.2% | 42.0% | 49.2% | 45.08% |
| PIQA (logic) | 62.2% | 63.0% | 67.3% | 69.75% |
| HellaSwag (context) | 31.8% | 29.5% | 42.0% | 46.71% |
| Hyperparameter | Value |
|---|---|
| Architecture | Llama (decoder-only transformer) |
| Parameters | ~50M |
| Vocab size | 32,000 |
| Hidden size | 512 |
| Intermediate size | 1,408 |
| Hidden layers | 12 |
| Attention heads | 8 |
| Key-value heads | 4 (GQA) |
| Max position embeddings | 1,024 |
| RoPE theta | 10,000 |
| Tied embeddings | Yes |
| Property | Value |
|---|---|
| Dataset | HuggingFaceFW/fineweb-edu (sample-100BT) |
| Total tokens | 20B |
| Sequence length | 1,024 tokens |
| Storage format | Memory-mapped binary (uint16, ~40 GB) |
Custom Byte-Level BPE tokenizer trained from scratch on 500,000 documents sampled from fineweb-edu (sample-10BT).
| Property | Value |
|---|---|
| Type | ByteLevelBPETokenizer |
| Vocabulary size | 32,000 |
| Min frequency | 2 |
| Special tokens | <s>, <pad>, </s>, <unk>, <mask> |
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Per-device batch size | 32 |
| Gradient accumulation steps | 4 |
| Effective batch size | 128 × 1,024 tokens |
| Learning rate | 6e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 2% |
| Optimizer | AdamW Fused (β1=0.9, β2=0.95) |
| Weight decay | 0.1 |
| Max grad norm | 1.0 |
| Precision | bfloat16 |
| torch.compile | Enabled |
| Hardware | Single GPU |
| Final loss | 3.259 |
import os, warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")
import torch
from transformers import pipeline, AutoTokenizer, logging
logging.set_verbosity_error()
MODEL_ID = "SupraLabs/Supra-50M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, clean_up_tokenization_spaces=False)
pipe = pipeline(
"text-generation",
model=MODEL_ID,
tokenizer=tokenizer,
device_map="auto",
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
)
def build_prompt(instruction, input_text=""):
if input_text.strip():
return (
"Below is an instruction that describes a task, paired with an input "
"that provides further context. Write a response that appropriately "
"completes the request.\n\n"
f"### Instruction:\n{instruction}\n\n"
f"### Input:\n{input_text}\n\n### Response:\n"
)
return (
"Below is an instruction that describes a task. Write a response that "
"appropriately completes the request.\n\n"
f"### Instruction:\n{instruction}\n\n### Response:\n"
)
def generate(instruction, input_text=""):
result = pipe(
build_prompt(instruction, input_text),
max_new_tokens=512, do_sample=True, temperature=0.7,
top_k=50, top_p=0.9, repetition_penalty=1.15,
pad_token_id=pipe.tokenizer.pad_token_id,
eos_token_id=pipe.tokenizer.eos_token_id,
return_full_text=False
)
return result[0]['generated_text'].strip()
while True:
print("\nEnter an instruction (or 'exit' to quit):")
user_input = input().strip()
if user_input.lower() == "exit":
break
print("\nEnter additional context (optional, press Enter to skip):")
context_input = input().strip()
print(f"\nResponse:\n{generate(user_input, context_input)}\n")
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="SupraLabs/Supra-50M_BASE",
device_map="auto",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
def generate_text(prompt, max_new_tokens=150):
result = pipe(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True, temperature=0.5,
top_k=25, top_p=0.9, repetition_penalty=1.2,
pad_token_id=pipe.tokenizer.pad_token_id,
eos_token_id=pipe.tokenizer.eos_token_id
)
return result[0]['generated_text']
prompt = "The importance of education is"
print(f"Prompt: {prompt}\n" + "-" * 40)
print("\nOutput:\n" + generate_text(prompt))
Prompt: "The main concept of physics is "
>
Prompt: "Artificial intelligence is "
>
Prompt: "Once upon a time, "
>
First model in the SupraLabs Scaling Up Plan. Feedback welcome!
Hi everyone,
I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.
So far, I’ve experimented with Hermes Agent and tried Qwen 3.6 (27B & 35B) as well as Gemma 41B. My workflow involves transcribing audio with Whisper and then feeding the transcript to a local AI. This works fine with a cloud model, but I cannot use a cloud solution in production due to patient data and privacy concerns. I want to handle everything locally.
My main issue is that Qwen 3.6 struggles with German. It sometimes produces technically correct words that aren’t commonly used in natural German. Additionally, the text can sometimes feel very “AI-like,” whereas cloud models produce much more natural-sounding results. Second problem I am experiencing that both models sometimes cannot distinguish what is important and what is not important, cloud models handle this way better...
I’m wondering if there’s a targeted approach to make local models behave better—would fine-tuning help here? Has anyone managed to get this working in a meaningful way for structured German text documentation?
I’ve built a complex iterative skill setup, which works well with DeepSeek V4, but the local results are disappointing. I don’t understand why generating text documentation from one-hour therapy sessions locally seems so difficult, and I’d love to hear what has worked for others.
Thanks in advance!
Probably most of you are aware that using anything other than -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of -ctk q8_0 -ctv q4_0 pps tanks.
I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use cmake -DGGML_CUDA_FA_ALL_QUANTS=ON .. which will take very long.
But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great.
Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16:
SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version. Using the --ngl option to offload.
First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.
I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.
I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.
If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)
Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.
Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.
EDIT: Offloading also has a nice bonus benefit of being QUIET. For anyone with a very loud GPU fan, it's a nice break. Yes it's slower but I can work on other tabs and windows while it processes and actually hear myself think. I might do more of this.
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.