r/Vllm

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more
▲ 18 r/Vllm+1 crossposts

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more
GitHub: https://github.com/vico-png/llamastation

I've been building this for the past few months as a side project — started because I didn't want to run llama.cpp from the command line every time I wanted to try a model. I just wanted something that worked with a click.
Fair warning: I'm not a developer. This is 100% vibe coded with AI assistance. If something in the codebase makes you cringe, please be kind and open a PR instead 🙏
Most frontends either hide everything behind abstractions (Ollama, LM Studio) or leave you writing command lines manually. LlamaStation tries to sit in the middle: a clean UI with full access to every parameter.
What makes it different
Runs llama-server directly — no intermediate layer, no daemon, no abstraction. LlamaStation launches llama-server.exe as a subprocess with full control over every flag. What you configure is exactly what gets passed to the binary. This means you get the full performance of llama.cpp with none of the overhead that tools like Ollama add on top.
Multiple backends, switchable from the UI:

⚡ Official llama.cpp (with MTP support since PR #22673)
🔬 TurboQuant fork — asymmetric KV cache quantization. This is the killer feature for me: 200k+ context on 24GB VRAM (dual RTX 3060) with minimal quality loss
⚛️ AtomicChat — TurboQuant + MTP combined
🐝 BeeLlama — DFlash + TurboQuant (experimental)

Real-time VRAM meter per GPU — color coded, updates live as the model loads.
Per-model profiles — every setting remembered automatically per model file.
Voice mode — push-to-talk or always-listening, voice cloning via XTTS v2, speech recognition via faster-whisper. Fully offline.
Headless mode — run without GUI using saved profiles, for servers or automation.
Auto-updater — updates llama.cpp official (and checks AtomicChat releases) from inside the app.

My setup for context
Dual RTX 3060 (24GB total), Ryzen 7 5700X, 32GB DDR4 3600MHz, Windows 11. Running Qwen3.6 27B Q4_K_M with TurboQuant KV cache and MTP — 177k context. Without MTP the same model starts at ~17 tok/s and drops to ~10 on long responses. With MTP it starts at ~29 tok/s and holds at ~22 even on long code generation. This is what I built LlamaStation for.

Status
v0.9 — it works well for my daily use. I've fully replaced other tools with it — I use it as the backend for coding agents, Telegram bots, voice assistants and other local automations. There's one known bug (server watchdog gets stuck in "restarting" state after OOM crash) and probably others I haven't hit yet. Opening it up to get feedback and contributions.
Not a programmer by trade — built this entirely with AI assistance. The codebase is a single main file by design, easy to read and modify.
Contributions very welcome — especially:

Linux/Mac port (currently Windows only)
Bug fixes
New backend integrations
UI improvements

GitHub — MIT license, no telemetry, no accounts.

u/Responsible_Egg9736 — 23 hours ago
▲ 9 r/Vllm

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \
--name qwen36-aggressive \
--restart unless-stopped \
-p 8000:8000 \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \
vllm/vllm-openai:cu130-nightly \
--model Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 4 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--reasoning-parser qwen3 \
--performance-mode throughput \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

reddit.com
u/povedaaqui — 2 days ago
▲ 4 r/Vllm

Qwen3.6-27B AWQ-INT4 on RTX 5090: KV cache FP8 at 24K context, and why low-temperature guided JSON loops on you

**TL;DR**: Running Qwen3.6-27B AWQ-INT4 on a single RTX 5090 (32 GB) for legal-claim extraction in a Go pipeline. Hit two non-obvious walls that cost me half a day: (1) BF16 KV cache caps you at 16K max-model-len, but FP8 KV gets you to 24K with the same VRAM footprint; (2) `temperature=0.2` under guided JSON schema triggers infinite repetition loops on this model — and the loop is not on text, it's on a numeric field generating a single integer with 5000+ digits. Sharing 42-run sampling benchmark, exact configs, and what actually works.

Posted to corroborate the [vLLM #40080 Gemma observation](https://github.com/vllm-project/vllm/issues/40080) and the [Qwen3.5 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145) with concrete numbers on a Blackwell SM_120 setup.

---

## Hardware and stack

- GPU: NVIDIA RTX 5090, 32 GB VRAM, Blackwell SM_120

- CUDA 12.8, cuDNN 9.6

- vLLM 0.19.0 via `nvcr.io/nvidia/vllm:26.04-py3`

- llama-swap v216 orchestrating three model slots:

- Vision: Qwen2-VL-7B-Instruct (16K context, BF16 KV, swap)

- Reason: **Qwen3.6-27B AWQ-INT4** (this is the one I'm writing about)

- Embed: BGE-M3 (resident, ~2.3 GB)

- Workload: legal-claim extraction of structured output via JSON Schema, ~5W1H decomposition per claim

The reasoning slot uses the [cyankiwi/Qwen3.6-27B-AWQ-INT4](https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4) build. Internal architecture is `Qwen3_5ForConditionalGeneration` (GDN hybrid + Mamba) — needs vLLM ≥ 0.17 to run at all.

---

## Wall #1: VRAM math for max-model-len on a single 32 GB card

Initial config: `--gpu-memory-utilization 0.85 --max-model-len 12288 --dtype auto`. 26.5 GB VRAM, working fine for short docs. But 37 % of my email corpus exceeds 8K tokens, and the chain-of-thought prompt I use needs ~8K output tokens for the scratchpad. So `12288 - 8192 = 4096` input budget, which overflows on most non-trivial emails.

Measured KV cache scaling with BF16:

| max-model-len | KV cache (BF16) | Weights + KV | Verdict on 32 GB |

|---|---|---|---|

| 12288 (start) | ~13 GB | 27 GB | ✓ comfortable margin |

| 16384 | ~17 GB | 31 GB | ⚠ 1 GB free, kills multi-slot co-tenancy |

| 24576 | ~26 GB | 40 GB | ✗ overflow |

| 32768 | ~35 GB | 49 GB | ✗ physically impossible |

The bench tool community on r/LocalLLaMA was telling me to "just bump to 32K", but that's not feasible at all on a 32 GB card without quantizing the KV cache. So I tried FP8 KV.

### FP8 KV cache changes the picture

Adding `--kv-cache-dtype fp8` halves the KV memory:

| max-model-len + FP8 KV | KV cache | Total | Tient en 32 GB |

|---|---|---|---|

| 16384 + FP8 | ~8.5 GB | 22.5 GB | ✓ huge margin |

| 24576 + FP8 | ~13 GB | 27 GB | ✓ same footprint as 12K BF16 start |

| 32768 + FP8 | ~17 GB | 31 GB | ⚠ tight |

Empirical measurement on the live server, after killing the container and warm-up:

| Config | VRAM steady-state | Cold start (warm cache) | Free VRAM |

|---|---|---|---|

| 12288 BF16 (start) | 26.5 GB | 96 s | 5.5 GB |

| 16384 BF16 | 28.0 GB | not retested | 4.0 GB |

| **24576 FP8 (chosen)** | **28.4 GB** | **131 s** (+35 s vs BF16) | **3.6 GB** |

Counterintuitive: 24K FP8 consumes nearly the same VRAM as 16K BF16, because vLLM pre-allocates the KV pool to `gpu-memory-utilization=0.85` regardless of effective dtype/length. You don't see VRAM savings on the gauge — you capitalize the saving in *input capacity*. Net gain: input budget moves from 4K → 16K tokens at `max_tokens=8192`.

FP8 KV quality cost on AWQ-INT4 weights: theoretical 2–3 % degradation, in practice noise-level on AWQ-INT4 (the 4-bit weight quantization dominates). Validated empirically — see end of post.

### Production llama-swap config for Reason slot

```yaml

qwen3.6-27b:

cmd: >

docker run --rm --name vllm-reason

--gpus all --ipc=host

-v /inference/models:/models

-v vllm-cache:/root/.cache/vllm

-p 127.0.0.1:8003:8000

nvcr.io/nvidia/vllm:26.04-py3

vllm serve /models/qwen3.6-27b-awq-int4

--served-model-name qwen3.6-27b

--gpu-memory-utilization 0.85

--max-model-len 24576 --kv-cache-dtype fp8

--max-num-seqs 4

ttl: 300

```

---

## Wall #2: guided JSON + low temperature = infinite repetition

First smoke test of the pipeline with `max-model-len 24576` plus the corresponding client-side `MaxTokens: 8192`: one document (`04546`, a short 953-char .md) generated **68 claims, of which 67 had `text=""` and identical `char_start=107, char_end=238`**. Pure loop fail mode.

Initial hypothesis: model-level repetition bias. Looked at the literature:

- vLLM bug [#40080 (Gemma)](https://github.com/vllm-project/vllm/issues/40080): "When grammar restricts the token space to valid JSON tokens, the model's slight repetition bias becomes a strong loop because the grammar prevents the model from generating an EOS or breaking out of the pattern."

- [Qwen3.5/3.6 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145): official sampling recommendation, **explicitly states "greedy decoding should not be used as it can lead to performance degradation and endless repetitions."** The pipeline was running at `T=0.2`, which is quasi-greedy.

So the bug is exactly what the vLLM ticket describes: the model has a baseline repetition tendency, guided JSON masks every token outside the schema, model can't emit EOS in the middle of an array, so it fills the array with whatever fits. On this corpus, sometimes that's `text=""` repeated, sometimes (as I found later in benchmarking) it's a single `char_start` integer with 5000+ digits.

### Bench protocol

7 sampling configs × 3 prototype documents (short, medium, complex) × 2 runs each = 42 calls against the live `:8156/v1/chat/completions` proxy (which forwards to llama-swap → vLLM Reason). Same JSON Schema, same prompt, same `max_tokens=8192`. Configs:

| Label | Sampling params |

|---|---|

| baseline_T02 | T=0.2 |

| hardened_T02 | T=0.2 + schema `minLength=1` on text + `maxItems=30` on claims |

| qwen_instruct | T=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (official Qwen instruct mode) |

| qwen_reasoning | T=0.6, top_p=0.95, top_k=20, presence_penalty=0.0 (official Qwen reasoning mode) |

| intermediate_T04 | T=0.4, top_p=0.9, presence_penalty=0.3 |

| reppen_only | T=0.2, repetition_penalty=1.1 |

| conservative_T03 | T=0.3, top_p=0.9, presence_penalty=0.5 |

### Bench results

Unique claims persisted per run, two runs per cell:

| Config | doc 04546 | doc 04547 | doc 19958 | Avg total | Loop fails |

|---|---|---|---|---|---|

| baseline_T02 | 11 / 12 | 5 / 3 | 11 / 11 | 26.5 | 0 |

| hardened_T02 | 11 / **FAIL** | 2 / 5 | 11 / 10 | 25.0 | **1** |

| qwen_instruct | 10 / 8 | 4 / 2 | 10 / 10 | 22.0 | 0 |

| **qwen_reasoning** | **11 / 12** | **6 / 4** | **10 / 18** | **30.5** | **0** |

| intermediate_T04 | 9 / 12 | 4 / 4 | 13 / 11 | 26.5 | 0 |

| reppen_only | 6 / 6 | 3 / 5 | 10 / 10 | 20.0 | 0 |

| conservative_T03 | **FAIL** / 12 | 6 / 3 | 9 / 9 | 25.5 | **1** |

Aggregate: 42 runs, 40 successes, **2 loop failures**. Both fails were on document 04546 (the short one), both at `T ≤ 0.3`. Failure mode confirmed by Python `int()` overflow: model emitted a 5000+ digit integer in a `char_start` or `char_end` field — pure numeric loop, not a text loop. A more permissive parser (which is what I had in Go originally) would silently truncate and accept garbage.

Average successful run latency: 63.8 s. Range 23.8–111.4 s on this prompt size (~6 KB system + 1 KB user).

### Findings

  1. **`qwen_reasoning` is the winner**: +15 % unique claim coverage over baseline, zero loop fails on the pathological doc, conforms to official Qwen3.6 recommendation. Higher variance on complex docs (19958: 10 vs 18 unique claims between runs) — to absorb with defensive dedup on the consumer side.

  2. **`T=0.2` (quasi-greedy) is the actual bug source.** 14 % loop failure rate on the pathological doc when T ≤ 0.3, 0 % when T ≥ 0.4. The official Qwen advice is empirically correct.

  3. **`repetition_penalty=1.1` strangles** — −25 % coverage. Not the right knob for structured generation.

  4. **`presence_penalty=1.5`** (official Qwen instruct mode value) is meant for short conversational replies, not multi-page JSON. Strangles too (−17 %).

  5. **`frequency_penalty=0.5`** (a desperate fix I tried earlier in the day) is catastrophic on structured output — −77 % coverage measured in production smoke. Avoid.

  6. **Schema hardening (`minLength=1` on text, `minimum/maximum` on integer fields, `maxItems`) is complementary**, not a replacement for sampling fix. Hardened schema still failed once at T=0.2 — the loop just shifted to another field (numeric instead of text).

### Final production config

Three coordinated changes, none of them sufficient alone:

**Server (vLLM)** — already shown above, the `24576 FP8` config.

**Client sampling** (Go pipeline payload):

```json

{

"model": "qwen3.6-27b",

"temperature": 0.6,

"top_p": 0.95,

"top_k": 20,

"presence_penalty": 0.0,

"max_tokens": 8192,

"response_format": {"type": "json_schema", "json_schema": {...}}

}

```

**Client schema** (in addition to the domain fields):

```json

{

"type": "object",

"properties": {

"claims": {

"type": "array",

"maxItems": 30,

"items": {

"properties": {

"text": {"type": "string", "minLength": 1},

"char_start": {"type": ["integer", "null"], "minimum": 0, "maximum": 100000},

"char_end": {"type": ["integer", "null"], "minimum": 0, "maximum": 100000}

}

}

}

},

"required": ["claims"]

}

```

**Client post-LLM**: defensive dedup on `(lowercased_stripped_text, char_start, char_end)` before INSERT, with a `needs_review` flag when `unique_count / total_count < 0.5` or `total > 30`. Catches the residual variance.

### Cost on the full run

Estimated for 1402 .md files:

| Metric | Baseline (T=0.2) | qwen_reasoning |

|---|---|---|

| Avg claims latency per doc | 35–80 s | 60–110 s (+30 %) |

| Unique claims per doc | n | n × 1.15 |

| Loop-failed docs | ~2–5 % expected | 0 measured in 42 runs |

| Docs flagged `needs_review` | n/a | est. 5–15 / 1402 |

---

## What I'd hammer if anyone is doing the same setup

  1. **Don't trust `T=0.2` for any non-trivial JSON-schema-constrained generation on Qwen3 family.** The official Qwen team flagged it, the vLLM Gemma ticket confirms it's a grammar+repetition interaction, my 42-run bench reproduces it. Use T=0.6 minimum.

  2. **Don't use `repetition_penalty` or `frequency_penalty` to fight JSON loops** — they punish lexical variation in legitimate paraphrases. Wrong knob.

  3. **Schema fields that accept integers need bounded ranges.** A `char_start: integer` without `maximum` is an invitation to a numeric loop.

  4. **FP8 KV cache is the single best knob to push context length on a 32 GB consumer card.** Same VRAM footprint, ~2x effective context. Quality impact is negligible on top of an already-INT4-quantized model.

  5. **Always log `usage.completion_tokens`** when calling `/v1/chat/completions` with structured output — if your call routinely hits the max, you've got a silent failure mode.

  6. **Cold start on Qwen3.6-27B AWQ-INT4** with the `torch.compile` cache persisted to a Docker volume: ~96 s BF16, ~131 s with FP8 KV (extra calibration step). Without persisted cache: 141 s. Worth the volume mount.

### Reproducibility

42-run bench script, results JSON, and exact prompt assets are kept on the server side under `/tmp/claim_bench/`. Happy to share if anyone wants to repro on their own Qwen3.6 quant variant — I expect the loop behavior to generalize across AWQ-INT4 / NVFP4 / GGUF, since the root cause is the model-level repetition bias × grammar masking, not the quantization.

If anyone has a clean explanation for why the loop on `char_start` produces a *single* 5000-digit integer rather than a stream of normal integers, I'd love to hear it. My hypothesis is that once the model commits to a digit token after `"char_start": `, the only grammar-valid next tokens are more digits or `,` / `}` — and if the digit-token transition probability beats the closing-token probability, it never closes.

---

## References

- Qwen3.5/3.6 sampling recommendations: [QwenLM/Qwen3.6 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145)

- Grammar-amplified repetition (vLLM): [vllm-project/vllm issue #40080](https://github.com/vllm-project/vllm/issues/40080)

- Empty-array bug under guided JSON: [vllm-project/vllm issue #13821](https://github.com/vllm-project/vllm/issues/13821)

- vLLM Quantized KV Cache doc: [docs.vllm.ai — quantized_kvcache](https://docs.vllm.ai/en/latest/features/quantization/quantized\_kvcache/)

- vLLM Structured Outputs: [docs.vllm.ai — structured_outputs](https://docs.vllm.ai/en/v0.8.2/features/structured\_outputs.html)

- Qwen3 official model card with sampling guidance: [Qwen/Qwen3-0.6B on HF](https://huggingface.co/Qwen/Qwen3-0.6B)

Setup date: 2026-05-19. Environment: `nvcr.io/nvidia/vllm:26.04-py3` (vLLM 0.19.0), Blackwell SM_120, RTX 5090 32 GB.

reddit.com
u/hazyhaar — 3 days ago
▲ 9 r/Vllm+1 crossposts

5060ti chads -&gt; gemma-4-31b-it-nvfp4 + vllm + mtp

Hey all,

While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.

Here is my working set up in a venv with uv:

cuda 13.1 && nvidia driver 590.48.01 (driver 595 and ubuntu 26.04 had difficulty finding all the cards and would only show 3/4 for some reason)

Environment="CUDA_HOME=/usr/local/cuda"

Environment="LD_LIBRARY_PATH=/usr/local/cuda/lib64"

Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"

Environment="VLLM_SKIP_P2P_CHECK=1"

vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \

--kv-cache-dtype fp8 \

--tensor-parallel-size 4 \

--max-num-seqs 2 \

--max-model-len auto \

--enable-auto-tool-choice \

--tool-call-parser gemma4 \

--chat-template examples/tool_chat_template_gemma4.jinja \

--language-model-only \

--reasoning-parser gemma4 \

--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \

--port 9999

Now, I got this off of the vllm recipes website with some caveats. In the speculative config, the recipe website does not list "method":"mtp" as being needed but the github documentation does say it is needed. It also seems that either will work and there is a closed issue with current comments about mtp and gemma documentation being inconsistent.

I have some environmental variables set. This is because on ubuntu 24.04 there is a mismatch with what cuda version it comes with and what I installed. So you need to declare it. I am also skipping the p2p check for right now, since I didn't go through the trouble of installing it and it has a slight speedup in boot.

Other issues. The kv cache is at fp8, I tried changing it but it crashes at start. This is from the recipe and I guess it might be in the model card or something. Probably something I have been too lazy to look into. Right now it is working well.

Unlike tool calling with qwen, gemma seems to do okay with mtp of 4 tokens (instead of 2, at least for me). You will also need to template in a template folder, see the vllm recipe website. I gave up after like 2 minutes with mistral-vibe and using it. There is an issue on their github (mistral-vibe) talking about issues with tool calling and vllm. I switched over to pi dev and it is so much faster that I probably wont go back.

Overall I am able to reach ~60 t/s on generation with this setup as a single user. Random generation is around 40 t/s and there are bursts up to 90 t/s sometimes, but these are just bursts.

I have my concurrency at 2, but this is because my wife sometimes uses it through openwebui and she never uses a lot of context. Context with the current settings says I can load up around 470k tokens or around 1.85x. For me and my setup this is fine. You may need more vram and probably wont use a 5060ti setup if you have like a company with a lot of users or something anyway.

While nvfp4 support is not all ironed out, it seems to be doing okay right now with the latest vllm. Have fun.

reddit.com
u/see_spot_ruminate — 5 days ago
▲ 7 r/Vllm

Request for vllm settings / setup for using with Claude Code on 16 GB VRAM + 32 GB RAM

Wondering if anyone here can help me out with any settings / starting points / a reality check.

My goal is to use Claude Code for some hobby apps primarily using local LLMs on my 3080 16gb with 32 GB RAM & Windows 11. Does anyone have a similar setup working relatively smoothly with tool calling on similar specs?

----

I got CC working with oLLama easily, but it's very slow. I was told vllm might work better, and I managed to get a vllm + LiteLLM setup running, but I'm struggling to get it to work with tool calls without being even slower than ollama. It's OK-ish without the tool calling, but that doesn't work for what I want to do. I feel like there might be some settings tweaking I can do to get it to work, but I've tried a bunch of things and no joy yet. I don't have a lot more time to stumble through setup stuff, which is why I'm reaching out.

I know my machine will never be fast, and I'll struggle to run even a model above about 12B (though oLLama seems to handle it). I'm not concerned about that. I just need it to be a little faster than 2+ hours to create CRUD stored procedures for 4 tables. Most of what I want can probably be handled by a basic < 12B model.

I'm cool with moving away from Claude Code for something lighter weight (maybe Pi?), but I need something with similar code management, tool management, and execution capabilities. I also don't have much time to play around with setup or building out capabilities or custom guides, skills, personalities, etc. to get to relatively basic functionality of managing the LLM and implementing what it suggests.

reddit.com
u/darkeagle03 — 5 days ago
▲ 84 r/Vllm+2 crossposts

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset.

Setup:

  • Hardware: 1x H100 80GB
  • Runtime: vLLM
  • Dataset: SPEED-Bench qualitative
  • Prompts: 880 total, 80 prompts across each of 11 categories
  • Models: google/gemma-4-31B-it and google/gemma-4-26B-A4B-it
  • MTP drafts: Google's matching Gemma 4 assistant models
  • DFlash drafts: z-lab's matching Gemma 4 DFlash models
  • MTP used num_speculative_tokens=8
  • DFlash used num_speculative_tokens=15
  • Context length / max model length: 32768
  • Temperature: 0
  • Prefix caching was disabled

Results:

  • For Gemma 4 31B dense, MTP was 3.11x faster and DFlash was 3.03x faster than baseline decoding at concurrency 1. Baseline hit 40.3 output tok/s, MTP hit 125.3 output tok/s, and DFlash hit 122.1 output tok/s. At concurrency 16, baseline reached 375 tok/s, MTP reached 953 tok/s, and DFlash reached 725 tok/s.

https://preview.redd.it/4zyyt58j7p0h1.png?width=2571&format=png&auto=webp&s=930d3a8383fb7fe40749217867f4f3ab9877b4a4

  • For Gemma 4 26B-A4B MoE, the result flipped. DFlash was 1.73x faster and MTP was 1.49x faster than baseline decoding at concurrency 1. Baseline hit 177.1 output tok/s, MTP hit 264.2 output tok/s, and DFlash hit 306.4 output tok/s. At concurrency 16, baseline reached 975 tok/s, MTP reached 1808 tok/s, and DFlash reached 1957 tok/s.

  • The MoE speedups were smaller than the dense-model speedups because the baseline MoE target is already relatively cheap to run. Gemma 4 26B-A4B has 25.2B total parameters, but only 3.8B active parameters during inference. That means speculative decoding has less target-model compute to remove compared with the dense 31B model.

https://preview.redd.it/twdqm7pk7p0h1.png?width=2596&format=png&auto=webp&s=71b388e143bd384fec08e299b3996ba8337e42f8

  • The gains were not uniform across workloads. Coding, math, STEM, and reasoning benefited more because these tasks often have more predictable token patterns. Writing, summarization, and roleplay improved less because there are many valid ways for the model to continue the text.

  • Higher per-position acceptance did not automatically mean higher throughput. MTP accepted more draft tokens, but DFlash showed better throughput on the MoE model. Acceptance is only one side of it. DFlash drafts the whole block in a single forward pass, while MTP drafts token by token. When the target is this fast, the cheaper draft path can matter more even with lower acceptance.

  • Most accepted draft tokens came from the first few positions. Position-1 acceptance was around 80% for MTP and 75% for DFlash, but by position 8 it dropped to under 20% for both.

https://preview.redd.it/di8n1c3m7p0h1.png?width=2615&format=png&auto=webp&s=e769d24d5ae9ad4722270437eef1f26a998ac6e8

For a real deployment, try both approaches on your own setup and workload instead of assuming one will always be better. The results can change with the model, prompts, hardware, and serving configuration. Hope these numbers give people a useful reference point.

All the benchmark setup and scripts used for benchmarking and to reproduce these results are in the Github repository.

You can read about more results and in-depth analysis in our blog: https://jarvislabs.ai/blog/gemma-4-mtp-vs-dflash-benchmark

reddit.com
u/LayerHot — 10 days ago
▲ 34 r/Vllm+1 crossposts

Qwen3.6-27B 8bit DFLASH performance vs num_speculative_tokens

I'm running Qwen3.6-27B 8bit on my RTX PRO 6000 Blackwell workstation edition and I was trying to figure out the optimal setting for `num_speculative_tokens` while using DFLASH. So I decided to run some benchmarks where I varied `num_speculative_tokens` from 1 to 20 to find the optimal value. Hopefully it's helpful to you guys!

Here's the results in text format:

🏆 FINAL RESULTS

===============================================

{'k'} | {'Avg tok/s'} | {'±std'} | Best?

\---------------------------------------------------

1 |         67.4 | ±   0.1 |

2 |         88.8 | ±   0.1 |

3 |        102.5 | ±   0.8 |

4 |        116.1 | ±   0.1 |

5 |        124.7 | ±   0.1 |

6 |        127.6 | ±   0.1 |

7 |        126.6 | ±   0.1 |

8 |        133.8 | ±   0.1 |

9 |        126.8 | ±   0.4 |

10 |        136.8 | ±   0.1 |

11 |        140.0 | ±   0.3 | ← BEST

12 |        132.5 | ±   0.2 |

13 |        137.8 | ±   0.1 |

14 |        135.0 | ±   3.9 |

15 |        136.7 | ±   1.3 |

16 |        132.2 | ±   0.2 |

17 |        129.8 | ±   0.1 |

18 |        123.4 | ±   0.1 |

19 |        123.8 | ±   0.4 |

20 |        125.0 | ±   0.1 |

🎯 Recommended: k = 11 (139.95999999999998%.1f tok/s)  

Here's my vLLM setup:

  qwen-vllm: # ← Qwen3.6-27B via vLLM (OpenAI-compatible API)
    image: vllm/vllm-openai:latest
    container_name: qwen-vllm
    ipc: host
    shm_size: 32g                    # Critical for large context + Qwen3.6 performance
    ports:
      - "8000:8000"                  # OpenAI-compatible endpoint[](http://localhost:8000/v1)
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface   # Persists the ~55 GB model download
    environment:
      - HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      - HF_HUB_ENABLE_HF_TRANSFER=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all             # ← Change to 1 if you only want to use a single GPU
              capabilities: [ gpu ]
    command: &gt;
      --model Qwen/Qwen3.6-27B-FP8
      --served-model-name qwen3.6-27b
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.90
      --max-model-len 262144
      --kv-cache-dtype auto
      --attention-backend flash_attn
      --max-num-batched-tokens 16384
      --max-num-seqs 24
      --trust-remote-code
      --enable-prefix-caching
      --enable-chunked-prefill
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 11}'
      -O3
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - hermes-net
u/dxplq876 — 11 days ago
▲ 3 r/Vllm

PLX 88096 - Opinions.

Does anyone use PLX 88096 or something similar?
If anyone has something similar, could they tell me what the tokens/s would look like using a PLX 88096 + five RTX 5060Ti 16GB with qwen3.6-35b-a3b?

I currently have four RTX 5060Ti cards in an MZ32-AR0 Ver3.0 motherboard. I currently use it with qwen3.6-27b, but I'd like to add five more to use with qwen3.6-35b-a3b and mistral-nemo-instruct-2407.

I actually wanted to assemble two PLX systems, each with 4-5 RTX 5060 Ti cards, so I would have one model in each PLX system.

However, I didn't find much information about performance, such as how it would be using PLX, and if token generation would be too slow.

If anyone could shed some light on how the performance would be affected, I would be very grateful.

reddit.com
u/offzinho3k — 10 days ago
▲ 18 r/Vllm

vLLM on Arc B70

Anyone has that card? I am interested given that price and the available memory. I am aware that speed wouldn't be comparable with Nvidia competitor (cheapest 32GB should be RTX PRO 4500, roughly 3 times its price).

If anyone has it, can you share some benchmark? Which quantization dtype are supported by that card? What's the experience in general in terms of features? Is it everything so experimental that chances things are not working are high?

reddit.com
u/-elmuz- — 14 days ago
▲ 10 r/Vllm

Help! VLLM makes my PC shutdown

Hello everybody! I need some help. I start to use VLLM not so long ago trying to squeeze more performance and for somehow after some text generations my computor shutdown like it had overcurrent, regardless of the model. With llama.cpp I have no problem.

It's just my PSU is not having enough power?

if is PSU, which one do you recommend?

Rig:

Ryzen 3900x

4 A4000 power limited to 100w each and clocks lowered

PSU Antec Signature 1000w

Ubuntu 22.04, VLLM 0.20.1 (but with previous versions it's the same)

UPDATE: I put another 650W PSU in tandem using the OC Link cable, connected to one GPU and so far so good, so yes, looks like it was transients killing my PSU ;-;

reddit.com
u/Impressive-Gain-1061 — 12 days ago