r/Vllm

▲ 16 r/Vllm+1 crossposts

Qwen3.6 vs 3.5 on DGX Spark: identical throughput, except with one flag flipped

Same config, same benchmark: Qwen3.6 performs within ±1% of Qwen3.5 across all four scenarios. Enable MTP and the 16-concurrent stress test jumps +24% throughput with −57% TTFT — exactly where speculative decoding is supposed to lose. Hypothesis: memory-bandwidth-bound serving on GB10 unified memory leaves compute headroom that MTP verification fills. 72.5% global acceptance rate, full breakdown by workload type.
https://docai.hu/en/blog/qwen36-mtp-gb10

u/Ok-Simple459 — 1 day ago

▲ 3 r/Vllm+1 crossposts

Performance issue: Low token generation (~20 tok/s vs 50 tok/s) on Radeon AI PRO R9700 (gfx1201) with vLLM ROCm & Gemma 4-26B

I’m testing cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a single AMD Radeon AI PRO R9700 (32 GB, gfx1201) using vLLM ROCm.

Current result: about 19–20 generated tok/s after warmup, with a short single-user decode benchmark. I expected something closer to 50 tok/s based on a published R9700 result for this model.

My current vLLM setup:

- vLLM: 0.22.1rc1.dev499+g470229c37.d20260613

- model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

- TP: 1

- max-num-seqs: 1

- max-model-len: 8192

- gpu-memory-utilization: 0.90

- kv-cache-dtype: fp8

- attention backend: TRITON_ATTN

- --enforce-eager

- --language-model-only

I see this warning on startup:

> Using default MoE config. Performance might be sub-optimal!

> Config file not found: E=128,N=704,device_name=AMD-gfx1201,dtype=int4_w4a16.json

CPU is a Ryzen 5 3600 with 32 GB DDR4-2666 (Upgrading to 900 series soon), but GPU memory is not full and RAM has ~22 GB available.

Is there an existing or recommended tuned MoE config for gfx1201, E=128, N=704, and int4_w4a16?
Is this model known to be slow on R9700 with Triton W4A16 / current ROCm vLLM?
What exact vLLM/ROCm image, flags, and benchmark method produced ~50 tok/s?
Is --enforce-eager costing meaningful decode performance on gfx1201, or is the missing MoE config the main issue?

I also tried Lemonade’s portable gfx120X runtime, but it ships gfx1200 kernel packs and fails on this gfx1201 card with hipErrorInvalidImage.

reddit.com

u/veryhasselglad — 2 days ago

▲ 86 r/Vllm+8 crossposts

Mac | Cubix | V620 | Ubuntu | ROCm | vLLM | Local AI Data Center

What a loaded title.

It started with the 2019 Mac Pro, however it has since grown into so much more, evolving from niche to explicitly unique. Allow me to explain.

TL;DR

Three 2019 Mac Pro systems (MacPro7,1)
Cubix Xpander Rackmount (8 PCIe slots, passive cooling)
AMD Radeon PRO V620
AMD Radeon PRO W6900X
AMD Radeon PRO W6800X Duo
AMD Radeon PRO W6800
Sonnet eGPU Breakaway Box 750/750ex
Ubuntu Server 24.04 LTS — bare metal
ROCm 7.2.3
vLLM ~~0.24.0~~ 0.25.1
FP16 and AWQ
Qwen3.6-27B / gemma-4-31B-it
Several Hermes agents
[ SUCCESS ]
بانتظار أسمع منكم جميعًا

The Dream

Achieving the dream is the goal here. The journey is half the dream, with the technical goal being the ability to run 30 to 50 concurrent agents. Currently, that means Hermes agents, each with a unique profile, role, and human name. (Adam, Samar, Sami, Dalia, Basil, Leen, Ziyad, Sultan, and many more)

On this journey, I hope to master vLLM, multi-GPU setups, high concurrency, general optimization, and troubleshooting wherever possible.

I will keep the actual goal and final purpose of all of this private for now.

GPUs | AMD? | NVIDIA? | Tenstorrent?

I previously discussed multiple-GPU setups in this post. u/Guanaalex introduced me to the world of Cubix Xpanders, and I was hooked. I managed to find a 4U Cubix Xpander Rackmount on eBay. The seller was kind enough to offer it at a price I could reasonably afford. Please support the seller, Mara7Electronics.

I decided to buy a full-fledged 42U server rack to host it and migrate all my hardware into it.

I had previously bought a nice, rack-mountable online double-conversion 3.6 kVA / 3.6 kW UPS to power the two Macs I was using. I decided to buy a couple more: one for each Mac and one for the Cubix Xpander. I also decided to replace my daily-driver 2019 Mac Pro with a Mac mini M4, allowing the Mac Pro to become my third Local AI system: LinuxAI-03.

Although I already had three AMD Radeon PRO W6800 GPUs that I had purchased for use as eGPUs, that plan was abandoned in favor of the Cubix Xpander's cleaner eight-GPU setup.

I considered purchasing five more W6800 GPUs, eight AMD Radeon PRO AI R9700 GPUs, or even eight Tenstorrent Blackhole p150 AI accelerators. I considered NVIDIA GPUs for a quick second, but the cost quickly killed that idea. Eventually, I stumbled across AMD Radeon PRO V620 cards on eBay, which came with fan shrouds, had been flashed with W6800 firmware, and included a comment explaining that the V620 firmware could be restored for pure compute use.

I had not considered these cards before. I barely knew anything about them. I looked them up and found them on eBay for a pleasant $350 USD each. Eight of them would cost about the same as three W6800 GPUs. The only challenge was cooling.

Lo and behold, the Cubix Xpander I had bought happened to be the model that supports passively cooled hardware. I did not give it another thought. I immediately started discussions with the seller. They refused to gift me a rhino prop with my purchase. I was kind of disappointed. I appreciate Rhino Technology, perhaps not for the missing rhino prop, but certainly for their communication and respect. Please support them.

You may notice that I did not consider Intel cards. The reason was simple: I did not know Intel's direction for its GPU business, and I did not want to invest in the hardware only to see development of its software stack discontinued if Intel sold or shut down that part of the business.

The Data Center

Although I had an old 12U server rack, it was more of a wall-mounted networking rack, and it was already full. I searched online for the 42U server rack I wanted, but everything was either moderately priced with no description beyond “42U,” or fully documented but insanely expensive.

I ended up sending my son to the local computer market, which is labeled a bazaar even though it is not really one. I loved the experience for him. He managed to find several shops carrying server racks with the specifications I wanted. He then found the cheapest shop that also offered delivery and installation, and bargained with the shop owner.

With that, I had my first 42U server rack: front-to-back airflow, double mesh doors on both sides, and fans preinstalled at the top. The server rack was delivered and installed on the same day.

Next came the UPS devices.

The Tecnoware UPS I mentioned earlier was no longer available for sale anywhere. Nothing online was both good enough and cheap enough. I sent my son back to the computer bazaar, but he could not find anything reasonably comparable to the UPS I already had in terms of its kilowatt-to-price ratio, online double-conversion capability, and rack-mountable design.

I ended up searching Haraj, the local equivalent of Craigslist, for UPS options, as well as Microless, which I would describe as Dubai's version of Newegg. I found a local vendor selling enterprise-grade 6 kVA / 6 kW UPS devices from a well-known international manufacturer for roughly half price. The catch? They were old stock from mid-2023, apparently unsold hardware left over from a project whose contract had ended.

I tried to purchase only two UPS units, but the company insisted on selling each one with three rack mounted battery packs and would not budge on the price. I was about to cancel the purchase when work pulled me away. Later, I had a nice conversation with u/Long-Shine-3701, who convinced me to go for it, particularly with my future green-energy project in mind.

At the time, I did not know exactly how old the batteries were. I only knew they were “old” and had generally been kept in room-temperature storage. Regardless, my goal was never to keep the servers alive for long periods during power outages. My main goals were to provide clean, pure sine-wave power and allow for safe shutdowns. It is worth noting that each battery pack contains twenty standard, replaceable 9 Ah battery cells, although I do not have the faintest idea how to replace them yet.

I reached an agreement with the company to provide each UPS with four batteries, the maximum number supported by these UPS units, along with a warranty, free delivery, and installation.

I went for it.

I did the rack-space math. It went something like this: a 1U UPS plus four 3U battery packs, with 1U of space between each unit to reduce heat buildup and prolong battery life... Thirty-seven rack units?! That was almost my entire rack.

I measured the data room quickly, then proceeded with a quick phone call to the server-rack supplier my son had found, followed by a bank transfer, and I had same-day delivery and installation of a second rack. I barely had 2 cm, roughly half an inch, of clearance after installing the second rack. It was a perfect fit. I felt like a child at a candy store at that point.

The next day, the UPS units and battery packs were installed. The company was concerned about the available power, but I had already purchased five 10 mm² copper conductors, obtained a second meter from the electric company for this setup, and purchased a couple of breakers—one manual and one smart—as well as power-distribution equipment.

All that remained was to hire an electrician to connect the second meter to the breaker in the room. I had already arranged for one to work on a Saturday so the task could be completed quickly. The plan was ready; only the execution remained.

The company set up the UPS units and battery packs and initially connected them to my home meter to charge the batteries and test the system. Everything seemed to be working well, pending grounding, neutral wiring, and connection of the second meter. If the absence of neural wiring questions for you, I used two live wires to complete the circuit, and obtain the higher voltage; 220 V rather than 110 V.

The electrical work, while impressive in my opinion, does not get a detailed mention here beyond the fact that it is now part of the home data center and is controlled through Home Assistant, after the electrician completed the connection. If anyone wants to know more, I would be more than happy to share.

Resources

While working on this project, I experimented and learned a great deal. I then shared a great deal and received a tremendous amount of valuable knowledge and education from the community, which changed my plans midway through the project.

The target was always higher concurrency through more VRAM. Unified memory, or uRAM, was not an option for me, as one of my goals was to master dedicated hardware—AI accelerators in one form or another—for inference.

The first idea was to add four eGPUs to the Mac that already had four GPUs. I bought:

~~Four~~ Three AMD Radeon PRO W6800 GPUs. The fourth was canceled by the seller.
Four Sonnet eGPU Breakaway Box 750/750ex enclosures.

Then the plan shifted to the Cubix Xpander, and I bought:

The Cubix Xpander
Eight AMD Radeon PRO V620 GPUs
A Mac mini M4 to replace my daily-driver 2019 Mac Pro
A fifth Sonnet eGPU Breakaway Box 750ex to use a PCIe card from the Mac Pro with the Mac mini
Two 42U server racks
Two enterprise-grade UPS units with four battery packs each
Two patch panels, one for each server rack
Two SilverStone HELA 2050R Platinum PSUs

I then found a pair of Cubix Xpander Desktop Elite systems, each with four PCIe slots, and bought those as well.

With international shipping and double taxation, I have severely exceeded my budget. I have had to bring all further spending to a complete stop and limit myself to covering only operational and maintenance costs.

The electricity bill alone will be an insane operation expense.

Something worth mentioning though, I would love to get my hands on sixteen Tenstorrent Blackhole p150a-series accelerators and QSFP-DD 800G cables. Testing all of them on a single server using every available Cubix Xpander would truly push every piece of hardware involved to its limit. Had I possessed the necessary capital, that is probably the direction I would have taken instead. I am just putting the thought out there. A Tenstorrent Galaxy Blackhole or four would be insane as well, would it not? A guy can only dream.

I am genuinely hopeful, believing in the work these guys are doing there. I would also like to highlight Tenstorrent's documentation and software stack.

The Challenge

I am happy to say that I am satisfied with the results, and I look forward to continue pushing further and expanding the stack.

Power:

The first hurdle was power. Not its availability, but its deliverability.

The PSUs in the Cubix Xpander were only designed to power eight cards using 8-pin and 6-pin connectors. For the V620 cards, I had to replace those PSUs with SilverStone HELA 2050R Platinum units to provide dual 8-pin connections to each GPU. That is sixteen 8-pin connections total, at 150 watts each.

They cost me a pretty penny, but I was lucky enough to find them on Microless for half the price listed on Amazon and eBay.

Assembly:

During my first exploratory disassembly of the Cubix Xpander, I may have overtightened the screws. When it came time to open the unit again, install the new PSUs, and then install the GPUs, the screws simply would not budge. I was unable to open it. I even stripped the screw heads while trying...

I performed some clever analysis and concluded that when I first opened the Cubix Xpander, it had just arrived and was incredibly hot. After I tightened the screws and placed the unit in the so-called data center, where the room temperature was below 18 degrees Celsius, the metal contracted and squeezed the life out of those screws.

Whether or not that logic makes sense, I decided to use a heat gun on the two screws that had become stuck and stripped while I was trying to remove them. Like magic, the screws, although extremely stripped, came out with absolutely no resistance. I suspect that using a heat gun is simply a well-known technique for removing stuck or stripped screws and that I got lucky there.

Moving on, due to my lack of experience, I had to repeat every step two or three times. The experience went something like this:

Place the Xpander in the rack. Remove the Xpander from the rack. Install the GPUs. Screw them in. Unscrew them. Remove the GPUs. Connect the 8-pin cables first. Reinstall the GPUs. Screw them in again. Finally, struggle with the weight while putting the Xpander back into the rack.

Wait—I do not like the Cubix Xpander mounted this high now that both UPS units have been installed. I am not changing anything at this point...

Mounting:

The heaviest items are supposed to be installed at the bottom of the rack. I figured that 300 kg, or 660 pounds, of batteries qualified, so they ended up in the lower sections.

However, I also wanted to place my tower-style 2019 Mac Pro systems in the rack.

How was I supposed to do that?

A few rack shelves, some foam from the UPS and battery boxes, and some tinkering to remove the feet, wheels, and handles from the three Macs, and they were ready to be inserted into the server rack at appropriately higher levels.

The shelves are what actually mount to the rack. The foam acts as a type of sliding drawer and as an insulation layer to prevent metal-on-metal damage. The stripped-down Macs are just the right size to slide in and out.

The Jet Engine:

I was excited to have everything set up and ready to go.

I powered on the Mac and Cubix setup and was blown away—pun intended—by the jet engine I had just installed in my home. Even though I have a dedicated data room for all of this, I could hear that engine from the living room and from my bedroom. Soundproofing the doors reduced the noise enough that it barely carried from the data room into my office, which was decent.

I hope to measure the decibel level one day, but that day is not today.

Killer Feature:

The AMD Radeon PRO V620 GPUs have a killer feature through which each card virtualizes up to twelve GPUs, each with 32 GB of VRAM (You could say it is the same VRAM viewed from a different angle). While my explanation of the feature may be completely off, it just means that each physical card can request roughly 384 GB of BAR address space. This incredibly useful feature (/sarcasm) prevented the amdgpu driver from loading on even a single GPU.

Fortunately, the fix was simple! I only needed to disable SR-IOV in the BIOS... on my Mac... Which had no access to BIOS...

I called my guy at Apple to request a special exception allowing me to access the BIOS on this one Mac. I had a really good feeling about it.

Then I woke up, realized I had to stop dreaming up fake solutions, and started searching for something real that I could do to solve this.

I brought up the topic over at r/AMD_V620. I could not find a solution that worked specifically for the Mac, so I created my own.

To put it simply, I patched the kernel to disable emulation for these cards. Yet another reason to patch the Ubuntu kernel on Macs.

It is not ideal, but it did the job.

Power, Phase 02:

Now that I could work in a sane, working environment, it was time to compare the eight-GPU setup with 20% more compute units against the previous four-GPU setup.

Nice. The numbers look bett... Wait. The data center is beeping.

What is going on?

Why is the UPS reporting “Mains Overload” and switching to battery power?

One step at a time:

Let me reclamp the ring terminals that do not look right.
The power distributors support up to 80 amperes each.
The smart breaker supports up to 100 amperes.
The in-room manual breaker supports up to 63 amperes.
The distant breaker supports up to 63 amperes.
The electric-company meter/breaker is limited to 70 amperes.

Everything seemed to be in order.

Why was this happening? Was the UPS defective?

Let me connect the server and Cubix Xpander to the other UPS.

Twenty amperes. So far, so good.

Nope. I spoke too soon. The same problem appeared.

If the alarm starts when I reach only 20 amperes... Light-bulb moment

Is the 70-ampere service from the power company divided across the three phases? Does that mean I only have access to roughly 23 amperes before maxing out any one of the three phases?

Me: Hello, power company? Could you please double the capacity of the meter I have here? One hundred and fifty amperes? Yes, that sounds good. Let us make it happen.

While waiting for the power company to upgrade the meter, I decided to balance the Cubix Xpander's load between the two UPS units, since it uses two PSUs.

I can imagine several potential problems with this arrangement, but the alternative was to wait at the mercy of the power company until it decided whether to approve my request or not. Then again, I could just be imagining things.

Miscellaneous:

Both personal and work-related matters delayed me severely and kept me occupied, including summer vacation and corporate restructuring.

The Outcome

Pure vLLM power.

I wrote this post over several days, almost weeks ago.

I am currently running vLLM 0.25.1. After weeks of experimenting with Qwen3.6-27B and gemma-4-31B-it in FP16, AWQ 4-bit, and AWQ 6-bit configurations, using both standard configurations and custom chat templates, I decided to focus on gemma-4-31B-it in FP16 using the chat template published in the vLLM GitHub repository.

I am now considering looking into the gemma-4-31B-it-assistant model, as well as what I understand to be a new chat template released by Google for it.

I AI developed a small package to automate vLLM serving. The goal is to enter a simple, single-word command, or something as close to that as possible, select a few options, and then serve the model through vLLM until I intentionally stop it.

After a crash, the environment is cleaned up and vLLM automatically starts serving again. A reboot has a similar effect. Since I have access to as many as eight GPUs in one machine, support for serving two models simultaneously is built directly into the package.

I currently do not use MTP. I found that, while it dramatically increases inference speed at low context lengths, it dramatically reduces inference speed at higher context lengths. Since my goal involves agentic workloads, which use high context lengths from the beginning, enabling MTP results in slower inference almost immediately.

With Gemma 4 in FP16, I can support roughly 10 concurrent requests at a 60K context length. Qwen3.6 gives me more than twice that concurrency at the same context length. On the 4-GPU Mac, I was able to achieve as many as 40 concurrent users with Qwen3.6-27B-AWQ (4-Bit), and 64k context window, and a special form of KV cache compression. However, I opted to stick to no compression, for maximum accuracy, since the end goal is essentially repeated loops.

With a single concurrent request, generation starts at approximately 18.5 tokens per second but quickly falls to an average of around 16 tokens per second. That is very slow for an eight-GPU rig. I cannot complain though, as these GPUs are limited to PCIe 3.0 x4 connections due to the Cubix Xpander pushing VRAM to the extreme, and at one fifth the cost, I am very happy.

I imagine I would have achieved dramatically greater throughput with Tenstorrent Blackhole p150 AI accelerators, but that is a project for another day.

I wonder what four Tenstorrent Galaxy Blackhole systems could do with trillion-parameter models.

Only The Beginning

This is far from the grand finale I am striving toward. It is, however, a good start, and I plan to continue pushing forward.

Today, I have approximately 10 agents running. Each has a specific responsibility or role. They have supported me immensely, and although they have shown me only a glimpse of what they can do, I am both impressed and satisfied with the results.

I have configured the agents with failover models. With hardware redundancy, automated vLLM recovery, and several hours of battery capacity, I expect downtime to be negligible. I would like to learn how to properly calculate the number of “nines” of availability, such as 99.9% uptime, that this setup could realistically achieve. Once I add green energy and a local generator, and eliminate any single-point-of-failure, I will give this much more serious thought. When that happens, it will no longer be a "Home Lab" though. Or are we past that point already?

At this point, I have three 2019 Mac Pro systems running Ubuntu Server 24.04 directly on bare metal. Depending on the machine, they contain AMD Radeon PRO V620, W6900X, W6800X Duo, or W6800 GPUs, or some combination of them.

They run ROCm 7.2.3 and serve models through vLLM 0.25.1, primarily using FP16 versions of gemma-4-31B-it and Qwen3.6-27B.

The hardware and software stacks are about as good as they can be without additional spending.

Next, I will focus on purely agentic workloads and goal-oriented loops.

I would love to receive feedback on what I did wrong, what I could improve, where I could learn and grow, and anything else I could do better.

إلى شبابنا السعوديّين والعرب في المملكة العربيّة السعوديّة عامةً، وفي الرياض خاصةً: تواصلوا معي، فضلًا لا أمرًا، بهدف التعاون وتبادل المعرفة في مجال الذكاء الاصطناعي والوكلاء أو المساعدين، مثل هرميز

بانتظار أسمع منكم

^(Disclaimer: I wrote this post myself. I also used AI as a tool to help clean up the wording and formatting.)

Reddit Posts

References

eBay Sellers

Online Marketplace

u/Faisal_Biyari — 3 days ago

▲ 131 r/Vllm+1 crossposts

Created vLLM Breath monitor like bTOP

Hello guys. Just wanted to share with you the visual monitor of vLLM.
Welcome to tweak my code. Posted at my private server.

Copy and paste in a browser tab.
http://s2.igrnt.info/.misc_code_shared/vllm_breath.py

Language switch - l.
Russian/English in the code.

u/EmergencyFly476 — 3 days ago

▲ 8 r/Vllm+2 crossposts

Compiling and running a pre-trained LLM on AWS Inferentia accelerator

In this tutorial, we are going to compile and run a small llama architecture model on an EC2 instance and if we manage to pass the compilation and inference test, it means our model is compatible.

Source code in Github: https://github.com/p0o/run-models-in-aws-inf2-ml-accelerator

pooria.co

u/opshack — 2 days ago

▲ 5 r/Vllm+7 crossposts

AI-coding tools and harnesses have changed the role of software engineering – and fast. Devs have started tokenmaxxing, with some companies, including Meta, launching leaderboards to measure their workers’ ability to burn through AI usage.

Agentic coding tools have radically changed AI usage, acting semi-independently through command lines, files, internet access, and tool calls. This also means they burn far more tokens than older chat-based tools, says Sebastian Baltes, professor of software engineering at the University of Heidelberg. “I’m a daily user of these tools myself, and it’s just astonishing how fast the field moves,” he says.

As software engineering evolves, the suppliers of that AI inference, including the biggest providers within the tech sector, have pulled the rug from beneath coders and the companies they work for.

Price hikes, tighter usage limits, and the shutting down of some providers to new signups has upended the new way of work software engineers were just getting used to.

Read the full article: https://leaddev.com/ai/your-ai-coding-budget-just-got-a-lot-more-complicated

u/OfficialLeadDev — 2 days ago

▲ 73 r/Vllm+10 crossposts

Byte exact KV cache grafting on frozen Gemma 4

We published a method to store verified knowledge as KV state and restore it byte identical to fresh computation.

On Gemma 4 12B, cached knowledge improved the same routing system from 76.7% to 90.0% on AIME 2025.

I will pitch this at AGI Summit on July 19.

Paper: https://arxiv.org/abs/2607.14431

u/MindPsychological140 — 4 days ago

▲ 15 r/Vllm

Just learned (the hard way) that long-running vllm can accumulate POISONED prefix-cache/MTP state: tool calls derail into mask-token/XML garbage for specific prompt prefixes while other prompts stay clean.

reddit.com

u/Afraid-Yoghurt6731 — 4 days ago

▲ 3 r/Vllm

Does vllm need a restart once in a while?

Out of the blues I started getting replies from the agent that completely broke tool parsing. Like

<read", "path": "/home/agent/.agents/skills/research/SKILL.md"}
{"path": "/home/agent/.agents/skills/research/SKILL.md"}
</read>

I checked all my code and made sure nothing I did today could have impacted this. No changes in version, nothing. But try after try, it just kept outputting this kind of garbage.

Out of desperation, I shutdown vllm and turned it back on. Lo and behold, it works like a charm again.

So now I'm really confused, are we supposed to refresh vllm once in a while? Could long running sessions corrupt memory in a way that harm the way it works?

reddit.com

u/nunodonato — 4 days ago

▲ 19 r/Vllm+4 crossposts

Looking for feedback on a small test SLM I built completely from scratch [P]

Architecture:

- Parameter count: 216.5M

- Layers: 10

- Attention / no attention:** Attention — 12-head multi-head self-attention, RoPE positional

encoding, SDPA. Decoder-only, pre-norm, RMSNorm + SwiGLU, tied input/output embeddings.

(hidden 1032, head_dim 86, FFN 4416)

- Tokenizer:** Custom 36k SentencePiece unigram, case-preserving, byte-fallback, with atomic

chat/role + memory special tokens (`<|user|>`, `<|assistant|>`, `<|end|>`, …)

- Context length: 768

- Dataset size/source: ~551M tokens. Pretraining on public English text (Wikipedia,

TinyStories, OpenWebText2); SFT on public instruction/chat sets (SmolTalk, UltraChat, Magpie,

AM-DeepSeek-R1, Orca-Math) + a small custom greetings/assistant set.

- Training time: ~15 hours

- Hardware: single NVIDIA RTX 3080 (10GB), bf16

- Training setup: AdamW (β 0.9/0.95), LR 3e-4, 1k warmup, weight decay 0.1, grad-clip 1.0,

effective batch 16,384 tokens/step (micro 4 × grad-accum 8 × seq 512), 33,650 steps.

Loss curve

Random init → 1.77 smoothed (1.27 last step). Started at exactly ln(36000)=10.49 (true from-scratch).

You can see SFT kick in — the sharp step-down at ~410M tokens.

(attach tinybrainbot_loss_curve.png)

Sample outputs

(F16 GGUF, temp 0.7 / top-p 0.9 / repeat-penalty 1.15 / freq 0.5 / presence 0.3)

> **User:** Hello

> **Bot:** Hello! How can I help you today?

> **User:** Tell me a story about a robot who learns to paint.

> **Bot:** In the heart of the city, there lived an assistant named Max. He was a kind and gentle

> soul with eyes that shone like diamonds, and he had always known that art would bring people

> together...

> **User:** Write a short poem about the rain.

> **Bot:** The rain was crisp and warm, casting dappled shadows on the walls of your town square...

> Your voice was so cool and gentle, so bright and calming all that you could feel at home.

> **User:** Explain what a computer is.

> **Bot:** ...the development of a computer can be categorized as either an operating system or a

> popular language, such as JavaScript or C++...

(^ that last one is the honest failure mode — confidently wrong on facts. It nails conversational form far better than it knows anything.)

What shocked me

- The tokenizer was way more important that I thought. I lost quite a few full runs on a broken char-level (386 vocab) fallback before building a real 36k subword tokenizer — and that fixed it and was the biggest quality jump than any architecture or data tweak.

- A GGUF export issue: llama.cpp has no unigram-tokenizer runtime on the default path, so a SentencePiece unigram*model gets silently re-segmented (Hello → Hell+o) and produces garbage — the weights were fine. Fix was exporting the GGUF vocab as UGM (t5 type) + the SP charsmap so it runs Viterbi. Cost me a day of pain.

What I want feedback on

- Token budget: 551M tokens for 216M params is way under Chinchilla (~4.3B). Would you push pretraining much longer before SFT, or is more/cleaner data the better spend at this scale for a quick demo?

- Data mix / timing: the SFT step-down in the loss is dramatic — is it real learning, or mostly the model latching onto chat format? How would you tell them apart?

- Cutting factual drift at ~200M params: more pretraining, longer context, or distillation froma bigger teacher — where's the best ratio for noticeably better factual responses?

- Arch nits: head_dim 86 is ugly (1032/12). Would 1024 hidden / 16 heads / 64 head_dim be ameaningfully better shape at this size (also allows quantization)?

Happy to answer anything about the pipeline (tokenizer training, the from-scratch loop, GGUF export).

Here's the model link as well https://huggingface.co/nkthebass/TinyBrainBot-demo-216.5m

u/nkthebass — 6 days ago

▲ 57 r/Vllm+3 crossposts

llm-d: Distributed LLM Inference on Kubernetes

youtu.be

u/RelevantEmergency707 — 6 days ago

▲ 251 r/Vllm+36 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 8 days ago

▲ 14 r/Vllm+2 crossposts

I benchmarked N-gram, MTP, EAGLE3, and DFlash speculative decoding on Qwen3.5-122B on Single DGX Spark

Been diving into speculative decoding lately and wanted to share some real numbers instead of just theory, since most explainers stop at "draft model proposes, target model verifies" without showing what happens when you actually run it.

Setup: Qwen3.5-122B-A10B (hybrid INT4/FP8 checkpoint) on a single DGX Spark, tested against the same prompt across all methods, vLLM logs for acceptance rate + throughput.

Baseline (no spec decoding): ~37 tok/s

N-gram: Actually came in below baseline (~30 tok/s). No neural component — it just matches repeated token sequences in context. On general prompts the acceptance rate was low enough that the drafting overhead cost more than it saved. Would probably do better on highly repetitive tasks like doc summarization, but not a general win.

MTP (native multi-token prediction head): This is where it got interesting. The model ships with an MTP head, but the INT4 checkpoint I used doesn't register those weights in its index file by default — needed a patch to get vLLM to actually load and use it (shoutout to the recipe author for documenting that). Once patched:

1 token: ~44-46 tok/s, ~85% acceptance

2 tokens: ~48-50 tok/s peak, this was the sweet spot

4 tokens: dropped back down — acceptance rate falls off hard the further out the head predicts, so the overhead of drafting/verifying/discarding stopped being worth it

EAGLE3: Couldn't find a good EAGLE3 drafter for the 122B model specifically, so I ran this on a smaller Qwen (30B-A3B) instead. At 1 speculative token, went from ~33 tok/s baseline to ~38-40 tok/s. Pushed it to 12 speculative tokens just to see — acceptance rate dropped to 8-11% and throughput roughly halved, even though the drafter itself was cranking out 85+ tok/s in draft speed. Perfect illustration of why raw draft speed doesn't matter if acceptance is low.

DFlash: The most novel of the four — instead of autoregressive drafting, it uses block diffusion to generate a whole chunk of tokens (I ran 16) in a single non-autoregressive pass. In theory this should sidestep the "overhead compounds with more speculative tokens" problem entirely. In practice, it was volatile — swung between 30-52 tok/s on general prompts, and even on a coding-focused test (where I expected structured/predictable output to shine) it peaked in the mid-40s rather than the ~80 tok/s I'd seen in some published benchmarks. My take: it's a genuinely exciting technique in an early state — MTP is more consistent/production-ready today, but DFlash's approach (non-autoregressive drafting) seems like the more scalable long-term direction.

Here are my results on all the tests ran.

Final Results

Some of my takeaways from the tests:

Acceptance rate is everything. More speculative tokens are not always better. The acceptance rate at each position compounds — even a single bad position in an autoregressive chain invalidates all subsequent drafts.
DFlash's parallel drafting breaks the serial constraint. By generating blocks in a single forward pass, DFlash eliminates the per-token cost overhead that limits autoregressive drafters at high K. This is the architectural reason its peak throughput is so high.
Task type matters. Structured output (code, JSON, math) consistently benefits more from speculative decoding than open-ended generation, because the conditional distribution is sharper and easier for any drafter to model.

Curious if anyone's gotten more consistent DFlash numbers, or found a good EAGLE3 drafter for larger MoE models like the Qwen-122b-a10b.

reddit.com

u/kristiyanstoyanovAI — 6 days ago

▲ 19 r/Vllm+1 crossposts

HGX200 - GLM5.2

My company is getting a single HGX H200 node with 8× H200 GPUs, and we’re planning to serve GLM-5.2 using vLLM.

Since the hardware has already been purchased, no budget was allocated for renting an equivalent cloud system for testing beforehand. I’m therefore looking for real-world performance data from anyone running GLM-5.2 on similar hardware with vLLM.

I’m particularly interested in:
- Maximum practical concurrency at 128K, 256K, 512K, and up to 1M context
- Aggregate throughput in tokens per second
- Per-user generation speed at different concurrency levels
- TTFT and inter-token latency
- KV-cache memory usage at each context size
- Performance with FP8 KV cache
- The impact of prefix caching
- The effect of TP=8 compared with other vLLM configurations
- Any experience with disaggregated prefill or KV-cache offloading

Even partial benchmark results would be very helpful. I’m mainly trying to estimate realistic concurrency and throughput rather than theoretical maximums.

reddit.com

u/Some-Manufacturer-21 — 7 days ago

▲ 15 r/Vllm+1 crossposts

Scaling my LLM inference for reply suggestions using disaggregated prefill

Sharing my learning from separated prefill and decode into separate stages to increase the processing throughput and reducing TTFT significantly

saraswatmks.github.io

u/Hairy_Goose9089 — 6 days ago

▲ 5 r/Vllm+1 crossposts

My organisation ditch ollama as a backend ai engine in which I used to run embedding model. Now they say use vllm, but vllm requires linux - company provides me IP also to access linux and use vllm via puTTy, could you guys help me that how to use model via vllm and get endpoint. Do I need to open 2

2 puTTy windows - one for serving the model and one for starting my fastapi application to get an endpoint api?

reddit.com

u/ash_pix — 8 days ago

▲ 12 r/Vllm+1 crossposts

Qwen3.6-35B-A3B-MTP + vision: does --mmproj disable MTP drafting?

Conclusion: No. In this test, using vision did not disable speculative drafting from the MTP head, either during the vision turn or for the rest of the session. I've seen worries in reddit post about this aspect. maybe was real in the past. now it s gone.

Tested with llama.cpp b9620 (57fe1f07c), an M5 Max, and unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL.

2-minute video: https://www.youtube.com/watch?v=lcXdfkXdLE0

Launch

llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
  --mmproj .../mmproj-BF16.gguf \
  -ngl 99 -c 262144 -fa on -np 1 \
  --spec-type draft-mtp --spec-draft-n-max 2 --port 8081

Both subsystems initialize in the same process. llama.cpp reports separate memory budgets for the projector and MTP context:

load_model: [mtmd] estimated worst-case memory usage of mmproj is 1134.00 MiB
load_model: [spec] estimated memory usage of MTP context is 826.70 MiB
common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
common_speculative_impl_draft_mtp: n_max=2, n_min=0, p_min=0.00, n_embd=2048, backend_sampling=1
load_model: speculative decoding context initialized
load_model: loaded multimodal model, '.../mmproj-BF16.gguf'

Text turns

Five text turns, with a prompt reaching roughly 29k tokens. Drafting remained active throughout:

task	draft acceptance	eval tok/s
1	0.862 (100/116)	95.29
78	0.955 (168/176)	99.76
171	0.828 (53/64)	89.36
209	0.739 (525/710)	79.48
570	0.865 (83/96)	88.65

Vision turn

Same server and session, now with a turn that actually encoded an image:

slot process_mtmd: id 0 | task 627 | encoding mtmd batch from idx = 82, n_chunks = 1
slot print_timing: task 627 | eval time = 6801.48 ms / 712 tokens
slot print_timing: task 627 | draft acceptance = 0.61950 (394 accepted / 636 generated)

That is 104.68 tok/s, the fastest eval measured in this session. While an image was in context, the MTP head generated 636 draft tokens and 394 were accepted.

Two messages appeared during the run:

find_slot: non-consecutive token position 82 after 81 ... during image encoding. Generation and drafting continued.
Qwen-VL models require at minimum 1024 image tokens ... try adding --image-min-tokens 1024. This is worth enabling for OCR or grounding, but was not required for a plain image description.

Results

Drafting survived the image. Acceptance fell to 0.62 compared with roughly 0.83-0.95 on the earlier text turns, but drafting was still active. The head was simply less confident about tokens following an image.

Drafting survived the session. Task 949, a text turn after the image, still drafted at 0.680 acceptance (151/222).

Final counters across seven turns:

generated drafts = 1010
accepted drafts  = 818
generated tokens = 2020
accepted tokens  = 1474

That is roughly 73% accepted overall.

No throughput penalty was observed in this run. The vision turn was actually the fastest measured, although this single test does not establish a universal performance result.

So, on llama.cpp b9620, --mmproj and --spec-type draft-mtp coexist. If you previously saw MTP drafting stop when vision was enabled, it may be worth retesting on a current build.

Happy codding!

u/Academic-Most6214 — 8 days ago

▲ 27 r/Vllm

I asked Fable to add qwen3.6 gguf, MoE and MTP support

It took Fable all day, and I can’t say I’ve looked at the code much atm or, heck, that i even understand vllm enough to know what I’m looking at.

However, what I do know is, Fable borrowed code from llama.cpp, I’m now using my own IQ4 quants for 27b and 35b, and it’s all working great!!

Before I wouldn’t have been able to run 35b with vllm, and AWQ 27b was at 89k context. Now I can not only run 35b, but at 245k ctx (196k for 27b) with (more importantly) all the batch support from vllm.

Same IQ4_XS GGUFs, same 3090 Ti @350W, single stream:

	ik_llama.cpp base	ik_llama MTP	vLLM (ours) base	vLLM MTP k=1	vLLM ctx configured
27B	48	79	44.0	64.4	192k
35B-A3B	120	~200 (n_max=2: 198.7)	179.6	203.6	245k

Got a couple more performance enhancements but here’s the repo if anyone wanted to take a look. I was going to PR it, but felt ‘dirty’ doing that when I don’t even know what Fable has done and I don’t have the hw to test larger quants.

https://github.com/localweights/vllm-gguf-plugin

u/yes_i_tried_google — 10 days ago

▲ 7 r/Vllm

Help me understand max_num_seqs

Hi all

I've been a bit confused on how to better tune max_num_seqs

When my vllm starts and loads the model it give me the max nr of requests at full context (usually around 12)

If I exceed this number, they go into waiting, and I see that in the logs.

So what is max_num_seqs used for? Is there any reason why we would set this value to be lower than the max requests vllm can handle?

thanks

reddit.com

u/nunodonato — 8 days ago

▲ 24 r/Vllm+4 crossposts

I trained a 200M Mixture-of-Experts language model (90M active) from scratch on 8B tokens at 15. I'd love some feedback.

Hi everyone,

Over the past several months, I've been working on **Tiny-MoE**, a language model that I built and trained from scratch as a learning project. I'm 15 years old, and my goal wasn't to compete with models like Llama or Qwen, but to understand how modern LLMs work by implementing the entire pipeline myself.

Some highlights:

200M parameter Mixture-of-Experts model (~90M active parameters per token)
Trained from scratch on **8 billion tokens**
Native PyTorch implementation (not built on Hugging Face Transformers)
Multi-Head Latent Attention (MLA)
RoPE + YaRN positional embeddings
Streaming and packed data pipeline
Custom training and inference code
Text generation with Top-k, Top-p, repetition penalty, and n-gram blocking

**Training datasets:**

FineWeb-Edu
Cosmopedia v2
OpenWebMath

The model is definitely **not state-of-the-art**. It can generate English, answer simple questions, and perform basic coding and math tasks, but it still struggles with longer generations. Given the model size and the compute available to me (primarily Kaggle GPUs), I'm happy with it as a learning project.

I'd really appreciate feedback on:

Code quality
Project structure
Model architecture
Training pipeline
Documentation
Anything else you think could be improved

GitHub:
https://github.com/AbdelrhmanEbied/Tiny-MoE

u/Odd-Ship226 — 12 days ago

r/Vllm

Qwen3.6 vs 3.5 on DGX Spark: identical throughput, except with one flag flipped

Performance issue: Low token generation (~20 tok/s vs 50 tok/s) on Radeon AI PRO R9700 (gfx1201) with vLLM ROCm &amp; Gemma 4-26B

Mac | Cubix | V620 | Ubuntu | ROCm | vLLM | Local AI Data Center

TL;DR

The Dream

GPUs | AMD? | NVIDIA? | Tenstorrent?

The Data Center

Resources

The Challenge

The Outcome

Only The Beginning

Created vLLM Breath monitor like bTOP

Compiling and running a pre-trained LLM on AWS Inferentia accelerator

Byte exact KV cache grafting on frozen Gemma 4

Just learned (the hard way) that long-running vllm can accumulate POISONED prefix-cache/MTP state: tool calls derail into mask-token/XML garbage for specific prompt prefixes while other prompts stay clean.

Does vllm *need* a restart once in a while?

Looking for feedback on a small test SLM I built completely from scratch [P]

llm-d: Distributed LLM Inference on Kubernetes

I benchmarked N-gram, MTP, EAGLE3, and DFlash speculative decoding on Qwen3.5-122B on Single DGX Spark

HGX200 - GLM5.2

Scaling my LLM inference for reply suggestions using disaggregated prefill

Qwen3.6-35B-A3B-MTP + vision: does --mmproj disable MTP drafting?

Launch

Text turns

Vision turn

Results

I asked Fable to add qwen3.6 gguf, MoE and MTP support

Help me understand max_num_seqs

I trained a 200M Mixture-of-Experts language model (90M active) from scratch on 8B tokens at 15. I'd love some feedback.

Performance issue: Low token generation (~20 tok/s vs 50 tok/s) on Radeon AI PRO R9700 (gfx1201) with vLLM ROCm & Gemma 4-26B

Does vllm need a restart once in a while?