
Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)
In short.
1. Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs is pretty good for running the recently popular 27B or 31B dense models.
2. The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s. You might experience up to several times better PP/s and TG/s on small densed models, depending on the situation, and it can be useful in multi coding agents scenarios.
3. Basically using riser cable can achieve eGPU's slot flexibility to fit 2slot NVLink with small mod on typical motherboard pcie 3090 cards.
4. Depending on KVcache types in vLLM, not only max context length and concurrent requests change but speed differs a lot in longer context. It might look good at beginning but not promising longer run.
5. For power efficiency, 27B dense models get better PP/s and TG/s per watt on eGPU. But for 122B, running on Strix halo alone via llama cpp showed better power efficiency than combined 3 GPUs.
6. NVLink does not do anything on llama.cpp's layer split, I have tried recent -sm tensor, gaining Tg/s was 30%ish but pp/s down performance was too big, so I stopped, and continue to vLLM on dual 3090.
I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B densed models of my Bosgame M5 Strix Halo, So I decided to do some scrambling to overcome it. Recently, these dense models are getting much more attention than 70B+ MoE models. To run them better I bought single 3090 via local second hand market, after I saw improvement, then quickly moved to dual egpu setup via both nvme pcie 4x4.
I was hesitated to try NVLink since no gurantee on my eGPU case, and 3 slot NVLink was too expensive(600USD+). Still I wanted to see if I could improve the eGPU's PHB speed which has to go through CPU.
But most 3090 cards including mine are 3 slot thick, so I end up buying a 2slot bridge for around $250 including custom fees.
For this, I removed the 3 fan shroud on the top 3090 and roughly attached 120mm fans with a 3D printed side blow duct to make it fit. Surprisingly, the temperature of this modded 3090 actually stays lower than the unmodded one on bottom.
Test Environment:
- Fedora 43
- llama cpp: Strix halo performance power mode, build 9221.
- 122B test was split by
-sm layerusing rocm7.2.3 and cuda. - 27B test used rocm 7.2.3 as baseline. (Comparing rocm 7.2.3 and vulkan radv, rocm has better pp/s and vulkan has better tg/s). Benchmarks were repeated only 2 times.
- Note: Since MTP is not fully implemented in llama cpp benchmarks yet, I borrowed the code_python MTP metrics (-pp/s% and +tg/s%) from kyuz0's strix halo toolbox for the 27B and 122B (using 35B A3B Moe stats) to plot simulated MTP lines. (https://kyuz0.github.io/amd-strix-halo-toolboxes/mtp.html)
- 122B test was split by
- vLLM: Nightly build. 3090s are power limited to 230W each.
- vLLM benchmarks followed the Club 3090 direction:
- Narrative: "Write a detailed 800-word essay explaining transformer attention." (max_tokens=1000)
- Code: "Write a Python implementation of quicksort with comments explaining each step." (max_tokens=800)
- Sampling: temp=0.6, top_p=0.95, top_k=20, presence_penalty=0.0, enable_thinking=false. Three warmups and five measured runs.
- Since Club 3090 doesn't have benchmarks based on context depth, I added those tests.
Benched vLLM models - Qwen 3.6 27B
| Recipe | Quantization | KV cache | Context | Concurrency | Drafter |
|---|---|---|---|---|---|
| docker-compose-dual (small, INT4 Standard) | AutoRound INT4 | fp8_e5m2 | 131K | 4 (total ~524K) | MTP=3 |
| turbo (High-Concurrency) | AutoRound INT4 | TQ3 (3-bit) | 262K | 4 (total ~1048K) | MTP=3 |
| mixed-bf16 (Precision,kinda Q6 feeling) | Mixed (INT4+8) | bfloat16 | 110K | 2 (total ~220K) | MTP=3 |
| mixed-fp8 (Sweet Spot) | Mixed (INT4+8) | fp8_e5m2 | 131K | 2 (total ~262K) | MTP=2 |
| autoround INT8 (Largest) | AutoRound INT8 | fp8_e5m2 | 115K | 1 (total ~115K) | MTP=3 |
Mixed bf16, Mixed fp8, Autoround INT8 recipes are small edited from Club 3090's recipe to look for better than Q4 level of quantization.
(I noticed MTP 2 on mixed-fp8 recipe while I am writing, too much work again to fix, so, keep it mind some different condition)
Benched vLLM models - Qwen 3.6 27B
| Recipe | KV cache | Context | Concurrency | Drafter |
|---|---|---|---|---|
| awq-bf16 (pure AWQ) | bf16 | 262K | 262K × 1, 131K × 2, 65K × 4 | MTP=4 |
| awq_autoround (hybrid awq) | bf16 | 262K | 262K × 1, 131K × 2, 65K × 4 | MTP=4 |
| int8 (larger context) | INT8 | 340K ~ 392K | 262K × 1, 170K × 2, 98K × 4 | MTP=4 |
| docker-compose-bf16 (default) | bf16 | 60K | 60K × 1 | MTP=4 |
Awq_autoround recipe is also small edited from original.
Results:
Triple : dual 3090 + Strix halo
122B Q4 K XL unsloth, q8_0, Strix Halo vs Triple
Strix halo (llama cpp 27B MTP Q6 K XL unsloth, 25GB including mmproj)
vs Dual 3090, Qwen3.6-27B-Mixed-AutoRound Minachist 28.9GB)
I chose these quants since considerably good enough quality and size wise close
Power efficiency
Rough calculation, but for 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.
NVLink on / off
Tested NVLink on vs off. As concurrency and context go up, NVLink defends the bandwidth bottleneck pretty well.
BF16 cache senario
fp8 cache case.
INT4 quant's fp8 senario
Gemma4 31B's case
Gemma-4-31B-it-AutoRound-AWQ, mattbucci, BF16 cache
This shows differences based on quantization and KV cache types. You can see how much max context length and speed fluctuate just by changing the cache type.
on Amphere card, TQ3 was pretty bad to keep Tg/s despite it can give more context amount..
Code vs Narrative MTP
When concurrency is 1, code generation is always faster than narrative. But as you can see, when concurrency is 2 and it goes into deeper context, code speed drops and gets reversed by narrative. Seems like a weird load happens when concurrent requests and long context combine.
Huge thanks to
Club 3090 (https://github.com/noonghunna/club-3090/tree/master),
kyuz0's toolbox (https://github.com/kyuz0/amd-strix-halo-toolboxes), and DasDigitaleMomentum's distrobox (https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox)