r/StrixHalo

RCCL Optimized for Multi‑Node Strix Halo Ethernet Deployments with Tensor, Expert Parallelism and rocSHMEM

RCCL Optimized for Multi‑Node Strix Halo Ethernet Deployments with Tensor, Expert Parallelism and rocSHMEM

Multi-Node Communication Strix Halo

RCCL (ROCm Collective Communications Library) receives targeted optimizations for Strix Halo multi-node configurations over Ethernet. Building on the initial multi-node enablement delivered in ROCm 7.12, this release optimizes RCCL for distributed AI inference using tensor parallelism (TP) and expert parallelism (EP) across up to four Ethernet-connected nodes, standardizing the network topology for Strix Halo clustering deployments.

Additionally, RCCL integrates rocSHMEM operations to improve all-to-all collective communication. rocSHMEM is AMD’s GPU-native communication library that enables GPUs to directly read and write each other’s memory without routing data through the CPU. By using rocSHMEM for GPU Direct Access (GDA) in all-to-all operations, RCCL reduces the overhead of exchanging data between GPUs. RCCL also implements threshold-based point-to-point batching by default, which groups smaller messages together to reduce communication overhead in multi-node configurations.

https://rocm.blogs.amd.com/ecosystems-and-partners/rocm-7.13-blog/README.html

Disclosure: I used Nemotron 3 Nano Omni to come up with the title for the news.

▲ 37 r/StrixHalo+1 crossposts

MTP in llama.cpp (PR #22673) tested on AMD Strix Halo: Qwen 3.6 35B-A3B hits 71 t/s short / 48 t/s at 62K via Vulkan RADV

Llama.cpp merged PR #22673 last week with MTP support. Three days later unsloth shipped Qwen 3.6 35B-A3B-MTP-GGUF. Today I swapped the vision endpoint on my Strix Halo box. Sharing because the numbers honestly surprised me.

Same hardware. Measurements:

Gemma 4 26B-A4B Q8 (before):

- 41 t/s short ctx

- 36 t/s at 22K

- 66 MiB KV per 1K tokens (SWA)

- 96K practical ceiling

Qwen 3.6 35B-A3B Q6_K + MTP-2 (now):

- 71 t/s short

- 48 t/s sustained at 62K (2200+ tokens in one decode)

- 2 MiB KV per 1K (Gated DeltaNet, linear attention in select layers)

- Running native 256K ctx, nowhere near hitting the memory wall

- MTP accept rate 86% average, peak 96.7%

+60-90% to generation speed. KV 15x more compact. Multimodal still works (mmproj-F16 in the same repo), tool calling works, thinking mode works. Nothing to build manually, just the stock kyuz0/amd-strix-halo-toolboxes:vulkan-radv image with llama.cpp master.

Hardware: AMD Ryzen AI Max+ 395, 128 GB UMA, Radeon 8060S gfx1151, Vulkan RADV backend.

The actual surprise was DeltaNet, not MTP. I assumed MTP was doing all the heavy lifting, but on long context most of the win comes from DeltaNet. Gemma's SWA falls off a cliff past 30K. Qwen stays almost flat. At 62K it loses about a third, not half.

#LocalLLM #StrixHalo #LlamaCpp #Qwen

reddit.com
u/voStragaIT — 3 days ago

Step by step guide to help me get Ubuntu 26.04 set up correctly for best performance

Okay, so firstly thank you for all your comments on my other questions about what OS is best to use etc, very useful

I have now tried Windows 11 LTSC IOT Enterprise - I ran the same benchmark tests using llama-bench, and asked same questions using ollama in both and yes LTSC is a much slimmed down version, but in reality the results from my basic testing of local llm were no quicker in LTSC than they were in my full bloated Windows 11 - so don't think that is the route for me.

So think Linux may well be my best route and also a good excuse to learn new areas, so

  • I had a look at CachyOS and just didn't really like the interface etc, if I am going to Linux I at least need to like it
  • I tried Fedora - seems okay
  • Finally for now went with Ubuntu 26.04 as most guides on internet I see seem to refer / relate to using Ubuntu, so for now for a Linux newbie it makes sense I think to go this route.

So onto some questions please:

aI am getting confused with ROCm and what exactly I need to do on Ubuntu (or any other distro) to install it. Seems to be so many mixed comments, suggestion and guides around. For example on the AMD site I see two different pages relating to it

https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/docs/install/installryz/native_linux/install-ryzen.html

and

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html

but not sure which ones I should follow and which steps are actually relevant for me

To refresh my device is a Corsair AI Workstation 300 with AMD Ryzen AI Max+ 395 (Strix Halo) and 128GB RAM, I have set 1GB in the BIOS, so rest is available to be in Linux as I understand it as shared VRAM.

So perhaps someone who uses similar CPU and Ubuntu could guide me through steps I should be doing to utilize the most I can out of this system.

So from a base install of Ubuntu 26.04 what should or shouldn't I be installing.

Much appreciated.

reddit.com
u/wingers999 — 4 days ago
▲ 63 r/StrixHalo+1 crossposts

Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo

Update to Lemonade v10.5.1, then:

# Get the model
lemonade pull Qwen3.6-27B-MTP-GGUF

# Get ROCm 7.13
lemonade backends install llamacpp:rocm

# Load the model (MTP args auto-applied)
lemonade load Qwen3.6-27B-MTP-GGUF --llamacpp rocm --ctx-size 0

Shown in the video taking a look in the mirror with the help of Pi agent.

Github: https://github.com/lemonade-sdk/lemonade Discord: https://discord.gg/5xXzkMu8Zk

PS. u/lucifer-vali fixed Fedora 43 support in this release as well :)

u/jfowers_amd — 4 days ago

llama.cpp MTP on Strix Halo: Qwen3.6 27B Q8 hits 2.44×, MoE 1.40×

MTP support landed in mainline llama.cpp on May 16 (PR #22673, commit 4f13cb7). Ran it on a Framework Desktop Strix Halo with ROCm 7.0.2.

Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:

  • Q4_K_M: 11.7 → 21.2 tok/s (1.81×, n=3)
  • Q8_0: 7.4 → 18.1 tok/s (2.44×, n=3)

Qwen3.6 35B-A3B (MoE), same harness:

  • 49.5 → 69.4 tok/s (1.40×, n=3)

The Q8 gain is bigger than Q4 because baseline Q8 was bandwidth-bound on the 215 GB/s LPDDR5X - MTP turns N decode steps into one heavier forward pass that the bandwidth can actually hide, so more of the weight traffic gets reused per token generated.

The MoE gain is smaller because only ~3B of 35B params run per token. Each forward pass is already cheap, so saving N-1 of them is a smaller win.

Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.

Writeup with build commands and per-shape tables (chat, rag, codegen, agent c=4): https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo

Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs

Build: llama.cpp 4f13cb7, ROCm 7.0.2, llama-server with --device Vulkan0 --split-mode none --main-gpu 0 to pin to the iGPU (the Strix APU enumerates as Vulkan0; CPU is Vulkan1 and shouldn't get any layers).

reddit.com
u/C_Coffie — 4 days ago

Six months daily-driving the Corsair AI Workstation 300 for production LLM fine-tuning + inference — settings, real workloads, and the upstream PRs that landed this week

Box:   Corsair AI Workstation 300   (CS-9080002-NA), AXB35-02 v3.07 BIOS, 128 GB LPDDR5X @ 8000 MT/s. Bought ~6 months ago, replaced the shipped Windows 11 with Ubuntu 24.04, and have been daily-driving it for LLM fine-tuning and inference since. Sharing the actual settings I landed on and real workload numbers — most Strix Halo writeups I've seen are reviews-from-a-week and don't cover what you actually need to tune for sustained AI workloads.

  The settings that actually mattered  

 BIOS
-   UMA = 1 GB (the BIOS minimum).   Don't raise this. The GPU does NOT need pre-allocated VRAM on Strix Halo — `amdgpu` auto-sizes GTT to the full 128 GB at boot, dynamically. Setting BIOS UMA higher just wastes RAM. Multiple Strix Halo OEMs default to 8-16 GB UMA, which is wrong for AI.
- Front-panel power profile:   Balanced (~85W)   for most workloads;   Max (~120W)   for active training. Quiet (~55W) chokes the GPU clock.
- BIOS 3.07 is one ahead of Corsair's current public (3.06) — I'm running it because it ships from the factory on later production runs. Public 3.06 is fine for AI work.   Don't cross-flash GMKtec firmware   to a Corsair board — I tried v1.12 of theirs, hit a 2 GB UMA floor and an O-key bug, had to recover.

 Kernel boot params (Ubuntu 24.04, kernel 6.19.14 mainline)
```
quiet splash iommu=pt pcie_aspm.policy=performance amdgpu.runpm=0
ttm.pages_limit=33554432 ttm.page_pool_size=33554432
transparent_hugepage=always numa_balancing=disable
```
- The `ttm. ` and `transparent_hugepage=always` ones are the under-discussed tunings. They fight unified-memory page-allocator fragmentation during long-running training. Default Ubuntu settings will fragment the GTT pool and you'll see RSS climb without OOM-ing for hours; these stabilize it.
- `amdgpu.runpm=0` keeps the GPU out of runtime-PM stalls during back-to-back inference calls.

 sysctl (`/etc/sysctl.d/zz-aurora.conf`)
```
vm.compaction_proactiveness=20
vm.compact_unevictable_allowed=1
```
- Mandatory for unified-memory APUs per AMD's MI300A guidance. Without them, long inference sessions on big models trigger reclaim storms.

 ROCm install
-   ROCm 7.13 nightly   is the only build with native `gfx1151` kernels. Earlier ROCm (7.1) wheels have a `_grouped_mm` null kernel that segfaults under MoE training.
- Install command for PyTorch:
  ```
  pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
  ```
  Lands you on `torch==2.11.0+rocm7.13.0` directly — no manual wheel wrangling.

 GPU performance level
-   Set `auto`, not `high`.   I had `high` set for a while; it locks clocks at max and disables thermal throttling, which caused a 89 °C crash mid-training. `auto` lets the firmware manage clocks across the Quiet/Balanced/Max profiles and stays well under 84 °C even at 120W.

  Real workloads this box runs daily  

Workload Quant / Format Memory Speed Stable?
Qwen3.5-27B bf16 LoRA training r=64 α=128, max_seq 8192 ~80 GB peak ~12.5 min/step (FLA), ~16 min (PyTorch fallback) Yes — multi-day runs are routine
Qwen3.5-122B-A10B Q6 inference llama.cpp HIP ~95 GB ~5-6 tok/s tg Yes
Qwen3-Coder-Next 80B-A3B Q8 inference llama.cpp HIP ~79 GB similar to 122B Yes
MiniMax M2.5 230B-A10B Q3 inference llama.cpp HIP ~95 GB ~5 tok/s tg Yes
ComfyUI (Wan 2.2 I2V) bf16 ~30 GB peak varies by workflow Yes — but don't run with LLM training

  What's still rough  

-   FlashAttention 2 / xformers don't run on `gfx1151`.   SDPA via PyTorch HIP only for training. The single biggest perf gap vs equivalent NVIDIA setups.
-   AOTriton coverage is incomplete.   Most fast-path kernels in libraries like Unsloth fall through to generic Triton on AMD. Closing fast but not closed.
-   `mmap`-only model loading on gfx1151 triggers ~30 min of GPU page-table setup.   Use `--no-mmap` (or `--mmap --direct-io` together) for llama.cpp. Not Corsair's fault — it's how unified-memory address translation interacts with mmap — but it traps new users.
-   `causal_conv1d` from PyPI builds-from-source AND fails on Ubuntu 24.04 default.   ROCm clang-20 picks gcc-14's runtime dir, which doesn't ship `<cstdlib>` headers in 24.04's default apt set. Fix landed in Unsloth this morning (see below); for direct pip install, set `HIPCC_COMPILE_FLAGS_APPEND='--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13'` before installing.
-   bitsandbytes for AMD needs a from-source rebuild   with `-DROCM_VERSION=83 -DBNB_ROCM_ARCH=gfx1151`. Official wheels don't work.
-   NPU (XDNA 2) isn't usable yet.   Kernel driver mainlined in 6.18+, kernel sees the device, but the userspace XRT shim + AMD XDNA SDK aren't packaged for general use. AMD has it on the roadmap. For now: 50 TOPS sitting idle.

  Upstream contributions this week  

Two PRs merged to `unslothai/unsloth:main` this morning (May 18) that directly affect this hardware:
-   PR #5517 (mine, merged 08:05 UTC):   fixes the `--gcc-install-dir` issue I mentioned above — Studio's `causal_conv1d` auto-install path now works on stock Ubuntu 24.04 without the env-var workaround. First upstream PR I've had merged.
-   PR #5434 (Daniel Han-Chen, merged 10:49 UTC):   auto-installs `flash-linear-attention` + tilelang for the Qwen3.5 / 3.6 / Qwen3-Next model family in Unsloth Studio. I caught a hard tilelang-on-HIP regression mid-review; the merged code has a two-layer HIP gate that fixes it specifically for our hardware.

Still in queue:
-   PR #5301 (Leo Borcherding):   ROCm unified-memory detection + routing Linux Strix Halo installers to `repo.amd.com/rocm/whl/gfx1151/`. Mergeable, awaits maintainer approval. This is the most important still-open one — when it lands, fresh installs auto-route to the right wheel without manual `--index-url` gymnastics.
-   PR #5303 (Leo Borcherding):   per-GPU lemonade-sdk llama.cpp prebuilts (Studio can use the gfx1151 prebuilt instead of source-building). I benched it as performance-equivalent to my hand-built — `pp64` ~254 ± 28 tok/s vs ~270 ± 2, `tg16` 7.48 vs 7.52, all inside σ. CI green.

  Public artifacts  

Reproduction scripts, faulthandler logs, bench outputs, BIOS notes:

https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/

One subdirectory per PR with the actual probe scripts that produced each finding.

  Open questions for fellow Strix Halo owners  

- What's the BIOS UMA default on the GMKtec EVO-X2, HP ZBook Ultra, MinisForum, and Beelink GTR variants? I keep seeing 8-16 GB defaults online; the 1 GB minimum is critical for the GTT-auto-sizing path.
- Anyone got the NPU (XDNA 2) actually doing useful inference work yet via the kernel's `amdxdna` driver? I've heard the XRT shim works for some workloads but haven't seen a clean writeup.
- Anyone running training on Max profile (120W) for >24h continuously? Thermals on Balanced are stable indefinitely; I haven't done Max sustained.
- Anyone tried 2-channel LPDDR5X tuning at 8533 MT/s instead of the shipped 8000 MT/s via the AGESA timing tables, or is that locked at BIOS level?

Happy to share specific configs / scripts / step-time numbers in the comments

reddit.com
u/Outrageous_Bug_669 — 4 days ago
▲ 26 r/StrixHalo+3 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

  • PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
  • FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps &gt; 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
  • In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
  • Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
  • /srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com
u/Outrageous_Bug_669 — 4 days ago

Strix Halo inference benchmarks across 20 models on llama.cpp ROCm, Vulkan, and CPU — side-by-side with a 3090 and 5070

I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.

The dataset: 55 runs, three rigs, five backends (rocm, vulkan, cpu, cuda, vllm-cuda), models from 0.35B (LFM2.5) through 35B-A3B (Qwen3.5 MoE). Workloads: short-prompt chat, long-context RAG, codegen long-output, and an agent shape at concurrency 1 and 4. Three measured iterations after one warmup, temperature 0, VRAM-fit verified before each run.

A few patterns from the data:

Memory bandwidth runs the show for decode. The RTX 5070 (12 GiB GDDR7, Vulkan) actually beats the RTX 3090 (24 GiB GDDR6X, CUDA) on every model that fits in 12 GiB:

Gemma-3-4b      chat:   5070 = 156.6  vs  3090 = 142.0   tok/s
Gemma-4-E4B     chat:   5070 = 124.3  vs  3090 = 118.4   tok/s
LFM2-8B-A1B     chat:   5070 = 336.1  vs  3090 = 318.7   tok/s

The 3090 wins decisively in the 14-31B band where the model fits in 24 GiB but not 12 GiB:

Gemma-4-26B-A4B chat:   3090 = 100.5  |  Strix ROCm = 43.7  |  Strix Vulkan = 47.7  tok/s
Qwen3.6-27B     chat:   3090 = 21.1   |  Strix ROCm = 11.2  |  Strix Vulkan = 11.6  tok/s

Strix Vulkan is often a hair faster than Strix ROCm on the same hardware/model. Biggest gap I saw was Gemma-4-26B-A4B at +9% (43.7 → 47.7). Some models are basically tied. Probably a gfx1151 kernel tuning gap on the bundled ROCm build; haven't dug in.

Quant cost on the 3090 for Qwen3.6-27B chat:

Q2_K = 24.0   Q3_K_M = 20.5   Q4_K_M = 21.1   Q5_K_M = 18.6   Q6_K = 15.3   tok/s

Q2 to Q6 is a 1.6x range. Q4 is the sweet spot. Q2 buys you ~14% over Q4 in exchange for the quality hit; Q6 costs ~28% for the quality bump. Surprised the curve isn't steeper.

Reasoning models look ~5x slower than they actually are if you only watch output tok/s. Qwen3.5/3.6 stream most output through a hidden reasoning_content channel that counts in the decode rate but isn't part of the user-visible answer. Worth knowing when picking a coding assistant.

CPU on Strix is not nothing. Gemma-4-26B-A4B MoE runs at ~5-9 tok/s on pure CPU thanks to unified memory + active-param routing. Not fast, but usable for batch work where you don't need the GPU.

Site has every run plus the rest of the models if you want to dig: https://calebcoffie.com/benchmarks. Methodology and the rest of the writeup: https://calebcoffie.com/blog/introducing-open-weight-model-benchmarks.

Things I know I haven't done: vLLM on Strix (lemonade's backend-readiness timeout kills the FP8 autotune; fix queued), CUDA on the 5070 (Arch gcc transition broke the cuda package mid-update, parked), the 70-130B Strix-only models (queued for v2). I don't own a 4090/5080/5090, so those aren't represented; the writeup has a back-of-envelope bandwidth extrapolation.

Not trying to replace existing benchmark sites. Just wanted another data point for my own setup and figured the same combo of rigs would be useful to someone else. Happy to be wrong on methodology if anyone spots a flaw.

reddit.com
u/C_Coffie — 5 days ago

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks

Here are some results (llama.cpp)!

Task 1: write a short poem
27B Dense: 12.5 tokens/s
27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s
27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s

Task 2: edit a hello word html artifact
27B Dense: 12.6 tokens/s
27B Dense MTP (spec-draft-n-max 6): 14.2 tokens/s
27B Dense MTP (spec-draft-n-max 3): 19.8 tokens/s

Task 3: create a hello world html directly in chat
27B Dense: 12.6 tokens/s
27B Dense MTP (spec-draft-n-max 6): 17.9 tokens/s
27B Dense MTP (spec-draft-n-max 3): 23.2 tokens/s

https://preview.redd.it/i8f0cj0zrn1h1.png?width=1797&format=png&auto=webp&s=a48dd04bdfa4ace1e9bceb8e79415971c5085e95

It's fascinating how it varies with tasks!

Settings used:

{
"name": "Qwen3.6-27B-UD-Q4_K_M",
"file": "Qwen3.6-27B-UD-Q4_K_M.gguf",
"custom": ["--mmproj", "C:/CarlAI/models/mmproj-Qwen_Qwen3.6-27B-bf16.gguf"],

"backend": "vulkan",

"parameters": {
"temp": 0.8,
"top_k": 20,
"top_p": 0.95,
"min_p": 0.00,
"repeat_penalty": 1.0,
"ngl": 99,
"context_length": 65000,
"jinja": true,
"flash_attn": "on"
}

},

{

"name": "Qwen3.6-27B-UD-Q4_K_XL_MTP",
"file": "Qwen3.6-27B-UD-Q4_K_XL_MTP.gguf",
"custom": ["-np", "1", "--spec-type", "draft-mtp", "--spec-draft-n-max", "6"],

"backend": "vulkan",

"parameters": {
"temp": 0.8,
"top_k": 20,
"top_p": 0.95,
"min_p": 0.00,
"repeat_penalty": 1.0,
"ngl": 99,
"context_length": 65000,
"jinja": true,
"flash_attn": "on"
}

},

reddit.com
u/PromptInjection_ — 5 days ago
▲ 45 r/StrixHalo+1 crossposts

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

for anyone who cares... 😄

prompt = spen a 1000 tokens
unsloth MTP models
strix halo

llama.cpp:server-rocm-mtp \
--spec-type draft-mtp \
--spec-draft-n-max 3

Qwen3.5-122B-Q5-MTP-General

n_decoded = 100 tg = 29.77 t/s
n_decoded = 179 tg = 27.95 t/s
n_decoded = 254 tg = 26.80 t/s

n_decoded = 4056 tg = 20.23 t/s
n_decoded = 4120 tg = 20.23 t/s
n_decoded = 4181 tg = 20.22 t/s

prompt eval time = 408.99 ms / 19 tokens
eval time = 207516.64 ms / 4200 tokens
tg = 20.24 t/s

Qwen3.5-122B-Q6-MTP-General

n_decoded = 102 tg = 25.10 t/s
n_decoded = 174 tg = 24.25 t/s
n_decoded = 225 tg = 22.04 t/s

n_decoded = 3193 tg = 17.27 t/s
n_decoded = 3244 tg = 17.26 t/s
n_decoded = 3281 tg = 17.18 t/s

prompt eval time = 488.39 ms / 19 tokens
eval time = 191156.72 ms / 3283 tokens
tg = 17.17 t/s

reddit.com
u/Boring_Office — 6 days ago

Do I need the 128GB vs 64GB?

It looks like prompt processing and text gen are slow but usable on Strix Halo and I think I want to get one but, do I need the 128gb version? Do people regularly use models that would need 128gb version or are they just too slow as model size increases so 64gb is the practical useable choice? Or is 128gb amazing and model that need >64gb still run okay?

Thank you for your thoughts. Trying to figure out if I would be happy with 64g or need to spend more.

reddit.com
u/Sporkers — 6 days ago

What llama.cpp / local LLM configs are people using on laptops like Ryzen AI Max+ 395?

I’m experimenting with local LLMs on a laptop and would love to compare configurations with people running similar hardware. I'm not new in this but also not quite expert tho :).

My setup:

  • ASUS ROG Flow Z13
  • Ryzen AI Max+ 395
  • 128GB unified memory
  • Radeon 8060S iGPU
  • Windows
  • llama.cpp with Vulkan backend or lemonade

Im not expecting desktop GPU performance, but I want to understand what is realistic and what people have found to work well in daily use

Thanks

reddit.com
u/seti_at_home — 5 days ago

I'm thinking about selling my Strix Halo

As in the title.
I've found very little use for a single machine. It sits in a weird spot, where there are no good models in that size. The max usable responsible model would be about 70b and the only good one i found at about that size is qwen-coder-nex.
It's a shame that rn there is little support from AMD for ROCm or their software. I know that they are working on their own model quantization, but seeing how ROCm works I can't help but be sceptical.
The king rn is qwen3.6 27b which is unusable as a daily driver. The prompt processing is killing me. The 35b variant can be run at a way cheaper hardware with better performane.

On the other side, If i had two of those, I could run Minimax with a decent speed, which would otherwise cost wayyy more in GPU VRAM.

I wish i still had my return period, now i have to look for a B2B buyer as I've bought it for my company.

reddit.com
u/PrzemChuck — 7 days ago
▲ 0 r/StrixHalo+1 crossposts

Anyone using Windows 11 for Local LLM - what do you do to "debloat" Windows...

Anyone using Windows 11 for Local LLM - what do you do to "debloat" Windows to get rid of unnecessary RAM hogging background services etc? to make it quicker and free up more resources for LLM?

Or do you instead use a version such as Windows 11 IOT LTSC

Or do you just not use Windows and use Linux instead?

Interested to hear everyone's thoughts on this

reddit.com
u/wingers999 — 6 days ago

Ongoing SOTA setups?

Hello everybody,

I just got me Strix Halo (after a very long wait) finally! (Minisforum MS S1 MAX)

Feeling like a kid with the latest gadget, I started my real research about how to get the max out of my new baby; I got a bit lost and I don't know what to do.

When I ask Claude/ChatGPT questions like "what are the best model to run for general use cases?" or "I trying to decide using between Deepseek v4 flash, Minimax 2.7, Qwen 3.6 models(either 27B or 35B3A), Gemma4 31B" all I got was mixed responses.

Sometimes it was OS was Ubuntu 26.04 because it has the latest drivers, sometimes it is Ubuntu 24.04 HWE because "the official AMD-built ROCm binaries target Ubuntu 24.04". Sometimes ROCm, sometimes "Mesa RADV (already in the kernel/userspace) — for llama.cpp Vulkan builds", sometimes both.

Model advices are also all over the place, mostly in between:

DeepSeek V4 Flash at Q4

Qwen 3.6-27B

Gemma 4 31B

MiniMax M2.7

Honestly, I am not technical&knowledgeable enough (yet!) to figure out the best setups; but I think maybe collectively we can create maybe our own benchmarks for the best models that we can run.

I would also love to hear your opinions/preferences.

reddit.com
u/anonrftw — 7 days ago

Anyone tried StepFun 3.5-flash on Strix Halo?

I tried a Q4 quant of StepFun 3.5-flash, it started out using 107GB (with 150K context), but with each prompt the memory use grew until it hit 120GB and soon after was OOM. Has anyone run this model longer than about 20 minutes and if so what llama.cpp settings did you use?

reddit.com
u/cafedude — 6 days ago
▲ 119 r/StrixHalo+1 crossposts

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short.

We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

TL;DR

End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context.

That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. At a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s, 2.5x faster end to end.

The same 128 GiB box hosts checkpoints up to ~100 GiB, a class of models a 24 GiB consumer GPU cannot touch (Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B, full BF16 27B).

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2 Target: Qwen3.6-27B Q4_K_M (15.65 GiB) Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 with DFLASH27B_DRAFT_SWA=2048 Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback

Decode (Qwen3.6-27B Q4_K_M, tok/s):

Engine tok/s vs AR
llama.cpp HIP AR 12.02 1.00x
llama.cpp Vulkan AR 12.45 1.04x
Luce DFlash (this PR) 26.85 2.23x

Prefill (Qwen3.6-27B, 16K tokens):

Engine TTFT vs AR
llama.cpp HIP AR 61.69 s 1.00x
Luce PFlash 20.2 s 3.05x

Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2). NIAH retrieval still passes at 16K.

Tuning note: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. Contrast with gfx1100 (7900 XTX, GDDR6 936 GB/s) where budget=8 wins, tile waste matters more than launch amortization. Default ship is arch-aware.

Reproduce

bash

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 &amp;&amp; git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

DFLASH27B_PREFILL_UBATCH=512 applies the PR #159 fix on top of PR #119. Once #159 merges, this is the daemon default.

What is still missing

  • BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
  • Multi-row q4_K decode GEMV. RDNA-native multi-row pattern (R=4-8 output rows sharing activation register state) for the drafter forward, currently 30% of compress time at long context.
  • Phase 2 tile shape tuning for gfx1151. Current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
  • 70B+ MoE targets. 128 GiB headroom is wasted on a 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; big work is wiring the expert-routed forward into the spec verify loop.

Constraints

ROCm 7.2.2+, gfx1151 tuned (gfx1100 also supported with arch-aware defaults), greedy verify only, no Vulkan / Metal / multi-GPU on this path yet.

We're working hard on this but we know we need to improve on many things.

Feedback is more than welcome :)

u/sandropuppo — 10 days ago
▲ 27 r/StrixHalo+1 crossposts

Strix Halo plus R9700 eGPU, Fedora 44. Best of both worlds.

I recently connected an R9700 to my Strix Halo. On Fedora 44 it was very easy. iGPU is rendering the OS to save vram in the R9700. I am using llama.cpp toolbox for the iGPU and using HIP_Visible_Devices to target the right gpu. The R9700 feels lightning fast, speed does fluctuate, but Qwen3.6 35B q4-k-m PP 2100 and TG 87.

Some possible uses would be have a big slow 27B on the iGPU to create plans and perform reviews and have the fast R9700 execute the plans. You could assign different agents to separate GPUs and work concurrently without any slowdown. If you need someone to talk to you can still load a chat model on the NPU to keep you busy while your agents work.

There isn’t much option as far as I know for software to take advantage of this set up, but I’ll start with Open-Notebook and see what else I can find. Send me any ideas you have for software or workflow.

reddit.com
u/I-will-allow-it — 8 days ago

LocalLightChat - the new portable lightweight ChatUI for LLMs

I got tired of every local AI frontend is either not portable, extremely slow and bloated- or even both. So i developed my own. It can handle even 500k+ tokens on a laptop from 2010!

LocalLightChat is a standalone chat interface for local LLMs and cloud APIs. Single binary, no installation, no dependencies. You download it, you run it, you're chatting. Works on Windows, Linux (x64/ARM64), and macOS.

What it actually does:

  • 500k+ token context – runs smooth even on old hardware
  • Full-text search across your entire chat history in under 100ms
  • Compress & Clone – squeeze 50k tokens down to 2k while keeping the stuff that matters
  • Documents & Artifacts – create and edit long-form content without drowning your chat
  • Web search built in (Serper/SearchNGX/Brave/custom) with minimal token overhead
  • Image generation via API or ComfyUI auto-detection
  • Multi-modal input – PDFs, images, CSV, YAML, XML, logs, all processed client-side
  • Full LLM parameter control – temperature, sampling, DRY, Mirostat, everything
  • Multi-user system with role-based auth if you need it

There's also a Docker image and a self-hosted option if you want to run it on your own nginx/PHP stack.

Links:

Currently at v0.5. Happy to answer questions or take feedback.

reddit.com
u/PromptInjection_ — 7 days ago