u/Outrageous_Bug_669

Six months daily-driving the Corsair AI Workstation 300 for production LLM fine-tuning + inference — settings, real workloads, and the upstream PRs that landed this week

Box:   Corsair AI Workstation 300   (CS-9080002-NA), AXB35-02 v3.07 BIOS, 128 GB LPDDR5X @ 8000 MT/s. Bought ~6 months ago, replaced the shipped Windows 11 with Ubuntu 24.04, and have been daily-driving it for LLM fine-tuning and inference since. Sharing the actual settings I landed on and real workload numbers — most Strix Halo writeups I've seen are reviews-from-a-week and don't cover what you actually need to tune for sustained AI workloads.

  The settings that actually mattered  

 BIOS
-   UMA = 1 GB (the BIOS minimum).   Don't raise this. The GPU does NOT need pre-allocated VRAM on Strix Halo — `amdgpu` auto-sizes GTT to the full 128 GB at boot, dynamically. Setting BIOS UMA higher just wastes RAM. Multiple Strix Halo OEMs default to 8-16 GB UMA, which is wrong for AI.
- Front-panel power profile:   Balanced (~85W)   for most workloads;   Max (~120W)   for active training. Quiet (~55W) chokes the GPU clock.
- BIOS 3.07 is one ahead of Corsair's current public (3.06) — I'm running it because it ships from the factory on later production runs. Public 3.06 is fine for AI work.   Don't cross-flash GMKtec firmware   to a Corsair board — I tried v1.12 of theirs, hit a 2 GB UMA floor and an O-key bug, had to recover.

 Kernel boot params (Ubuntu 24.04, kernel 6.19.14 mainline)
```
quiet splash iommu=pt pcie_aspm.policy=performance amdgpu.runpm=0
ttm.pages_limit=33554432 ttm.page_pool_size=33554432
transparent_hugepage=always numa_balancing=disable
```
- The `ttm. ` and `transparent_hugepage=always` ones are the under-discussed tunings. They fight unified-memory page-allocator fragmentation during long-running training. Default Ubuntu settings will fragment the GTT pool and you'll see RSS climb without OOM-ing for hours; these stabilize it.
- `amdgpu.runpm=0` keeps the GPU out of runtime-PM stalls during back-to-back inference calls.

 sysctl (`/etc/sysctl.d/zz-aurora.conf`)
```
vm.compaction_proactiveness=20
vm.compact_unevictable_allowed=1
```
- Mandatory for unified-memory APUs per AMD's MI300A guidance. Without them, long inference sessions on big models trigger reclaim storms.

 ROCm install
-   ROCm 7.13 nightly   is the only build with native `gfx1151` kernels. Earlier ROCm (7.1) wheels have a `_grouped_mm` null kernel that segfaults under MoE training.
- Install command for PyTorch:
  ```
  pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
  ```
  Lands you on `torch==2.11.0+rocm7.13.0` directly — no manual wheel wrangling.

 GPU performance level
-   Set `auto`, not `high`.   I had `high` set for a while; it locks clocks at max and disables thermal throttling, which caused a 89 °C crash mid-training. `auto` lets the firmware manage clocks across the Quiet/Balanced/Max profiles and stays well under 84 °C even at 120W.

  Real workloads this box runs daily  

Workload Quant / Format Memory Speed Stable?
Qwen3.5-27B bf16 LoRA training r=64 α=128, max_seq 8192 ~80 GB peak ~12.5 min/step (FLA), ~16 min (PyTorch fallback) Yes — multi-day runs are routine
Qwen3.5-122B-A10B Q6 inference llama.cpp HIP ~95 GB ~5-6 tok/s tg Yes
Qwen3-Coder-Next 80B-A3B Q8 inference llama.cpp HIP ~79 GB similar to 122B Yes
MiniMax M2.5 230B-A10B Q3 inference llama.cpp HIP ~95 GB ~5 tok/s tg Yes
ComfyUI (Wan 2.2 I2V) bf16 ~30 GB peak varies by workflow Yes — but don't run with LLM training

  What's still rough  

-   FlashAttention 2 / xformers don't run on `gfx1151`.   SDPA via PyTorch HIP only for training. The single biggest perf gap vs equivalent NVIDIA setups.
-   AOTriton coverage is incomplete.   Most fast-path kernels in libraries like Unsloth fall through to generic Triton on AMD. Closing fast but not closed.
-   `mmap`-only model loading on gfx1151 triggers ~30 min of GPU page-table setup.   Use `--no-mmap` (or `--mmap --direct-io` together) for llama.cpp. Not Corsair's fault — it's how unified-memory address translation interacts with mmap — but it traps new users.
-   `causal_conv1d` from PyPI builds-from-source AND fails on Ubuntu 24.04 default.   ROCm clang-20 picks gcc-14's runtime dir, which doesn't ship `<cstdlib>` headers in 24.04's default apt set. Fix landed in Unsloth this morning (see below); for direct pip install, set `HIPCC_COMPILE_FLAGS_APPEND='--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13'` before installing.
-   bitsandbytes for AMD needs a from-source rebuild   with `-DROCM_VERSION=83 -DBNB_ROCM_ARCH=gfx1151`. Official wheels don't work.
-   NPU (XDNA 2) isn't usable yet.   Kernel driver mainlined in 6.18+, kernel sees the device, but the userspace XRT shim + AMD XDNA SDK aren't packaged for general use. AMD has it on the roadmap. For now: 50 TOPS sitting idle.

  Upstream contributions this week  

Two PRs merged to `unslothai/unsloth:main` this morning (May 18) that directly affect this hardware:
-   PR #5517 (mine, merged 08:05 UTC):   fixes the `--gcc-install-dir` issue I mentioned above — Studio's `causal_conv1d` auto-install path now works on stock Ubuntu 24.04 without the env-var workaround. First upstream PR I've had merged.
-   PR #5434 (Daniel Han-Chen, merged 10:49 UTC):   auto-installs `flash-linear-attention` + tilelang for the Qwen3.5 / 3.6 / Qwen3-Next model family in Unsloth Studio. I caught a hard tilelang-on-HIP regression mid-review; the merged code has a two-layer HIP gate that fixes it specifically for our hardware.

Still in queue:
-   PR #5301 (Leo Borcherding):   ROCm unified-memory detection + routing Linux Strix Halo installers to `repo.amd.com/rocm/whl/gfx1151/`. Mergeable, awaits maintainer approval. This is the most important still-open one — when it lands, fresh installs auto-route to the right wheel without manual `--index-url` gymnastics.
-   PR #5303 (Leo Borcherding):   per-GPU lemonade-sdk llama.cpp prebuilts (Studio can use the gfx1151 prebuilt instead of source-building). I benched it as performance-equivalent to my hand-built — `pp64` ~254 ± 28 tok/s vs ~270 ± 2, `tg16` 7.48 vs 7.52, all inside σ. CI green.

  Public artifacts  

Reproduction scripts, faulthandler logs, bench outputs, BIOS notes:

https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/

One subdirectory per PR with the actual probe scripts that produced each finding.

  Open questions for fellow Strix Halo owners  

- What's the BIOS UMA default on the GMKtec EVO-X2, HP ZBook Ultra, MinisForum, and Beelink GTR variants? I keep seeing 8-16 GB defaults online; the 1 GB minimum is critical for the GTT-auto-sizing path.
- Anyone got the NPU (XDNA 2) actually doing useful inference work yet via the kernel's `amdxdna` driver? I've heard the XRT shim works for some workloads but haven't seen a clean writeup.
- Anyone running training on Max profile (120W) for >24h continuously? Thermals on Balanced are stable indefinitely; I haven't done Max sustained.
- Anyone tried 2-channel LPDDR5X tuning at 8533 MT/s instead of the shipped 8000 MT/s via the AGESA timing tables, or is that locked at BIOS level?

Happy to share specific configs / scripts / step-time numbers in the comments

reddit.com
u/Outrageous_Bug_669 — 3 days ago
▲ 3 r/ROCm

fine-tuning 27B hybrid models on stri to ox halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

reddit.com
u/Outrageous_Bug_669 — 9 days ago
▲ 26 r/AMDRyzen+3 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

  • PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
  • FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps &gt; 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
  • In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
  • Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
  • /srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com
u/Outrageous_Bug_669 — 3 days ago