Six months daily-driving the Corsair AI Workstation 300 for production LLM fine-tuning + inference — settings, real workloads, and the upstream PRs that landed this week
Box: Corsair AI Workstation 300 (CS-9080002-NA), AXB35-02 v3.07 BIOS, 128 GB LPDDR5X @ 8000 MT/s. Bought ~6 months ago, replaced the shipped Windows 11 with Ubuntu 24.04, and have been daily-driving it for LLM fine-tuning and inference since. Sharing the actual settings I landed on and real workload numbers — most Strix Halo writeups I've seen are reviews-from-a-week and don't cover what you actually need to tune for sustained AI workloads.
The settings that actually mattered
BIOS
- UMA = 1 GB (the BIOS minimum). Don't raise this. The GPU does NOT need pre-allocated VRAM on Strix Halo — `amdgpu` auto-sizes GTT to the full 128 GB at boot, dynamically. Setting BIOS UMA higher just wastes RAM. Multiple Strix Halo OEMs default to 8-16 GB UMA, which is wrong for AI.
- Front-panel power profile: Balanced (~85W) for most workloads; Max (~120W) for active training. Quiet (~55W) chokes the GPU clock.
- BIOS 3.07 is one ahead of Corsair's current public (3.06) — I'm running it because it ships from the factory on later production runs. Public 3.06 is fine for AI work. Don't cross-flash GMKtec firmware to a Corsair board — I tried v1.12 of theirs, hit a 2 GB UMA floor and an O-key bug, had to recover.
Kernel boot params (Ubuntu 24.04, kernel 6.19.14 mainline)
```
quiet splash iommu=pt pcie_aspm.policy=performance amdgpu.runpm=0
ttm.pages_limit=33554432 ttm.page_pool_size=33554432
transparent_hugepage=always numa_balancing=disable
```
- The `ttm. ` and `transparent_hugepage=always` ones are the under-discussed tunings. They fight unified-memory page-allocator fragmentation during long-running training. Default Ubuntu settings will fragment the GTT pool and you'll see RSS climb without OOM-ing for hours; these stabilize it.
- `amdgpu.runpm=0` keeps the GPU out of runtime-PM stalls during back-to-back inference calls.
sysctl (`/etc/sysctl.d/zz-aurora.conf`)
```
vm.compaction_proactiveness=20
vm.compact_unevictable_allowed=1
```
- Mandatory for unified-memory APUs per AMD's MI300A guidance. Without them, long inference sessions on big models trigger reclaim storms.
ROCm install
- ROCm 7.13 nightly is the only build with native `gfx1151` kernels. Earlier ROCm (7.1) wheels have a `_grouped_mm` null kernel that segfaults under MoE training.
- Install command for PyTorch:
```
pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
```
Lands you on `torch==2.11.0+rocm7.13.0` directly — no manual wheel wrangling.
GPU performance level
- Set `auto`, not `high`. I had `high` set for a while; it locks clocks at max and disables thermal throttling, which caused a 89 °C crash mid-training. `auto` lets the firmware manage clocks across the Quiet/Balanced/Max profiles and stays well under 84 °C even at 120W.
Real workloads this box runs daily
| Workload | Quant / Format | Memory | Speed | Stable? |
|---|---|---|---|---|
| Qwen3.5-27B bf16 LoRA training | r=64 α=128, max_seq 8192 | ~80 GB peak | ~12.5 min/step (FLA), ~16 min (PyTorch fallback) | Yes — multi-day runs are routine |
| Qwen3.5-122B-A10B Q6 inference | llama.cpp HIP | ~95 GB | ~5-6 tok/s tg | Yes |
| Qwen3-Coder-Next 80B-A3B Q8 inference | llama.cpp HIP | ~79 GB | similar to 122B | Yes |
| MiniMax M2.5 230B-A10B Q3 inference | llama.cpp HIP | ~95 GB | ~5 tok/s tg | Yes |
| ComfyUI (Wan 2.2 I2V) | bf16 | ~30 GB peak | varies by workflow | Yes — but don't run with LLM training |
What's still rough
- FlashAttention 2 / xformers don't run on `gfx1151`. SDPA via PyTorch HIP only for training. The single biggest perf gap vs equivalent NVIDIA setups.
- AOTriton coverage is incomplete. Most fast-path kernels in libraries like Unsloth fall through to generic Triton on AMD. Closing fast but not closed.
- `mmap`-only model loading on gfx1151 triggers ~30 min of GPU page-table setup. Use `--no-mmap` (or `--mmap --direct-io` together) for llama.cpp. Not Corsair's fault — it's how unified-memory address translation interacts with mmap — but it traps new users.
- `causal_conv1d` from PyPI builds-from-source AND fails on Ubuntu 24.04 default. ROCm clang-20 picks gcc-14's runtime dir, which doesn't ship `<cstdlib>` headers in 24.04's default apt set. Fix landed in Unsloth this morning (see below); for direct pip install, set `HIPCC_COMPILE_FLAGS_APPEND='--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13'` before installing.
- bitsandbytes for AMD needs a from-source rebuild with `-DROCM_VERSION=83 -DBNB_ROCM_ARCH=gfx1151`. Official wheels don't work.
- NPU (XDNA 2) isn't usable yet. Kernel driver mainlined in 6.18+, kernel sees the device, but the userspace XRT shim + AMD XDNA SDK aren't packaged for general use. AMD has it on the roadmap. For now: 50 TOPS sitting idle.
Upstream contributions this week
Two PRs merged to `unslothai/unsloth:main` this morning (May 18) that directly affect this hardware:
- PR #5517 (mine, merged 08:05 UTC): fixes the `--gcc-install-dir` issue I mentioned above — Studio's `causal_conv1d` auto-install path now works on stock Ubuntu 24.04 without the env-var workaround. First upstream PR I've had merged.
- PR #5434 (Daniel Han-Chen, merged 10:49 UTC): auto-installs `flash-linear-attention` + tilelang for the Qwen3.5 / 3.6 / Qwen3-Next model family in Unsloth Studio. I caught a hard tilelang-on-HIP regression mid-review; the merged code has a two-layer HIP gate that fixes it specifically for our hardware.
Still in queue:
- PR #5301 (Leo Borcherding): ROCm unified-memory detection + routing Linux Strix Halo installers to `repo.amd.com/rocm/whl/gfx1151/`. Mergeable, awaits maintainer approval. This is the most important still-open one — when it lands, fresh installs auto-route to the right wheel without manual `--index-url` gymnastics.
- PR #5303 (Leo Borcherding): per-GPU lemonade-sdk llama.cpp prebuilts (Studio can use the gfx1151 prebuilt instead of source-building). I benched it as performance-equivalent to my hand-built — `pp64` ~254 ± 28 tok/s vs ~270 ± 2, `tg16` 7.48 vs 7.52, all inside σ. CI green.
Public artifacts
Reproduction scripts, faulthandler logs, bench outputs, BIOS notes:
https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/
One subdirectory per PR with the actual probe scripts that produced each finding.
Open questions for fellow Strix Halo owners
- What's the BIOS UMA default on the GMKtec EVO-X2, HP ZBook Ultra, MinisForum, and Beelink GTR variants? I keep seeing 8-16 GB defaults online; the 1 GB minimum is critical for the GTT-auto-sizing path.
- Anyone got the NPU (XDNA 2) actually doing useful inference work yet via the kernel's `amdxdna` driver? I've heard the XRT shim works for some workloads but haven't seen a clean writeup.
- Anyone running training on Max profile (120W) for >24h continuously? Thermals on Balanced are stable indefinitely; I haven't done Max sustained.
- Anyone tried 2-channel LPDDR5X tuning at 8533 MT/s instead of the shipped 8000 MT/s via the AGESA timing tables, or is that locked at BIOS level?
Happy to share specific configs / scripts / step-time numbers in the comments