r/StrixHalo

▲ 44 r/StrixHalo+2 crossposts

Who've told you that distributed training is impossible? Democratizing AI: The Psyche Network Architecture

It seems that not only it is totally possible without incurring in unfeasible excessively narrow train data transfer bottlenecks but that several models have already been trained using this method. It mostly depends on how many GPUs join such kind of network.

See here: https://psyche.network/runs

nousresearch.com

u/DevelopmentBorn3978 — 22 hours ago

▲ 31 r/StrixHalo+1 crossposts

Strix.Monitor : web-based resource monitor for Linux systems

If you use your Strix Halo -- or any Linux box, for that matter -- as a headless AI server and want to monitor if your LLMs have gone into a doom loop or just want to avoid OOM errors, I built a tool to help you stay on top of things:

https://github.com/levanillawafer/strix-monitor

It was born out of necessity, but also as a test of how far I could push the local LLMs I was running on the that same box. Spoiler alert, not that far, it turns out. So, Claude to the rescue.

Why not Windows? There's already plenty of decent monitoring apps. Why not htop or nvtop? I use them, but it's nice being able to check on things from my phone. Why release it? Why not. Strix Halo is already a niche of a niche, but if you occupy the same section of the Venn diagram as me, this might help you out. Enjoy.

u/fantastic_mr_wolf — 21 hours ago

▲ 43 r/StrixHalo+2 crossposts

Pub-Beta: Hal0 - Local Homelab LLM+ Inference Powerhouse for StrixHalo / Proxmox / More

Hey r/StrixHalo — I built hal0.dev with the goal of optimizing for exactly this hardware and extracting the best possible performance, functionality, and value from it.

We're finally ready and opening public beta this weekend. Would love to have you kick the tires — I've had limited testers so far and we're ready for more.

The idea. A Strix Halo box is a genuinely special piece of kit — Radeon iGPU, XDNA NPU, and one big unified-memory pool — and hal0's goal is to extract the most performance, value, and functionality possible from it.
Chat, embeddings, rerank, transcription, live speech, image gen — answers on one local /v1/ API.

This is my first real shot at something this ambitious, so the philosophy is deliberately narrow: high impact features, reliable, proven tools, wired up automatically, and integrated deeply across the platform.

One-line install builds and wires up — automatically

Models across llama.cpp (Vulkan/ROCm FPX / MTP) and the XDNA NPU via FastFlowLM — running co-resident, highly tuned - chat, embed, rerank, vision, STT, TTS, and image gen via ComfyUI
Hermes agent provisioned with auto model/slot detection and custom Hindsight memory integration with MCP access for outside agents/tools - no manual config
Operator Board — a multi agent capable Hermes-backed Kanban that tracks tasks across profiles, lanes, and projects, with gated actions pausing for your sign-off and live agent chat beside it to help you orchestrate.
Open WebUI for chat, RAG, and more, alongside the dashboard - models & slots appear automatically.
Custom Hindsight memory + knowledge graph (NPU Extraction by default) wired to Hermes out of the box and exposed via MCP for Claude, Pi etc.
MCP server exposing hal0 admin surfaces to agents — keeps agents in the know about the entire lab structure and lets them tweak it on your command.

Slots: every model runs in a "slot" — one model, one container, with a typed lifecycle and a GPU arbiter that assigns unified-memory to either always-on concurrent LLMs or image gen, one group at a time — so GPU workloads never fight over the pool, yet multiple LLMs stay concurrent and always ready.

Agents & memory: striving for the deepest, most seamless Hermes integration possible — kanban, delegation, and hal0 administration, all out of the box. Memory is a constantly improving shared brain: a fully built-out Hindsight custom-provider system with a primary private bank (seeded per child profile) plus a shared bank with MCP access, so agents like claude-code, pi, and opencode can learn from and teach your agent as the homelab evolves.

Developed on a Ryzen AI Max+ 395 / 128 GB. I run mine in a Proxmox LXC for the exceptional quality-of-life wins — resource sharing/allocation without being captured, plus the reliability. Bare-metal Ubuntu and WSL2 (WIP) paths are in the docs too. It's hardware-agnostic in principle but tuned for Strix Halo first, particularly on Proxmox — NVIDIA/CUDA is being worked toward as a supported runtime device, but don't count on it working just yet.

Open-source, Apache-2.0. Come kick it around and tell me what falls off 🙂

⭐ https://github.com/Hal0ai/hal0 - Give Us A Star!

🗪 https://discord.gg/n2ftGqYr8 - Join Us In The Discord!

💫 https://hal0.dev - Promo, Info, Docs & More

u/horratiocornbl0wer — 1 day ago

▲ 13 r/StrixHalo

How are you guys running your models?

I am very curious because right now I am using cachy os and lm studio on mine, its not the best idea but it works for now. with some research I found that amd has made Lemonade and I was tempted to try it but it lacks some stuff that I would like in terms of customization of how the model runs. I also found hal0 and this one seems really interesting, but I would like to see more from it before committing to something long term. has anyone tried any of these? what other options are you guys using? what models as well :) I am new to this platform and want to take the most out of it, specially for coding.

reddit.com

u/Doge_Plays — 2 days ago

▲ 13 r/StrixHalo

Local Setup: Strix Halo 128gb

Does anybody have this agent set up locally on a 128gb unified strix halo 395+ AI max pc?

What models are best to give hermes for this kind of local setup with 96gb vram? I got my pc coming Monday and need some community tips. Ive been considering hermes 4.3 36b and qwen 3.6 27b. I am looking for all-around best model to serve as the daily driver for hermes.

I am also going to connect a 4090 via usb4 to run smaller models totally offloaded (or stable diffusion models completely offloaded). Not sure if this will be tricky with an AMD setup. Any tips are welcome. Thanks in advance.

reddit.com

u/ReipuSarada — 2 days ago

▲ 23 r/StrixHalo+2 crossposts

PSA - New Corsair AI Workstation 300 Arrived Infected!!

PSA - Corsair AI Workstation 300 (Revival/Refurbished Series) came with active malware (Virus:Win64/Expiro.DD!MTB) straight out of the box

Posting this as a warning for anyone looking at Corsair’s Revival line, specifically the AI Workstation 300.

I bought a Corsair AI Workstation 300, Revival Series, for $2,719.99 (sad to say but a great price in the current environment) Specs are Ryzen AI Max+ 395, 128GB RAM, and dual 2TB SSDs. It also comes with a 90-day warranty instead of standard coverage, which is worth knowing going in.

Full disclosure, I work in cybersecurity and offensive cyberspace operations professionally. That said, my home setup was not exactly a model of best practices. My home network was flat, with no VLANs and no real segmentation. That part is on me. I am not pretending I had a perfect lab environment here.

The unit arrived with the stock Windows 11 install. The first odd thing I noticed was that it would not complete the Windows 11 setup over Wi-Fi during the Out-of-Box Experience. It kept failing. I eventually got impatient and plugged it directly into my router with ethernet so I could finish setup.

Once I got to the desktop for the first time, before I installed any third-party software, Windows Security started throwing active detections. A lot of them. They were all flagged as Severe and Active, and they were all the same threat:

Virus/Expiro.DD!MTB

Defender was not just cleaning them up and moving on. It kept showing action needed, and the remediation loop did not appear to be resolving the issue.

Around the same general time, I got a Microsoft account sign-in notification from an unfamiliar device showing a location in Harare, Zimbabwe. I cannot prove from my side exactly how that happened, but the timing was close enough that I treated it as related until proven otherwise.

I immediately moved to a separate machine, started account recovery, regained access, changed credentials, and killed active sessions. After that, I started treating anything that had been on the same flat network as potentially exposed.

I also had trouble accessing my router admin panel during the same window. I am not going to claim with certainty that the router was compromised, because I do not have enough evidence to prove that. But given everything else going on, I treated it as unsafe, pulled it offline, reset it, flashed clean firmware, changed admin credentials, changed the SSID, and hardened the wireless settings.

The part I am still most concerned about is the NAS. I had a NAS on the same LAN with years of data on it. I physically disconnected it and started reviewing logs and scanning it from a clean system. The logs showed an authenticated session from the Corsair’s MAC address during the same general window. I do not know yet whether anything malicious was written to the NAS, but that was enough for me to take it seriously.

For anyone unfamiliar with Expiro, it is a file-infecting virus family that targets Windows executables. Rather than just dropping one obvious malicious file, it can infect legitimate .exe files. That makes cleanup more annoying because you may not be dealing with a single file to delete. You may be dealing with a system where many executables have been touched.

That also means mapped drives and shared folders matter. If infected executables were written to a shared location, the next system may not be infected immediately. The risk is that someone later runs one of those infected files.

My working theory, and I want to be clear that it is only a theory, is that this may be related to the refurbished Revival process. This could be a bad image, a failed drive wipe, or some other issue in the refurb pipeline. I am not claiming this affects new Corsair systems, and I am not claiming I can prove exactly where the infection came from yet. What I can say is that this machine started throwing severe Defender detections immediately after first boot, before I had installed anything myself.

Current status:

The Corsair is physically disconnected and isolated. I am doing a full secure erase of the SSDs, deleting all partitions, and reinstalling Windows from trusted installation media before it touches a network again.

The NAS is disconnected from the network while I scan the contents from a clean device and look specifically for infected executables or anything written during that access window.

I am rotating credentials that may have been exposed and moving more accounts toward hardware security keys where possible.

I opened an official support case with Corsair that includes the SKU and screenshots of the Defender detections. I have not received a resolution yet.

I am posting this before Corsair finishes responding because people buying refurbished systems should be careful about plugging them directly into a flat home network. At minimum, I would recommend isolating Revival/refurbished systems during first boot, wiping and reinstalling from trusted media, and not trusting the factory image until you verify it yourself.

Screenshots of the initial Defender logs that I first saw flying down the screen attached, for no real reason but help recreate the feeling of panic I felt when I first saw them, hahaahahaha.

Has anyone else bought from the Corsair Revival line recently and seen anything like this?

https://preview.redd.it/8xikn3awn0bh1.jpg?width=3024&format=pjpg&auto=webp&s=8672c9271ed83e9c2fd6d809c1f88383fea77e72

reddit.com

u/ChemicalMemory — 3 days ago

▲ 24 r/StrixHalo+1 crossposts

An argument for 40–50W on the ROG Flow Z13 (Ryzen AI Max+ 395)

I’ve been messing around with power limits on my Z13 for the past couple of weeks, and honestly I think people are way too obsessed with running this thing at 60W+.

Sure, you can squeeze out a few more FPS, but after a certain point you’re just trading a ton of extra heat and fan noise for a pretty small performance gain.

I’ve settled on this setup and don’t really see myself changing it:

G-Helper Settings

Windows Power Plan
Balanced

CPU Boost
Disabled

Power Limits
SPL: 40W
sPPT: 45W
fPPT: 50W

Advanced
CPU Temp Limit: 80°C
CPU Undervolt: -15
GPU Undervolt: -15

Fan Curve (CPU & GPU)
≤55°C: 0 RPM
60°C: 2,200 RPM
70°C: 3,800 RPM
80°C: 4,800 RPM
85°C: 5,600 RPM

Benchmarks

Cyberpunk 2077
1900×1200 | Highest | RT Off
Native: 56 FPS
Native + FSR 3.1 Frame Gen: 91 FPS
FSR 4 Quality: 62.8 FPS
Temps sit around 64–67°C.

007 First Light
1900×1200 | Ultra | FSR Quality: 65 FPS
Temps are 68–70°C.

Resident Evil 9
1900×1200 | Max | RT High
Native: 55 FPS
FSR 4.1.1 Quality: 65 FPS
Native + Frame Gen: 92 FPS
Temps stay around 65–68°C.

For me, this is exactly where the Z13 shines.
It’s cool enough that I never think about temperatures.

The fans are noticeable, but they’re nowhere near jet engine territory.

The tablet itself never gets uncomfortably hot.

And I’m still getting 60–90 FPS in modern AAA games depending on whether I use FSR or Frame Generation.

I know there are people running 60–70W profiles, but I’m genuinely curious, are you actually noticing a meaningful difference while playing?
Looking at frame counters is one thing, but in actual gameplay, I don’t think I’d notice an extra 5–10 FPS nearly as much as I’d notice the extra heat and fan noise.

Beyond 50W, it feels like you’re hitting diminishing returns pretty quickly.

Anyone else end up around the same power limits?

reddit.com

u/NootropicNinja — 3 days ago

▲ 10 r/StrixHalo

Cachy-Router an OpenAI-compatible router for llama.cpp with shared KV-cache

Disclaimer: all code was written by Codex, I am not a coder so please don't be too harsh. This post is written 100% by me. I don't know how it works and this is the first thing I have ever put on GitHub. It isn't perfect but it actually works for me. The performance of how much less time work was being done got me so excited I wanted to share it.

https://github.com/JCFrags/Cachy-Router

The problem I wanted solved was making effective use of multiple Strix Halo PCs, running llama.cpp on both meant entering 2 APIs into an agent and effective coordination of the machines for sub-agent use was a nightmare. It also meant that agent id is set to a specific PC creating idle time waiting for the right one to become available or forcing long prompt processing.

The inspiration came from seeing CachyLLama ( https://github.com/fewtarius/CachyLLama#MIT-1-ov-file ) saving KV-cache to disk and I had a thought that files on disk can be shared between computers.

CachyLLama is a fork of llama.cpp, this tries to imitate the caching ideas while being an independent router layer so you can bring any version of llama.cpp that you want.

Strix Halo is great but if you have a large system prompt it can easily take 1-2 minutes just to say "Hi", now it is processed once and future messages begin generating almost immediately. That change alone felt magical. Now with this if you are using sub-agents it doesn't matter which one finishes first because the KV-cache is shared between them.

Cachy-Router is designed to reduce idle time on a multi PC setup when work is queued and focus on token generation instead of re-processing the same things over and over.

I had Codex SSH into the Strix Halo PCs to optimize and set everything up. Now I am running Step-3.7-Flash Q4_K_S with MTP at sustained 32 t/s and now with Cachy-Router it actually feels usable and competitive with mid-range cloud options.

reddit.com

u/random_user-1234 — 3 days ago

▲ 10 r/StrixHalo

Buy now or wait on a local LLM box during the memory crunch? How I'd read it after running a Strix Halo daily for 6 months

I get asked some version of "should I buy a local AI box now or wait for the next one" a lot, and my answer changed this year, so I wrote up how I currently think about it. The short version here, full writeup linked at the bottom.

As I'm sure you can all relate, the usual instinct is to wait, because hardware normally gets cheaper and better. Right now that's kind of backwards for this class of machine, for two reasons.

One, the memory is soldered (at least for the non-DIY boxes, I am not talking about multi-gpu self-built setups here) and it's getting more expensive, not cheaper. LPDDR5X jumped 89% in a single quarter and the analysts I've read don't expect real relief before late 2027. So for these, the capacity you pick is permanent, and you're picking it in a bad market. Concretely: the same box I paid under $2,200 for is about $2,800 now.

Two, for token generation these boxes are all bandwidth-bound, and the announced successors mostly add memory at the same bandwidth. A 192GB or 768GB successor lets you load a bigger model but runs it at more or less the same speed. So waiting for "the bigger one" only helps if you're actually running out of capacity, not if you want more speed.

One caveat so I'm not oversimplifying: prompt processing (time to first token) is a separate, compute-bound story, and there the picture flips. The Spark is much faster at it, Apple and Strix Halo are weaker. The M5 Mac is interesting because it is reported to target exactly that, and it's the one case where I'd actually think about waiting, with the honest caveat that Apple just raised prices and the release date is still an expectation, not a certainty.

I'm running a Strix Halo box myself, so I'm not neutral, but I tried to be fair to the Spark and the Mac in the full post. So I'm curious what you'd do: if someone came to you today and asked whether to buy now or wait, what would you tell them? Especially keen to hear from people actually running a Spark or a Mac Studio, and how prompt processing feels with daily use.

https://thefrontierlab.ai/buy-or-wait-local-llm-hardware/

u/uncanny_instinct — 4 days ago

▲ 0 r/StrixHalo

GPT-OSS-120B (MXFP4) resident on a Strix Halo iGPU, +2 more models in 96GB UMA — real tok/s, Vulkan not ROCm, and the gotchas

Running a 3-model stack fully resident on an **AMD Ryzen AI MAX+ 395 (Strix Halo)** integrated GPU — no discrete card, no cloud. Sharing the setup and honest numbers because most "local LLM on an APU" posts skip the part where the vendor stack doesn't work.

**The box:** GMKtec EVO-X2, Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GB unified RAM, Ubuntu 24.04 / 6.17 HWE kernel. BIOS UMA split set to **96 GiB VRAM / 32 GiB system**.

**Why Vulkan, not ROCm:** ROCm/HIP is broken on gfx1151 — sentence-transformers hangs on HIP init, the compute stack won't come up. So the whole thing runs on **llama.cpp's Vulkan backend (RADV)**. If you're on Strix Halo, save yourself the week: go Vulkan.

**What's resident (~78 GiB of the 96):**

* **lm1 — GPT-OSS-120B, MXFP4** (~60 GiB) — the reasoner; ~5B active params/token (MoE)

* **lm2 — Llama3.2-24B-A3B (i1-Q4_K_M)** (~11 GiB) — creative/uncensored slot

* **lm3 — Qwen2.5-Coder-7B (Q4_K_M)** (~5 GiB) — FIM/autocomplete

* plus MiniLM + nomic embeddings on their own llama-server

**Numbers (GPT-OSS-120B, MXFP4, Vulkan, direct completion):** ~**81 t/s prompt**, ~**38 t/s generation**. First request after load is slow (cold buffers) then it settles. For reference, the previous lm1 (Qwen3.6-35B-A3B) did ~823 PP / 46 TG — the 120B trades throughput for depth. 38 tok/s on a 120B on an *integrated* GPU is the headline for me.

**llama-server flags that matter here:** `-ngl -1`, `--flash-attn on`, `--cache-type-k q8_0 --cache-type-v q8_0` (KV quant is what makes 64k ctx fit alongside everything), `--batch-size 2048 --ubatch-size 512`, `--no-mmap`. Note: **ubatch is not a lever for A3B MoE** — llama.cpp's default 512 is already the peak, bigger just wastes VRAM.

**Gotchas nobody tells you:**

**gpt-oss reasons on *every* call** via the harmony analysis channel, and `--reasoning-budget 0` / `<think>` stripping does **not** suppress it — that's a Qwen-ism that doesn't apply. A trivial query still eats ~18s of reasoning. What *does* work: pass `chat_template_kwargs: {"reasoning_effort": "low"}` per request — measured ~5× faster with equivalent output on mechanical tasks. Budget your client timeouts for reasoning+generation, not generation alone (learned that the hard way when a batch worker's 240s timeout started failing on big prompts).
**On a 96/32 UMA split, the *system* side becomes the tight budget**, not VRAM. I'm at ~5.6 GiB free of 32 with all three models up. Watch RAM, not just VRAM.
`--parallel 1` means callers queue serially — fine for batch, plan for it.

**What it actually drives:** a private OpenAI-compatible gateway that fronts all three as watched `llama-server` children (VRAM-largest-first, watchdog restarts). It powers a news-corroboration engine, a VSCode agent extension, and an always-on story generator — all pointing at `:5000`, nothing leaving the box.

Happy to answer anything — flags, the Vulkan path, MoE-on-bandwidth-bound-APU tradeoffs, the multi-model VRAM budgeting.

reddit.com

u/arhalfax — 4 days ago

▲ 15 r/StrixHalo

Ryzen AI Max (Strix Halo) Mobile Devices, as of the end of June 2026.

Ryzen AI Max (Strix Halo) Mobile Devices, as of the end of June 2026

Mobile devices contain a battery for use away from a desk power outlet, and come in several forms including clamshell laptops, convertibile notebook with 360° hinge, tablets with detachable keyboard, and handheld device with game controllers. Ryzen AI Max mobile devices appear to include the following:

Company	Brand	Model	Date	Strix Halo(s)	RAM (GB)	Scrn Diag.	Scrn Res.	Screen Feat.	Form	Key Layout	m.2 size	Batt (Wh)	PS (W)	Wt (kg)
Asus	RoG	Flow Z13 ^(GZ302)	2025	+395 or 390	128 or 64 or 32	13.4	1600p	IPS touch 180Hz	detachable	60%, ½←↕→	2230 × 1	70.0	200	1.20
HP	ZBook	Ultra G1a	2025	Pro +395 or 385	128 or 64 or 32	14.0	1800p or 1200p	OLED touch 120Hz or IPS matte	clamshell	60%, ½⇇↕⇉	2280 × 1	74.5	140	1.60
GPD		Win 5	2025	+395 or 385	128 or 64 or 32	7.0	1080p	LTPS touch 120Hz	handheld	none	2280 × 1	80.0	180	0.94

OneXPlayer		Super X	2026	+395 or 385	128 or 64 or 48 or 32	14.0	1800p	OLED pen 120Hz	detachable	60%, ½⇐↕⇒	2280 × 1	85.6		1.30
Asus	TUF	A14 ^(FA401EA)	2026	+392	64 or 32	14.0	1600p	IPS 165Hz	clamshell	60%, ½←↕→	2280 × 2	73.0	200	1.48
Asus	ProArt	PX13 ^(HN7306EAC) ^(HN7306EA)	2026	+395	128 or 64	13.3	1800p	OLED touch 60Hz	clamshell	60%, ½←↕→	2230 × 1	73.0	200	1.39
Lenovo	Yoga	Pro 7a ^(15ASH11)	2026	+388	128 or 64 or 32	15.3	1600p	OLED 165Hz	convertible	60%, ½←↕→	2242, 2280	84.0	140	1.69
Lenovo	Legion	7a Gen 11 ^(15ASH11)	2026	+392 or +388	64 or 32	15.3	1600p	OLED 165Hz	clamshell	60%, ful⇇↕⇉	2242, 2280	84.0	180	1.55
Nimo		Axis	2026	+395	128	16.0	1600p	165Hz	clamshell	FulNum, ful←↕→	2280 × 2	99.0	230	2.5

u/gc9r — 5 days ago

▲ 9 r/StrixHalo+1 crossposts

I made a local model that gives you a multimedia role playing experience (and you can too!)

[deleted]

u/[deleted] — 4 days ago

▲ 5 r/StrixHalo

Thoughts on this M.2 NVME adaptor?

One of the original reasons behind my purchase of the MS-S1 was that it could take a PCIe card. And you can get a PCIe card that can add NMVE slots. Brilliant.

Grabbed one from AliExpress. Didn't measure the available space and ended up with a dead weight as it was full size and didn't fit.

I thought OK, I'll go for a half height one. It's only 2 slots and it fits inside. Didn't realise that PCIe 4.0 x4 means only one lane and therefore only 1 drive would be available because the motherboard doesn't support bifurcation. Another dead weight!

My current solution is using x2 WD Blue SN5000, each inside a USB adaptor and plugged into the front 2 USB ports (couldn't get them to work in the rear ports)

It works but seems risky.

Anyway, ramble over. I've just seen this one on Amazon that works on PCIe 4.0 x4 and exposes all 4 drives (2 on top, 2 underneath) due to its onboard PLX 8747 controller. Sounds ideal, but this time I'm not rushing in. I know the overall speed would be split between the drives but still seems like it would be faster than my current setup?

Anyone have any thoughts? Good? Bad? Indifferent?

It's £140 which is cheaper than any others I've seen

https://www.amazon.co.uk/gp/product/B0F316W6PJ

Upgrade your system's storage capabilities with this high-performance PCIe 3.0 x8 SSD to Dual M.2 NVMe adapter card. With the reliable PLX 8747 chipset, this adapter lets you add four M.2 NVMe SSDs to your system through a single PCIe slot. The card supports several M.2 form factors, including 2242, 2260 and 2280mm, offering excellent flexibility for your storage needs. Built-in cooling fan ensures optimal temperature management for sustained performance. This adapter card features a PCIe 3.0 x8 host truck interface for peak performance and comes with a standard bracket for secure installation. Perfect for enthusiasts and professionals looking to extend the storage capacity of their system while maintaining access to high-speed data. Dual M.2 connectors allow for multiple drive configurations, making it an ideal solution for both storage expansion and performance improvement in compatible systems.

reddit.com

u/ZeroThaHero — 6 days ago

▲ 20 r/StrixHalo

Toolbox Runner — a web UI alternative to the Strix Halo toolbox TUI

If you're running kyuz0's Strix Halo toolboxes (llama.cpp / vLLM / ComfyUI), you know the TUI. I built a browser UI for the same containers — screenshots in the gallery.

It started as me just typing commands into terminal sessions on my box. Then I baked the repetitive ones into shell scripts. Then the scripts grew into a small Flask app — and it kind of snowballed into a full web UI. The whole thing was pair-programmed with Claude.(I´m bad on the UI part)

What it does:

Launch any toolbox command from the browser, output streams live
Sessions run on tmux — close the tab, reload, or reboot and your llama-server keeps running; reconnect and the output is still there
Model manager: browse / download / update / delete HF models, flags ones not wired to any launch
Toolbox updates: checks each source repo against GitHub and each container image against its registry digest
Point-and-click editors for commands and llama.cpp presets — no hand-editing YAML/INI

It's a single-user tool I made for my own machine, not a polished product. Before I sink more time into it, I wanted to ask the people actually running this hardware: would you use something like this, or is the TUI already enough for you?

Happy to answer anything about the setup — and glad to share the full README / code if there's interest.

https://preview.redd.it/zabk3enl2hah1.png?width=3200&format=png&auto=webp&s=4756aaab462f9b1d692c7d6e8937e39715ec03cb

https://preview.redd.it/g7eddq8n2hah1.png?width=3200&format=png&auto=webp&s=b8760a2fad00d8ceaac6676859894288f73f33e6

https://preview.redd.it/18qf9xvo2hah1.png?width=3200&format=png&auto=webp&s=792bcf623bf5f42e57327b0533b7789b676a6991

https://preview.redd.it/8mzteigq2hah1.png?width=3200&format=png&auto=webp&s=c5656f1cfe55c3802de5a66bf4873002d7b4d30a

https://preview.redd.it/hg3pvb7r2hah1.png?width=3200&format=png&auto=webp&s=8c96e7df44ed7987ef6a376919cdb2f73b5a2dee

https://preview.redd.it/9fybjujs2hah1.png?width=3200&format=png&auto=webp&s=1a0964bc3a9b00f869aa13df9c3d5bb65b1c0327

https://preview.redd.it/ceodshdt2hah1.png?width=3200&format=png&auto=webp&s=0c4920849fe736e5f86146185e4859315116dfb6

https://preview.redd.it/vx4v7wqu2hah1.png?width=3200&format=png&auto=webp&s=59dbd79cbf2b5083f18f16825974d6147c440d7e

1. overview.png

   Main view: a llama-server session running on tmux, output streaming live. Close the browser and it keeps going.


2. launcher.png
   Launch modal — pick a toolbox on the left, hit Run on one of its commands. No memorizing flags.


3. models.png
   Model manager — browse, download, update or delete Hugging Face models, with unused ones flagged.


4. toolboxes.png
   Toolbox updates — checks each source repo against GitHub and each container image against its registry digest (✓ / ⬆ badges).


5. commands.png
   Add/edit the commands shown per toolbox from a form — it rewrites commands.yaml for you, comments preserved.


6. presets.png
   llama.cpp preset builder — add model presets without hand-editing the .ini files.


7. raw-config.png
   Prefer raw text? Edit commands.yaml and the preset .ini files directly in the browser, validated before save.


8. logs.png
   Every session is logged to disk — browse, view or download past runs, even after they exit.

reddit.com

u/urbanswelt — 6 days ago

▲ 9 r/StrixHalo+2 crossposts

GEEKOM A9 Mega

Been running a full local LLM stack on a GEEKOM A9 Mega for a few weeks. 128GB unified memory, 170mm mini PC, runs models that normally need an A100. The hardware delivers. The AMD software ecosystem around it is still catching up.

Sharing the friction points because I couldn't find anything specific to gfx1151 when I was setting this up.

Specs

- CPU: AMD Ryzen AI Max+ 395, 16C/32T, 5.1GHz boost

- GPU: Radeon 8060S, RDNA 3.5, gfx1151

- RAM: 128GB LPDDR5x unified, 96GB carved to VRAM in BIOS

- OS: Ubuntu 24.04, OEM kernel 6.17

Current stack: Qwen3-235B (107GB), Qwen3-30B, DeepSeek-R1 70B, Qwen3-VL 30B vision, few 27B variants. All tested on one box.

The issues; none of these are hardware faults, all ecosystem/tooling maturity

ROCm 7.2 lies about VRAM on gfx1151

hipMemGetInfo returns ~26GB (system free RAM) instead of the actual 96GB. Model loads hang forever at "fitting params to device memory." Fix is HIP_VISIBLE_DEVICES=-1 in your Ollama service environment to force Vulkan/RADV, which correctly sees 111.5 GiB. gfx1151 is new enough that ROCm just hasn't caught up yet.

MTP is blocked

llama-server's multi-token prediction path uses HIP compute dispatch throws WALKER_ERROR and MAPPING_ERROR in dmesg on gfx1151 then page-faults. No workaround, waiting on ROCm 8.0. Not a hardware limitation, purely a driver gap.

Vulkan caps context efficiency around 32K

Token gen is good ~63 t/s on 30B, ~15 t/s on 235B. But prompt processing on long contexts is slow. ROCm would be 3x faster on prefill for 130K+ context. Since ROCm is broken you feel this on large document ingestion. Again a tooling problem not a silicon one.

Ollama rough edges

ollama pull hf.co/ fails due to a redirect auth bug download GGUFs manually with the hf CLI instead. Split GGUFs (00001-of-00009) can't be registered directly, merge with llama-gguf-split first. Neither is AMD-specific, just things you hit when most community docs assume CUDA.

The most frustrating documentation everything assumes Nvidia, AMD need to up their game here else good hardware with no or limited tooling support will discourage adoption.

-----

Bottom line

The silicon is ahead of its software support. AMD is putting out genuinely competitive hardware for local inference — 128GB unified at this price and form factor is hard to beat. But gfx1151 is new enough that you're in early-adopter territory. ROCm docs mostly cover gfx1100/1101, community guides assume Nvidia, and you'll be reading kernel logs more than you'd like.
If you want plug-and-play today, wait for ROCm 8.0. If you're okay with some manual setup it's worth it.

Happy to answer questions on the Vulkan setup or specific model configs.

----------Update on what all was tried and failed 😄 -----------

Done exhaustive kernel testing trying to get ROCm HIP working for llama.cpp inference. Everything fails with the same page fault:

amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:35)

GCVM_L2_PROTECTION_FAULT_STATUS: 0x00800932

PERMISSION_FAULTS: 0x3 ← both read AND write denied

WALKER_ERROR: 0x1

MAPPING_ERROR: 0x1

Tested on every kernel I could find. All fail identically:

- Ubuntu OEM 6.17.0-1025-oem → PERMISSION_FAULTS 0x3

- Ubuntu OEM 6.17.0 + amdgpu-dkms 6.19.4 (AMD 31.30 repo) → PERMISSION_FAULTS 0x3

- Ubuntu mainline 6.18.9 → PERMISSION_FAULTS 0x3

- Ubuntu mainline 7.0.14 → PERMISSION_FAULTS 0x3

- Fedora 42 kernel 6.18.0-rc5 vanilla → WORKS (per kyuz0/amd-strix-halo-toolboxes benchmarks)

Also tried every env var and kernel param people suggest:

- amdgpu.noretry=0 → no effect, XNACK stays NO regardless

- HSA_XNACK=1 → no effect

- amdgpu.vm_fragment_size=9 → no effect on permissions

- GGML_HIP_UMA=OFF (forces regular hipMalloc instead of SVM) → same faults

- amd_iommu=off + amdgpu.gttsize=126976 → GTT confirmed at 124GB, fault unchanged

Key finding: Checked kernel config on both Ubuntu kernels — they have identical flags to Fedora (CONFIG_HSA_AMD_SVM=y, CONFIG_HMM_MIRROR=y, CONFIG_DEVICE_PRIVATE=y, CONFIG_ZONE_DEVICE=y). The fix in Fedora's kernel is not a config difference. It's amdgpu driver code patches — presumably from AMD's drm-next/amdgpu-next branch — that haven't landed in Ubuntu mainline or the 6.18 stable series yet.

The fault PERMISSION_FAULTS: 0x3 means both read AND write are denied. The GPU driver is mapping memory into the GPU's address space but the page table entries are missing the r/W permission bits. gfx1151-specific bug.

Workaround: Ollama with Vulkan/RADV backend (HIP_VISIBLE_DEVICES=-1 disables broken ROCm, forces Vulkan). Running ~17.5 t/s on Qwen3-235B Q3_K_M. Not as fast as HIP but stable.

Can anyone confirm which specific commits fix gfx1151 page table permissions in Fedora's tree? Is this targeting a specific drm-next PR for 6.19? Would help users know whether to patch or just wait.

u/Working-Release-3771 — 6 days ago

▲ 4 r/StrixHalo

torn between gmktec and corsair ai 300

I couldn't find one source that has tested both, besides tomshardware, but they didn't test them over the same criteria. (fan noise, power draw, thermals, etc)

From what I can find, they are very similar, and It all comes down to these differences:

Corsair AI 300:

+ Internal PSU = better portability
+ theoretically better hardware and software support
+ slightly higher SSD speeds (6k in crystal mark)
- comes with windows 11 home, no Hyper-V support, no BitLocker Encryption
- more fans, and Tomshardware says it was noticeably noisy during load
- integrated PSU means it's more trouble if it breaks, and potentially drives heat inside
- $200 more expensive in my region
- comes with a 1TB SSD
- Wifi 6

GMKtec Evo X2

+ $200 cheaper in my region
+ comes with a 2TB SSD
+ comes with Windows 11 Pro, Hyper-V support, BitLocker Encryption
+ less fans so theoretically, quieter
+ easier to replace PSU if it breaks, since it's external
+ Wifi 7
- slightly slower SSD speeds (5k in crystal mark)
- external PSU, worse portability
- I've read they have issues with software/bios/ drivers that could be flimsy

Has anyone used both and can tell me if there is any meaningful difference in hardware reliability or something else that would make one a better choice over the other?

reddit.com

u/nemuro87 — 7 days ago

▲ 18 r/StrixHalo

Minisforum MS-S1 Max 128GB back in stock

Finally, back in stock for anyone waiting.

u/JumpyMasterpiece4511 — 7 days ago

▲ 34 r/StrixHalo

4 Months with Strix Halo: From Gaming to Local AI Exploration

I’ve been an AMD enthusiast since the Slot A era. My recent journey included owning the ASUS ROG Zephyrus (4800HS) and Strix Point.

As many of you know, RAM prices spiked unexpectedly earlier this year. Seeing an opportunity to grab a high-spec machine while prices were still somewhat manageable, I picked up an HP Zbook Ultra G1a with 128GB of RAM. Looking back, it seems like this specific configuration is almost impossible to find in the US now, regardless of the price.

Initially, I bought it as a step up from Strix Point, mainly hoping for more stable frame rates in certain 3D games, like Tokyo Xtreme Racer. At the time, when Lisa Su pitched this as a tool for Creators, Gamers, and AI Developers, AI wasn’t really on my radar.

That changed in April when my company started providing a subscription for a Coding AI Assistant. It made me wonder: “What can I actually do with my own laptop?”

I started experimenting with Ollama and became fascinated with abliterated LLMs. The built-in web search tools were also a huge plus. While trying to compare Vulkan and ROCm via LM Studio, I came across “Lemonade.” I realized I could run lightweight models (<10B) directly on the NPU, and now I use it as a handy, lightweight translator (since English isn’t my first language).

To be honest, Strix Halo isn’t a “perfect” AI machine, as many of you have likely experienced. It really shines with MoE models, but anything with more than 15B active parameters tends to struggle. Still, it’s enough to carve out some very practical use cases.

In terms of models, the Gemma 4 series has excellent language support. I’m currently testing gpt-oss 120b with some challenging questions to see if it truly lives up to its 120b intelligence. I’ve also found that Qwen 3.6 offers superior coding assistance compared to Gemma 4 at the same size, and Ministral serves as a great European alternative.

There are still some frustrations with AMD’s software ecosystem. For instance, EfficientNet for digiKam’s AutoTag doesn’t run on the GPU, and Lemonade lacks an auto-updater. However, I’m glad I dove into this a bit late—the practical utility has been surprisingly high.

So, that’s my experience with Strix Halo so far. How are you all getting on with your setups?

reddit.com

u/nsfw_tie — 8 days ago

▲ 15 r/StrixHalo

We ran Qwen3-0.6B on the Strix Halo NPU at 4.8 tok/s -- need help unlocking INT8

We ran Qwen3-0.6B on the Strix Halo NPU at 4.8 tok/s -- need help unlocking INT8

TL;DR: Reverse-engineered the undocumented AMD XDNA2 NPU on Strix Halo (Ryzen AI MAX+ 395) to run Qwen3-0.6B at 210ms/tok (4.8 tok/s) -- 3.2x faster than CPU. Built everything from scratch -- 15 xclbins, 7 compiler bug fixes, full IRON API pipeline. But INT8 is blocked by the MLIR toolchain (only accepts BFP16 types), and BF16 DMA hangs due to aiecc descriptor generation bugs. Need community help.

What We Built

The vision: Fold amdxdna (NPU) into amdgpu so the NPU, GPU, and CPU share one memory manager, one DRM fd, and one ROCm compute API.

                         ┌─────────────────────────────────────┐
                         │        Lemonade SDK / Ollama        │
                         │   (Single API -- any model, any HW) │
                         └──────────────┬──────────────────────┘
                                        │
                         ┌──────────────▼──────────────────────┐
                         │        ROCm HIP Runtime             │
                         └──────────────┬──────────────────────┘
                                        │
             ┌──────────────────────────┼──────────────────────────┐
             │                          │                          │
   ┌─────────▼─────────┐    ┌──────────▼─────────┐    ┌──────────▼─────────┐
   │     GPU (GFX)     │    │     NPU (XDNA2)    │    │     CPU (x86)      │
   │  amdgpu driver    │    │  amdgpu NPU IP     │    │  Native (Zen 5)    │
   │  RDNA 3.5 CUs     │    │  8 AIE columns     │    │  ~2-3 tok/s CPU    │
   │  80 TFLOPS FP16   │    │  31 TFLOPS BFP16   │    │  (llama.cpp)       │
   └─────────┬─────────┘    └──────────┬─────────┘    └──────────┬─────────┘
             │                          │                          │
             └──────────────────────────┼──────────────────────────┘
                                        │
                         ┌──────────────▼──────────────────────┐
                         │       Unified Memory Manager        │
                         │   One DRM fd, one address space     │
                         │   Shared page table (GPU-&gt;NPU-&gt;CPU) │
                         └─────────────────────────────────────┘

The current stack (pre-unification):

6 custom xclbins -> 4 GEMMs/layer x 28 layers = 112 NPU calls/token Fused QKV (1024x4096) + Fused GU (1024x6144) + O + D projections Threaded LM head (4x) + Threaded attention (4x) BFP16 format (hardware-native block float, RMSE 0.0003)

Performance

Metric	CPU (llama.cpp)	NPU (This Work)	Gain
Decode latency	~668 ms/tok	210 ms/tok	3.2x
Throughput	~1.5 tok/s	4.8 tok/s	3.2x
Power efficiency	~25-35W	~12W NPU+CPU	~3x perf/W

The NPU pulls about 2W during inference, the CPU about 10W for attention/LM head. Total system ~12W vs ~30W for CPU-only.

What Works

6 BFP16 xclbins (QKV fused, O, GU fused, D, KV) -- RUNNING IN PRODUCTION
4 Multi-token M=256 xclbins (for batched decode) -- BUILT (needs aiecc fix)
4 2-layer batch N=8320 xclbins (for layer pairs) -- BUILT (needs engine integration)
4 INT8 xclbins (QKV, O, GU, D) -- BUILT (DMA stride needs fix)
IRON API @iron.jit -- FULL PIPELINE WORKING END-TO-END
INT8 matmul via IRON (64x64x64) -- EXACT MATCH, ERROR=0

7 Compiler/API Bugs Fixed

All patched, tested, and documented in the repo:

ScalarValue nanobind type mismatch -- removed ArithValueMeta metaclass
AIE ELF symbol rename -- pure-Python ELF32 parser (objcopy can't handle 32-bit AIE ELFs)
transpose.hpp incomplete type -- template deduction fallback
mm.cc missing extern "C" -- Peano compiler needs C linkage for symbol resolution
SKIP_VECTORIZED flag -- preprocessor guard for scalar-only kernel compilation
MLIR parser i8/i16 rejection -- patched AIEXDialect.cpp + AIETargetModel.cpp, rebuilt with ninja
~15 IRON API integration issues -- all fixed, @iron.jit works end-to-end

The Wall: INT8 and BF16

INT8 -- 50 TOPS, Blocked by MLIR Dialect

The NPU hardware fully supports INT8. We proved it: IRON API INT8 matmul at 64x64x64 produces exact match, error=0. But the aiecc MLIR parser only accepts v8bfp16ebs8 and v16bfp16ebs16 types -- i8 is rejected at parse time.

What we did: Patched AIEXDialect.cpp and AIETargetModel.cpp in the aiecc source, rebuilt with ninja, and successfully built working INT8 xclbins (66KB each, all 4 projections).

The last mile: The DMA stride formulas in the MLIR generator were written for BFP16's 1.125-byte packed format (v8bfp16ebs8). They need recalibration for INT8's 1-byte element type. The MLIR looks correct but the DMA descriptors access wrong memory offsets.

What INT8 unlocks: 2x throughput. ~100ms/tok, ~10 tok/s. This is the standard precision for quantized LLM inference (GGUF Q4_0, IQ4_XS, etc.).

BF16 -- Hangs at Runtime

BF16 xclbins compile successfully but the DMA controller hangs on the first kernel call. Every kernel variant (identity copy, native matmul, emulated matmul) hangs identically. The Chess compiler generates incorrect DMA descriptors for bfloat16 memory types.

Windows handles BF16 correctly through a different NPU stack (DirectML/QNN). The Linux aiecc/Chess toolchain has a bug in its bfloat16 DMA descriptor generation that we haven't been able to work around.

What We Need Help With

INT8 DMA strides -- The n1_core tile streaming hierarchy in the MLIR generator produces DMA descriptors that don't work for INT8. Someone familiar with AIE DMA multi-dimensional addressing (the dimensionsToStream / dimensionsFromStream attributes in aie.objectfifo) could probably fix this in minutes.
BF16 descriptor fix -- The Chess compiler's bfloat16 DMA descriptor generation has a bug. A newer toolchain version, a workaround in the MLIR generator, or reverse-engineering the Windows NPU DMA path would fix this.
Windows NPU stack -- What does DirectML/QNN do differently for BF16/INT8? If someone can trace the Windows NPU driver calls for BF16/INT8 DMA, we can replicate the correct behavior on Linux.

Repository

Full handoff + xclbins + engine source + all investigation docs:

https://github.com/bong-water-water-bong/npu-gpu-cpu

Key files in docs/:

HANDOFF-NPU-OPTIMIZATION.md -- Complete 3-day journey (880+ lines)
INT8-HANDOFF.md -- INT8 deep-dive: 6 failed paths, root cause analysis
AGENTS.md -- Session summary for future coding agents
REDDIT_POST.md -- This post

Built from scratch over 3 days. No documentation. No support. Just reverse engineering, 7 compiler bug fixes, 15 xclbins, and a lot of coffee.

u/Creepy-Douchebag — 7 days ago

▲ 16 r/StrixHalo

The cheapest 128GB Strix Halo?

https://preview.redd.it/z1x34f023o9h1.png?width=1303&format=png&auto=webp&s=a095ac0d51874cf0fd353ed1de1cbce14c9136da

reddit.com

u/tecneeq — 10 days ago

r/StrixHalo

Who've told you that distributed training is impossible? Democratizing AI: The Psyche Network Architecture

Strix.Monitor : web-based resource monitor for Linux systems

Pub-Beta: Hal0 - Local Homelab LLM+ Inference Powerhouse for StrixHalo / Proxmox / More

How are you guys running your models?

Local Setup: Strix Halo 128gb

PSA - New Corsair AI Workstation 300 Arrived Infected!!

An argument for 40–50W on the ROG Flow Z13 (Ryzen AI Max+ 395)

Cachy-Router an OpenAI-compatible router for llama.cpp with shared KV-cache

Buy now or wait on a local LLM box during the memory crunch? How I'd read it after running a Strix Halo daily for 6 months

GPT-OSS-120B (MXFP4) resident on a Strix Halo iGPU, +2 more models in 96GB UMA — real tok/s, Vulkan not ROCm, and the gotchas

Ryzen AI Max (Strix Halo) Mobile Devices, as of the end of June 2026.

Ryzen AI Max (Strix Halo) Mobile Devices, as of the end of June 2026

I made a local model that gives you a multimedia role playing experience (and you can too!)

Thoughts on this M.2 NVME adaptor?

Toolbox Runner — a web UI alternative to the Strix Halo toolbox TUI

GEEKOM A9 Mega

torn between gmktec and corsair ai 300

Minisforum MS-S1 Max 128GB back in stock

4 Months with Strix Halo: From Gaming to Local AI Exploration

**We ran Qwen3-0.6B on the Strix Halo NPU at 4.8 tok/s -- need help unlocking INT8**

The cheapest 128GB Strix Halo?

We ran Qwen3-0.6B on the Strix Halo NPU at 4.8 tok/s -- need help unlocking INT8