▲ 37 r/ROCm

Easy ROCm + llama.cpp build script for AMD GPUs

I made a setup script that handles ROCm installation and building llama.cpp from source with ROCm + Vulkan support. It lets you pick your GPU target and ROCm version interactively, then clones the latest llama.cpp master, configures automatically and build quickly. I have attached my current config to run server where i am getting following bench.

Bench (RX 6700 XT - gfx1030):

Ornith-1.0-35B-MTP-APEX-I-Compact (17GB Q4_K)

- pp3000: 786 tok/s

- tg128: 40 tok/s

Note: Use latest llama.cpp master. My old 2-month-old build got 25-30 tps, updated build now 40-42 tps on same model. Big difference from recent commits.

Repo in comment

reddit.com

u/Key_Flatworm7995 — 2 days ago

▲ 9 r/ROCm+1 crossposts

DeepSeek V4 Flash

Anyone run deepseek-v4-flash on amd gpu here?

reddit.com

u/djdeniro — 1 day ago

▲ 2 r/ROCm+1 crossposts

Face swap / deepfake workflow on RX 9070 XT under Arch Linux?

Hi everyone,

I'm trying to build a reliable deepfake / face swap workflow on Arch Linux using an AMD Radeon RX 9070 XT (tried ROCm 6.4, 7.0).

So far I haven't been able to get a working pipeline. ReActor doesn't work correctly on my system. The GPU is detected by PyTorch (torch.cuda.is_available() == True), but face swapping either falls back to the CPU or produces broken results

My goal is simple:

* use a **source image** (the face that should be swapped in),

* use a **target video** (a different person),

* process the video frame by frame,

* replace the face on every frame,

* then reassemble the processed frames back into the final video.

Has anyone with an RX 9070 XT or another RDNA4 GPU successfully done this in ComfyUI?

Which nodes or tools are you using?

* ReActor

* FaceFusion

* SimSwap

* InstantID

* PuLID

* Something else?

I'm mainly looking for a workflow that works reliably with AMD ROCm on Arch Linux.

Any recommendations, working workflows, or setup tips would be greatly appreciated. Thanks!

reddit.com

u/Interesting_Ad7497 — 1 day ago

▲ 7 r/ROCm+1 crossposts

vLLM on native Windows ROCm RDNA3 — custom kernel port, gfx1100.

vLLM has no native Windows support (WSL2 or a couple of community forks only), and even on Linux its cpp_extension-based hipify path doesn't handle a Windows torch-rocm build cleanly. I put together an out-of-tree platform plugin plus a build harness that compiles vLLM's own csrc HIP kernels natively on Windows + RDNA3, without forking vLLM itself.

Repo: https://github.com/ThePie88/vLLM-ROCm-Windows

Stack: RX 7900 XT (gfx1100), Windows 11, HIP SDK 7.2 (MSVC + clang 22), torch 2.10.0+rocm7.13 (TheRock-class Windows build), vLLM v0.19.1.

The build problem and the workaround

vLLM's Linux build relies on a CUDA→HIP header redirect that the Windows torch wheel doesn't ship, and cpp_extension's hipify orchestrator mishandles Windows paths outright. Instead of fighting that, the harness applies torch's own hipify regex-substitution engine (RE_PYTORCH_PREPROCESSOR + PYTORCH_MAP) directly to the csrc sources, with a small set of redirect shim headers, then compiles with torch.utils.cpp_extension.load() (--rocm-device-lib-path, -DUSE_ROCM=1, -DTORCH_HIP_VERSION=0, HALF-guard undefs, linking rocblas/hipblas/amdhip64). This is the one load-bearing trick the whole thing depends on.

Ops registration also has a Windows-specific gotcha worth flagging for anyone else doing this: this torch-rocm Windows build presents HIP devices under the CUDA dispatch key, not torch::kHIP — every native op has to be .impl(..., torch::kCUDA, ...), or it silently fails to bind.

Currently compiled and validated this way:

silu_and_mul, rms_norm, fused_add_rms_norm, rotary_embedding (fused activation/norm/RoPE)
the W4A16 GPTQ/exllama GEMM (gptq_gemm, gptq_shuffle) from csrc/quantization/gptq/q_gemm.cu, including its small-batch decode path — this has zero kernel on Windows otherwise

For AWQ-uint4 (no fast kernel on ROCm at all — exllama only takes uint4b8, Marlin is CUDA-only), I wrote a Triton M=1 dequant-GEMV, a real reduction (no tl.dot/split-K/atomicAdd) that reuses conch's weight normalization and autotunes per shape. Takes AWQ decode from 12.2 to 50.9 tok/s on a 14B model.

Full inventory of what's ported vs. what's left (with hipify/adapt/rewrite verdicts per file across csrc/, csrc/rocm/, csrc/attention/, csrc/moe/, csrc/quantization/) is in docs/csrc-native-build-roadmap.md.

Numbers (single-stream decode, batch 1, all verified coherent)

Model	Quant	tok/s
Qwen2.5-7B-Instruct-GPTQ-Int4 (dense)	GPTQ Int4	115
ERNIE-4.5-21B-A3B-Thinking (MoE)	W4A16 gs32	62.7 → 79.2
Qwythos-9B (Qwen3.5 hybrid)	W4A16	61.7
DeepSeek-R1-Distill-Qwen-14B-AWQ	AWQ Int4	12.2 → 50.9

torch.compile/inductor and hipGraph decode capture (FULL_DECODE_ONLY) both work. Getting inductor to run at all needed a torch.distributed.tensor (DTensor) stub — the module is genuinely absent on this build, but a bare missing module raises a half-initialized ImportError, and inductor's graph logging only guards the import with except ModuleNotFoundError. One stub module fixes it.

Native paged attention: built it, it's faster, and it still loses

This is the part I'd most want ROCm-side eyes on. I ported vLLM's generic wave32 paged attention (csrc/attention/, not the gfx9/MFMA csrc/rocm/attention.cu, which has no gfx11 path) to compile natively. In isolation it's ~3.2x faster than the Triton decode kernel it replaces, numerically correct (rel err ~5e-4).

Wired end-to-end: -9% on one model, -5% on another. The kernel itself is faster, but the backend path around it (cache-write op + wrapper + metadata) is heavier than the Triton path's fused version. Ablation (no-op each component under cudagraph, measure the tok/s delta — the only reliable method here, since torch.profiler misattributes time to zero-kernel view ops on this stack even under cudagraph) confirmed attention compute is genuinely the biggest lever at ~27% of decode time.

Follow-up: a flash-layout kernel reading the Triton path's KV cache directly, to keep the light fused path and avoid the heavier backend. Still lost, -26%, at head_size=128. Turns out the native kernel's advantage is head_size=256-specific — that's where Triton's own kernel is pathologically slow; at head 128 Triton is already near the bandwidth roofline and beating it needs a genuinely faster kernel, not just a native one. Parked until I have a head-256 model that fits cleanly in 20GB to test on (the one I have overflows and spills).

AITER on gfx1100 — the verdict I landed on

Evaluated whether AITER's kernels are portable to RDNA3. Short version: no, not for the parts that matter. AITER's Python dispatch accepts gfx1100 with no compile-time gate, but the two things worth having — fmha_v3 paged attention and MLA decode — are shipped as ASM-tuned .co blobs for gfx942/950/1250 only, no gfx1100 blob, and regenerating them needs AMD's tuning pipeline, not something a hipify pass gets you. Composable Kernel's instance templates are gfx9-only. And on Windows specifically, setup.py forces AITER_TRITON_ONLY=True / ENABLE_CK=False regardless. What is portable (the Triton-based paths) runs at parity with what vLLM's own Triton fallback already does — no net gain from vendoring it.

My gfx1100 native paged-attention kernel above is functionally the RDNA3-native equivalent of fmha_v3 — the kernel-level win is real (3.2x), same as AITER's headline claim for CDNA. The gap is entirely in the integration path, not the hardware ceiling. If anyone on the ROCm side has thoughts on why the ROCM_ATTN backend path is that much heavier than TRITON_ATTN's fused version, or wants the isolated kernel to poke at, I'll take pointers.

Not done

Single GPU only — RCCL doesn't exist on Windows, so torch.distributed is a single-process shim.
fp8 KV cache works; sub-8-bit (a KVarN calibration-free port, Hadamard+Sinkhorn+RTN) runs end-to-end at ~4.7x KV capacity but isn't production-ready yet (workspace over-allocation).
Most of csrc (paged attention native path, MoE expert GEMM, several fusion kernels) is still unported — roadmap with effort/payoff per kernel is in the repo.

Setup steps and the pinned (fragile) dependency versions are in the README. Questions welcome, especially from anyone else fighting RDNA3 on native Windows.

u/Former-University905 — 2 days ago

▲ 9 r/ROCm

ComfyUi AMD R9700 FP8 not working - Comfy manually do FP16 and need 2x more VRAM for models.

[INFO] Native ops: float8_e5m2, float8_e4m3fn, int8_tensorwise , emulated ops: mxfp8, nvfp4

[INFO] model weight dtype torch.float8_e4m3fn, manual cast: torch.float16

Has anyone managed to solve this problem for the AMD R9700 GPU in ComfyUI on Linux?

https://github.com/Comfy-Org/ComfyUI/issues/11519

Is there anyone here who has successfully run FP8 Wan 2.2 on an R9700 GPU? By "successfully," I mean achieving the correct VRAM usage and speed, without ComfyUI automatically converting the model weights to FP16 and increasing VRAM consumption. If so, please share the VRAM usage for FP8 on this GPU at 1280x720x81. I’m starting to wonder if it actually works on this card at the moment.

u/Glittering-Cold-2981 — 3 days ago

▲ 31 r/ROCm+1 crossposts

Toward Better HIP Kernel Generation for AMD GPUs

https://scalingintelligence.stanford.edu/blogs/hipkernels/

u/Superb-Translator236 — 3 days ago

▲ 7 r/ROCm+1 crossposts

hipfire engine for consumer cards

I found this repo https://github.com/Kaden-Schutt/hipfire that’s apparently made specifically for consumer amd cards and wanted to know if anyone has used it successfully. Right now I have a 7900xtx & 7900xt trying to run qwen3.6 27b but I can’t get it to run on both cards just on my xtx. Apparently it’s not supported yet but it uses some interesting quants and could be worth looking into/following the updates.

u/Beneficial-Border-26 — 5 days ago

▲ 3 r/ROCm

How do i proper do int8 quantization for model like Anima on Rdna2 cards?

I have 6700xt, 32gb ram, from what i found online, int8 quantization should help improving speed by 30% but after setting up fast int8 triton backend, i found that the speed practically didn't change, it was 4.65s/it in fp16 and 4.53s/it int8 at 832x1216. Did i do something wrong or it was rdna2 limitation?

Comfyui var i use cache none, pinned memory disabled, pytorch cross attention

reddit.com

u/ziege159 — 5 days ago

▲ 11 r/ROCm

Try my roctop, a lightweight terminal monitor for AMD/ROCm GPUs

Built roctop, a lightweight terminal monitor for AMD/ROCm GPUs.
It gives you a nvitop-style view of GPU utilization, memory, temps, power, and running processes, designed for a clean terminal-first workflow on AMD systems.
If you work with AMD GPUs and want a fast, readable monitoring tool, check it out:
https://github.com/nrhevu/roctop
#ROCm #AMD #GPU #Python #OpenSource

https://preview.redd.it/u777y225ifah1.png?width=2610&format=png&auto=webp&s=c36cd5571deaa9208793dea1ecc58a90170b4577

reddit.com

u/Rhev-2001 — 6 days ago

▲ 6 r/ROCm+1 crossposts

Gtx 980 4gb or Rx 580 8gb for running AI models locally?

I am going to buy a budget gpu. The Rx 580 8gb and the gtx 980 4gb are about the same price and performance.

The RX 580 8gb has an advantage of +4gb vram, however, the gtx 980 has cuda support which - as I read- has much better performance.

So, which to choose? The exact model I am going to be using is mdx-q (a vocal remover).

*Note: I am not living in the US so the prices are very different.

reddit.com

u/Budget_Astronaut_956 — 7 days ago

▲ 162 r/ROCm+1 crossposts

Dual GPU Build - 2x R9700

If you’re wondering, can NR200 fit 2xGPU? Yes, it can.

My hardware setup includes:

- Minisforum BD795i SE motherboard with an AMD Ryzen 9 7945HX processor and Noctua NF-A12x15 PWM chromax.black.swap fan.
- 2x ASUS Turbo Radeon AI Pro R9700 GPUs with 32GB VRAM (64GB total).
- ASUS ROG Loki SFX-L 1200W Titanium PSU
- M.2 to 10GbE AQC107 Ethernet adapter.
- TISHRIC PCIe 5.0 Gen5 x16 to Dual MCIO SFF-1016 adapter riser card with Bifurcation 128Gbps 8x/8x.
- 2x Crucial 32GB 5600MHz SODIMM DDR5 RAM (64GB total).
- Ubuntu 26.04, along with OpenClaw and LM Studio.

Both GPUs are running at full Gen5 8x speed. I added two Noctua NF-A9x14 PWM fans to a side vertical GPU and limited their power to 225W each. Under full load, and qwen3.6-27B system remains relatively quiet and cool, with the GPUs staying around 70°C at most. The only reason I use power limits is noise; at full blast, the noise level is uncomfortable. I found that at 225W, the fan noise level remains comfortable and performance loss is minimal.

With the new ASUS R9700 BIOS, the GPU fans sit at 12% and are almost silent while idle.

The build is not yet complete. I plan to upgrade the 10GbE to a single 25GbE port and possibly swap the motherboard for the ASUS ROG STRIX X870-| GAMING WIFI and R9950X3D CPU if I find a reliable AIO CPU cooler that can fit at the top of the case. In this scenario, I will connect two more eGPUs via USB4.

u/Legitimate_Fold8314 — 11 days ago

▲ 2 r/ROCm

Hi , im looking for the best combination of rocm/vulkan/model for a 9070xt 16gb for coding, and another one for software engineering and related tasks

Literry im pissed im unable to buy a 9700 with 32 gb and go qwen 35b with less quatization and over 30 tokens sec with current config. willing to reach the possible most performant model.

any link for someone in this specific journey? or someone to share additional info?

my rocm is currently 7.13

thanks!

reddit.com

u/aftasardemmuito — 8 days ago

▲ 2 r/ROCm

Rocm - Qwen3 TTS - Slow processing - help

I've been trying to use Qwen3-TTS on my AMD Radeon 9700 32gb. I've finally got to a point the where the card seems to be used when generating audio. See the screenshot.

The problem is, it's no quicker than running it on the CPU. 2mins to generate 20 seconds of audio, way above what it should be.

I've been trying to problem solve it for days. When it first starts, blue at first level, it seems GPU and VRAM are properly being used but when GPU % raises to the next level at 100% then the MEM Mhz goes to base speed at 96Mhz. And there seems to be high CPU usage than there should be but GPU % is at max too.

I've shared my work in progress at: https://github.com/8perezm/esuyo-qwen3-tts-rocm

The docker files are where most of the magic happens:
https://github.com/8perezm/esuyo-qwen3-tts-rocm/blob/main/Dockerfile
https://github.com/8perezm/esuyo-qwen3-tts-rocm/blob/main/compose.yaml
https://github.com/8perezm/esuyo-qwen3-tts-rocm/blob/main/app/server.py

Does anyone have any ideas of alterations I can make? All the other images I've tried including voicebox don't work, so I decided to start from scratch.

https://preview.redd.it/0bgaczr7wh9h1.png?width=1102&format=png&auto=webp&s=87779b78934ce6a20decce3e0d946b9d0c773246

My test command:

python test_custom_voice.py --url "http://192.168.5.4:8001/v1/audio/speech" --text "It worked beautifully in narrow, well-defined domains. The most famous example is MYCIN, built at Stanford in the 1970s, which diagnosed bacterial infections and recommended antibiotics. In tests, it actually outperformed human doctors." --output "speech5.wav" --speaker "Ryan"

reddit.com

u/migsperez — 10 days ago

▲ 55 r/ROCm+2 crossposts

ROCm vs Vulkan vs vLLM on Dual R9700's

Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds.

llama.cpp services Running ROCm and Vulkan

Model	Backend	Gen
35B-A3B Q6_K_XL (MTP)	ROCm	~106 t/s
27B Q6_K_XL (MTP)	ROCm	~44 t/s
35B-A3B Q6_K_XL (MTP)	Vulkan	~87 t/s
27B Q6_K_XL (MTP)	Vulkan	~41 t/s

vLLM

Model	Backend	Gen
35B-A3B MoE FP8 (MTP)	ROCm + AITER	156 t/s
27B FP8 (MTP)	ROCm + AITER	69 t/s

**EDIT, here are prefill speeds from 35BA3 since several were asking:

Pulled these from vLLM logger.

Prompt size	Prefill speed	(= tokens ÷ TTFT)

~10K	~10,000 tok/s	10,033 ÷ 0.98s
~40K	~6,600 tok/s	39,997 ÷ 6.0s
~70K	~5,500 tok/s	70,027 ÷ 12.7s
~100K	~4,400 tok/s	99,991 ÷ 22.9s

I am curious what speeds others are seeing on Qwen3.6 35BA3 and 27B.

reddit.com

u/whodoneit1 — 14 days ago

▲ 22 r/ROCm+1 crossposts

ROCm 7.2 working on AMD Vega 8 (Ryzen 5700G); could also work on Vega 56/64

A few months ago, when I tried to squeeze the maximum from my AMD Vega 8 APU — Ryzen 5700G, I was not able to find the latest custom-baked ROCm for LLM anywhere, so I decided to build one for Linux — here is ROCm 7.2 working on AMD Vega 8: https://github.com/daimonionnn/amd-vega-rocm-vulkan-llm-toolkit

This can also work on Vega56/64 (Vega10) since it is same architecture. Maybe just some minor changes in config are needed.

Tested on Qwen35B and smaller Gemma4 models. It was better in prefill than Vulkan, but since then Vulkan has improved even in prefill. I did not have time to extensively test it on more LLM models, and the results are a mixture of older and newer ROCm, Vulkan, different settings, and different Ubuntu versions/Docker images. My plan was to test and optimize it on Vega 56/64 (Vega 10), but my only Vega 56 died some time ago — I shorted it badly when I started the PC with the graphics card not fully seated in the PCIe slot. I also recently upgraded to a new MOBO, CPU, and 2x Radeon 9700 AI Pro (Asus ProArt Z890 and Intel Core Ultra 5 250K Plus) and I'm not planning to develop/optimize this anymore, but any pull requests or forks for Vega 56/64 + PyTorch/ComfyUI support, optimization, benchmarks are welcome. See https://github.com/daimonionnn/amd-vega-rocm-vulkan-llm-toolkit/blob/main/docs/ARCHITECTURE.md for details.

u/Daimonionnnn — 10 days ago

▲ 1 r/ROCm

RDNA4 WSL2

Is WSL2 still not working in RDNA4?

reddit.com

u/DAMDMA — 12 days ago

▲ 0 r/ROCm

I need something as good as Claude Opus, is 24GB RX7900 XTX enough?

I really need a good coding agent. Like really really good, probably closer to Claude Fable but can't build something that good with budget. So, is this enough, close enough instead?

reddit.com

u/Emre-Y — 13 days ago