r/ROCm

▲ 11 r/ROCm+1 crossposts

AMD AI Pro 9700 — anyone using MTP?

With my 32GB R9700 using llama.cpp and Vulkan on Qwen3.6 35B Q4 or Q5 I’ve been happy getting 115-130 tps, which is much better than my RTX 4070 Super which spills over too much. For the price the 9700 seems, for me, to be a better choice than a used 24GB 3090. Has anyone been able to do MTP on a 9700 with that Qwen model or similar?

reddit.com
u/WSTangoDelta — 1 day ago
▲ 4 r/ROCm

9070 help with ROCm

So i have a 9070 and I've been trying to run local LLMs like Qwen 2.5 32B recently, on LM studio and ollama, im on windows. It keeps defaulting to my CPU and not even detecting my GPU, does anyone know why this is happening? I think its got to do with drivers or support by ROCm, so does anyone have a workaround? I'm on windows 11

reddit.com
u/iggy_btd — 1 day ago
▲ 5 r/ROCm

Rocm and ComfyUI inside a Docker or Podman.

I have tried my best to try and make ROCM and comfyui run inside a docker or podman, but so far to no awail
Are there anyone out there who have a good guide or script that I can look at ?

I am running a Fedora 44 and want to try the 7.2 rocms to run inside a container with ComfyUI as well

reddit.com
u/druidican — 1 day ago
▲ 37 r/ROCm+1 crossposts

MTP in llama.cpp (PR #22673) tested on AMD Strix Halo: Qwen 3.6 35B-A3B hits 71 t/s short / 48 t/s at 62K via Vulkan RADV

Llama.cpp merged PR #22673 last week with MTP support. Three days later unsloth shipped Qwen 3.6 35B-A3B-MTP-GGUF. Today I swapped the vision endpoint on my Strix Halo box. Sharing because the numbers honestly surprised me.

Same hardware. Measurements:

Gemma 4 26B-A4B Q8 (before):

- 41 t/s short ctx

- 36 t/s at 22K

- 66 MiB KV per 1K tokens (SWA)

- 96K practical ceiling

Qwen 3.6 35B-A3B Q6_K + MTP-2 (now):

- 71 t/s short

- 48 t/s sustained at 62K (2200+ tokens in one decode)

- 2 MiB KV per 1K (Gated DeltaNet, linear attention in select layers)

- Running native 256K ctx, nowhere near hitting the memory wall

- MTP accept rate 86% average, peak 96.7%

+60-90% to generation speed. KV 15x more compact. Multimodal still works (mmproj-F16 in the same repo), tool calling works, thinking mode works. Nothing to build manually, just the stock kyuz0/amd-strix-halo-toolboxes:vulkan-radv image with llama.cpp master.

Hardware: AMD Ryzen AI Max+ 395, 128 GB UMA, Radeon 8060S gfx1151, Vulkan RADV backend.

The actual surprise was DeltaNet, not MTP. I assumed MTP was doing all the heavy lifting, but on long context most of the win comes from DeltaNet. Gemma's SWA falls off a cliff past 30K. Qwen stays almost flat. At 62K it loses about a third, not half.

#LocalLLM #StrixHalo #LlamaCpp #Qwen

reddit.com
u/voStragaIT — 3 days ago
▲ 16 r/ROCm

7900 XTX fp16/bf16 pytorch matmul performance

Cannot find proper source for the dense fp16 with fp32 accum for 7900 xtx or rent it, can I get someone who owns a 7900 XTX to run this torch benchmark script and report the metrics (if you have uv, should just be able to run "uv run script.py":

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "torch"
# ]
# ///


# just "uv run torch_params_test.py" to execute


import time
import torch
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


# Matrix size and benchmark parameters
N = 4096
FLOPS = N*N*N*2  # For GEMM operations
warmup = 10
iterations = 512
cooldown = 1
mem_size_gb = 1.0
mem_warmup = 5
mem_iterations = 32


def get_gpu_info():
    """Get GPU model name and other details"""
    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
        return f"{gpu_name} ({gpu_mem:.2f} GB)"
    return "No GPU detected"


def run_compute_benchmark(dtype_name):
    """Run a compute benchmark with high precision mode and specified data type"""
    torch.cuda.empty_cache()
    torch.set_float32_matmul_precision('high')  # Use TF32 for float32
    
    dtype = getattr(torch, dtype_name)
    
    # Create random matrices
    b = torch.rand((N, N), dtype=dtype, device="cuda")
    c = torch.rand((N, N), dtype=dtype, device="cuda")
    
    # Warmup
    for _ in range(warmup):
        a = b @ c
        torch.cuda.synchronize()
    
    # Benchmark
    times = []
    for _ in range(iterations):
        st = time.perf_counter()
        a = b @ c
        torch.cuda.synchronize()
        times.append(time.perf_counter() - st)
    
    # Calculate performance
    tm = min(times)
    tflops = FLOPS * 1e-12 / tm
    
    print(f"{dtype_name:10s}: {tm*1e6:8.2f} μs, {tflops:7.2f} TFLOPS")
    
    # Cooldown period
    time.sleep(cooldown)
    
    return tflops


def run_amp_benchmark():
    """Run benchmark with Automatic Mixed Precision"""
    torch.cuda.empty_cache()
    torch.set_float32_matmul_precision('high')
    
    # Create FP32 tensors
    b = torch.rand((N, N), dtype=torch.float32, device="cuda")
    c = torch.rand((N, N), dtype=torch.float32, device="cuda")
    
    # Warmup
    for _ in range(warmup):
        with torch.amp.autocast(device_type='cuda'):
            a = b @ c
        torch.cuda.synchronize()
    
    # Benchmark
    times = []
    for _ in range(iterations):
        st = time.perf_counter()
        with torch.amp.autocast(device_type='cuda'):
            a = b @ c
        torch.cuda.synchronize()
        times.append(time.perf_counter() - st)
    
    # Calculate performance
    tm = min(times)
    tflops = FLOPS * 1e-12 / tm
    
    print(f"{'amp':10s}: {tm*1e6:8.2f} μs, {tflops:7.2f} TFLOPS")
    
    # Cooldown period
    time.sleep(cooldown)
    
    return tflops


def measure_memory_bandwidth():
    """Measure memory bandwidth in GB/s using tensor operations"""
    torch.cuda.empty_cache()
    
    # Calculate tensor size to match desired memory usage
    num_elements = int(mem_size_gb * 1e9 / 4)  # 4 bytes per float
    
    # For memory bandwidth testing, use flat vectors to ensure
    # contiguous memory access patterns
    x = torch.ones(num_elements, dtype=torch.float32, device="cuda")
    y = torch.ones(num_elements, dtype=torch.float32, device="cuda")
    
    # Bytes moved in each test (read x, y, write z)
    bytes_per_iter = num_elements * 4 * 3  # 3 = 2 reads + 1 write
    
    # Warmup
    for _ in range(mem_warmup):
        z = x + y
        torch.cuda.synchronize()
    
    # Benchmark
    times = []
    for _ in range(mem_iterations):
        torch.cuda.synchronize()
        st = time.perf_counter()
        z = x + y
        torch.cuda.synchronize()
        times.append(time.perf_counter() - st)
    
    # Calculate bandwidth
    tm = min(times)
    bandwidth_gbps = bytes_per_iter / tm / 1e9
    
    print(f"\nMemory Bandwidth Test ({mem_size_gb:.1f} GB tensor)")
    print(f"Vector Addition: {bandwidth_gbps:.2f} GB/s")
    
    # Additional memory test: copy operation
    times = []
    for _ in range(mem_iterations):
        torch.cuda.synchronize()
        st = time.perf_counter()
        z = x.clone()
        torch.cuda.synchronize()
        times.append(time.perf_counter() - st)
    
    # Calculate bandwidth (copy is 1 read + 1 write)
    tm = min(times)
    memcpy_bandwidth_gbps = (num_elements * 4 * 2) / tm / 1e9
    
    print(f"Memory Copy:     {memcpy_bandwidth_gbps:.2f} GB/s")


def measure_cpu_gpu_transfer():
    """Measure CPU<->GPU transfer speed in GB/s"""
    torch.cuda.empty_cache()
    
    # Use half the memory size for transfer tests to avoid OOM
    transfer_size_gb = mem_size_gb / 2
    num_elements = int(transfer_size_gb * 1e9 / 4)  # 4 bytes per float
    
    # Create CPU tensor
    x_cpu = torch.ones(num_elements, dtype=torch.float32)
    
    # Warmup
    for _ in range(mem_warmup):
        x_gpu = x_cpu.cuda()
        torch.cuda.synchronize()
        x_back = x_gpu.cpu()
    
    # CPU -> GPU transfer
    times_to_gpu = []
    for _ in range(mem_iterations):
        torch.cuda.synchronize()
        st = time.perf_counter()
        x_gpu = x_cpu.cuda()
        torch.cuda.synchronize()
        times_to_gpu.append(time.perf_counter() - st)
    
    # GPU -> CPU transfer
    times_to_cpu = []
    for _ in range(mem_iterations):
        torch.cuda.synchronize()
        st = time.perf_counter()
        x_back = x_gpu.cpu()
        # No synchronize needed for CPU operations
        times_to_cpu.append(time.perf_counter() - st)
    
    # Calculate bandwidth
    tm_to_gpu = min(times_to_gpu)
    tm_to_cpu = min(times_to_cpu)
    
    bytes_transferred = num_elements * 4
    to_gpu_gbps = bytes_transferred / tm_to_gpu / 1e9
    to_cpu_gbps = bytes_transferred / tm_to_cpu / 1e9
    
    print(f"\nCPU<->GPU Transfer Test ({transfer_size_gb:.1f} GB tensor)")
    print(f"CPU -> GPU:      {to_gpu_gbps:.2f} GB/s")
    print(f"GPU -> CPU:      {to_cpu_gbps:.2f} GB/s")


def main():
    # Print header information first
    print(f"GPU: {get_gpu_info()}")
    print(f"Matrix Size: {N}x{N} ({N*N*4/1e9:.2f} GB per matrix)")
    print("=" * 60)
    
    # Compute benchmarks
    print("Matrix Multiplication Performance:")
    for dtype in ["float32", "float16", "bfloat16"]:
        try:
            run_compute_benchmark(dtype)
        except Exception as e:
            print(f"Error testing {dtype}: {e}")
    
    try:
        run_amp_benchmark()
    except Exception as e:
        print(f"Error testing AMP: {e}")
    
    # Memory bandwidth benchmarks
    try:
        measure_memory_bandwidth()
    except Exception as e:
        print(f"Error in memory bandwidth test: {e}")
    
if __name__ == "__main__":
    main()
reddit.com
u/cyberuser42 — 3 days ago
▲ 5 r/ROCm

TurboQuant with 16 GB VRAM?

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.

Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.

Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?

reddit.com
u/Haunting-Stretch8069 — 3 days ago
▲ 5 r/ROCm

Building First AI/LLM PC With Dual 9070 XT GPUs – Any ROCm or AMD Issues I Should Know About?

I wanted to build a PC for a long time, but prices always stopped me. Recently I read that prices may not come down anytime soon, so I decided to just build it now.

My main use case will be running local AI models for occasional gaming, some python/go automation coding and research automation. I want to process a lot of PDF and XLS documents using tools like Docling, Granite, and OCR models, then store chunked data in Qdrant/vector DBs.

I am new to this area, but I work as a cloud engineer and I am comfortable setting up technical systems and learning new things quickly.

Right now I am planning this build:

  • 2 × 9070 XT GPUs
  • Ryzen 9950X CPU
  • 64GB 6000MHz RAM
  • 1200W PSU
  • Proper case and cooling

My main concern is using AMD GPUs instead of NVIDIA because AMD gives more VRAM for the price.

Before I buy everything, I wanted to ask:

  • Will this setup work well for local AI workloads?
  • Are there any major ROCm or AMD GPU problems I should know about?
  • Anything important I should be careful about before building?

Any advice would help.

reddit.com
u/AnmolLFC — 4 days ago
▲ 26 r/ROCm+3 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

  • PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
  • FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps > 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
  • In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
  • Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
  • /srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com
u/Outrageous_Bug_669 — 4 days ago
▲ 12 r/ROCm

Considering the R9700

I currently have a 7900 GRE, which is a great card, but I was considering an R9700. I've had the GRE for two years now and, from what I've read, we won't see RDNA5 for at least another year. The 32 gb VRAM is also pretty tempting.

I've been tinkering with local LLMs for code use. I've got the Qwen3-Coder-30B-A3B running with Unsloth's Q4_K_XL quant with 65k context using Vulkan (ROCm seems kinda trash here).

I use Llama.cpp + Llama-Swap.

Other than that, I also run diffusion workloads - I2V with Wan 2.2 - and also game a little (definitely not as much as I used to though).

OS: Ubuntu 26.04

CPU: 7800X3D

RAM: 32GB DDR5

My concerns:

- Price. If the card were ~1 - 1.1k I'd be less hesitant, but it's currently sitting at ~1.4k. I don't know if I can justify that cost for essentially a 9070XT with 32GB VRAM.

- Noise. I've heard the blower-style fan can be quite loud. This is my main PC, so if it sounds like a jet engine is spooling up next to me that'd be a deal breaker.

R9700 owners - I'd love to have your insight on this.

EDIT: Well, I decided to go for it. Let's see how it goes.

reddit.com
u/DecentEscape228 — 5 days ago
▲ 50 r/ROCm+5 crossposts

New Asus Flow Z13 KJP Edition Laptop Purchased - Guidance Needed for Dev Env Setup

Good People.

I purchased a new Asus Flow Z13 KJP Edition Strix Halo laptop (cum tablet). It’s got 128 GM unified memory RAM machine, and I specifically bought it for running local LLMs. I have been into Apple ecosystem and have been a Mac OS user since 18 years. However, as the Mac Studio prices are beyond my affordability, have gone ahead with this. I have been into software web development for quite sometime and I am comfortable with trying out new things.

I would like to get some guidance around the setting up this machine for development, specifically Web, Mobile Apps and running Local LLM. I have bought a Windows 11 Pro license and upgraded to it, however I would like to also dual boot it to Ubuntu or a specific flavour of Linux that is optimised for performance and general broad support for drivers and development packages easy availability. The idea is not to spend a lot of time tinkering or learning the Linux side of things but to optimise the machine for local LLM running.

Coming from Mac OS to Windows and Linux, I also want to get your inputs if there are any particular things that would help me along this journey that I am starting.

Please do share your inputs and thoughts on the comments, not just specific to what I am looking for but anything and everything that will be of help to me, I would really appreciate it.

Thank you.

u/bmanojk — 6 days ago
▲ 1 r/ROCm

Quad MI50 setup - weird behaviour

Hi all, I have a quad MI50 32GB setup on the same motherboard (an old Supermicro X10DRX with 8 PCIe 3.0 x8 slots, not ideal but I'm experimenting).

I'm using llama.cpp in docker with images from:

https://github.com/mixa3607/ML-gfx906

but also tried other ones with the same results.

This is what happens. It seems like my GPUs are grouped into two groups, 0+1 and 2+3. If I stick to just one of these groups, llama.cpp (but also Ollama) works fine. If I use the full quad GPU (so 0+1+2+3) or if I mix the groups (like 0+2, 0+3, 1+2, 1+3) I get:

ggml_cuda_compute_forward: SCALE failed
current device: 0, in function ggml_cuda_compute_forward at /build/llamacpp/ggml/src/ggml-cuda/ggml-cuda.cu:3114

and a bunch of trace-back messages:

[40461] libggml-base.so.0(+0x1addb)[0x7bf09bcbaddb]
[40461] libggml-base.so.0(ggml_print_backtrace+0x21c)[0x7bf09bcbb25c]
[40461] libggml-base.so.0(ggml_abort+0x15b)[0x7bf09bcbb43b]
[40461] /app/libggml-hip.so(+0x27f262)[0x7bf097ef7262]
[40461] /app/libggml-hip.so(+0x28a534)[0x7bf097f02534]
[40461] /app/libggml-hip.so(+0x2862a1)[0x7bf097efe2a1]
[40461] libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x817)[0x7bf09bcd88c7]
[40461] libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7bf09be38a31]
[40461] libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x114)[0x7bf09be3b124]
[40461] libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x390)[0x7bf09be42630]
[40461] libllama.so.0(llama_decode+0xf)[0x7bf09be440ff]
[40461] libllama-common.so.0(_Z23common_init_from_paramsR13common_params+0x3ff)[0x7bf09c35e93f]
[40461] /app/llama-server(+0x11b668)[0x63bdd6185668]
[40461] /app/llama-server(+0x6bc41)[0x63bdd60d5c41]
[40461] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7bf09b71c1ca]
[40461] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7bf09b71c28b]
[40461] /app/llama-server(+0x6c875)[0x63bdd60d6875]

I can also launch two docker containers in parallel with, each allocated on one of the two groups, and they work flawlessly, so I'm excluding problems related to the motherboard.

I'm using ROCm 6.3.3. For llama, this is what I get:

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65504 MiB):
 Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
 Device 1: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
load_backend: loaded ROCm backend from /app/libggml-hip.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 1 (81b0d88)
built with GNU 13.3.0 for Linux x86_64

Any ideas?

u/rhpk — 6 days ago
▲ 2 r/ROCm

9070 xt and 6800 ?

Looking to see if anybody has combined these cards with rocm/vulkan for 32gb of vram for llama.cpp or lm studio?

reddit.com
u/Brave_Load7620 — 5 days ago
▲ 5 r/ROCm

Running SDXL Image Generation on 9060 xt 16gb

Hello, i think this is the right place to ask,

recently I've got myself a PC with 9060 xt 16gb, I've tried using Stability Matrix to download and use ComfyUI and SD.Next using DirectML and ZLUDA but as of now it has not worked and when it worked it used my CPU to generate images, after some more research i found i can "use" ROCm to be able to generate images,

so I'm asking if there is any guide for the procedure as for how i can use ComfyUI using my 9060 xt, thanks.

reddit.com
u/CamperSlayer69 — 6 days ago
▲ 126 r/ROCm+17 crossposts

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out. ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

  1. Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
  2. Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
  3. Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
  4. Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
  5. Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
  6. Music - ACE-Step v1 generates a 30s instrumental from Director's brief
  7. Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
  8. Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about):

  • 1280×720, not 640×640 default. Costs more but matches what producers want
  • 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up
  • flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults)
  • Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker
  • Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out
  • Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work:

  • ParaAttention FBCache (lossless 2× on Wan2.2)
  • torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2×
  • AITER MoE acceleration on Qwen director (vLLM)
  • End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

u/Inevitable-Log5414 — 8 days ago
▲ 10 r/ROCm

OS choice for AMD GPUs? (Fedora vs. Ubuntu)

I see Ubuntu mentioned most frequently in relation to ROCm and AMD cards, but I'm somewhat partial to Fedora. Would that be asking for trouble?

Does single-vs-multiple GPUs (R9700) affect the calculation?

Thanks.

reddit.com
u/ShadowyTreeline — 7 days ago
▲ 2 r/ROCm

Rx9060xt 16gb vs RTX5060 8gb

Hello everyone,

I need opinions. In my country, RTX5060(new) 8gb costs almost $350 and RX9060XT(new) 16gb costs almost $440. RTX5060ti(new) 16gb cost almost $585. Now, I was planning to buy a GPU for ML training and inference. I am a little bit confused here. I know that CUDA is much more mature than ROCM. I don't have the budget to buy RTX5060ti 16gb. I am confused between 5060 and 9060xt. 9060xt have more vram than 5060. But 5060 has better support for ML. What should I do here ? I will train CNN and LLM(small ones) models with a good amount of data which one should I choose here ? Is there any possibility of ROCM to be more optimized for ML in future ?

reddit.com
u/Specialist-Zone-8296 — 7 days ago
▲ 84 r/ROCm

I got tired of hunting AMD GPU + AI configs across blog posts and Discord threads, so I built a curated index — rocmate

Every time I set up a new AI tool on my RX 7900 XTX, I spent hours

digging through GitHub issues, outdated blog posts, and Discord threads

just to find the right HSA_OVERRIDE value or the correct PyTorch ROCm

wheel URL. Information exists, but it's scattered and rarely chip-specific.

So I built rocmate — a version-controlled compatibility index + CLI that

tells you what works on your specific AMD GPU:

pip install rocmate
rocmate doctor        # check your system
rocmate show ollama   # see tested config for your chip
rocmate install ollama # install with correct ENV vars

Stable Diffusion WebUI, vLLM, Axolotl, ExLlamaV2) across 5 chip

generations (gfx1100, gfx1101, gfx1102, gfx1030, gfx1034).

What I actually need from this community: configs for chips I don't own.

If you have an RX 6700 (gfx1031), RX 5700 (gfx1010), or any RDNA1 card,

and you've gotten any of these tools running — a 5-minute PR with your

config would help everyone with the same hardware.

GitHub: https://github.com/T0nd3/rocmate

PyPI: https://pypi.org/project/rocmate/

u/T0nd3 — 8 days ago
▲ 11 r/ROCm

Update on Isaac sim port

Hey everyone. I wanted to give a quick update on Void Compute and the mission to get Isaac Sim running natively on AMD hardware. I hit a massive wall with the original plan of full OptiX emulation. After digging into the telemetry logs and the RDNA 3 instruction set, it became clear that trying to emulate an NVIDIA proprietary compiler at the transistor level is just falling for their trap. It is slow and it creates a performance bottleneck that wastes the raw power of the 7800 XT.

I am officially pivoting the project architecture of​ Phase 2. We are no longer trying to be NVIDIA. We are building a compatibility middleware that bypasses the proprietary OptiX moat entirely. Instead of emulating the silicon, I am targeting the Hydra Render Delegate. This allows the engine to talk natively to AMD Ray Accelerators through the Vulkan RT API. By forcing the Vulkan delegate and using rasterization with a native ray tracing pipeline, we get hardware intersection testing without the CUDA overhead.

I am also ripping out the PhysX backend. PhysX is designed to gatekeep performance on non NVIDIA hardware. I am replacing it with the Newton Dynamics solver because it is backend agnostic and deterministic. For the final visual output, I am using Intel OIDN 2.2 with a HIP backend to handle denoising. This provides visual parity with OptiX while staying within the open standard ecosystem.

reddit.com
u/ChrisGamer5013 — 6 days ago
▲ 34 r/ROCm

SageAttention v2 native port running on RDNA4

https://github.com/thu-ml/SageAttention/pull/368

Please try it out and let me know here or on the PR if you face any problems or if you see any speedups. Thanks and enjoy!

Currently only tested on Windows, not linux, but it should work on Linux with hopefully no/minimal changes.

On my side on Windows with a 9070XT when using comfyui with `--use-sage-attention` and running WAN2.1 1.4b (fp8_e4m3fn weight dtype), I'm seeing about a 42% speedup on the diffusion step times vs. `--use-pytorch-cross-attention`.

u/adyaman — 8 days ago