u/Duviwin

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode

Edit: By now it's actually already at ~220 tok/s prefill and ~14 tok/s decode without any measurable quality loss compared to my original measurements. See https://www.reddit.com/r/StrixHalo/s/kH7f3E4mAV

TLDR: antirez/ds4 q2 rocm branch on strix halo 128gb works with ~80 tok/s prefill and ~7 tok/s decode speed.

Hi all, I was looking on this reddit subthread for someone who tested antirez/ds4 on strix halo but I couldn't find it, so I gave openclaw+gpt5.5 the task to test it out on my machine and make a post so I can save you all some tokens.

Hardware / setup:

• Machine: Bosgame M5 / Strix Halo
• APU: AMD Ryzen AI Max+ 395 w/ Radeon 8060S
• RAM: 128 GB installed, Linux reports ~124 GiB total; I treated ~120 GB as the practical safety envelope
• Kernel: Linux 7.0.1
• Backend: ROCm/CUDA path from local ds4-rocm
• DS4 commit: 7a751eb with local ds4_cuda.cu modifications present
• Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
• Model file size: 86,720,111,488 bytes, about 80.75 GiB
• Benchmark command shape followed the upstream speed-bench README: https://github.com/antirez/ds4/blob/main/speed-bench/README.md

Command used:

./ds4-bench \\  
\-m ds4flash.gguf \\  
\--prompt-file speed-bench/promessi\_sposi.txt \\  
\--ctx-start 2048 \\  
\--ctx-max 65536 \\  
\--step-mul 2 \\  
\--gen-tokens 128 \\  
\--csv /tmp/ds4-bosgame-m5-65k.csv  

Results:

Context Prefill t/s Decode t/s Live KV payload
2,048 55.67 7.81 49.8 MiB
4,096 86.61 7.75 76.6 MiB
8,192 86.51 7.69 130.4 MiB
16,384 83.92 7.58 237.9 MiB
32,768 81.51 7.44 453.0 MiB
65,536 79.84 7.15 883.1 MiB

So the short version is: yes, it runs on Strix Halo with the Q2 imatrix model. Prompt processing is roughly 80 t/s once the benchmark gets past the tiny first segment. Decode is much slower, around 7-8 t/s, and gradually drops with longer context.

Memory math before testing:

The Q2 model itself is about 80.75 GiB. DS4 reports its estimated context buffer allocation before running. On this backend, the estimate was:

• 65k context: ~1.28 GiB context buffers, ~82.05 GiB model + context buffers
• 131k: ~2.37 GiB, ~83.14 GiB total
• 262k: ~4.55 GiB, ~85.32 GiB total
• 524k: ~8.91 GiB, ~89.67 GiB total
• 1M: ~17.63 GiB, ~98.39 GiB total
• 2M: ~35.07 GiB, ~115.83 GiB total, but this is too close to the 120 GB practical envelope once OS/runtime overhead is included

Since DeepSeek V4 Flash is described as a 1M-token-context model anyway, my takeaway is that 1M context should be the realistic upper target for this 128 GB Strix Halo setup with Q2. I did not run a full 1M prefill benchmark because that would take hours, but the memory estimate says it should fit without getting near the dangerous edge. 2M is theoretically close to fitting by buffer math alone, but I would not treat it as safe on a 120 GB envelope.

Limitations / caveats:

• This is DS4-specific. It is not a general GGUF runner.
• I tested the Q2 imatrix model, not Q4.
• I benchmarked up to 65k context for speed. Larger-context fit is based on DS4’s own context-buffer estimate plus model size, not a multi-hour 1M benchmark run.
• This was on a local ROCm branch/build, so upstream behavior may move quickly.

u/Duviwin — 17 days ago