
RDNA2 Consumer GPU, get double your tok/s. You are missing out.
What's good everybody, I probably have the fastest possible setup on consumer grade AMD Radeon RDNA2 GPUs with qwen3.6 35B. Rx6800 (16gb) and Rx6700xT (12gb). Flash attention is not enabled for our cards, but with this work around, you too can get the speed boosts of flash attention.
tldr; vulkan tok/s 30. stock rocm tok/s: Doesnt run. This build: 70-80 tok/s
try it yourself.
https://github.com/Minerest/llama.cpp_RDNA2_FlashAttnEnabled/releases/tag/mtp-fa-workaround
If you guys try to run flash attention on rocm with this hardware with a stock llama cpp build, you will hit a wall.
GGMLFlash Attention Crash (gfx1030/gfx1031)
GGML_ASSERT(max_blocks_per_sm > 0) failed
ggml/src/ggml-cuda/fattn-common.cuh:1054
Basically, HIP reports that hipOccupancyMaxActiveBlocksPerMultiprocessor
= 0 which is wrong. This is working proof that we do, indeed, have memory. I patched a workaround log when you would have crashed. There's some technical findings in github, but for the rest of you who just want a faster build, this is it.
Buyer Beware, local AI on rocm crash often. Gemma crashes on bigger contexts with this build. Deepseek ran very, very slowly. Only confirmed working AI I've tried is qwen3.6 35B and 27B.
And for those who want the llama server flags.
exec "$REPO/mtp-build/bin/llama-server" \
-m "$MODEL" \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-fa on \
--no-mmproj \
-ngl 50 \
-ts 16,10 \
-c 64192 \
--parallel 1 \
--host 127.0.0.1 --port 8080 \
And finally, the llama cpp build command post patch
cmake -S . -B build-instrumented \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DGPU_TARGETS="gfx1030;gfx1031" \
-DROCM_PATH=/usr \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_HIP_FLAGS="-DGGML_FATTN_TRACE"
cmake --build build-instrumented --target llama-bench -j6