u/DiscipleofDeceit666

RDNA2 Consumer GPU, get double your tok/s. You are missing out.

What's good everybody, I probably have the fastest possible setup on consumer grade AMD Radeon RDNA2 GPUs with qwen3.6 35B. Rx6800 (16gb) and Rx6700xT (12gb). Flash attention is not enabled for our cards, but with this work around, you too can get the speed boosts of flash attention.

tldr; vulkan tok/s 30. stock rocm tok/s: Doesnt run. This build: 70-80 tok/s
try it yourself.

https://github.com/Minerest/llama.cpp_RDNA2_FlashAttnEnabled/releases/tag/mtp-fa-workaround

If you guys try to run flash attention on rocm with this hardware with a stock llama cpp build, you will hit a wall.

GGMLFlash Attention Crash (gfx1030/gfx1031)
GGML_ASSERT(max_blocks_per_sm > 0) failed
ggml/src/ggml-cuda/fattn-common.cuh:1054

Basically, HIP reports that hipOccupancyMaxActiveBlocksPerMultiprocessor
= 0 which is wrong. This is working proof that we do, indeed, have memory. I patched a workaround log when you would have crashed. There's some technical findings in github, but for the rest of you who just want a faster build, this is it.

Buyer Beware, local AI on rocm crash often. Gemma crashes on bigger contexts with this build. Deepseek ran very, very slowly. Only confirmed working AI I've tried is qwen3.6 35B and 27B.

And for those who want the llama server flags.

exec "$REPO/mtp-build/bin/llama-server" \
-m "$MODEL" \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-fa on \
--no-mmproj \
-ngl 50 \
-ts 16,10 \
-c 64192 \
--parallel 1 \
--host 127.0.0.1 --port 8080 \

And finally, the llama cpp build command post patch

cmake -S . -B build-instrumented \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DGPU_TARGETS="gfx1030;gfx1031" \
-DROCM_PATH=/usr \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_HIP_FLAGS="-DGGML_FATTN_TRACE"
cmake --build build-instrumented --target llama-bench -j6

u/DiscipleofDeceit666 — 4 days ago

▲ 8 r/LocalLLM

Got the cloud -> Local LLM code flow nailed. 10x token savings using qwen code CLI wrapper

Edit: Open sourced this tool. Check it out here: https://github.com/Minerest/lean-qwen

tldr: 10x savings comes from using your local hardware for writing code rather than Claude code.

Dependencies are qwen code cli, but soon it will be harness agnostic. This is my first contribution to the open source community, vibed out.

Local LLMs for consumer GPUs biggest challenge is context management. If we use them as agents, they’ll suck up all the context they can and become useless very quick so I needed a tool to manage that for me. Something that’ll give the chance to clear the context and fill it with relevant information to finish the next task at hand. Been tinkering around and finally have a proof of concept that works! Like 10x cloud savings on actual coding tasks. Claude is claiming 5x but most of token burn it’s counting was just chit chat that didn’t need to happen.

Talk about the toml

Basically, a cloud model and I would talk about a feature set and it would create a toml file broken up into bite sized tasks. Thats all the cloud token burn. Then I’d run this python script that would read the toml file and send it to qwen code CLI, the absolute best harness for the qwen3.6 moe. It does the heavy lifting of writing code and finding files etc etc.

Built in unit testing

Once the code writing is done, the Python script would run unit tests defined in the toml. It passes, next task. It fails, the stack trace and relevant context gets fed back into qwen code to fix itself. Lean like sizzurp.

~~Optionally, on multiple test failures, it can kill the ai server and launch the 27B dense model to take from there. [this has been stripped]~~

The longer this feature flow, the more cloud savings you get.

I’m probably gonna open source this project for those of us who like to combine cloud and local AI. I’ve tested this with the Q4 KM quant of the qwen3.6 35B a3b at 48k max context. I’ve got 28Gb of vram across two GPUs. The qwen3.6 quality has been validated by Claude in a real codebase. Full stack feature development broken into bite sized tasks. LFG

Happy to answer any questions. Got my own llamma cpp build and everything.

u/DiscipleofDeceit666 — 6 days ago

▲ 16 r/DeepSeek+1 crossposts

Deepseek got my local AI server running at 50 tok/s! Custom llama cpp build and all

I'm hoping to use this model to subsidize the already subsidized deepseek model. Super excited that I went from something that doesn't work, even with the other guys help, to something that runs. Wrote a code patch to llamma cpp to default to some value when a bad value is encountered. Crazy but very exciting stuff!

Hardware: dream team AMD
2600x

rx6800 16gb vram

rx 6700xt 12 gb vram

and a metric ton of ram

That being said, anyone know what terminal deepseek tui runs on? I get weird glitches that forces me to kill the session and restart.

u/DiscipleofDeceit666 — 7 days ago

▲ 40 r/OpenSourceeAI+1 crossposts

Had the rx6800 16gb for a few years. Had fun running local things and decided to fork over an arm and a leg to boost myself up to 64gbs ram and 28Gb of vram with the addition of the 6700xt.

Rdna2 come holler at me. I can run a 27B dense model at 10tok/s output with quality work. But the real win is being able to load a mini model for ✨speculative decoding ✨ The way I understand it is it’s basically an autocomplete for your ai model. 1gb of ram is what it costs and it boosted my writes from 10 to 15 tokens a second.

I’ve experimented with the new tensor parallelism setting, but it’s a bit slower than the normal layer thing I set up. Also, can’t compress the kv cache yet. Either way, the ceiling only goes up from here.

u/DiscipleofDeceit666 — 5 days ago

▲ 22 r/LocalLLM

I’m using an older RDNA2 card and prior to today, my months old build had very spotty support for flash attention.

I just downloaded the latest release and started toying around with different models in my 16 gig vram GPU. Turns out, I can now use Gema A4B and get speeds of like 60 tokens per second output. Time til first token is like 1 second even after sending it a big file. Might be worth putting something into a script where it checks, pulls, and installs the latest stable releases from GitHub.

I might be convinced to get a second GPU just for this cause. Support is moving so fast!

reddit.com

u/DiscipleofDeceit666 — 21 days ago

▲ 3 r/metalguitar

What’s up fellow circle pit enjoyers. When I was in high school, I was in a bunch of bands and it was dope. But real quick, we noticed that it was really hard to network with all the local talent and venues. How can we organize shows without diving deep into Facebook groups or Craigslist listings.

In comes http://www.bandeezy.com

A future exists where if you want to throw a house party, you’d just open up bandeezy and scroll through a list of local bands to invite to your show.

All the data in the app is faked and the images AI generated. My purpose here is to gauge interest before I start throwing money at this idea. I’m not here to make it rich, or even make money. I built this because I love live music. And I want to support the local scene.

Id appreciate any feedback like missing features etc. Click the login as dev button to log into the dev account. At the top, you can switch between a band user, a venue manager user, and just a regular fan.

Please don’t graffiti the comment section, database writes are active and uncensored.

This been my dream in the making.

reddit.com

u/DiscipleofDeceit666 — 23 days ago

RDNA2 Consumer GPU, get double your tok/s. You are missing out.

Got the cloud -&gt; Local LLM code flow nailed. 10x token savings using qwen code CLI wrapper

Deepseek got my local AI server running at 50 tok/s! Custom llama cpp build and all

Got the cloud -> Local LLM code flow nailed. 10x token savings using qwen code CLI wrapper