u/Disastrous-Cat-7016

An optimized Qwopus3.6 27B v2 that consistently hits 30-40 tokens per second on STRIX HALO?

Yes!

CHADROCK QWOPUS3.6 27B v2 with MTP and Charlie12345's ROCmFP4 optimizations is now the smartest and fastest 27B I've ever used.

Highlights:

HumanEval:
• Chadrock: 159/164 = 96.95%
• Original Q5_K_M: 151/164 = 92.07%

HumanEval+:

• Chadrock: 155/164 = 94.51%
• Original Q5_K_M: 147/164 = 89.63%
That’s +8 tasks on HumanEval and +8 tasks on HumanEval+ vs the recorded original Qwopus3.6 27B v2 Q5_K_M row.

Speed:

• 164 HumanEval tasks
• 45,033 completion tokens
• 1346.8s cumulative request latency
• ~59 tok/s mean total-token request speed
• ~60 tok/s median total-token request speed
• ~33.44 tok/s completion-only llama.cpp eval speed
• ~37.14 tok/s peak active completion speed

The original Q5_K_M run recorded 3834s generation time. Chadrock completed the same 164-task codegen workload with ~2.8x lower recorded request-generation time while also scoring higher.

BFCL v4 non-live tool calling:

• Overall: 85.88%
• Simple Python AST: 94.50%
• Multiple-call AST: 96.00%
• Parallel-call AST: 86.50%
• Parallel multiple-call AST: 85.50%
• Irrelevance detection: 81.67%

This is the profile to try if you want a local Strix Halo model that feels fast while staying sharp on coding and tool-use formats.

You need Carlo's custom llama fork, this is built for AMD, details are in the model card. I couldn't attach it because of reddits filter.

Fastest Qwopus 27b for Strix Halo so far!