TurboQuant with 16 GB VRAM?
I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.
Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.
Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?