u/Haunting-Stretch8069

▲ 5 r/ROCm

TurboQuant with 16 GB VRAM?

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.

Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.

Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?

reddit.com
u/Haunting-Stretch8069 — 3 days ago

TurboQuant with 16 GB VRAM?

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1 for OpenClaw. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.

Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.

Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?

reddit.com
u/Haunting-Stretch8069 — 3 days ago
▲ 15 r/Qwen_AI

TurboQuant on 16 GB VRAM

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.

Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.

Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?

reddit.com
u/Haunting-Stretch8069 — 3 days ago

Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw?

I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm.

I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup. This is on Ubuntu, with the display connected to my iGPU, so the RX 7800 XT should have no display overhead. I only have 16 GB DDR4 RAM, which is why I haven’t tried the 35B MoE model.

My goal is to optimize performance in agentic use such as OpenClaw, Hermes Agent, etc. across capability, token generation speed, context length, reliability, and so on...

Current command:

GPU_MAX_HEAP_SIZE=100 \
GPU_MAX_ALLOC_PERCENT=100 \
./build/bin/llama-server \
  -m /home/guy/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-IQ4_XS.gguf \
  -ngl 999 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --parallel 1 \
  --prio 2 \
  --fit off \
  --no-mmap \
  -b 65536 \
  -ub 512 \
  --reasoning-format deepseek \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0 \
  -n 32768 \
  --no-context-shift \
reddit.com
u/Haunting-Stretch8069 — 5 days ago

Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw?

I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm.

I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup. This is on Ubuntu, with the display connected to my iGPU, so the RX 7800 XT should have no display overhead. I only have 16 GB DDR4 RAM, which is why I haven’t tried the 35B MoE model.

My goal is to optimize performance in agentic use such as OpenClaw, Hermesetc. across capability, token generation speed, context length, reliability, and so on...

Current command:

GPU_MAX_HEAP_SIZE=100 \
GPU_MAX_ALLOC_PERCENT=100 \
./build/bin/llama-server \
  -m /home/guy/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-IQ4_XS.gguf \
  -ngl 999 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --parallel 1 \
  --prio 2 \
  --fit off \
  --no-mmap \
  -b 65536 \
  -ub 512 \
  --reasoning-format deepseek \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0 \
  -n 32768 \
  --no-context-shift \
reddit.com
u/Haunting-Stretch8069 — 5 days ago
▲ 1 r/LocalLLM+1 crossposts

Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw?

I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm.

I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup. This is on Ubuntu, with the display connected to my iGPU, so the RX 7800 XT should have no display overhead. I only have 16 GB DDR4 RAM, which is why I haven’t tried the 35B MoE model.

My goal is to optimize performance in agentic use such as OpenClaw, Hermes Agent, etc. across capability, token generation speed, context length, reliability, and so on...

Current command:

GPU_MAX_HEAP_SIZE=100 \
GPU_MAX_ALLOC_PERCENT=100 \
./build/bin/llama-server \
  -m /home/guy/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-IQ4_XS.gguf \
  -ngl 999 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --parallel 1 \
  --prio 2 \
  --fit off \
  --no-mmap \
  -b 65536 \
  -ub 512 \
  --reasoning-format deepseek \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0 \
  -n 32768 \
  --no-context-shift \
reddit.com
u/Haunting-Stretch8069 — 5 days ago

Scrolling screenshot extension that works on Atlas?

I tried 5 different Chrome extensions, all failed on Atlas for one reason or another. In fact, many extensions don't function correctly, most importantly, Claude on Chrome.

Does anyone know a scrolling screenshot extension that works?

reddit.com
u/Haunting-Stretch8069 — 13 days ago