u/Embarrassed-Result87

Just wanted to share a massive win for the low-VRAM gang. I’ve been tinkering with an old RX 570 4GB paired with an i5-9400F on CachyOS, and the results with the latest llama.cpp are honestly mind-blowing. I initially struggled with the AUR versions of llama-vulkan, hitting VRAM limits almost instantly when loading Gemma. But then I switched to the latest official llama.cpp binaries (the Ubuntu build), and everything just clicked.

The Setup: GPU: AMD Radeon RX 570 4GB (Polaris 10) OS: CachyOS (Linux) using RADV drivers Model: gemma-4-E2B-it-Q4_K_M.gguf Backend: Vulkan

The "Magic" Command: ./llama-server -m gemma-4-E2B-it-Q4_K_M.gguf --host 0.0.0.0 --port 11435 --ctx-size 8192 --n-gpu-layers 99 --threads 4 --no-warmup --reasoning off -np 2

The Numbers: Context Size: 8192 (8k) Speed: 56 tokens/sec consistently. VRAM Usage: 3.6 GB total (System takes ~600MB, the model + 8k KV cache takes ~3GB).

Key Takeaways: -np 2 is the sweet spot: Surprisingly, setting parallel slots to 2 worked flawlessly while keeping the VRAM usage within the 4GB limit. It handles the 8k context without any crashes.

Official binaries > AUR: At least for this specific setup, the official llama.cpp build handled Vulkan memory mapping much more efficiently than the community packages I tried earlier. 8k Context on 4GB: It’s actually usable! I’m getting lightning-fast responses for RAG tasks and medical paper summarization. If you have an old Polaris card lying around, don't sleep on it. With the right quantization and the latest llama.cpp optimizations, these "relics" are still absolute demons for small models.

Stay local!

Who said 4GB VRAM is dead? 56 t/s on a Polaris RX 570 with 8k Context!