▲ 2 r/LocalLLM
help for more optimisation
i ran a qwen 3.6 35 A3B (llama.cpp) with pi agent in ubuntu and saw a good speed of abt 250+ tk/s prompt read speed and around 25 tk/s generation speed and 32k context i looking for scaling this up with adding more tools but how possible it is ? cuz i am doing this on my loq with 3050 6gb with i5hx cpu and 24 gb ram.
optimisation command:
./build/bin/llama-server \
-m "/mnt/models/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \
-ngl 999 \
-ot "exps=CPU" \
--no-mmap \
--mlock \
--ctx-size 32768 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-t 8 \
-b 4096 -ub 4096
What's already tried:
-ot "exps=CPU"instead of-ngl X— biggest gain, doubled speed--no-mmap— loads full model into RAM, eliminates disk reads--mlock— pins model in RAM, prevents OS paging (works natively on Linux)-b 4096 -ub 4096— large batch size, massively improves prompt processingq4_0KV cache — compressed KV cache saves VRAM for more layers- Running on Linux instead of Windows — saves 3-4GB RAM overhead
my questions:
- Can
-otbe tuned further — e.g. keep some experts on GPU selectively? - Is
iq4_nlKV cache better thanq4_0for this model? - Worth trying
ik_llama.cppfork for extra speed? - Flash attention flags — any benefit for MoE on RTX 3050?
- Optimal
-tthread count for i5 HX — currently 8, should it be higher? - Model is on NTFS drive mounted in Linux — would copying to ext4 improve speed?
u/Funny-Factor-6082 — 6 days ago