u/Funny-Factor-6082

help for more optimisation

i ran a qwen 3.6 35 A3B (llama.cpp) with pi agent in ubuntu and saw a good speed of abt 250+ tk/s prompt read speed and around 25 tk/s generation speed and 32k context i looking for scaling this up with adding more tools but how possible it is ? cuz i am doing this on my loq with 3050 6gb with i5hx cpu and 24 gb ram.

optimisation command:

./build/bin/llama-server \

-m "/mnt/models/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \

-ngl 999 \

-ot "exps=CPU" \

--no-mmap \

--mlock \

--ctx-size 32768 \

--cache-type-k q4_0 \

--cache-type-v q4_0 \

-t 8 \

-b 4096 -ub 4096

What's already tried:

  • -ot "exps=CPU" instead of -ngl X — biggest gain, doubled speed
  • --no-mmap — loads full model into RAM, eliminates disk reads
  • --mlock — pins model in RAM, prevents OS paging (works natively on Linux)
  • -b 4096 -ub 4096 — large batch size, massively improves prompt processing
  • q4_0 KV cache — compressed KV cache saves VRAM for more layers
  • Running on Linux instead of Windows — saves 3-4GB RAM overhead

my questions:

  • Can -ot be tuned further — e.g. keep some experts on GPU selectively?
  • Is iq4_nl KV cache better than q4_0 for this model?
  • Worth trying ik_llama.cpp fork for extra speed?
  • Flash attention flags — any benefit for MoE on RTX 3050?
  • Optimal -t thread count for i5 HX — currently 8, should it be higher?
  • Model is on NTFS drive mounted in Linux — would copying to ext4 improve speed?
reddit.com
u/Funny-Factor-6082 — 6 days ago