
Followed the turboquant llama.cpp setup from this video and got Qwen3.6-35B-A3B running at 256k / ~40 tok/s on RTX 3080 20GB — looking for advice on further tuning + agent use
I wanted to share a setup result and get some advice from people here who know llama.cpp / turboquant better than I do.
I followed the general approach from this video:
https://www.youtube.com/watch?v=8F_5pdcD3HY
I did not copy it 1:1, but I used it as the main reference and adapted it to my own machine.
My current setup:
- GPU: RTX 3080 20GB
- RAM: 15 GB
- CPU: i3-10100F
- llama.cpp turboquant build
- Model: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
- mmproj: mmproj-F16.gguf
- Context: 256k
- n-cpu-moe: 22
- cache-type-k: turbo4
- cache-type-v: turbo3
- flash-attn on
Current result:
- stable at 256k context
- roughly 40 tok/s
- model load time is around 5 minutes
- vision also works after adding mmproj
What I found interesting is that the biggest unlock was not just using a quantized GGUF, but combining that with turboquant KV cache settings. That was the part that made 256k actually possible on this machine.
What I’m hoping to learn from people here:
- Performance tuning
Given this hardware and this model, is there anything obvious I should still try to improve throughput or latency?
For example:
- different n-cpu-moe values
- different batch / ubatch
- different cache type combos
- whether 256k is worth keeping vs dropping to 128k for better real-world performance
- Thinking mode vs no thinking mode
For agentic workloads (Hermes, OpenClaw, tool-using assistants, coding flows, etc.), would you keep thinking enabled or disable it?
My intuition is:
- thinking mode = better for hard reasoning / planning
- no thinking = better for speed / responsiveness / lower token cost
But I’d love to hear from people actually using Qwen in agent-style workflows.
Do you find thinking mode worth it for tool use, or does it mostly just add latency?
- Agent use in general
If the goal is to use this model for agentic tasks rather than just chat, would you optimize differently?
For example:
- lower context but faster response
- no thinking mode
- different quant choice
- maybe a different model entirely for the controller / planner role
I’m pretty happy that I got this working at all on this box, but I also suspect I’m still in the “it works” phase rather than the “it’s really optimized” phase.
Would really appreciate any suggestions, corrections, or things you’d test next.