u/Aggressive-Support15

I wanted to share a setup result and get some advice from people here who know llama.cpp / turboquant better than I do.

I followed the general approach from this video:

https://www.youtube.com/watch?v=8F_5pdcD3HY

I did not copy it 1:1, but I used it as the main reference and adapted it to my own machine.

My current setup:

- GPU: RTX 3080 20GB

- RAM: 15 GB

- CPU: i3-10100F

- llama.cpp turboquant build

- Model: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

- mmproj: mmproj-F16.gguf

- Context: 256k

- n-cpu-moe: 22

- cache-type-k: turbo4

- cache-type-v: turbo3

- flash-attn on

Current result:

- stable at 256k context

- roughly 40 tok/s

- model load time is around 5 minutes

- vision also works after adding mmproj

What I found interesting is that the biggest unlock was not just using a quantized GGUF, but combining that with turboquant KV cache settings. That was the part that made 256k actually possible on this machine.

What I’m hoping to learn from people here:

Performance tuning

Given this hardware and this model, is there anything obvious I should still try to improve throughput or latency?

For example:

- different n-cpu-moe values

- different batch / ubatch

- different cache type combos

- whether 256k is worth keeping vs dropping to 128k for better real-world performance

Thinking mode vs no thinking mode

For agentic workloads (Hermes, OpenClaw, tool-using assistants, coding flows, etc.), would you keep thinking enabled or disable it?

My intuition is:

- thinking mode = better for hard reasoning / planning

- no thinking = better for speed / responsiveness / lower token cost

But I’d love to hear from people actually using Qwen in agent-style workflows.

Do you find thinking mode worth it for tool use, or does it mostly just add latency?

Agent use in general

If the goal is to use this model for agentic tasks rather than just chat, would you optimize differently?

For example:

- lower context but faster response

- no thinking mode

- different quant choice

- maybe a different model entirely for the controller / planner role

I’m pretty happy that I got this working at all on this box, but I also suspect I’m still in the “it works” phase rather than the “it’s really optimized” phase.

Would really appreciate any suggestions, corrections, or things you’d test next.

Followed the turboquant llama.cpp setup from this video and got Qwen3.6-35B-A3B running at 256k / ~40 tok/s on RTX 3080 20GB — looking for advice on further tuning + agent use