
Hermes + Local - Performance breakthrough with Llama CPP + Multi-Token Prediction
Current setup: Hermes Agent running on a Mac mini. LLM is Qwen3.6 MOE via Ollama running on Strix Halo .
Good results, but so slow that it's barely usable - I've been switching to a ChatGPT Plus plan to get serious work done. Not ideal. Spending all this money and still have to use a paid subscription? No thank you.
Saw this video today:
https://www.youtube.com/watch?v=MI0Pm1d6YF4
Thought it was worth a try, so I spent an hour switching over to llama cpp for a test.
Woo hoo! This thing is actually usable now. More testing to do, but it feels 5-10x faster than before.
If you're still using ollama, consider switching to llama cpp. This is a revelation for me.
** Updated - tweaked further **
./build/bin/llama-server \
-m "$MODEL_PATH" \
--host 0.0.0.0 \
--port "$PORT" \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.85 \
--parallel 1 \
-ngl 99 \
-fa on \
-c 65536 \
--timeout 600 \
--keep 12000 \
--no-slots \
--no-mmap \
--jinja