u/OddUnderstanding2309

Hey folks.

For weeks I try to run a "good setup" for a local Hermes agent.

This is my Hardware:
- Ryzen 9 5950X 48GB DDR4 3600 some NVME disks blablabla
- 2x RTX 3080 12G
- 2x RTX 3090 24GB
- 1x 1500 NZXT PSU
- 1x Corsair 750W PSU

So a quite capable system with 72GB VMEM, so I thought.

Software:
- Fedora 44 Workstation
- llama-cpp compiled from source
- Hermes AI agent
- local LLM (I do not want any cloud llm, I need my data in my house all the time)

Startup parameters right now (working kinda):

~/Tools/llama.cpp/build/bin/llama-server \

-hf llmfan46/gemma-4-31B-it-uncensored-heretic-GGUF:Q8_0 \

--mmproj-offload \

--host 0.0.0.0 --port 8080 \

--reasoning-budget 4000 \

--reasoning-budget-message "... thinking budget exceeded, let's answer now." \

--jinja \

--chat-template-kwargs "{\"preserve_thinking\": true}" \

-c 256000 \

-np 1 \

-ngl 99 \

-t 16 \

-b 2048 -ub 1024 \

-fa on \

-fit on \

--cache-type-k q8_0 --cache-type-v q8_0 \

--no-warmup \

--slot-prompt-similarity 0.1 \

--cache-prompt --no-context-shift \

--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.1 \

-ts 2.2,2.2,0.9,0.9 -sm tensor

That runs about 1000t/s pp and 30 t/s tg, What is fine with my for Q8 and KV-k and KV-v Q8.

My problem is ~~this~~ are these:
- llama-cpp kv cache reprocessing kind of often (every 5 messages or so). This takes 1-2 minutes at ~1000t/s pp.
- I need to hold hands with gemma4 all the time, it gets a complex task, and always reports back after a few minutes instead of just juggling along on its own. I made a /goal and "told it" to not ask back and work on its own. But it stops all the time and tells me how great it did and where we are right now.
But what I want is, that it runs for an hour or so without waiting all the time...

I tried the A3B MOE versions, they are fast, but completely unusable for agentic work.
I tried qwen 27b A LOT, but the kv reprocessings are worse, that it takes hours for a few runs because constant KV cache reprocessings.

I tried qwen 27B on vLLM on the two 3090 only with the club-3090 project. That was "fine" but also not really good. It crawled to a halt after about 24h every day, that I had to restart the vLLM server to get it out of the 0.1 t/s mud hell.

Are my parameters for llama-cpp wrong that lead to my "agent asking on every turn" problems? Or what can I do?

I need help to run local Hermes Agent on my rig. llama-cpp self compiled