u/Doug_Fripon

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.

The context for each parallel client appears to be equally distributed across server slots (so each client is allowed c / np context).

I have some questions:

- What are the consequences of launching a server with a greater context -c than what the model allows?

- What if c / np is greater than the model max context? Are there any negative to that regarding model performance?

- If a rig allows to allocate twice the context max size in vram, is it twice energy and time efficient to serve two agents in parallel rather than sequentially?

Llamacpp server : How do the -np and -c flags interact?