
For those of you hosting LLMs locally, how do you monitor usage and performance?
I’m hosting a couple of local models on a not-so-powerful machine. To make that workable, I use llama.cpp in router mode so switching models is seamless: the old model gets unloaded and the new one gets loaded automatically.
Previously I was using llama-swap, but I moved to llama.cpp. The first thing I missed was proper monitoring for each invocation (prompt processing time, token generation speed, overall response latency, etc.).
After messing around for a couple of hours, I ended up setting up Prometheus to scrape metrics from all loaded models and built a Grafana dashboard on top of it (I'll leave an image if you are curious).
Unfortunately, I discovered that the /metrics endpoint in llama.cpp seems to be broken in this setup: querying it keeps the models awake, which prevents them from being swapped out or letting the server enter an idle state.
Issue here if anyone is interested:
https://github.com/ggml-org/llama.cpp/issues/20227
So now I’m curious: how are you all monitoring local LLM performance and usage?