u/Count_Rugens_Finger

I have an old server with 96GB ECC DDR4 RAM and a 24 core Xeon. It has a RTX 3070 GPU with 8GB VRAM. I mostly use my main PC for LLMs but I have started using the server to host LLMs in the 120B class (gpt-oss, Qwen3.5, Nemotron) because it is the only machine I have with enough RAM. Since it is mostly processing on CPU, it is very slow (3 tok/sec). So the idea is I use my main PC with smaller models for fast responses, and for jobs that need more smarts, I send it off to the server for slow processing. That works fine but still, if I can improve the generation speed I would like to.

For my hardware (mostly CPU) I really don't know where to start. Is there some baseline guidance for optimizing an LLM for which GPU offload is very small?

How to properly optimize 120B local LLM on 8GB GPU?