How to properly optimize 120B local LLM on 8GB GPU?
I have an old server with 96GB ECC DDR4 RAM and a 24 core Xeon. It has a RTX 3070 GPU with 8GB VRAM. I mostly use my main PC for LLMs but I have started using the server to host LLMs in the 120B class (gpt-oss, Qwen3.5, Nemotron) because it is the only machine I have with enough RAM. Since it is mostly processing on CPU, it is very slow (3 tok/sec). So the idea is I use my main PC with smaller models for fast responses, and for jobs that need more smarts, I send it off to the server for slow processing. That works fine but still, if I can improve the generation speed I would like to.
For my hardware (mostly CPU) I really don't know where to start. Is there some baseline guidance for optimizing an LLM for which GPU offload is very small?