u/Friendly_Beginning24

How do local users run large models locally?

Just as the title says, the furthest I can go is 31B. But I'm curious how people are able to run larger models at respectable quants with seemingly modest hardware.

Or are those setups only "technically" able to run them, with slow text generation and prefill speeds?

I'd like to be able to run larger than 31b models so I'm looking for ways to do so.

Thanks!

reddit.com

Enjoying Qwen 3.6 but it thinks too much!

Hello! Does anyone know how to make Qwen 3.6 think less? I'm enjoying it very much, follows instructions really well but it thinks too much!

I'm running Qwen 3.6 27b on LM Studio.

reddit.com
u/Friendly_Beginning24 — 12 days ago

Hello! Just wanted to know what the trade offs are with running Gemma 4 31b Q6 on a 3090 and 5060ti since I've read enough to know that multigpu is going to slow things down, especially if they're different GPUs. I don't mind a a generation speed of 10t/s but I would like the prefill to be decently fast. Say.. Reading 32k context worth of text in 60 seconds. I'm not opposed to dropping to Q5, though. Would this set up be able to do that? Or is my expectation too high?

I can run Gemma 4 31b Iq4ks on my 3090 but I'm very limited by the context size even with KV cache set to Q4. Flash attention is always on.

Using LM Studio as I'm not particularly knowledgeable about running LLMs locally yet.

reddit.com
u/Friendly_Beginning24 — 17 days ago