u/comanderxv

This is for all with 12GB VRAM.

Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers. The reason is I own an RTX 2060 with 12GB VRAM. That sounds big but is too little for dense models. That is why I use mainly MoE models because of that. The problem is, you need to split some layers to the CPU lane.

As you all surely know, Qwen3.6-35B-A3B uses only 8 experts per token; the rest are unused, so why not fill the experts into VRAM instead of complete layers full of unused experts?

I started to create a UI to monitor which experts are used. This already showed me that the first layers are more important to have on VRAM than the last ones; the reason is that they would change the experts more frequently than the others. Unfortunately, n-cpu-moe with llama.cpp will let the first layers on the CPU, so I tried -ot, but that's another story. With the optimized setup, I was able to reach about 22 tk/s. (Remember the 2060 has only about half the CUDA cores of a 3060.) With the default --n-cpu-moe, I get 19 tk/s

I only run Q6 models, since the degradation at coding is visible. My context is not quantized (same reason), and because of Java development, I need a big context window of 100k.

However, with my expert variant and a hit rate of about 62%, it increased to 26 tks. The break-even point was at a 42% hit rate. This means the prompt has used 42% of the chosen experts on the GPU in my cache. As I tested smaller sizes of RAM (built-in argument to specify the VRAM usage), another use case came into my mind. With a good profile, you can reduce the usage a lot without sacrificing speed.

Now, to my question. Is there a person who would like to give it a test? I really would like to know how it behaves on a 3060/4060 or similar. (CUDA is a requirement, and Qwen 35B A3B or Gemma 26B A4B). Currently, it is tested only on Linux.

Really, I don't want to earn any stars or so. I don't care; I just want to know how much it increases the token generation on which NVIDIA graphics card.

It would need the following: checkout and build https://github.com/adrianhoehne/llama.cpp

Start it with the additional arguments:

./build/bin/llama-server --moe-layer-perf-out experts.json \
--cpu-moe \
--ctx-size 100000 \
--parallel 1

Then start a prompt and wait. This will take longer than usual because every expert is still on the CPU.

After that, exchange the arguments to

./build/bin/llama-server --moe-hot-cache experts.json \
--moe-hot-cache-max-mib -1 \
--moe-hot-cache-auto-reserve-mib 1024 \
--moe-hot-cache-update-rate 0.10 \
--cpu-moe \
--ctx-size 100000 \
--parallel 1

And start measurement.

I also included the view of which experts are used to the Llama UI:

Button for ui

Experts first llama.cpp