u/JGeek00

llama-server RAM usage grows to OOM

I'm doing some tests with llama-bench, tuning some configs, I'm always using the same config for llama-benchy so the prompt should be always the same. For each round the RAM usage grows until it reaches OOM and it clears the RAM again.

This is what happens:

- Round 1: RAM usage bumps to 20%

- Ends round 1 and usage falls to 0%

- Round 2: RAM usage bumps to 40%

- Ends round 2 and usage falls to 0%

- Round 3: RAM usage bumps to 60%

- Ends round 2 and usage falls to 0%

...

That happens until it reaches the OOM and the usage "resets" again to 0% and this process starts again.

This issue has also happened with OpenCode. I work on a coding session that bumps the memory usage to 60%, then I start a new coding session clearing the conversation history (and the context), but the memory usage instead of starting from 0% again, it starts from that 60%, and soon it reaches OOM.

Config

model: models/Qwen3.6-27B-MTP-Q4_K_M.gguf
mmproj: models/mmproj-BF16.gguf
webui-config-file: webui-config.json
batch-size: 1024
ubatch-size: 512
ctx-size: 131072
cache-type-k: q8_0
cache-type-v: q8_0
threads: 4
threads-batch: 8
parallel: 1
spec-type: draft-mtp
spec-draft-n-max: 2
spec-draft-p-min: 0.4
flash-attn: on
gpu-layers: all
n-gpu-layers: 99
checkpoint-every-n-tokens: -1
ctx-checkpoints: 0
cache-ram: 12288
tools: all
alias: Qwen3.6-27B
chat-template-kwargs: '{"preserve_thinking": true}'
jinja
no-mmproj-offload
webui-mcp-proxy
host: 0.0.0.0
port: 8080
reddit.com
u/JGeek00 — 4 days ago

Tested MTP with llama.cpp and Qwen3.6-27B on RTX 3090

I have just compiled the new release of llama.cpp that includes MTP and tried it for agentic coding on my RTX 3090.

Model: Qwen3.6-27B-Q4_K_M

MTP config: --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1

Without MTP: 100K context with mmproj enabled -> 21.5 GB VRAM usage

With MTP: 100K context with mmproj enabled -> 22.1 GB VRAM usage

Numeric results with llama-benchy:

  • Without MTP: 1020 t/s for prompt processing and 42 t/s for token generation
  • With MTP: 830 t/s for prompt processing and 60 t/s for token generation

Using MTP results in -18% t/s in prompt processing and +42% in token generation

I think MTP is a good improvement but is only usable if you currently (without MTP) have at least 2 GB of memory free. If your setup is memory constrained don't even try it.

EDIT:

Retried everything with a more conservative and adequate config (previously using --spec-draft-n-max 6)

reddit.com
u/JGeek00 — 6 days ago

Time for small models to reach Opus 4.6?

How much time do you think will take for open small models like Qwen3.6-27B or Gemma4-31B to reach Opus 4.6 level for coding tasks?

reddit.com
u/JGeek00 — 7 days ago

llama-server uses RAM even when it has VRAM available

I’m running llama-server on a machine with a RTX 3090 and 16 GB of memory. I’m using Qwen3.6-27B with the context set at 128K and q8 for both parts of kv cache. According to nvidia-smi the memory usage is on 22,5 GB of 24,5 GB, so it has 2 GB of VRAM available, but still llama-server uses 60% of the memory, and sometimes it goes up to 90% and llama-server throws an out of memory error. I thought that it was because the VRAM was full, but there was at least 1.5 GB free. I don’t understand why it uses RAM when it has free VRAM.

Log:

may 14 13:30:21 ai-server systemd[1592]: llama-cpp.service: The kernel OOM killer killed some processes in this unit.
may 14 13:30:22 ai-server systemd[1592]: llama-cpp.service: Main process exited, code=killed, status=9/KILL
may 14 13:30:22 ai-server systemd[1592]: llama-cpp.service: Failed with result 'oom-kill'.
may 14 13:30:22 ai-server systemd[1592]: llama-cpp.service: Consumed 10min 52.373s CPU time over 54min 33.678s wall clock time, 14G memory peak, 3.7G memory swap peak.
may 14 13:30:28 ai-server systemd[1592]: llama-cpp.service: Scheduled restart job, restart counter is at 1.
may 14 13:30:29 ai-server systemd[1592]: Starting llama-cpp.service - llama.cpp daemon...
may 14 13:30:40 ai-server systemd[1592]: Started llama-cpp.service - llama.cpp daemon.

Config:

model: models/Qwen3.6-27B-Q4_K_M.gguf
mmproj: models/mmproj-BF16.gguf
webui-config-file: webui-config.json
batch-size: 1024
ubatch-size: 512
ctx-size: 131072
cache-type-k: q8_0
cache-type-v: q8_0
threads: 4
threads-batch: 8
flash-attn: on
gpu-layers: all
n-gpu-layers: 99
tools: all
alias: Qwen3.6-27B
chat-template-kwargs: '{"preserve_thinking": true}'
jinja
webui-mcp-proxy
host: 0.0.0.0
port: 8080
reddit.com
u/JGeek00 — 8 days ago

Switch from llama.cpp to vLLM?

I'm currently using llama.cpp on my AI server to run Qwen3.6-27B. I use it for agentic coding with OpenCode. I'm running it on a RTX 3090.

This is my config:

model: llama.cpp/models/Qwen3.6-27B-Q4_K_M.gguf
mmproj: llama.cpp/models/mmproj-BF16.gguf
webui-config-file: llama.cpp/webui-config.json
batch-size: 4096
ubatch-size: 1024
ctx-size: 131072
cache-type-k: q8_0
cache-type-v: q8_0
threads: 8
threads-batch: 16
mlock
jinja
webui-mcp-proxy
tools: all
alias: Qwen3.6-27B
flash-attn: on
gpu-layers: all
chat-template-kwargs: '{"preserve_thinking": true}'
host: 0.0.0.0
port: 8080

With this config I'm getting 38 tps when the context is empty and around 28 when it's full. Do you think it would be a good idea to switch to vLLM?

reddit.com
u/JGeek00 — 9 days ago

I have been doing some research about turboquant and it looks like it’s a huge advantage. What improvements can I expect when switching the KV cache Q8 to TQ4? I haven’t tried it yet because llama.cpp still doesn’t support it. I saw that vLLM already supports it but I also saw that it’s more difficult to set up than llama.cpp and that pushed me away.

reddit.com
u/JGeek00 — 20 days ago

Hi everyone. First of all im fairly new to the local AI world, so my knowledge is still limited but I want to learn more about it. After the recent changes in GitHub Copilot I decided to ditch it and use local LLMs mainly because I don’t like the uncertainty that’s around cloud AI services in terms of price and usage limits, they are changing almost every month! With a local server I have no usage limit and I don’t have to think about price increases.

I pulled out of a closet my old gaming PC with an i7 6700K and a GTX 1070, I installed Debian, llama.cpp and ran Qwen3.5-9B, which works surprisingly well for normal prompts on the chat. But I want to use it for agentic coding so I will buy a RTX 3090 on the second hand market where I will run Qwen3.6-27B-Q4, which weights around 16.5 GB. For agentic coding with OpenCode I need a large context, so I have to compress the KV cache. My question is, what is better for agentic coding: 56K context with Q8 KV cache or 256K context with Q4 KV cache? I also rode that a RAG can help to use coding agents with not very large context windows. I’m also open for other recommendations. Thank you.

reddit.com
u/JGeek00 — 24 days ago