u/Jorlen

Can't believe I got it working!  Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image

I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix.

Vulkan seems to work. I tested a quick prompt, hopefully it's stable because if so, this gives me 48gb of VRAM to play with. Had to buy a new powersupply, but for $300 and to be able to leverage my older 7800xt - well worth it, I think.

Edit: I have dyslexia with numbers - the title reads R7900 it's an R9700.

u/Jorlen — 12 hours ago

Seeking resources to read about llama.cpp server and how offloading works

SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version. Using the --ngl option to offload.


First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.

I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.

I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.

If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)

Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.

Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.


EDIT: Offloading also has a nice bonus benefit of being QUIET. For anyone with a very loud GPU fan, it's a nice break. Yes it's slower but I can work on other tabs and windows while it processes and actually hear myself think. I might do more of this.

reddit.com
u/Jorlen — 18 hours ago

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

I'd love to hear from developers who use big context windows if they notice a difference?

Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory.

I don't really need a full study, just wondering, anecdotally, what people have experienced.

My current setup: Docker stack with Llama.cpp server at the helm (Vulkan - I pay AMD tax daily) - 32GB VRAM, using mostly Qwen 3.6 models for development. I go back and forth beetween the 27b dense and 35b MoE. WIth a dash of the lil guy (3.5 9B omnicoder variant) for smaller stuff since it's so zippy and uses a shite-ton less vram.

reddit.com
u/Jorlen — 6 days ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs.

In terms of speed, I was getting around 1.5x the tok/sec of previous tests.

The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context and raised it to 300k. This is at KV Q8_0 quant. Edit: I was wrong, I had mistakenly left it at q4_0. I will redo tests tomorrow with Q8.

I use VSCodium and Roo. The idea was to see how far I can push the context window and measure (by feel) if a large context window with a multi-file project slows it down too much to be effective.

Model used: Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - link

OS/Software: Ubuntu 24.04 - Vulkan - To use MTP I had to use a docker version of the MTP prototype of llama.cpp server (image: havenoammo/llama:vulkan-server)

My current window is 300k context but I feel like I can go even higher as my VRAM used is 28.3gb / 32gb. Likely 400k is viable (with the 35B MoE model that is).

GPU: Asus Radeon R9700 AI Pro card (32gb RDNA 4 card)

Just want to shoot my appreciation for the local LLM community and everyone responsible for enabling us to run these kinds of powerful models at home. Amazing when I think where we were just a year ago. I am having a blast exploring all this tech and every day that I learn something new, it just leaves me astounded.

EDIT: Switched to the Qwen 3.6 27b model (non-MoE) as I was running into issues with the MoE model when deep in context sessoin (200k ish). Will update results.

u/Jorlen — 8 days ago

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

I have a docker stack with a bunch of AI services and llama.cpp server is the brain.

I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.

Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?

Edit: To clarify, the above test was done on the same model, no prompt data, no existing context, no system prompt. Tabula rasa. The model in question was a 22.6gb file.

reddit.com
u/Jorlen — 9 days ago

Ubuntu 24.04 - AMD - OpenAI - anyone get SST working?

I've tried just about everything I can think of to get a speech-to-text engine working with GPU. Vulkan seems to be compatible with a lot of stuff, does anyone know a good one? I've tried whisper (docker / local) and speaches. They all just work with CPU only.

Would prefer to keep it in docker stack but I don't care anymore, if it has to be installed / running local, that's fine too, so long as it works.

GPU: Amd Radeon R9700 AI PRO

reddit.com
u/Jorlen — 9 days ago

I'm at wits end... can anyone help? Ubuntu 24.04 with R9700 AI PRO - Docker comfyUI woes

RESOLVED - SEE NOTES BELOW

I just cannot get this thing stable. It generates a few images, then a few full black images and then crashes.

I have tried so many different images, docker config yamls, you name it. Probably dozens of hours of trial and error. Note that I can run a non-stop LLM model without any issues - 100% stable. Games are fine, anything else GPU related - no problems.. It's just.. Comfyui that won't play nice.

Please, share me your config if you are using the same setup:

  1. Ubuntu 24.04 LTS

  2. AMD Radeon R9700 AI Pro card

  3. Docker image version of Comfyui

Thanks in advance and happy generating!


Finally found a working config. If anyone needs to borrow some of these settings, just remember this is for the Radeon R9700 AI Pro card, Ubuntu 24.04 LTS, running with Docker Comfyui and ROCM setup. Carefully use some of these settings, not all will apply to your config but the main core components, such as the image, etc. should be stable.

image: yanwk/comfyui-boot:rocm7
container_name: comfyui
restart: no
networks:
  - ai_network
ports:
  - "8188:8188"
shm_size: "16gb"
ipc: host
security_opt:
  - seccomp:unconfined
group_add:
  - video
  - "992" 
devices:
  - /dev/kfd:/dev/kfd
  - /dev/dri:/dev/dri
volumes:
  - ./comfyui_custom_nodes:/root/ComfyUI/custom_nodes
  - ./comfyui_models:/root/ComfyUI/models
  - ./comfyui_output:/root/ComfyUI/output
  - ./comfyui_user:/root/ComfyUI/user
environment:
  ROCM_PATH: "/opt/rocm"
  HSA_OVERRIDE_GFX_VERSION: "12.0.1"
  HSA_ENABLE_SDMA: "0"
  HSA_ENABLE_SDMA_COPY: "0"
  PYTORCH_HIP_ALLOC_CONF: "expandable_segments:True"
  
  # Removed HSA_DISABLE_CACHE and MIOPEN flags so the CPU can rest!
  
  # Removed disable-smart-memory so the GPU runs at full speed
  CLI_ARGS: "--highvram"
reddit.com
u/Jorlen — 10 days ago

Hi everyone! I'm new to the wonderful world of LLMs and I'm having an issue which is probably really basic, but I just can't figure out after a lot of wasted time.

I started with LM studio and I have several models already downloaded. I figured I could:

  1. Use the process of creating an ollama model using the LM studio's downloaded .GGUF file

  2. I am creating a file called model with the full filename and then in the same folder, using the command ollama create <new model name for ollama> followed by the file which has the name

  3. I am on Linux if it makes any difference (probably not)

Every time I do this, the model that gets "imported" into Ollama is just.. broken, it responds but it generates things I didn't ask for.

Most recently I tried it with this .GGUF file: Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-V2-FP32-i1-GGUF

Am I somehow messing up the model's base parameters? Is there a better way, other than just redownloading the models from Ollama?

Thanks and sorry if this is a stupid question! Hopefully it's a simple solution.

reddit.com
u/Jorlen — 19 days ago

Hey all, new to the space, having a lot of fun and I'm learning how to code with a model and realizing how impressive they are.

I have some questions, and I'm mainly just curious as to what other people do and what they like and how much of a model hoarder ya'll are ;)

My use cases now are vscode and to help me with technical issues. I realize models are trained and their data is basically a container, some are a year old, etc so it's not always good for recent tech stuff.

  1. How many models do you have? Ballpark GB of storage used?

  2. What's your favorite model group? Gemma3-4, Llama, Qwen, etc.

  3. What's your main use? Development? Creative writing? Automation of home-based systems? Using it for work / business?

  4. How do you determine if a given model is good for you, personally? Do you have a series of tests you throw at it or do you just improve and take your time?

  5. For vscode - and learning to code - what's a good system or extension to leverage a local LLM to help me out? I'm pasting code just in LM studio back and forth, and it works but I know there are better ways. Would you recommend a different IDE? I am not tied to vscode; it's what was suggested to me

  6. What tools do you guys use to help local models talk to other locally hosted services? Do you build your own, use out-of-the-box stuff? Right now I have SearXNG locally hosted and I had fun getting the LLM to talk to it and return searches ,just with basic python. A whole new world of possibilities awaits and I'm curious what you guys are doing!

Any other advice is most welcome. If there's a good guide to help just learn the fundamentals, that would be cool too. LM studio is what I'm using and the sheer amount of settings, along with jinja templates, system prompts.. there's a lot to absorb.

reddit.com
u/Jorlen — 22 days ago