u/HomoAgens1

NVIDIA Nemotron — does anyone actually use it?

Everyone seems to be running Gemma 4 or some version of Qwen. Nemotron gets almost no mentions. Is it just less visible because it's NVIDIA, or is there a real reason nobody talks about it?

Has anyone benchmarked it against Qwen3 or Gemma 4 on reasoning/code tasks? Is it even worth trying locally?

Also open to suggestions: if you were running something comparable to Qwen3.6-35B-A3B Q5_K_M on 12GB VRAM, what would you pick instead?

reddit.com
u/HomoAgens1 — 10 days ago

Qwen3.6-35B-A3B Q5_K_M on 12GB VRAM — working llama.cpp config

Quick config share for anyone with a 12GB card and enough system RAM who wants to run Qwen3.6-35B-A3B at Q5 quality.

Hardware

  • GPU: NVIDIA RTX A2000 12GB
  • RAM: 128GB
  • OS: Oracle Linux Server release 9.7, llama.cpp latest CUDA build (13.2), Driver: 595.71.05

Performance

  • Prompt processing: 79 tok/s
  • Generation: 35 tok/s
  • VRAM: ~10.3 GB
  • RAM: ~18.4 GB resident (~13.3 GB are MoE expert weights in CPU pinned memory, confirmed from llama.cpp load log)

The trick: -ncmoe

Qwen3.6-35B-A3B is MoE (35B total parameters, ~3B active per token). -ncmoe N offloads N expert blocks to CPU RAM. With enough system RAM this is the key to fitting a 35B model on 12GB VRAM.

Each MoE block costs ~500 MiB on GPU with Q5_K_M. Other guides suggest -ncmoe 18 but those are calibrated on IQ4_XS — a much smaller quant. On Q5_K_M, -ncmoe 18 crashes with out of memory. -ncmoe 26 fits with ~1 GB to spare, -ncmoe 28 is safer if you have other processes using VRAM.

Config

llama-server \
    -hf bartowski/Qwen_Qwen3.6-35B-A3B-GGUF \
    -hff Qwen_Qwen3.6-35B-A3B-Q5_K_M.gguf \
    -ngl 999 \
    -ncmoe 26 \
    -c 32768 \
    -ctk q8_0 \
    -ctv q8_0 \
    --flash-attn on \
    -t 16 \
    --no-mmap \
    --jinja
  • -hf / -hff: HuggingFace repo and filename — llama.cpp downloads the model automatically on first run
  • -ngl 999: put all layers on GPU; -ncmoe then overrides how many MoE expert blocks actually stay there
  • -ncmoe 26: keep 26 MoE expert blocks on CPU RAM instead of VRAM (~500 MiB saved per block)
  • -c 32768: context window in tokens (32K).
  • -ctk q8_0 -ctv q8_0: 8-bit KV cache — halves KV cache VRAM with no measurable quality loss on this GPU
  • --flash-attn on: faster attention with lower VRAM usage during inference. Write on explicitly — without the value, llama.cpp parses the next flag as the argument and crashes silently
  • -t 16: CPU threads for the offloaded MoE experts — set to your physical core count
  • --no-mmap: load the full model into RAM before serving. Slower startup, more stable inference
  • --jinja: use the chat template embedded in the GGUF. Required for Qwen3 models

Thinking mode

The model thinks by default. Use /no_think at the start of your message for quick tasks, let it think for reasoning/code. The quality difference is real.

35 tok/s on a 35B model at Q5 feels solid. In practice this config works well as a stable backend for agentic AI pipelines — the generation speed is fast enough that multi-step agents don't feel sluggish waiting for each LLM call. Happy to answer questions.

reddit.com
u/HomoAgens1 — 12 days ago
▲ 15 r/LocalLLM+2 crossposts

I built a local autonomous agent that streams every reasoning step live in the UI — no black boxes

Hey r/ollama,

I've been building Pragma, an open-source autonomous agent that runs entirely on Ollama. The thing that bothered me about most agents is that you have no idea what they're actually doing — you give them a task and wait.

Pragma shows you everything in real time: every thought, every tool call, every observation, as it happens.

What it does:

  • Runs a ReAct loop (think → act → observe → repeat) and streams each step live in the UI
  • Two models: a small reasoning model for orchestration, a coding model (Qwen 2.5 Coder) for code generation
  • Skill palette: filesystem, shell, web search, LLM calls, and more — each skill is a folder you can add to
  • Threads with persistent history, working directory per conversation
  • No API key, no cloud, everything stays local

Stack: FastAPI + Vanilla JS + WebSocket. No framework magic, every file is understandable in isolation.

Tested on: NVIDIA RTX A2000 12GB with Gemma 4 E4B (reasoning) + Qwen 2.5 Coder 7B (code). 12GB VRAM is the practical minimum for Gemma, 24GB gives more headroom.

Repo: https://github.com/homoagens/pragma

Happy to answer questions about the architecture, the skill system, or how the ReAct loop works.

u/HomoAgens1 — 11 hours ago