u/Lower-Ad6101

\#!/usr/bin/zsh cd /home/user/bin/ \# Load aliases and clean system caches setopt aliases source \~/.zshrc clearcache \# Function to reclaim RAM disk space cleanup() { echo "\\n\[System\] Cleaning up RAM cache at /dev/shm/llama\_cache..." rm -rf /dev/shm/llama\_cache } \# Trap EXIT (script finish), INT (Ctrl+C), and TERM (kill) trap cleanup EXIT INT TERM \# Create fresh RAM cache directory mkdir -p /dev/shm/llama\_cache echo "\[System\] Starting llama-server with RAM cache..." llama-server \\ \--slot-save-path /dev/shm/llama\_cache \\ \-m "/home/user/.lmstudio/models/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF/qwen36\_35b\_Q5\_K\_M.gguf" \\ \--n-gpu-layers 41 \\ \--n-cpu-moe 31 \\ \--ctx-size 24576 \\ \--parallel 1 \\ \--flash-attn on \\ \--cache-type-k q8\_0 \\ \--cache-type-v q8\_0 \\ \--threads 4 \\ \--threads-batch 4 \\ \--split-mode none \\ \--batch-size 2048 \\ \--ubatch-size 512 \\ \--mlock \\ \--reasoning on \\ \--chat-template-kwargs '{"preserve\_thinking": true}' \\ \--host [0.0.0.0](http://0.0.0.0) \\ \--port 8080 \\ \--temp 0.3 \\ \--top-k 40 \\ \--top-p 0.9 \\ \--min-p 0.08 \\ \--repeat-penalty 1.1 \\ \--repeat-last-n 64 \\ \--cache-prompt \\ \--n-predict -1

request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it { "name": "ContextOverflowError", "data": { "message": "request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it", "responseBody": "{\"error\":{\"code\":400,\"message\":\"request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it\",\"type\":\"exceed_context_size_error\",\"n_prompt_tokens\":44775,\"n_ctx\":32768}}" } }

{ "$schema": "https://kilo.ai", "model": "llama-cpp/qwen3.6-35b-a3b", "small_model": "llama-cpp/qwen3.6-35b-a3b", "agent": { "concurrency": { "limit": 1 }, "limit": { "context": 32768, "input": 28000, "output": 4096 }, "plan": { "model": "llama-cpp/qwen3.6-35b-a3b" }, "debug": { "model": "llama-cpp/qwen3.6-35b-a3b" }, "orchestrator": { "model": "llama-cpp/qwen3.6-35b-a3b" }, "ask": { "model": "llama-cpp/qwen3.6-35b-a3b" }, "code": { "model": "llama-cpp/qwen3.6-35b-a3b" } }, "provider": { "llama-cpp": { "name": "Local Qwen3.6-35b-a3b", "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://localhost:8080/v1" }, "models": { "qwen3.6-35b-a3b": { "name": "Qwen3.6 35b A3B", "context_window": 32768, "max_input_tokens": 22000, "reasoning": true, "variants": { "thinking": { "enable_thinking": true, "chat_template_args": { "enable_thinking": true } } } } } } }, "instructions": [ "/home/user/proj/kilocode/INSTRUCTIONS.md" ], "permission": { "bash": "allow" } }

... reasoning-budget: deactivated (natural end) slot print_timing: id 0 | task 2 | prompt eval time = 72230.52 ms / 13675 tokens ( 5.28 ms per token, 189.32 tokens per second) eval time = 10275.10 ms / 157 tokens ( 65.45 ms per token, 15.28 tokens per second) total time = 82505.63 ms / 13832 tokens slot release: id 0 | task 2 | stop processing: n_tokens = 13831, truncated = 0 srv update_slots: all slots are idle srv params_from_: Chat format: peg-native slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.440 (> 0.100 thold), f_keep = 1.000 reasoning-budget: activated, budget=2147483647 tokens slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 484 | processing task, is_child = 0 slot update_slots: id 0 | task 484 | new prompt, n_ctx_slot = 24576, n_keep = 0, task.n_tokens = 31422 srv send_error: task id = 484, error: request (31422 tokens) exceeds the available context size (24576 tokens), try increasing it slot release: id 0 | task 484 | stop processing: n_tokens = 13831, truncated = 0 srv stop: cancel task, id_task = 484 srv update_slots: no tokens to decode srv update_slots: all slots are idle srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 400 srv params_from_: Chat format: peg-native slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 148257259339 srv get_availabl: updating prompt cache srv prompt_save: - saving prompt with length 13831, total state size = 206.587 MiB srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.002 srv load: - found better prompt with f_keep = 0.426, sim = 0.331 srv update: - cache state: 1 prompts, 395.027 MiB (limits: 8192.000 MiB, 24576 tokens, 286824 est) srv update: - prompt 0x55d63dd3c310: 13831 tokens, checkpoints: 3, 395.027 MiB srv get_availabl: prompt cache update took 241.41 ms ...

Hi,

Reading here about what people run on what (high) hardware configurations, I was very hesitant to even ask for help about tweaking (squeezing a bit more) my configuration, as I have pretty low hardware spec in comparison, but I was encouraged by recent success posts, especially this recent one so I've decided to ask anyway.

My hardware consists of GTX 1080 8GB VRAM, 32GB DDR4 (2133 MT/s) and an older gen Intel i5-7600 with 4 cores.

Even though I'm pretty new in running local models, I've tried many models that I could load, from Qwen2.5-coder-[7,14..]-instruct, Qwen3-coder-30b-instruct-480b-distill-v2-i1 to Mistral and gpt-oss but decided to settle with Qwen3.6-35-a3b.

My main use as a software engineer is primary C++ and secondary (learning) Python coding and debugging.

At first I was consulting google (AI mode) and then switched to ChatGPT for advice's about adequate models for my hardware spec (until I decided) and then spent hours even days chatting with it about tweaking settings in LM Studio (0.4.12 (Build 1)), restarting OS (because when model fails to load subsequent tries fail immediately, I guess because memory fragmentation and nothing helped except full restart) and then trying something else... also, trying out many agents mainly to use from within VS Code, Cline, Roo Code, Continue... Aider (outside), Open Code... (ChatGPT insisted to stay away from "havier" agents like Qwen Code, Codex.. which are too much for my spec and context length, to which I'll come in a bit).

I've decided to settle for now with Cline (prone to loops but more natural to interact with than say Roo Code) and Continue (not so autonomous but more compact and faster). Also I'm not using auto complete as it's not crucial for me and it's already slow as it is.

I'm also using all of this on Linux with KDE (maybe doesn't matter so much but thought to mention it since it's a bit heavier DE).

Also I do not mind waiting a little longer (slightly less speed) if I'll keep intelligence/reasoning.

Following ChatGPT suggestions I've come up with the following setting in LM Studio for Qwen3.6-35b-a3b Q4_K_M GGUF:

LM Studio Settings -> Model Defaults: - Model Loading guardrail: Strict

LM Studio Settings -> Runtime: - GGUF: CUDA llama.cpp (Linux) v2.13.0

Model Settings:

Load pane:

- Context Length: 12288 (if I go higher model fails to load, if I go lower I can't use Continue and/or Cline)
- GPU Offload: 9 (I remember that I could go higher to 10 but then I would need to lower context length. Any layer higher it fails to load)
- CPU Thread Pool Size: 2 (that's max as LM Studio wont let me go higher no matter what even though I have 4 cores)
- Evaluation Batch Size: 256
- Max Concurrent Predictions: 2
- Unified KV Cache: ON
- RoPE Frequency Base: Unchecked (auto)
- RoPE Frequency Scale: Unchecked (auto)
- Offload KV Cache to GPU Memory: ON
- Keep Model in Memory: ON
- Try mmap(): ON
- Seed: Unchecked (Random Seed)
- Number of Experts: 8
- Number of layers for which to force into CPU: 0
- Flash Attention: ON
- K Cache Quantization Type: Q4\_0
- V Cache Quantization Type: Q4\_0

Inference pane:

- Temperature: 0.3
- Limit Response Length: Unchecked
- Context Overflow: Truncate Middle
- Stop Strings: empty
- CPU Threads: 2 (max, for the same reason as for CPU Thread Pool Size)
- Start String: &lt;think&gt;
- End String: &lt;/think&gt;
- Top K Sampling: 40
- Repeat Penalty: 1.1
- Presence Penalty: Unchecked
- Top P Sampling: 0.9
- Min P Sampling: 0.08
- In Prompt Template section (Template "Jinja"), as a first line, I've set:
  {%- set preserve\_thinking = True %}

- System prompt:
"You are an expert software engineer (C++17/20, Python 3.12).

Goal:

Produce correct, concise, and practical solutions with minimal iteration.

----------------------------------------
General Behavior
----------------------------------------
- Be decisive and avoid unnecessary back-and-forth.
- Prefer simple, correct solutions over complex ones.
- Do not over-engineer.

----------------------------------------
Task Handling
----------------------------------------
- Identify task type implicitly:
  - Design → define structure first
  - Implementation → write complete, correct code
  - Debugging → find root cause and apply minimal fix

- Do not mix modes unnecessarily.
- Complete the current task before switching context.

----------------------------------------
Scope Control
----------------------------------------
- Focus only on relevant code or logic.
- Avoid scanning or rewriting unrelated parts.
- Do not expand scope unless required.

----------------------------------------
Reasoning
----------------------------------------
- Keep reasoning brief (3–5 bullets max).
- Focus on decisions, not exploration.

----------------------------------------
Anti-Loop / Anti-Drift
----------------------------------------
- Do not repeat the same failed approach.
- If uncertain, make the most likely assumption and proceed.
- Avoid re-analyzing the same information.

----------------------------------------
Code Quality
----------------------------------------
- Do not invent variables or APIs.
- Ensure consistency across the solution.
- Avoid partial or broken implementations.

----------------------------------------
Output
----------------------------------------
- Be concise and direct.
- Show only relevant code or results.
- Do not include unnecessary explanation unless asked."

With these setting in LM Studio's chat, after generation finishes it shows around 3.50 tok/sec (sometimes it's 3.48, sometimes 3.70). Very, very slow I know... and also it's very bad in finding and fixing bugs but better than the models I've tried before.

Now I know it's a lot to ask but I would like to hear some advice's from you for my use case (C++ and Python) and also considering my hardware spec, about:

what model should I use (Q4_K_M, 5_K_S...i1-Q4_K_S...)?
what settings should I use for it?

Thanks!

Kilo Code refuses to respect context size