u/Lower-Ad6101

Kilo Code refuses to respect context size

Hi,

I've been usin Roo Code until recently it was discontinued, so I've switched to Kilo Code and with Google's AI I've been trying for almost two days to get it working correctly with no success, it just keeps overflowing my model's context size.
Note I was using Roo Code with LM Studio, but then switched to llama.cpp.
This is my `llama-qwen.zsh` script for launching `llama-server`:

\#!/usr/bin/zsh

cd /home/user/bin/



\# Load aliases and clean system caches

setopt aliases

source \~/.zshrc

clearcache



\# Function to reclaim RAM disk space

cleanup() {  
echo "\\n\[System\] Cleaning up RAM cache at /dev/shm/llama\_cache..."  
rm -rf /dev/shm/llama\_cache  
}



\# Trap EXIT (script finish), INT (Ctrl+C), and TERM (kill)  
trap cleanup EXIT INT TERM



\# Create fresh RAM cache directory  
mkdir -p /dev/shm/llama\_cache



echo "\[System\] Starting llama-server with RAM cache..."



llama-server \\  
  \--slot-save-path /dev/shm/llama\_cache \\  
  \-m "/home/user/.lmstudio/models/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF/qwen36\_35b\_Q5\_K\_M.gguf" \\  
  \--n-gpu-layers 41 \\  
  \--n-cpu-moe 31 \\
  \--ctx-size 24576 \\
  \--parallel 1 \\
  \--flash-attn on \\
  \--cache-type-k q8\_0 \\
  \--cache-type-v q8\_0 \\
  \--threads 4 \\
  \--threads-batch 4 \\
  \--split-mode none \\
  \--batch-size 2048 \\
  \--ubatch-size 512 \\
  \--mlock \\
  \--reasoning on \\
  \--chat-template-kwargs '{"preserve\_thinking": true}' \\
  \--host [0.0.0.0](http://0.0.0.0) \\
  \--port 8080 \\
  \--temp 0.3 \\
  \--top-k 40 \\
  \--top-p 0.9 \\
  \--min-p 0.08 \\
  \--repeat-penalty 1.1 \\
  \--repeat-last-n 64 \\
  \--cache-prompt \\
  \--n-predict -1

When I was using Roo Code it was condensing context often but got the work done, now it shows red text box inside its gui extension in VSCode saying:

request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it
{
  "name": "ContextOverflowError",
  "data": {
    "message": "request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it",
    "responseBody": "{\"error\":{\"code\":400,\"message\":\"request (44775 tokens) exceeds the available context size (32768 tokens), try increasing it\",\"type\":\"exceed_context_size_error\",\"n_prompt_tokens\":44775,\"n_ctx\":32768}}"
  }
}

My kilo.jsonc is like this:

{
  "$schema": "https://kilo.ai",
  "model": "llama-cpp/qwen3.6-35b-a3b",
  "small_model": "llama-cpp/qwen3.6-35b-a3b",
  "agent": {
    "concurrency": {
      "limit": 1
    },
    "limit": {
      "context": 32768,
      "input": 28000,
      "output": 4096
    },
    "plan": {
      "model": "llama-cpp/qwen3.6-35b-a3b"
    },
    "debug": {
      "model": "llama-cpp/qwen3.6-35b-a3b"
    },
    "orchestrator": {
      "model": "llama-cpp/qwen3.6-35b-a3b"
    },
    "ask": {
      "model": "llama-cpp/qwen3.6-35b-a3b"
    },
    "code": {
      "model": "llama-cpp/qwen3.6-35b-a3b"
    }
  },
  "provider": {
    "llama-cpp": {
      "name": "Local Qwen3.6-35b-a3b",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "qwen3.6-35b-a3b": {
          "name": "Qwen3.6 35b A3B",
          "context_window": 32768,
          "max_input_tokens": 22000,
          "reasoning": true,
          "variants": {
            "thinking": {
              "enable_thinking": true,
              "chat_template_args": {
                "enable_thinking": true
              }
            }
          }
        }
      }
    }
  },
  "instructions": [
    "/home/user/proj/kilocode/INSTRUCTIONS.md"
  ],
  "permission": {
    "bash": "allow"
  }
}

llama-server output:

...
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 2 | 
prompt eval time =   72230.52 ms / 13675 tokens (    5.28 ms per token,   189.32 tokens per second)
       eval time =   10275.10 ms /   157 tokens (   65.45 ms per token,    15.28 tokens per second)
      total time =   82505.63 ms / 13832 tokens
slot      release: id  0 | task 2 | stop processing: n_tokens = 13831, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.440 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 484 | processing task, is_child = 0
slot update_slots: id  0 | task 484 | new prompt, n_ctx_slot = 24576, n_keep = 0, task.n_tokens = 31422
srv    send_error: task id = 484, error: request (31422 tokens) exceeds the available context size (24576 tokens), try increasing it
slot      release: id  0 | task 484 | stop processing: n_tokens = 13831, truncated = 0
srv          stop: cancel task, id_task = 484
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 400
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 148257259339
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 13831, total state size = 206.587 MiB
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.002
srv          load:  - found better prompt with f_keep = 0.426, sim = 0.331
srv        update:  - cache state: 1 prompts, 395.027 MiB (limits: 8192.000 MiB, 24576 tokens, 286824 est)
srv        update:    - prompt 0x55d63dd3c310:   13831 tokens, checkpoints:  3,   395.027 MiB
srv  get_availabl: prompt cache update took 241.41 ms
...

Is there any fix I could try, or should I switch to Cline until something is fixed?

Thanks in advance.

reddit.com
u/Lower-Ad6101 — 3 days ago

Hi,

Reading here about what people run on what (high) hardware configurations, I was very hesitant to even ask for help about tweaking (squeezing a bit more) my configuration, as I have pretty low hardware spec in comparison, but I was encouraged by recent success posts, especially this recent one so I've decided to ask anyway.

My hardware consists of GTX 1080 8GB VRAM, 32GB DDR4 (2133 MT/s) and an older gen Intel i5-7600 with 4 cores.

Even though I'm pretty new in running local models, I've tried many models that I could load, from Qwen2.5-coder-[7,14..]-instruct, Qwen3-coder-30b-instruct-480b-distill-v2-i1 to Mistral and gpt-oss but decided to settle with Qwen3.6-35-a3b.

My main use as a software engineer is primary C++ and secondary (learning) Python coding and debugging.

At first I was consulting google (AI mode) and then switched to ChatGPT for advice's about adequate models for my hardware spec (until I decided) and then spent hours even days chatting with it about tweaking settings in LM Studio (0.4.12 (Build 1)), restarting OS (because when model fails to load subsequent tries fail immediately, I guess because memory fragmentation and nothing helped except full restart) and then trying something else... also, trying out many agents mainly to use from within VS Code, Cline, Roo Code, Continue... Aider (outside), Open Code... (ChatGPT insisted to stay away from "havier" agents like Qwen Code, Codex.. which are too much for my spec and context length, to which I'll come in a bit).

I've decided to settle for now with Cline (prone to loops but more natural to interact with than say Roo Code) and Continue (not so autonomous but more compact and faster). Also I'm not using auto complete as it's not crucial for me and it's already slow as it is.

I'm also using all of this on Linux with KDE (maybe doesn't matter so much but thought to mention it since it's a bit heavier DE).

Also I do not mind waiting a little longer (slightly less speed) if I'll keep intelligence/reasoning.

Following ChatGPT suggestions I've come up with the following setting in LM Studio for Qwen3.6-35b-a3b Q4_K_M GGUF:

LM Studio Settings -> Model Defaults: - Model Loading guardrail: Strict

LM Studio Settings -> Runtime: - GGUF: CUDA llama.cpp (Linux) v2.13.0

Model Settings:

Load pane:

- Context Length: 12288 (if I go higher model fails to load, if I go lower I can't use Continue and/or Cline)
- GPU Offload: 9 (I remember that I could go higher to 10 but then I would need to lower context length. Any layer higher it fails to load)
- CPU Thread Pool Size: 2 (that's max as LM Studio wont let me go higher no matter what even though I have 4 cores)
- Evaluation Batch Size: 256
- Max Concurrent Predictions: 2
- Unified KV Cache: ON
- RoPE Frequency Base: Unchecked (auto)
- RoPE Frequency Scale: Unchecked (auto)
- Offload KV Cache to GPU Memory: ON
- Keep Model in Memory: ON
- Try mmap(): ON
- Seed: Unchecked (Random Seed)
- Number of Experts: 8
- Number of layers for which to force into CPU: 0
- Flash Attention: ON
- K Cache Quantization Type: Q4\_0
- V Cache Quantization Type: Q4\_0

Inference pane:

- Temperature: 0.3
- Limit Response Length: Unchecked
- Context Overflow: Truncate Middle
- Stop Strings: empty
- CPU Threads: 2 (max, for the same reason as for CPU Thread Pool Size)
- Start String: <think>
- End String: </think>
- Top K Sampling: 40
- Repeat Penalty: 1.1
- Presence Penalty: Unchecked
- Top P Sampling: 0.9
- Min P Sampling: 0.08
- In Prompt Template section (Template "Jinja"), as a first line, I've set:
  {%- set preserve\_thinking = True %}

- System prompt:
"You are an expert software engineer (C++17/20, Python 3.12).

Goal:

Produce correct, concise, and practical solutions with minimal iteration.

----------------------------------------
General Behavior
----------------------------------------
- Be decisive and avoid unnecessary back-and-forth.
- Prefer simple, correct solutions over complex ones.
- Do not over-engineer.

----------------------------------------
Task Handling
----------------------------------------
- Identify task type implicitly:
  - Design → define structure first
  - Implementation → write complete, correct code
  - Debugging → find root cause and apply minimal fix

- Do not mix modes unnecessarily.
- Complete the current task before switching context.

----------------------------------------
Scope Control
----------------------------------------
- Focus only on relevant code or logic.
- Avoid scanning or rewriting unrelated parts.
- Do not expand scope unless required.

----------------------------------------
Reasoning
----------------------------------------
- Keep reasoning brief (3–5 bullets max).
- Focus on decisions, not exploration.

----------------------------------------
Anti-Loop / Anti-Drift
----------------------------------------
- Do not repeat the same failed approach.
- If uncertain, make the most likely assumption and proceed.
- Avoid re-analyzing the same information.

----------------------------------------
Code Quality
----------------------------------------
- Do not invent variables or APIs.
- Ensure consistency across the solution.
- Avoid partial or broken implementations.

----------------------------------------
Output
----------------------------------------
- Be concise and direct.
- Show only relevant code or results.
- Do not include unnecessary explanation unless asked."

With these setting in LM Studio's chat, after generation finishes it shows around 3.50 tok/sec (sometimes it's 3.48, sometimes 3.70). Very, very slow I know... and also it's very bad in finding and fixing bugs but better than the models I've tried before.

Now I know it's a lot to ask but I would like to hear some advice's from you for my use case (C++ and Python) and also considering my hardware spec, about:

  • what model should I use (Q4_K_M, 5_K_S...i1-Q4_K_S...)?
  • what settings should I use for it?

Thanks!

reddit.com
u/Lower-Ad6101 — 22 days ago