u/GrungeWerX

WARNING: I'm speed typing this, no time to organizea/format, so if short paragraph chunks bother you, just keep it moving.

CONTEXT UPDATE: (for those interested, otherwise skip)

>For those interested in the data points, the task was building an agentic workflow inside of rivet that included an mcp subgraph (with a list of 11 tools) that received json instructions from the main subgraph so that I could shave off 30K tokens from the main agent's memory. The main subgraph included context trimming and pre-injection of memory, soul, and agent .md files. Task also included testing, rigging it up with openwebui and llama.cpp, and to create an adapter bridge between the server and owui. The agent was testing it by using a smaller Qwen 2B model running parallel in CPU. All of this was 100% handed off to my agent.

When Qwen 3.6 35B dropped, a lot of people were heaping praises and I thought they were just glazing it because of the speed. 27B was objectionably smarter than the 35 on 3.5.

So when I got around to using the 27B version (unsloth's Q5KXL UD @ KV Q8/8), it became my daily driver without thinking on. No loops, solid speeds. And I've been mostly fine. Until the past two days.

I never gave 35B achance because speed (at the time) wasn't that important to me and again, the 27B is known to be smarter. But after wasting 2 days trying to de-bug subgraphs in rivet and blowing HOURS of time constantly dropping quants due to context overflow and having the model's intelligence labotomize, I remembered reading a post recently where someone did a test comparing the IQ4NXLs (MTP + standard) against the Q4KXL, Q5 and others.

So, I gave Qwen 3.6 35B IQ4NXL a shot, no kv cache compression since vram wasn't as much an issue, and it nearly one-shotted the solution. I've since run a few more tests with it and for a minute I've just been confused - like why is the 35 better? So, I figured it must be a) Qwens are still really good at lower quants, and more importantly b) kv cache REALLY MATTERS.

The 35B still creeps when it hits high context, even worse than the 27B it seems, and the only way I can do my end session routines is to switch to the Q4KXL at KV Q4/4, but then it's a risk that it'll forget a routine or miss details in the session summary. Also, I haven't spent a lot of time learning the 35Bs, so I need some time to feel them out and figure out what works best.

Anyway, the point is - the IQ4NXL w/unquanted kv cache outperformed the 27B Q5 K XL at kv q/8/8, to say nothing about the 27B Q4 at kv q/4/4. I always though it didn't matter much because of different comments and AI saying it's only a slight decrease in intelligence. But when it comes to agentic work, it clearly makes a difference and can save you HOURS of time.

And...it's fast. So yeah, I'm using 35B a lot more now - at least for this particular project. I still love the 27B and there's other stuff that I'd prefer even the quanted 27B to do over the 35B. And to be fair to the 27B, I haven't tried it w/no kv cache compression because I need speed, but I'm going to assume it'll probably have a leap in intelligence unquanted as well. But for now, I've gotta lot of work to do, time is of the essence, and I've only got an RTX 3090 TI.

Side note: I've been using LM Studio since I started using LLMs a couple of years ago, but with this current bug it has where it won't overflow or compact context, it's slowing everything down having to start new sessions, have my agent re-read all the notes, eat all that context, summarize at end when context is full again, rinse repeat. So I've moved over to llama.cpp.

I hesitated on llama.cpp because I didn't feel like learning a new tool (adding to my ever-growing-and-already-too-large-list of apps) , because I didn't feel like bothering with it, but since I've gone agentic, I just had my agent complie it and it works fine, so yeah. Just let the agent do it. 😄

My setup: Windows 10/11 i7 12700K | RTX 3090 TI | 96GB RAM

Local server: LM Studio

Models: Qwen 3.5/3.6 27B|35B Q5 UD K XL + Gemma 4 31B| 26B Q4 UD K XL

Up until this point, I've only used sota models for coding. When Qwen 3.5 dropped, it was the first local model that felt sota, and I've been using it ever since, primarily as a lore master for my IPs story bible, but nothing agentic.

Last week, I "built" my first agent, giving her a custom system prompt with instructions for daily startup and end of session summaries, personality template, user preferences file, memory using redis and postgres that tracks tasks and updates any skills she learns, several mcp tools for filesystem access, her own folder in documents, and cli (stripped of the http capabilities).

Every morning, she does her startup routine, checking her notes, outstanding tasks that need to be accomplished, and updates me on where we are with projects. She handles redis/postgres memory for me, and she's helping me build a personal assistant inside of n8n - she's able to build workflows herself via mcp tool.

This whole experience has blown me away. I've heard people talking about agents, known what they can do, heard about open claw, hermes, etc. But there's a big difference between hearing other people talking about it and experiencing it yourself.

I spent a lot of time setting her up exactly how I wanted. No guides, just my own ideas. But all these posts about pi, hermes, etc. had me wondering if I'm missing out on something special. But when I asked claude what benefits I'd get from those harnesses, it and gemini both told me I've already built out like 90% of what they offer and just need to give my agent the power to spawn her own agents and add dynamic tool calling for the sub-agents. I don't need context compaction because she writes summaries end of session.

Is this all? I don't assume everything AI says is right, so I want to ask the enthusiasts - what do these harnesses offer that I'm overlooking?

My plan is to have my agent spawn sub-agents - the code looks pretty simple to do - and then I want to vibecode a GUI that allows me to view their outputs along with the main agents in a custom chat window or something. I'm asking Qwen now about building the dynamic tool calls, but I also know that I can just give each sub-agent designated mcp tools.

What else should I be thinking about?

You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.

Are harnesses like OpenClaw and Hermes really necessary?