u/LocalAI_Amateur

I've been writing a no-compile no-dependencies node.js based AI Harness for llama.cpp as a learning exercise and can really use some help. I'm basing my code off https://github.com/av/mi and https://pi.dev/ with really basic agentic loops. It basically loop until there are no more tool calls being made then returns the control to the user prompt.

My biggest problems are

often times the LLM will ignore the tool call and the results and call the same tools again.
or worse, sometimes it'll drift it's attention to answer a previously answered question and tries to work from there instead of the latest tool call or continue its plan.

I'm using a q4 quant of qwen3.6 27b. I don't experience this problem when I run the same model under pi. I've looked at pi's agentic loop implementation and there doesn't seem to be any special sauce.

I added reminder messages after tool calls to remind it to review them before moving on and it helps a bit, but I would like to know if anyone has experienced the same problem in their own AI harness development and how do you address it?

So far the reminder messages I've implemented kinna work, but it feels like band-aids than real cures.

Edit: add bare minimal source code.

coffee.mjs

tools/bash.mjs

if you have node.js installed 'node coffee.mjs' will run it. no dependencies. just make sure llama-server is running. all config information are stored as variables at the top of coffee.mjs. Very basic stuff, but should be very human readable code.

I have more tools and skills implemented, but this is the bare minimum that forms a basic AI coding agent/harness. Like I said, it's a learning project, not competing for anything. I've been using it as daily driver tho.

Oh, and if you have free AI resource, feel free to have it scan the code to see if it can help answer the question. thank you!

A bit of context. I was coding up a little html tower defense game where you can alter the path by placing additional waypoints.

My setup: 32gb ram with 16gb vram 5070 ti. Using AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS on LM Studio with OpenCode. I've graduated from one-shot vibe-coding prompts.

The spec for this game was complicated enough that it couldn't have been done in LM Studio so I tried OpenCode. The project was chugging along, Qwen3.6 35b-a3b was getting things done when 27b dropped. Naturally I had to try it. Only problem is that I couldn't use any of the Q4 models due to vram issues, so I dropped to an IQ3_M model from mradermacher/Qwen3.6-27B-i1-GGUF.

I had worries that IQ3_M would have been too much compression but it did fine and was even able to find a difficult bug that IQ4_XS version of Qwen3.6 35b-a3b couldn't. They say dense models handle compression better than MoE models. Is that the reason for this? What are other people's experience with 35b-a3b vs 27b versions of Qwen3.6?

Using LM Studio,

I got 50-60 tokens per second with Qwen3.6 35b-a3b (AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS) but the prompt processing gets real slow sometimes.

I got 40ish tokens per second with mradermacher/Qwen3.6-27B-i1-GGUF IQ3_M but it was decent speed throughout.

How are people's experiences with these two models at 16gb vram? Anyone doing actual work with IQ3 models of 27b?

Oh, the Waypoint Tower Defense game is done and can be played on htmlbin. The save/load doesn't seem to work on their site, but if you download the file and open it in browser, it'll work fine. It's a self-contained single html game. Meant to be like minesweeper but for tower defense. The path logic is simply connect to the nearest unvisited waypoint from the starting point. And repeat until all waypoints are visited.

Learning to write AI harness old fashioned way. Need help with attention drift and ignoring tool call results!