
What I learned running a local coding agent on an RTX 4070 Super
I wanted to try out coding with local models and see if I can get them to produce complete, working projects. My hardware is decent (in general), but not a dedicated AI setup: RTX 4070 Super with 12GB VRAM, which means I'm limited in terms of what models I can run.
For this purpose I built an app that takes an idea, explores different options, breaks it down and then implements it. Idea being that it presents you with solutions, you can pick or ask for new ones, once you're happy with an approach you have a Q&A session with the model to make final adjustments and answer any open questions and then you let it implement.
While it's working it also collects telemetry so you can keep track of how well it's performing, which model is working, etc.
github: https://github.com/goranstjepanovic/thinktank
Since this is essentially a playground project I implemented a hybrid inference: Ollama + llama.cpp + OpenVINO controlled through models.yaml file where you select which model you want to use for what purpose with what backend. And project can be stopped/resumed so you can change out the models you're using.
What I learned:
- orchestrator was getting lost as the project grew and kept getting stuck in same tasks, I ended up introducing a dedicated planning stage before starting to keep it on track forcing it to use plan management tools. The plan is generated at start but is dynamic, orchestrator can add/remove/update tasks as it goes - this is useful as failed tasks are broken down into smaller ones during a run
- task verification - I added a verification stage after each task completion with forced fix tasks auto-triggered for issues found to make sure the models weren't making things up
- dynamic model selection - I found not every model is best for everything so I created a fallback chain with priority based on success rate and speed and this seems to be working well
- tools matter - I ended up implementing a lot of tools, from web search to memory to make sure I can keep the models from constantly trying to read the entire project and get side tracked
- never test on well known things - I started testing the app by asking it to implement a memory game, then a snake game and it did really well, but then realised as soon as I gave it an original idea it fell apart :-)
I haven't settled on a list of models that work best yet, my current setup for sub-agent is:
rnj-1:8b
gpt-oss-coder:20b
qwen2.5-coder:14b
qwen3-coder:30b
with orchestrator being qwen3-8b through OpenVINO
(all this depends heavily on available hardware as well so my choices are based on 12GB VRAM)
Full transparency: app was built 100% using Claude Code
In any case, just sharing if it can help anyone currently exploring like me