u/Glittering_Focus1538

Hey everyone, a lot of people have been interested in SmallCode and how it functions under the hood.

  1. The core problem it's solving

Most AI coding tools are built for models with 128-500k+ context windows and reliable JSON output. SmallCode starts from the opposite assumption: your model has maybe 64-128k context, it sometimes writes tool calls that aren't valid JSON, and it will forget what it was doing by step three of a five-step task. Every architectural decision flows from that constraint. It's not trying to be a smarter Cursor, it's trying to extract useful work from the kind of model that runs on a gaming pc, laptop, phone or tablet.

  1. What happens when you type a message

Before a single token goes to your model, the agent loop does a surprising amount of pre-work. It checks whether your message is too vague to act on, not with an LLM call, but with a regex classifier that costs zero tokens. If you typed "fix it" with no context, it injects a system message asking the model to request clarification rather than guessing. It also scans for dropped image files, expands `@file` references into actual content, and injects a git diff if your message implies you're talking about recent changes.

Then, before building the prompt, it runs a deterministic tool router against your message. This is a weighted regex scoring system, think of it like a confidence vote across eight categories (read, write, search, run, plan, code-intelligence, web, respond). The winning category decides which tool schemas get included in the prompt. A "respond" classification injects zero tools, saving around 800 tokens. A "write" classification gives you only write-relevant tools. This is the core bet: most tasks are obviously one thing, and sending all 20 tool definitions every single time is wasteful for small context windows.

  1. The tool routing system in more detail

The eight categories each have a set of signals, positive-weight patterns that raise the score, negative-weight anti-signals that lower it. "Explain" lowers the write score. "All uses of" raises search. "How does X work" triggers code-intelligence routing, which gives the model `graph_search` and `explain_symbol` instead of write tools.

When there's a near-tie, priority breaks it: write > run > code-intelligence > search > plan > read > web > respond. This means ambiguous action-oriented messages default toward doing something rather than just answering.

On very small context windows (under 16k), the system switches to two-stage routing: the first call just picks a category, the second gets the actual tools. This trades one extra round-trip for dramatically lower token consumption per call.

The interesting edge case is what happens when you say "yes" or "ok" to a model question mid-task. Without a special guard, the router would reclassify "ok" as a `respond` (no tools), stripping the write tools the model needs to continue. There's an explicit affirmation guard that keeps the prior category instead.

  1. The MarrowScript compiled layer — what it actually is

There's a `src/compiled/` directory full of files with headers saying "Generated by MarrowScript compiler. DO NOT EDIT." The honest answer is: some of it is real compiled output and some of it is hand-written JavaScript living in a folder called `compiled/`.

The genuine compiled artifacts are the infrastructure layer: a structured JSON logger, an in-memory metrics system (counter/histogram/gauge), a saga flow runtime that executes steps with backward compensation on failure, and a cognition cache with canonical-JSON key derivation, TTL management, and Postgres support. These have corresponding `.ts` source files and the JavaScript is clearly machine-shaped.

The `features/` subdirectory is different. It's a collection of small async functions that call the model for specific micro-tasks: repair a malformed tool call, summarize a large file, generate a commit message, analyze a bash error, classify whether a task needs clarification. Each one has an in-memory cache keyed by content hash, a timeout, and a fallback. They work as a thin prompt dispatch layer. The "compilation" here is more about the design discipline declaring what a feature does, what it returns, what happens on failure, than about literal code generation.

What matters for usage is that these features are all gracefully degrading. If the compiled module isn't available, everything falls back to regex or just returns null. None of them can break the agent loop.

  1. Planning and why small models need it

Small models drift. By turn four of a six-turn task, they've often forgotten what step three was supposed to accomplish. The plan-tracker is the mitigation: for tasks that look multi-step (long message, refactor/migrate keywords, multiple imperative sentences), the agent injects a one-shot instruction asking the model to emit a numbered plan before any tool calls. Once that plan is captured, either by an LLM-based extractor that handles prose-embedded plans, or a regex fallback, it gets re-injected as a running anchor on every subsequent turn.

The anchor looks like this:
```
ACTIVE PLAN (step 3 of 5):
✓ 1. Read the existing auth module
✓ 2. Identify the JWT validation function
→ 3. Add the refresh token handler
  4. Update the route middleware
  5. Run tests
```

The model always knows where it is. When it says "step 3 done," the tracker advances. This is the single biggest reliability improvement for multi-file tasks.

The recently added dependency graph takes the plan steps and asks a question in pure code (no LLM): do any of these steps touch the same file? If step 2 and step 5 both mention `auth.js`, step 5 depends on step 2. Topological sort produces batches of independent steps that could run concurrently. This is wired up to the parallel executor, which isn't active by default yet but is the foundation for running independent edits simultaneously.

  1. How editing actually works

The primary edit primitive is `patch`, search-and-replace where the `old_str` has to match exactly one location. This is deliberate. Small models are unreliable at reproducing whole files: they truncate, hallucinate imports, drift in indentation. A surgical patch that touches 10 lines is orders of magnitude more reliable than rewriting 300 lines, and it's cheaper on context.

When a patch fails because the model's `old_str` no longer matches the current file content — which happens when previous edits have shifted things — there's a semantic merge fallback that asks the model to merge the intended change into the current file content and return the whole corrected file. It's a last resort, not the first move.

There's also a read-before-write guard: if the model tries to write to a file it hasn't read this session, the first attempt is refused with a hint. The second attempt is allowed, because sometimes you legitimately want to fully replace a file. The guard exists because small models regularly overwrite files with incorrect content when they haven't internalized what's already there.

  1. The session memory and persistence layer

Memory is two-tier. Short-term working memory lives in the conversation history and gets evicted under context pressure. Long-term project memory lives in a SQLite database with full-text search, keyed by content type (decision, workflow, gotcha, convention, context). When you ask the model to remember something, it's written there. When a new task starts, semantically relevant entries are loaded based on keyword overlap with the message.

Each session is persisted to disk with atomic writes (write temp file, then rename). Sessions have time-descending IDs so the most recent one sorts first lexicographically. Path traversal is prevented. File permissions are set to 0600.

Snapshots are a separate mechanism for rollback: before each agent turn, a checkpoint is opened. Every write and patch records the pre-edit file content. If validation hard-fails after all retries, auto-rollback can revert all edits in the turn back to the checkpoint state. The `.smallcode/snapshots/` directory stores this metadata for manual audit.

  1. What escalation is and when it fires

Every local model run has a ceiling, some tasks are genuinely beyond what a 8B or 26B model can do reliably. Escalation is the opt-in escape hatch: if you've configured a cloud API key (Anthropic, OpenAI, or DeepSeek), then when the local model hard-fails after exhausted retries and decomposition strategies, SmallCode can fire one call to a stronger cloud model.

The escalation engine auto-detects available keys in preference order (Anthropic first, then OpenAI, then DeepSeek). It converts the full conversation history into the provider's native format — Anthropic requires alternating user/assistant turns and `tool_use`/`tool_result` blocks instead of OpenAI's `tool_calls`/`tool` format — and sends it with a framing system message: "A smaller local model failed. Fix it in as few tool calls as possible."

There's a session cap (default five escalations) to prevent runaway API costs. Without a configured key, `canEscalate()` returns false immediately and the feature is completely dormant. It's opt-in in the strongest sense.

SmallCode is genuinely purpose-built for the constraint. The router, the plan-tracker, the patch-first editing, the forgiving JSON parser, the thinking budget control. These aren't features bolted on top of a Claude Code clone. They're compensations for a specific class of model limitation, evolved through running the thing on real hardware against real tasks.

reddit.com

Hey everyone, a lot of people have been interested in SmallCode and how it functions under the hood.

  1. The core problem it's solving

Most AI coding tools are built for models with 128-500k+ context windows and reliable JSON output. SmallCode starts from the opposite assumption: your model has maybe 64-128k context, it sometimes writes tool calls that aren't valid JSON, and it will forget what it was doing by step three of a five-step task. Every architectural decision flows from that constraint. It's not trying to be a smarter Cursor, it's trying to extract useful work from the kind of model that runs on a gaming pc, laptop, phone or tablet.

  1. What happens when you type a message

Before a single token goes to your model, the agent loop does a surprising amount of pre-work. It checks whether your message is too vague to act on, not with an LLM call, but with a regex classifier that costs zero tokens. If you typed "fix it" with no context, it injects a system message asking the model to request clarification rather than guessing. It also scans for dropped image files, expands `@file` references into actual content, and injects a git diff if your message implies you're talking about recent changes.

Then, before building the prompt, it runs a deterministic tool router against your message. This is a weighted regex scoring system, think of it like a confidence vote across eight categories (read, write, search, run, plan, code-intelligence, web, respond). The winning category decides which tool schemas get included in the prompt. A "respond" classification injects zero tools, saving around 800 tokens. A "write" classification gives you only write-relevant tools. This is the core bet: most tasks are obviously one thing, and sending all 20 tool definitions every single time is wasteful for small context windows.

  1. The tool routing system in more detail

The eight categories each have a set of signals, positive-weight patterns that raise the score, negative-weight anti-signals that lower it. "Explain" lowers the write score. "All uses of" raises search. "How does X work" triggers code-intelligence routing, which gives the model `graph_search` and `explain_symbol` instead of write tools.

When there's a near-tie, priority breaks it: write > run > code-intelligence > search > plan > read > web > respond. This means ambiguous action-oriented messages default toward doing something rather than just answering.

On very small context windows (under 16k), the system switches to two-stage routing: the first call just picks a category, the second gets the actual tools. This trades one extra round-trip for dramatically lower token consumption per call.

The interesting edge case is what happens when you say "yes" or "ok" to a model question mid-task. Without a special guard, the router would reclassify "ok" as a `respond` (no tools), stripping the write tools the model needs to continue. There's an explicit affirmation guard that keeps the prior category instead.

## 4. The MarrowScript compiled layer — what it actually is

There's a `src/compiled/` directory full of files with headers saying "Generated by MarrowScript compiler. DO NOT EDIT." The honest answer is: some of it is real compiled output and some of it is hand-written JavaScript living in a folder called `compiled/`.

The genuine compiled artifacts are the infrastructure layer: a structured JSON logger, an in-memory metrics system (counter/histogram/gauge), a saga flow runtime that executes steps with backward compensation on failure, and a cognition cache with canonical-JSON key derivation, TTL management, and Postgres support. These have corresponding `.ts` source files and the JavaScript is clearly machine-shaped.

The `features/` subdirectory is different. It's a collection of small async functions that call the model for specific micro-tasks: repair a malformed tool call, summarize a large file, generate a commit message, analyze a bash error, classify whether a task needs clarification. Each one has an in-memory cache keyed by content hash, a timeout, and a fallback. They work as a thin prompt dispatch layer. The "compilation" here is more about the design discipline declaring what a feature does, what it returns, what happens on failure, than about literal code generation.

What matters for usage is that these features are all gracefully degrading. If the compiled module isn't available, everything falls back to regex or just returns null. None of them can break the agent loop.

## 5. Planning and why small models need it

Small models drift. By turn four of a six-turn task, they've often forgotten what step three was supposed to accomplish. The plan-tracker is the mitigation: for tasks that look multi-step (long message, refactor/migrate keywords, multiple imperative sentences), the agent injects a one-shot instruction asking the model to emit a numbered plan before any tool calls. Once that plan is captured, either by an LLM-based extractor that handles prose-embedded plans, or a regex fallback, it gets re-injected as a running anchor on every subsequent turn.

The anchor looks like this:
```
ACTIVE PLAN (step 3 of 5):
✓ 1. Read the existing auth module
✓ 2. Identify the JWT validation function
→ 3. Add the refresh token handler
  4. Update the route middleware
  5. Run tests
```

The model always knows where it is. When it says "step 3 done," the tracker advances. This is the single biggest reliability improvement for multi-file tasks.

The recently added dependency graph takes the plan steps and asks a question in pure code (no LLM): do any of these steps touch the same file? If step 2 and step 5 both mention `auth.js`, step 5 depends on step 2. Topological sort produces batches of independent steps that could run concurrently. This is wired up to the parallel executor, which isn't active by default yet but is the foundation for running independent edits simultaneously.

## 6. How editing actually works

The primary edit primitive is `patch`, search-and-replace where the `old_str` has to match exactly one location. This is deliberate. Small models are unreliable at reproducing whole files: they truncate, hallucinate imports, drift in indentation. A surgical patch that touches 10 lines is orders of magnitude more reliable than rewriting 300 lines, and it's cheaper on context.

When a patch fails because the model's `old_str` no longer matches the current file content — which happens when previous edits have shifted things — there's a semantic merge fallback that asks the model to merge the intended change into the current file content and return the whole corrected file. It's a last resort, not the first move.

There's also a read-before-write guard: if the model tries to write to a file it hasn't read this session, the first attempt is refused with a hint. The second attempt is allowed, because sometimes you legitimately want to fully replace a file. The guard exists because small models regularly overwrite files with incorrect content when they haven't internalized what's already there.

## 7. The session memory and persistence layer

Memory is two-tier. Short-term working memory lives in the conversation history and gets evicted under context pressure. Long-term project memory lives in a SQLite database with full-text search, keyed by content type (decision, workflow, gotcha, convention, context). When you ask the model to remember something, it's written there. When a new task starts, semantically relevant entries are loaded based on keyword overlap with the message.

Each session is persisted to disk with atomic writes (write temp file, then rename). Sessions have time-descending IDs so the most recent one sorts first lexicographically. Path traversal is prevented. File permissions are set to 0600.

Snapshots are a separate mechanism for rollback: before each agent turn, a checkpoint is opened. Every write and patch records the pre-edit file content. If validation hard-fails after all retries, auto-rollback can revert all edits in the turn back to the checkpoint state. The `.smallcode/snapshots/` directory stores this metadata for manual audit.

## 8. What escalation is and when it fires

Every local model run has a ceiling, some tasks are genuinely beyond what a 8B or 26B model can do reliably. Escalation is the opt-in escape hatch: if you've configured a cloud API key (Anthropic, OpenAI, or DeepSeek), then when the local model hard-fails after exhausted retries and decomposition strategies, SmallCode can fire one call to a stronger cloud model.

The escalation engine auto-detects available keys in preference order (Anthropic first, then OpenAI, then DeepSeek). It converts the full conversation history into the provider's native format — Anthropic requires alternating user/assistant turns and `tool_use`/`tool_result` blocks instead of OpenAI's `tool_calls`/`tool` format — and sends it with a framing system message: "A smaller local model failed. Fix it in as few tool calls as possible."

There's a session cap (default five escalations) to prevent runaway API costs. Without a configured key, `canEscalate()` returns false immediately and the feature is completely dormant. It's opt-in in the strongest sense.

SmallCode is genuinely purpose-built for the constraint. The router, the plan-tracker, the patch-first editing, the forgiving JSON parser, the thinking budget control. These aren't features bolted on top of a Claude Code clone. They're compensations for a specific class of model limitation, evolved through running the thing on real hardware against real tasks.

reddit.com
▲ 462 r/opencodeCLI+2 crossposts

Back again, many changes have taken place.

After fixing more than 90 bugs, I can now safely claim that my project when downloaded from npm or built from source is stable. As a newer dev there was a LOT of issues I had to work through, hours of troubleshooting and tui/commandline conflicts. It was a nightmare but it's finally over.
I would really appreciate if new users or those that had a bad experience could give it another shot.
https://github.com/Doorman11991/smallcode
over 50 people have made forks of my project, I hope everyone can take my code and use their own inspiration to make it 100x better.
I appreciate all of your support and kind words over the last few days. Thank you!

u/Glittering_Focus1538 — 2 days ago
▲ 1.0k r/LocalLLM+2 crossposts

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi-step tasks collapse.

So I built SmallCode. It's designed from the ground up for small local models.

The result: 87/100 benchmark tasks pass with a Gemma 4 model that only activates 4B parameters per token. OpenCode scores ~75% with 14B models. The harness does the heavy lifting, not the model size.

How it works (the tricks that make small models reliable):

  • Compound tools: Instead of making the model chain 4 tool calls (find file → read file → edit file → verify), SmallCode gives it one tool that does all 4. Small models lose coherence after 3+ sequential calls. This cuts failures in half.
  • Improvement loop: Every time the model writes code, SmallCode instantly compiles/lints it. If it fails, it feeds the errors back automatically. The model doesn't need to be smart enough to get it right first try — it just needs to fix errors when shown them.
  • Decompose on failure: If the model fails the same thing twice, SmallCode stops retrying and instead breaks the problem into smaller pieces. "Fix this 200-line file" becomes "fix line 45 only."
  • Escalation: If even decompose fails and you have a Claude/OpenAI key configured, it auto-escalates to the bigger model for just that one task. You stay local 95% of the time, cloud 5%.
  • Token budgeting: Small models have 32k-256k context. SmallCode never dumps a whole file in. It summarizes, truncates, and manages every token so the model never sees "..." truncation in the middle of important code.
  • Code graph: Instead of grep-searching your codebase, SmallCode indexes your code into a symbol graph (functions, classes, who-calls-what). When you ask "how does auth work," it walks the graph and returns just the relevant connected code — not 15 random file snippets.

What it looks like:

Full-screen terminal UI (like OpenCode/vim), scrollable chat, command palette with /, plugin system, persistent memory across sessions.

What it doesn't do:

  • No LSP integration (yet)
  • No multi-session (yet)
  • No desktop app
  • Doesn't compete with Claude Code for frontier model users

Install:

npm install -g smallcode
cd your-project
smallcode

Point it at LM Studio, Ollama, or any OpenAI-compatible endpoint.

MIT licensed, everything's on GitHub: https://github.com/Doorman11991/smallcode

Happy to answer questions about the architecture or benchmark methodology.

u/Glittering_Focus1538 — 5 days ago
▲ 8 r/opencode+3 crossposts

First Token aware MCP server.

I present budget-aware-mcp

Built on CodeGraphContext for indexing (tree-sitter, 155 languages).
Replaces their retrieval layer with hop-based graph walks.

  • Sub-millisecond queries (0.07-0.15ms in-process)
  • Token budget enforcement (agent says "max 8000 tokens" — retrieval stops there)
  • Scope check (prevents hallucinated code generation)
  • Deterministic results (same query = same output, always)
  • Session-level token accounting

If you're looking for almost perfect longterm codebase memory this is the project for you.

u/Glittering_Focus1538 — 6 days ago

Do you want quality AI images using local models?

With my TS adapter you can take any OpenAPI model and use it to finetune your AI image generation for you!

What it does, concretely:

  1. You call adapter.txt2img({ prompt: "a cat" })
  2. The adapter sends "a cat" to LM Studio with a system prompt that says "rewrite this as a detailed Stable Diffusion prompt"
  3. LM Studio (using whatever model you have loaded — Qwen, Llama, Mistral, anything) responds with something like "A regal tabby cat sitting on a windowsill at golden hour, soft natural lighting, shallow depth of field, photorealistic, 4k"
  4. The adapter takes that enhanced text and POSTs it to A1111
  5. A1111 generates the image and sends it back

What it adds over calling A1111 directly:

Without the adapter With the adapter
You type a prompt → A1111 sees that exact prompt → image quality reflects how good your prompting skills are You type a prompt → an LLM rewrites it into a detailed Stable-Diffusion-friendly prompt → A1111 sees the enhanced version → much better images
Hand-craft prompts every time Type whatever, let the LLM do the prompt-engineering work
Need to know which keywords matter ("masterpiece", "best quality", "8k", etc.) LLM has read enough SD documentation to know
github.com
u/Glittering_Focus1538 — 6 days ago

I created a compiller&DSL for Backend scaffolding used it to make an plugin for opencode

https://github.com/Doorman11991/opencode-bonescript-backend
and
https://github.com/Doorman11991/BoneScript

The main project goal of BoneScript was to ease the boredom of endless boilerplates and wasted time that backend development can be.
BoneScript is a declarative language that compiles system descriptions into complete, runnable Node.js backends.

u/Glittering_Focus1538 — 8 days ago
▲ 6 r/opencode+2 crossposts

Hey Everyone! I’ve been experimenting with OpenCode + BoneScript for structured backend generation.

I’ve been experimenting with making coding agents generate complete backends using BoneScript, and it’s working surprisingly well.

BoneScript’s structure ends up being extremely LLM-friendly:

  • declarative system layout
  • predictable architecture
  • explicit entities/capabilities/routes
  • less ambiguity than raw backend frameworks

So I built an OpenCode plugin/backend integration that pushes agents toward generating BoneScript instead of ad-hoc backend code.

The result is that the model tends to:

  • stay architecturally consistent longer
  • make fewer structural mistakes
  • generate cleaner backend flows
  • reason about systems at a higher level instead of individual files

Project:
opencode-bonescript-backend | npm package

I’d genuinely love feedback from people building agentic coding tools or experimenting with LLM-native development workflows.

u/Glittering_Focus1538 — 8 days ago
▲ 325 r/AIDeveloperNews+11 crossposts

BoneScript, a new opensource Compiler for complete backend development

I developed an LSP, VS-Code extension and NPM package, please try it out and give me your thoughts!

github.com
u/Glittering_Focus1538 — 2 days ago
▲ 2 r/coolgithubprojects+2 crossposts

I built an opensource DSL compiler for backend management.

Declare your backend. Ship production code. BoneScript is a DSL and compiler that turns a single declarative file into a full-featured TypeScript backend, complete with APIs, auth, database, realtime, state machines, and deployment. Also ships with a build script for a custom VS-Code extension. Please try it out and give me your thoughts.

github.com
u/Glittering_Focus1538 — 8 days ago

DMCA Sentinel a site that protects small business partners.

DMCA Sentinel is built as a lightweight compliance-first DMCA workflow tool for small teams, not a full-service legal firm. It combines notice parsing, manual/email intake, Cloudflare-aware automation, vault export, and secure encrypted storage in one modern app. DMCA-Sentinel

u/Glittering_Focus1538 — 10 days ago