Hey everyone, a lot of people have been interested in SmallCode and how it functions under the hood.
- The core problem it's solving
Most AI coding tools are built for models with 128-500k+ context windows and reliable JSON output. SmallCode starts from the opposite assumption: your model has maybe 64-128k context, it sometimes writes tool calls that aren't valid JSON, and it will forget what it was doing by step three of a five-step task. Every architectural decision flows from that constraint. It's not trying to be a smarter Cursor, it's trying to extract useful work from the kind of model that runs on a gaming pc, laptop, phone or tablet.
- What happens when you type a message
Before a single token goes to your model, the agent loop does a surprising amount of pre-work. It checks whether your message is too vague to act on, not with an LLM call, but with a regex classifier that costs zero tokens. If you typed "fix it" with no context, it injects a system message asking the model to request clarification rather than guessing. It also scans for dropped image files, expands `@file` references into actual content, and injects a git diff if your message implies you're talking about recent changes.
Then, before building the prompt, it runs a deterministic tool router against your message. This is a weighted regex scoring system, think of it like a confidence vote across eight categories (read, write, search, run, plan, code-intelligence, web, respond). The winning category decides which tool schemas get included in the prompt. A "respond" classification injects zero tools, saving around 800 tokens. A "write" classification gives you only write-relevant tools. This is the core bet: most tasks are obviously one thing, and sending all 20 tool definitions every single time is wasteful for small context windows.
- The tool routing system in more detail
The eight categories each have a set of signals, positive-weight patterns that raise the score, negative-weight anti-signals that lower it. "Explain" lowers the write score. "All uses of" raises search. "How does X work" triggers code-intelligence routing, which gives the model `graph_search` and `explain_symbol` instead of write tools.
When there's a near-tie, priority breaks it: write > run > code-intelligence > search > plan > read > web > respond. This means ambiguous action-oriented messages default toward doing something rather than just answering.
On very small context windows (under 16k), the system switches to two-stage routing: the first call just picks a category, the second gets the actual tools. This trades one extra round-trip for dramatically lower token consumption per call.
The interesting edge case is what happens when you say "yes" or "ok" to a model question mid-task. Without a special guard, the router would reclassify "ok" as a `respond` (no tools), stripping the write tools the model needs to continue. There's an explicit affirmation guard that keeps the prior category instead.
- The MarrowScript compiled layer — what it actually is
There's a `src/compiled/` directory full of files with headers saying "Generated by MarrowScript compiler. DO NOT EDIT." The honest answer is: some of it is real compiled output and some of it is hand-written JavaScript living in a folder called `compiled/`.
The genuine compiled artifacts are the infrastructure layer: a structured JSON logger, an in-memory metrics system (counter/histogram/gauge), a saga flow runtime that executes steps with backward compensation on failure, and a cognition cache with canonical-JSON key derivation, TTL management, and Postgres support. These have corresponding `.ts` source files and the JavaScript is clearly machine-shaped.
The `features/` subdirectory is different. It's a collection of small async functions that call the model for specific micro-tasks: repair a malformed tool call, summarize a large file, generate a commit message, analyze a bash error, classify whether a task needs clarification. Each one has an in-memory cache keyed by content hash, a timeout, and a fallback. They work as a thin prompt dispatch layer. The "compilation" here is more about the design discipline declaring what a feature does, what it returns, what happens on failure, than about literal code generation.
What matters for usage is that these features are all gracefully degrading. If the compiled module isn't available, everything falls back to regex or just returns null. None of them can break the agent loop.
- Planning and why small models need it
Small models drift. By turn four of a six-turn task, they've often forgotten what step three was supposed to accomplish. The plan-tracker is the mitigation: for tasks that look multi-step (long message, refactor/migrate keywords, multiple imperative sentences), the agent injects a one-shot instruction asking the model to emit a numbered plan before any tool calls. Once that plan is captured, either by an LLM-based extractor that handles prose-embedded plans, or a regex fallback, it gets re-injected as a running anchor on every subsequent turn.
The anchor looks like this:
```
ACTIVE PLAN (step 3 of 5):
✓ 1. Read the existing auth module
✓ 2. Identify the JWT validation function
→ 3. Add the refresh token handler
4. Update the route middleware
5. Run tests
```
The model always knows where it is. When it says "step 3 done," the tracker advances. This is the single biggest reliability improvement for multi-file tasks.
The recently added dependency graph takes the plan steps and asks a question in pure code (no LLM): do any of these steps touch the same file? If step 2 and step 5 both mention `auth.js`, step 5 depends on step 2. Topological sort produces batches of independent steps that could run concurrently. This is wired up to the parallel executor, which isn't active by default yet but is the foundation for running independent edits simultaneously.
- How editing actually works
The primary edit primitive is `patch`, search-and-replace where the `old_str` has to match exactly one location. This is deliberate. Small models are unreliable at reproducing whole files: they truncate, hallucinate imports, drift in indentation. A surgical patch that touches 10 lines is orders of magnitude more reliable than rewriting 300 lines, and it's cheaper on context.
When a patch fails because the model's `old_str` no longer matches the current file content — which happens when previous edits have shifted things — there's a semantic merge fallback that asks the model to merge the intended change into the current file content and return the whole corrected file. It's a last resort, not the first move.
There's also a read-before-write guard: if the model tries to write to a file it hasn't read this session, the first attempt is refused with a hint. The second attempt is allowed, because sometimes you legitimately want to fully replace a file. The guard exists because small models regularly overwrite files with incorrect content when they haven't internalized what's already there.
- The session memory and persistence layer
Memory is two-tier. Short-term working memory lives in the conversation history and gets evicted under context pressure. Long-term project memory lives in a SQLite database with full-text search, keyed by content type (decision, workflow, gotcha, convention, context). When you ask the model to remember something, it's written there. When a new task starts, semantically relevant entries are loaded based on keyword overlap with the message.
Each session is persisted to disk with atomic writes (write temp file, then rename). Sessions have time-descending IDs so the most recent one sorts first lexicographically. Path traversal is prevented. File permissions are set to 0600.
Snapshots are a separate mechanism for rollback: before each agent turn, a checkpoint is opened. Every write and patch records the pre-edit file content. If validation hard-fails after all retries, auto-rollback can revert all edits in the turn back to the checkpoint state. The `.smallcode/snapshots/` directory stores this metadata for manual audit.
- What escalation is and when it fires
Every local model run has a ceiling, some tasks are genuinely beyond what a 8B or 26B model can do reliably. Escalation is the opt-in escape hatch: if you've configured a cloud API key (Anthropic, OpenAI, or DeepSeek), then when the local model hard-fails after exhausted retries and decomposition strategies, SmallCode can fire one call to a stronger cloud model.
The escalation engine auto-detects available keys in preference order (Anthropic first, then OpenAI, then DeepSeek). It converts the full conversation history into the provider's native format — Anthropic requires alternating user/assistant turns and `tool_use`/`tool_result` blocks instead of OpenAI's `tool_calls`/`tool` format — and sends it with a framing system message: "A smaller local model failed. Fix it in as few tool calls as possible."
There's a session cap (default five escalations) to prevent runaway API costs. Without a configured key, `canEscalate()` returns false immediately and the feature is completely dormant. It's opt-in in the strongest sense.
SmallCode is genuinely purpose-built for the constraint. The router, the plan-tracker, the patch-first editing, the forgiving JSON parser, the thinking budget control. These aren't features bolted on top of a Claude Code clone. They're compensations for a specific class of model limitation, evolved through running the thing on real hardware against real tasks.