
u/Best_Volume_3126

Thin Harness, Fat Skills: 200 Lines of Harness, All the Value in the Skills
This is a short summary on how Thin harness & Fat Skills are really important.
The 2x engineers and the 100x engineers run the same models. The difference is architecture.
Anthropic accidentally published the full Claude Code source to npm. 512,000 lines. Reading it confirmed what I'd been teaching at Y Combinator: the value lives in the wrapper, not the model. That wrapper is the harness. Most builders make it fat and make the skills thin. That's exactly backwards.
Skill files
A skill file is a markdown document encoding a process. The user supplies the task. The skill supplies the judgment.
The key: a skill works like a method call. Same procedure, different parameters, different output. A skill called /investigate has seven steps: scope the dataset, build a timeline, diarize documents, synthesize, argue both sides, cite sources. Parameters: TARGET, QUESTION, DATASET. Point it at a safety scientist and 2.1 million discovery emails and you get a medical research analyst. Point it at a shell company and Federal Election Commission filings and you get a forensic investigator tracing campaign donations.
Same file. The invocation supplies the world.
The harness
The harness runs the large language model in a loop, reads and writes files, manages context, enforces safety. About 200 lines of code. JSON in, text out.
The anti-pattern: 40+ tool definitions eating half your context window, Model Context Protocol round-trips taking 2 to 5 seconds per call. A Playwright command-line interface handles each browser operation in 100 milliseconds. A Chrome Model Context Protocol server takes 15 seconds for screenshot, find, click, wait, read. 75x slower, from one architectural choice.
Resolvers
A resolver is a routing table for context. Task type X appears, document Y loads first.
My CLAUDE.md hit 20,000 lines. Every pattern, every lesson. Model attention degraded and Claude Code told me to cut it. The fix was 200 lines of pointers. The resolver pulls the right document when it matters, without polluting the context window.
Latent vs. deterministic
Every step in your system belongs on one side of this line. Latent space is where the model reads and decides: judgment, synthesis, pattern recognition. Deterministic is where the same input produces the same output: SQL, compiled code, arithmetic.
A large language model seats 8 people at a dinner accounting for personalities. Ask it to seat 800 and it produces a plausible, completely wrong seating chart. Combinatorial optimization belongs in deterministic tooling. The worst systems put the wrong work on the wrong side.
Diarization
The model reads everything about a subject and writes one structured profile: a page of judgment distilled from dozens of documents. No SQL query produces this. No retrieval-augmented generation pipeline produces this. The model has to read, hold contradictions in mind, notice what changed, and synthesize. It's the difference between a database lookup and an analyst's brief.
All five working together
Chase Center, July 2026. Six thousand founders at Startup School. A skill called /enrich-founder pulls all sources, diarizes, and flags the gap between what founders say and what they're building. Deterministic layer handles SQL, GitHub stats, demo URL tests. Cron runs nightly.
The diarization catches things no keyword search finds:
FOUNDER: Maria Santos
COMPANY: Contrail (contrail.dev)
SAYS: "Datadog for AI agents"
ACTUALLY BUILDING: 80% of commits are in billing module.
She's building a FinOps tool disguised as observability.
Surfacing that gap requires reading the GitHub history, the application, and the advisor transcript at once. No embedding similarity search does that.
Matching uses the same skill with three invocations: /match-breakout clusters 1,200 founders by sector, 30 per room. /match-lunch does serendipity matching across sectors, 8 per table, the large language model invents themes, a deterministic algorithm assigns seats. /match-live runs nearest-neighbor embedding at 200ms for one-on-one pairs in real time.
The model also makes calls a clustering algorithm can't: "Kim applied as 'developer tools' but his one-on-one transcript shows he's building SOC 2 compliance automation. Move him to FinTech." No embedding captures that.
After the event, an /improve skill diarizes the mediocre Net Promoter Score responses, extracts patterns, and writes rules back into the matching skill:
When attendee says "AI infrastructure"
but startup is 80%+ billing code:
→ Classify as FinTech, not AI Infra.
When two attendees in same group
already know each other:
→ Penalize proximity.
Prioritize novel introductions.
The skill rewrites itself. July event: 12% "OK" ratings. Next event: 4%. Nobody touched the code.
Skills are permanent upgrades
The instruction I gave my OpenClaw that got 1,000 likes and 2,500 bookmarks:
>
People read it as prompt engineering. It's architecture. Every skill you write runs at 3 AM, never forgets, and gets better automatically when the next model ships, because the judgment in the latent steps improves while the deterministic steps stay reliable.
Fat skills, thin harness, discipline to codify everything.
It's okay to give yourself the permission to be a beginner.
Guard your inner life like your life depends on it.
we run 50+ services through 1 mcp server. here's the architecture.
We run 50+ services through one mcp server. it's Linear, GitHub, Sentry, Notion, Slack, Vercel, Gmail, 42+ other and every tool our team uses, each with its own auth flow, rate limits, and credential format.
the architecture
Everything runs through a context management layer, tool, credentials, and state live outside the model context. agents connect to one endpoint. That endpoint knows where each plugin lives, handles routing, and manages credentials. you register a service once, auth runs once, the token lives in the workspace, and every agent on the team inherits access through scoped grants defined at the plugin level. add an agent, assign grants. remove a plugin, all agents lose access. no credential files in the repo, no rotation scripts.
the agent gets tools, no credentials. sentry.getIssue(), linear.createTask(), github.getPullRequest(). the layer translates each call into the correct authenticated request, handles rate limit retries, returns the result. the agent never touches auth.
Before this setup, roughly 30% of our token budget went to tool discovery and auth retry logic re-fetching capability lists, retrying failed auth, and renegotiating endpoints. Tool discovery now happens once at workspace init. that 30% comes back every session.
tracing
on Observability: every tool invocation produces a trace. we pipe those into langfuse & track latency per service, error rates per tool, token cost per agent session. when sentry slows down, we catch it in the trace before the agent times out.
we also run opentelemetry-mcp-server as a second plugin, connected to our jaeger backend. the agent queries its own trace data mid-session failed calls, 50+, auth errors, exact timestamps. no human checking a dashboard.
when something breaks, we get the specific call that failed, the service that errored, and where the chain stopped. one workspace without many integrations.
Consistency looks like nothing is happening, until it changes everything.
Hermes Agent Architecture: From One Agent to a Full Fleet
Hermes Agent is an autonomous framework from Nous Research. It ranks first on OpenRouter for global token usage, with 150k+ GitHub stars,
The pitch against something like OpenClaw: Hermes is opinionated. Defaults are baked in, the agent makes decisions for you, and every project starts with 100+ capabilities already wired. OpenClaw gives you primitives and explicit control. Both are valid. Hermes wins when you want compounding capability over time. OpenClaw wins when you want to control every step.
Architecture
Three layers per agent.
A brain. Memory lives in ~/.hermes/memories/ across MEMORY.md (your business, customers, products) and USER.md (your timezone, recurring projects, preferred output formats). Both load before the first prompt. Sessions persist in SQLite with full-text search across sessions.
A personality. SOUL.md defines tone. Six agents can share the same brain with six different souls, one for outbound, one for research, one for admin, each scoped to its role.
A skillset. The 123 bundled skills are the floor. As the agent works, it watches itself and writes new skills based on your actual tasks. You don't prompt it to do this.
The tool gateway gives you 300+ models under one subscription, Model Context Protocol integration for any external service, and 20+ messaging surfaces including Telegram, Discord, Slack, and email. The agent runs local, in Docker, over SSH on a virtual private server, or serverless through Daytona or Modal.
The four levels
The mental model has four parts: you as operator, the agent control room (a folder at /root/vps-agents that governs the fleet, not an agent you chat through), the Hermes agents as workers, and an optional task bus between the orchestrator and specialists.
Storage split:
/root/vps-agents → control room: docs, rules, runbooks, architecture
/srv/<agent-name>/data → live runtime: secrets, memory, skills, sessions, crons
You can rebuild the live runtime from the control room. You cannot rebuild it the other way.
Level 1: One agent. Fill SOUL.md, MEMORY.md, and USER.md. Connect it to Telegram or Discord. Run real tasks. Let the skill library grow on its own.
Level 2: Multiple specialists, each with its own soul, scope, and credentials. You talk to each one directly. No orchestrator yet. Prove your specialists are useful before adding routing complexity.
A new agent gets its own container when it needs its own credentials, its own long-term memory, or handles ongoing work that constitutes a separate role. Otherwise, keep things consolidated.
Level 3: Add a Hermes orchestrator as the front door. It reads the control room to know which agents exist, what each handles, where task queues live, and where the runbooks are. Three interaction paths:
control path: you ──► agent control room (manage the fleet)
direct path: you ──► specialist agent (fastest, when you know who owns it)
orchestrated path: you ──► orchestrator ──► task bus ──► specialists ──► you
Level 4: Same as Level 3 with recurring workflows on cron. Search engine results page reports, server health checks, backup verification, content operations. Nothing needs you to start the day.
Spinning it up
Clone the template at github.com/shannhk/hermes-agent-control-room. The intended path: hand Claude Code or Codex a Hetzner API key and let the bundled skills run the setup. You get a provisioned virtual private server, the control room cloned at /root/agent-control-room, skills linked into ~/.claude/skills, one agent registered with its runbook filled in, and an SSH alias so ssh hermes connects from your laptop. Ten to fifteen minutes.
Growing agents, not writing them
Production agents don't get written from scratch.
Step 1: Prototype in Hermes. Describe the workflow, let it run, expect it to get most of it wrong.
Step 2: Run it two or three times on real work. Correct the drift. The harness watches and starts writing the skill as it learns the shape of your task.
Step 3: Fine-tune in a dedicated Claude Code workspace. Tighten the prompts, lock the routing, add error handling, decide what runs on cron.
Step 4: Push to its own Docker container on the virtual private server, set the cron, walk away.
Model routing
The tool gateway routes to 300+ models per agent or per task. For content and copy, Claude Opus. For structured multi-step work and automation, Codex. Run your orchestrator and strategy agents on the strongest model you can afford. Drop to cheaper models for batch processing and research scraping.
Actual trade-offs
Hermes's opinionated defaults are also constraints. If you want explicit control over every step, OpenClaw fits better.
Levels 3 and 4 require real infrastructure knowledge: Docker, virtual private servers, SSH, the control room structure. Don't skip Level 1.
The model sets the ceiling. Hermes makes a capable model more productive. It doesn't make a weak model strategic.
why enterprise Agent stacks use all 3 layers (and how to build one)
the "mcp vs cli" debate is a distraction. production agents don't choose. they run all 3 connectivity layers at once, and if you're still picking sides you're asking the wrong question.
here's how the stack actually works.
3 layers, 3 jobs
skills are markdown files that teach the model how to use a tool. portable, loadable from .claude/skills/ or a remote repo, and they belong in context before the model touches anything complex. think of them as domain knowledge on a wire.
cli is the unix layer. the model already knows git, gh, curl, jq from pretraining, so it composes commands with pipes, filters output inline, and handles errors without burning tokens. a cli response runs around 200 tokens. a naive mcp response runs 44,000 to 55,000. that gap is not a rounding error.
mcp is connective tissue for anything requiring authorization, governance, and audit trails. schema-first means tool selection is deterministic. in enterprise workflows you can't afford ambiguous tool calls. mcp is the answer to that specific problem, and only that problem.
token overhead is solvable
the main criticism of mcp is token bloat. it's fair and it's also a pattern problem, not a protocol problem. front-loading every tool schema at 44,000+ tokens is bad architecture, not a flaw in the spec.
progressive discovery fixes it. give the model a tool_search capability so it loads tools on demand. that 1 change cuts context usage by a factor of 5.
the other latency killer is sequential tool calls. every inference hop adds wait time. instead, drop the model into a repl environment and let it write an orchestration script that runs everything at once. 1 script, 1 inference round, results back in parallel.
const issue = await mcp.call_tool("linear_get_issue", { id: "ENG-5121" });
const prs = await mcp.call_tool("github_list_prs", { repo: "frontend" });
const typedIssue = await extract("claude-haiku-4-5", expectedType, issue);
that's code mode. the latency difference at scale is not small.
write mcp servers for agents, not rest clients
most mcp servers today are rest apis with a thin wrapper. that's the wrong mental model. a developer reading api docs and an agent reading your tool schema are doing completely different things. agents succeed faster when parameter names are descriptive and annotations tell them exactly what's expected.
def submit_expense(
amount: Annotated[float, "The expense amount in USD"],
date: Annotated[date, "Date of the expense in YYYY-MM-DD format"],
category: Annotated[Category, "The expense category"]
) -> str:
beyond annotations: expose an execution environment so the model can run code mode against your server. ship ui resources as html, javascript, and css so the server renders its own interface in the client. cloudflare's mcp implementation does this already.
what's shipping in 2026
3 things worth tracking.
stateless transport (google's proposal) makes mcp servers deployable to kubernetes and cloud run without session state. typescript and python sdk 2.0 are tied to this.
cross-app access brings sso across mcp servers via your company's identity provider, with server discovery automated through .well-known/mcp-server-card/server.json.
skills over mcp lets servers ship domain knowledge alongside tools using skills/list and skills/get endpoints. that closes the gap between what a server can do and what the model knows about how to do it well.
the takeaway
skipping mcp in enterprise contexts doesn't simplify anything. you trade token overhead for authorization fragmentation, zero audit trails, and vendor lock-in. those are harder problems.
skills for domain knowledge. mcp for secure connectivity. cli for token-efficient local execution. progressive discovery and code mode make all 3 practical together. both techniques are available now. most teams aren't using them yet.
/goal is the hottest Command in Claude Code right now.
/goal runs multi-step tasks to completion without you staying in the loop.
Available in Claude Code, Codex, and Hermes.
How the loop works:
- You type
/goalfollowed by the end state you want - The model executes a step, then checks: "am I done?"
- If no, it continues. If yes, it stops and reports back.
Use it for jobs with many steps and a clear finish line:
- "Build my course landing page: hero, 5 modules, 3 testimonials, FAQ, and Stripe checkout"
- "Migrate my 80 blog posts from WordPress to Beehiiv, fix every broken image and internal link along the way"
- "Process every customer support ticket from last month: categorize them, draft template replies, and document the top 5 recurring issues"
Skip it for simple tasks. Single-turn prompts ("write a tweet", "explain X") don't need the overhead.
To write a tight /goal prompt, paste this into Claude Code, Codex, or Hermes:
>"Write me a /goal prompt. Ask me what I'm trying to do first, then keep asking follow-up questions until you can describe 'done' in specific, measurable terms."
Prefix the output with /goal and run it.
The self-check loop is what separates this from a long prompt. A long prompt front-loads all the conditions. /goal re-evaluates after every step, so it handles branching jobs where the path isn't fully predictable upfront.
Using NotebookLM as a knowledge base (prompts included)
NotebookLM is source-grounded. It reasons only over what you upload. That constraint is the feature, not a limitation.
Here's a six-layer system: Sources set your input boundary, Chat handles extraction and synthesis, Notes capture outputs so they don't vanish mid-session, Notes promoted to Sources enable two-pass synthesis, Studio generates reports and audio overviews and flashcards, and Export moves the result to Docs, Sheets, Obsidian, or Notion.
The Notes-to-Sources loop is what most people skip. You extract frameworks, save the extraction as a Note, promote that Note to a Source, then ask NotebookLM to build a course or playbook using the framework-note as structure and the original sources as evidence.
Run the source audit before touching Chat. Weak sources compound through every layer downstream.
Conduct a rigorous Source Audit.
Create a table with:
1. Source name
2. Source type
3. Publication date if available
4. Author or organisation if available
5. Core thesis
6. Usefulness rating
7. Potential bias or weakness
8. What this source is best used for
9. Whether it should be kept, removed or used cautiously
Then identify any contradictions between sources.
Knowledge base build, after the audit:
Act as an Expert Knowledge Architect.
Create a complete knowledge base from the selected sources.
Work in this order:
1. Create a source inventory.
2. Extract key ideas, definitions, frameworks, processes, examples, warnings, tools and gaps.
3. Create a master theme map.
4. Create a concept map.
5. Create a framework map.
6. Create a process map.
7. Identify contradictions and missing information.
8. Build a modular knowledge base.
9. Add checklists, templates, prompts and practical exercises.
10. Finish with a source coverage and claim audit.
Do not invent unsupported examples.
Mark gaps clearly.
Active recall for studying dense material:
Act as a strict Socratic tutor.
Test me on the selected sources one question at a time.
Rules:
- Ask one question.
- Wait for my answer.
- Grade my answer against the sources.
- Explain what I got right.
- Explain what I missed.
- Give a hint before giving the full answer.
- Track my weak areas.
- At the end, create a revision plan based on my mistakes.
Meeting transcripts:
Analyse this meeting transcript.
Provide:
1. 3-sentence executive summary
2. Key decisions made
3. Action item table with owner and deadline
4. Risks raised
5. Open questions
6. Dependencies
7. Follow-up email draft
8. Project brief update
If the transcript does not clearly assign an owner or deadline, mark it as unclear.
For projects that outgrow one notebook, split by theme and generate a Bridge Summary per notebook. Store them in Obsidian or Notion, bring final drafts back into NotebookLM for verification. Gemini handles cross-notebook work and long-form generation.
Create a Bridge Summary for this notebook.
The summary must represent the most important knowledge from all selected sources.
Include:
1. Core topic
2. Main themes
3. Key frameworks
4. Important evidence
5. Contradictions
6. Gaps
7. Useful examples
8. Recommended next-step questions
9. Source list
This Bridge Summary will be exported into an external knowledge base.
Make it dense and portable.
I Cut Claude Code Token Usage 20x: Using Cheaper Models for daily tasks.
Claude Code's Bash tool runs any command on your PATH. That's my entire setup.
I was burning through my Pro allocation very quicky every week while building drone guidance systems. Reading 800-line Python files, generating test harnesses, rewriting docs. Most of that work requires zero reasoning. It just burns tokens.
The fix: route cheap tasks to a cheap model. Two scripts, 20 lines in CLAUDE.md.
I used Kimi K2.5 as the worker. OpenAI-compatible API, 128K context window, roughly 1/100th the cost. Any cheap long-context model works, DeepSeek, Qwen, Gemini Flash. The pattern is what matters, not the model.
ask-kimi — bulk reading
Instead of Claude pulling five files into context to answer one question:
ask-kimi --paths gcs_main.py gimbal_control.py network.py \
--question "What IP addresses and ports are used for video streaming?"
Kimi returns a structured summary. Claude reads the summary. Before: ~8,000 tokens. After: ~400 tokens.
#!/path/to/venv/bin/python3
import argparse, os, pathlib
from openai import OpenAI
client = OpenAI(
api_key=os.environ["MOONSHOT_API_KEY"],
base_url="https://api.moonshot.ai/v1",
)
docs = []
for path in args.paths:
content = pathlib.Path(path).read_text()
docs.append(f"<file path='{path}'>\n{content}\n</file>")
corpus = "\n\n".join(docs)
resp = client.chat.completions.create(
model="kimi-k2.5",
messages=[
{"role": "system", "content": "You are a precise code analyst..."},
{"role": "user", "content": f"<corpus>\n{corpus}\n</corpus>"},
{"role": "user", "content": args.question},
],
max_tokens=8192,
)
print(resp.choices[0].message.content)
kimi-write — boilerplate generation
Test files, config scaffolding, documentation drafts:
kimi-write --spec "pytest test file for the MAVLink heartbeat parser" \
--context src/mavlink_parser.py \
--target tests/test_mavlink_parser.py
Kimi generates the file. Claude makes surgical edits on the 5% that needs judgment.
Routing via CLAUDE.md
Claude reads this file at session start. The routing rules live here:
## Kimi K2.5 Delegation Tools
### ask-kimi — bulk reading
For files >400 lines, or reading 3+ files at once:
ask-kimi --paths <file1> <file2>... --question "<question>"
### kimi-write — boilerplate generation
For tests, config files, docstrings, repetitive patterns:
kimi-write --spec "<what>" --context <reference> --target <output>
### When NOT to delegate
- Tasks under ~2000 tokens
- Debugging, architectural decisions, safety-critical code
- Anything requiring careful reasoning
- When exact line numbers are needed for editing
The "when not to delegate" block matters as much as the rest. Without a clear boundary, Claude tries to route everything including the work it's actually good at. With it, Claude self-routes without extra prompting.
Two things worth knowing
Kimi K2.5 burns internal chain-of-thought tokens that don't appear in the response. Set max_tokens too low and the response comes back empty with no error. Use 8192 for reading tasks, 16384 for generation.
Put the corpus before the question in every call. Moonshot uses prefix caching. Same files, different questions: first call full price, next calls at 25% cost.
messages = [
{"role": "user", "content": f"<corpus>{corpus}</corpus>"},
{"role": "user", "content": question},
]
Three weeks of daily engineering work. Total Kimi spend: $0.38. Claude Pro limit: not hit once.
The magic you're looking for is in the work you're avoiding.
What are you planting in your garden?
Why I Stopped Using Markdown for Claude Code Outputs. HTML Outputs Are Underrated
Markdown made sense when you were the one editing the file. You'd write a plan, Claude would suggest changes, you'd merge them by hand. The format served that loop.
That loop is mostly gone. Claude edits the files now. You read them, or you pass them to a verification agent, or you share them with someone who needs to approve the direction. Nobody's doing line edits in a text file.
Markdown still works at 30 lines. Past 100, most people stop reading. The format can't hold tables with real styling, can't embed diagrams that don't look like ASCII guesswork, can't let you interact with the content. It just sits there as text.
HTML doesn't have those limits. Claude can put real tabular data in a table with CSS, draw diagrams in SVG, add JavaScript-driven sliders so you can tune a parameter and see the result, build a mobile-responsive layout if the file needs to travel. There's almost no category of information that won't fit, and you can share it as a URL instead of an attachment.
The practical upgrade shows up fast in a few specific workflows.
Specs and planning. Instead of a 200-line markdown plan nobody reads, ask Claude Code to produce an HTML file with mockups, data flow diagrams, and annotated code snippets in one document. Pass that file into the next session as context. The verification agent reads it too and has far more to work with than a flat text spec.
A prompt that works:
>
Code review. Rendered diffs, severity-coded annotations, flowcharts of the logic you're trying to explain, all in one file you attach to the pull request. The default GitHub diff view doesn't do any of that.
>
Throwaway editors. This one takes a minute to see, but it's the most useful pattern. When you're working on something that's painful to describe in text, ask Claude to build you a single-purpose HTML interface for that exact thing. Drag-and-drop ticket triage, a form-based config editor with dependency warnings, a side-by-side prompt editor with live variable rendering. Always end it with an export button that outputs the result as text or JSON you can paste back into Claude Code.
>
The export button is the critical detail. Without it, the editor is a dead end. With it, it becomes a UI layer for your agent loop.
Context is the reason to use Claude Code specifically for this. Claude Code can read your file system, pull from connected Model Context Protocol servers like Slack or Linear, check your git history. An HTML report built from that context will have actual specifics in it, not placeholders.
One real tradeoff: HTML diffs are noisy. If your team reviews documentation in version control, HTML is harder to scan than markdown. For files that live in a repo and get reviewed in pull requests, markdown still wins. For files that get read, shared, or acted on, HTML is the better format.
The frontend design plugin helps Claude produce cleaner, more consistent HTML output. If you want it to match your product's visual style, point Claude at your codebase and ask it to generate a design system reference file first, then use that as a reference for subsequent HTML outputs.
You don't need a skill or a preset for any of this. Ask Claude to make an HTML file and describe what it should contain. The format will handle the rest.
Anyone else struggling with the guilt/frustration balance while TTC #2?
I didn’t expect TTC for another baby to feel emotionally this complicated. On one hand, I feel incredibly grateful to already have a child. On the other hand, every unsuccessful cycle still hurts more than I want to admit, and then I immediately feel guilty for even being upset about it.
I also think I underestimated how hard it is to TTC while parenting at the same time. Between exhaustion, schedules, interrupted sleep, and just constantly being needed by someone else, it feels very different from TTC the first time around. Some days I’m calm about it, and other days I find myself spiraling over timing, symptoms, and whether something is wrong because it isn’t happening as quickly this time... this version of TTC feels emotionally very different than I expected.
To attract better, you have to become better.
Karpathy's CLAUDE.md cuts Claude mistakes to 11%. Here are the 8 rules that get it to 3%
Here's Karpathy's Claude complaints into 4 rules, put them in a single CLAUDE.md. The rules worked. Across 30 codebases over 6 weeks, mistake rates dropped from 41% to 11%.
The 4 rules were written for single-shot, one-codebase autocomplete sessions. They don't cover agent loops, multi-step tasks, or silent failures. Below are 8 rules that do.
The original 4
## Rule 1 — Think Before Coding
State assumptions explicitly. Ask rather than guess.
Push back when a simpler approach exists. Stop when confused.
## Rule 2 — Simplicity First
Minimum code that solves the problem. Nothing speculative.
No abstractions for single-use code.
## Rule 3 — Surgical Changes
Touch only what you must. Don't improve adjacent code.
Match existing style. Don't refactor what isn't broken.
## Rule 4 — Goal-Driven Execution
Define success criteria. Loop until verified.
Strong success criteria let Claude loop independently.
The 8 rules I added
Rule 5. Claude called to decide whether to retry on 503 worked for two weeks, then started flaking. The model read the request body as context for the retry decision. The policy became random.
## Rule 5 — Use the model only for judgment calls
Use for: classification, drafting, summarization, extraction.
Do NOT use for: routing, retries, status-code handling, deterministic transforms.
If code can answer, code answers.
Rule 6. A debugging session ran 90 minutes on the same 8KB error. By message 40, Claude was re-suggesting fixes rejected 40 messages earlier.
## Rule 6 — Token budgets are not advisory
Per-task: 4,000 tokens. Per-session: 30,000 tokens.
If approaching budget, summarize and start fresh.
Surface the breach. Do not silently overrun.
Rule 7. A codebase had two error-handling patterns. Claude blended them. Errors got swallowed twice.
## Rule 7 — Surface conflicts, don't average them
If two patterns contradict, pick one (more recent / more tested).
Explain why. Flag the other for cleanup.
Don't blend conflicting patterns.
Rule 8. Claude added a function next to an identical one it hadn't read. The new one took precedence via import order. The original had been source of truth for 6 months.
## Rule 8 — Read before you write
Before adding code, read exports, immediate callers, shared utilities.
If unsure why existing code is structured a certain way, ask.
Rule 9. Claude wrote 12 tests for an auth function, all passed, auth was broken in production. The tests verified the function returned something. The function returned a constant.
## Rule 9 — Tests verify intent, not just behavior
Tests must encode WHY behavior matters, not just WHAT it does.
A test that can't fail when business logic changes is wrong.
Rule 10. A 6-step refactor went wrong on step 4. Claude completed steps 5 and 6 on top of the broken state before I noticed.
## Rule 10 — Checkpoint after every significant step
Summarize what was done, what's verified, what's left.
Don't continue from a state you can't describe back.
If you lose track, stop and restate.
Rule 11. Claude introduced React hooks into a class-component codebase. They worked. They broke the testing patterns, which assumed componentDidMount.
## Rule 11 — Match the codebase's conventions, even if you disagree
Conformance > taste inside the codebase.
If you think a convention is harmful, surface it. Don't fork it silently.
Rule 12. Claude reported a database migration "completed successfully." It had skipped 14% of records on constraint violations, logged but not surfaced. Found 11 days later.
## Rule 12 — Fail loud
"Completed" is wrong if anything was skipped silently.
"Tests pass" is wrong if any were skipped.
Default to surfacing uncertainty, not hiding it.
Full file (copy-paste ready)
# CLAUDE.md — 12-rule template
These rules apply to every task in this project unless explicitly overridden.
Bias: caution over speed on non-trivial work.
## Rule 1 — Think Before Coding
State assumptions explicitly. Ask rather than guess.
Push back when a simpler approach exists. Stop when confused.
## Rule 2 — Simplicity First
Minimum code that solves the problem. Nothing speculative.
No abstractions for single-use code.
## Rule 3 — Surgical Changes
Touch only what you must. Don't improve adjacent code.
Match existing style. Don't refactor what isn't broken.
## Rule 4 — Goal-Driven Execution
Define success criteria. Loop until verified.
Strong success criteria let Claude loop independently.
## Rule 5 — Use the model only for judgment calls
Use for: classification, drafting, summarization, extraction.
Do NOT use for: routing, retries, deterministic transforms.
If code can answer, code answers.
## Rule 6 — Token budgets are not advisory
Per-task: 4,000 tokens. Per-session: 30,000 tokens.
If approaching budget, summarize and start fresh.
Surface the breach. Do not silently overrun.
## Rule 7 — Surface conflicts, don't average them
If two patterns contradict, pick one (more recent / more tested).
Explain why. Flag the other for cleanup.
## Rule 8 — Read before you write
Before adding code, read exports, immediate callers, shared utilities.
If unsure why existing code is structured a certain way, ask.
## Rule 9 — Tests verify intent, not just behavior
Tests must encode WHY behavior matters, not just WHAT it does.
A test that can't fail when business logic changes is wrong.
## Rule 10 — Checkpoint after every significant step
Summarize what was done, what's verified, what's left.
Don't continue from a state you can't describe back.
## Rule 11 — Match the codebase's conventions, even if you disagree
Conformance > taste inside the codebase.
If you think a convention is harmful, surface it. Don't fork silently.
## Rule 12 — Fail loud
"Completed" is wrong if anything was skipped silently.
"Tests pass" is wrong if any were skipped.
Default to surfacing uncertainty, not hiding it.
Save at repo root. Add project-specific rules below. Hard ceiling at 200 lines total: compliance drops past it. Going from 4 rules to 12 moves compliance from 78% to 76% and cuts mistake rate from 11% to 3%.