u/Deep_Structure2023

Running Claude Code Agents In Parallel.

Running Claude Code Agents In Parallel.

I was building a motion analysis feature for an iOS health app that watches how someone walks and flags fall risk in elderly patients. The gait analysis engine at the core of it accepted one format: H.264 video, AAC audio, a specific resolution range. Feed it anything else and it threw this and stopped:

VideoConversionError: codec mismatch — expected H.264/AAC, received HEVC/AC3

Three times in one week. My Claude Code converter was supposed to handle format normalization before the video reached the engine. It didn't, not reliably. One agent was doing everything: inspect the file, decide on parameters, run ffmpeg, validate output, structure the data for iOS. When it failed, the error could have come from any of those steps. By step 3 it was making decisions based on half-remembered output from step 1, not the actual file metadata.

The fix was 4 Claude Code agents running in parallel from a single Orchestrator, wired together with the Superpowers plugin, tool execution routed through Harbor, retry persistence handled by Temporal, and per-agent run visibility from Agno.

ORCHESTRATOR
│  Coordinates. Delegates. Never touches files.
│
├──────────────────┬──────────────────┬──────────────────
│                  │                  │
│  [parallel]      │  [parallel]      │  [parallel]
▼                  ▼                  ▼
          
Inspects file.     Confirms gait      Extracts target
Returns JSON.      engine is up.      spec for ffmpeg.
│                  │                  │
└──────────────────┴──────────────────┘
                   │
                   │  [merge + hand off]
                   ▼
         
         Runs ffmpeg.
         Returns converted path.
                   │
                   ▼
         u/validator
         PASS or FAIL.
         Structures iOS payload.

The Orchestrator fans out to three agents simultaneously: the Format Analyst inspects the video, a pre-flight engine check confirms the gait analysis endpoint is reachable and accepting input, and a metadata prep agent pulls the target spec for ffmpeg. All three run in parallel. The Conversion Engine starts only once all three return. That fan-out cut pipeline latency on a 10-video batch by roughly 40% compared to running the same steps sequentially.

The wiring

The Superpowers plugin (free, 94,000+ GitHub stars, on Anthropic's official marketplace) adds u/agent invocation to Claude Code. Install it once:

claude mcp add superpowers

The Orchestrator fires all three parallel agents in one instruction block:

 analyze /path/to/patient-walking-session.mp4
 confirm gait-engine endpoint is accepting input
 extract target spec for H264/AAC 1280x720

Once all three respond, the Orchestrator passes only what the Conversion Engine needs:

 convert /path/to/input.mp4 to H264/AAC target-resolution 1280x720

The key constraint: pass only what the next agent needs. Early on I forwarded the full metadata JSON from the Format Analyst to the Conversion Engine. It started making decisions on fields it had no business reading. The Orchestrator now extracts the relevant keys from each parallel response and merges just those before passing downstream.

The CLAUDE.md definitions

Each agent gets a narrow CLAUDE.md. The "What You Do Not Do" block is not optional. Without it, agents fill in what they think you want.

Orchestrator (key lines):

You coordinate. You never write code. You never execute ffmpeg.
Fan out to , , and  in parallel.
Wait for all three. Then call u/conversion-engine with merged output.
One retry maximum on FAIL. Log the reason. Return the error to a human.

Format Analyst (full output contract):

Return ONLY this JSON:
{
  "codec_video": "H.264" | "HEVC" | "VP9" | "AV1" | "unknown",
  "codec_audio": "AAC" | "AC3" | "MP3" | "unknown",
  "resolution": { "width": number, "height": number },
  "bitrate_kbps": number | null,
  "duration_seconds": number | null,
  "meets_spec": boolean,
  "mismatch_reason": string | null
}
Run: ffprobe -v quiet -print_format json -show_streams [filepath]
Do not convert. Do not suggest fixes. Return the JSON and stop.

The three problems the diagram doesn't show

Context bloat. The Orchestrator was accumulating tool responses from every parallel agent across a long batch session. By the tenth video it was carrying output from nine it no longer needed, and that dead weight was affecting decisions on the current one.

Harbor sits between the Orchestrator and the tools as an execution layer. Each agent requests a tool call, Harbor scopes what that agent is allowed to run, executes it in a sandbox, and writes the result to a shared workspace. The Orchestrator never holds tool output in its context window — Harbor holds it. Files, artifacts, and traces persist there across the full batch run. The Format Analyst can't accidentally touch the gait engine endpoint, and the engine-check agent can't write to disk, because Harbor grants each one only what its scope allows.

Retry persistence. When the Validator returned FAIL and triggered a retry, the Orchestrator had no durable memory of why the first attempt failed. Temporal stores workflow state between steps, so a failed conversion restarts from exactly where it stopped rather than re-running the full parallel fan-out from scratch.

Run visibility. Per-agent cost across parallel runs was opaque. Agno's built-in runtime surfaces per-agent execution metrics without a separate observability stack, which matters once you're processing batches of 50+ videos and need to know which agent is burning tokens.

The pipeline has been stable for three weeks.

u/Deep_Structure2023 — 2 days ago

One Architecture Change Cut a Claude Code Session from $9.21 to $2.81

My Agent bill spiked because the backend was feeding the agent unoptimized noise, and the agent pays to process it on every single call.

Three failure modes drive almost all of the waste.

1. Documentation dumps

When Claude calls a generic Model Context Protocol server like Supabase's, it doesn't get a surgical answer. It gets the entire schema.

Ask for Google OAuth setup, and the server returns the full authentication manual: magic links, Security Assertion Markup Language, phone auth, single sign-on, all of it. Every tool call drags 5x to 10x more tokens into the context window than the task requires. Across a full deployment session, that single flaw burns hundreds of thousands of tokens.

2. Discovery tax

A human developer opens a dashboard and reads the backend state in one glance. An agent can't do that.

Because standard Model Context Protocol servers don't expose a single topology endpoint, the agent runs fragmented discovery: list_tables, execute_sql, one call at a time. It reconstructs backend state like a puzzle, bleeding tokens at every step.

3. Error loop compounding

When an agent hits a generic 403 or 500 error and the logs don't specify where the rejection happened, it guesses. It rewrites the frontend, redeploys the function, checks logs, and retries. In the benchmark that prompted this post, a 401 Unauthorized error during document upload triggered 8 full retry rounds. The actual failure was upstream at the platform's security gate, nowhere near the code the agent kept rewriting.

Every retry resends the entire conversation history. The context window grows. Each subsequent guess costs more than the last.

The fix: three-layer context architecture

Andrej Karpathy's definition of context engineering applies here: fill the context window with exactly the right information for the next step. Most teams apply that discipline to prompts and ignore it completely for backends.

InsForge, an open-source tool, implements this through three constrained layers:

  • Skills (static knowledge): Atomic, domain-specific instructions loaded at session start. Progressive disclosure keeps the initial load to roughly 100 tokens. Full implementation patterns only enter the context when the agent confirms it's working in that specific domain. insforge-debug loads only on a crash, for example.
  • Command-line interface (direct execution): Instead of running deployments through chat, the agent pipes npx insforge/cli commands through the terminal and receives structured JSON back. Semantic exit codes replace raw error logs. The retry loop stops because the agent gets an exact failure reason, not a wall of output to interpret.
  • Model Context Protocol (live state only): A single get_backend_metadata call returns the full backend topology, tables, auth, storage, models, in one 500-token JSON payload. No discovery queries. No sequential calls.

The numbers

Same prompt. Same task: build a full retrieval-augmented generation application.

  • Standard Supabase Model Context Protocol server: 10.4 million tokens, $9.21, required repeated human intervention to break error loops.
  • InsForge architecture: 3.7 million tokens, $2.81, completed without interruption.

A 2.8x cost reduction with no model change and no change to what you're building. You restructured how the backend exposed information to the agent, and the bill dropped by two thirds.

reddit.com
u/Deep_Structure2023 — 4 days ago
▲ 7 r/AIAgentsInAction+1 crossposts

OpenAI Symphony vs Claude Managed Agents vs CrewAI: The $617K Orchestration Decision

Researchers benchmarked four agent orchestration patterns across 10,000 Securities and Exchange Commission filings and five large language models before Anthropic, OpenAI, or CrewAI shipped their 2026 agent frameworks. The Pareto-optimal answer was hierarchical. Then the products launched.

The four patterns: sequential pipeline, parallel fan-out, hierarchical supervisor-worker, and reflexive self-correcting. On Claude 3.5 Sonnet, reflexive scored highest on accuracy. Hierarchical hit 98.5% of that score at 60.7% of the cost. At 10,000 documents per day, the cost gap compounds to $617,000 a year.

Reflexive also collapses above 50,000 tasks per day. Correction loops start timing out, and the top-performing pattern at low volume becomes the worst at scale. Sequential degrades the least. Hierarchical holds the middle.

Symphony (OpenAI)

Built on Elixir and the BEAM virtual machine. Every agent runs in its own process; a crash in one doesn't touch the others. OpenAI's 6-layer spec maps Linear issues to agent tasks, and internal teams reported 5x more landed pull requests in the first three weeks.

The architecture is parallel fan-out. For isolated coding tasks where agents don't coordinate, Symphony's fault tolerance is hard to beat. For workflows requiring shared state or multi-step delegation, the framework can't express that. Your team also needs to run Elixir, or accept it as a second runtime.

Managed Agents (Anthropic)

Three layers: Brain (reasoning), Hands (tool execution in a managed container), Session (persistent state). That Brain/Hands split maps to the supervisor-worker pattern the benchmark scored as Pareto-optimal.

Pricing: $0.08 per session-hour plus token costs. A 10-minute task on Claude Sonnet runs roughly $0.013 in session fees plus $0.05 to $0.15 in tokens.

One delegation level is the current limit. A supervisor assigns to workers; workers can't spawn sub-workers. Multi-agent coordination is in research preview. Netflix and Rakuten are in production with it, but the architecture hasn't fully delivered the hierarchical pattern at depth yet.

CrewAI

45,900 stars. 12 million daily executions. You compose whatever pattern fits the workflow: sequential for one task type, hierarchical for another, parallel fan-out for a third, same codebase. Version 1.14.3 added checkpoint and fork support for long-running workflows plus e2b sandbox integration for safe code execution.

Model-agnostic routing is where CrewAI saves you money. Shifting 30% of tasks to smaller models cuts cost by 34% with a 2.1% accuracy drop. No other framework in this comparison lets you implement that today. The trade-off: you debug cascading failures across three model providers when something breaks at 3am.

The decision

Parallel coding agents, no coordination needed: Symphony.

Managed infra, hierarchical pattern, one delegation level: Managed Agents.

Multi-model routing, deep agent hierarchies, full architectural control: CrewAI.

A separate study of 70 real-world agent projects identified the failure mode teams keep hitting: capability grows, governance doesn't. More agents get added; guardrails stay the same. None of these three frameworks ships native cost tracking, quality scoring, or operational dashboards. You build that layer yourself regardless of which one you pick.

The teams that stay functional at scale govern their agents as carefully as they architect them. The tooling won't do it for you.

reddit.com
u/Deep_Structure2023 — 6 days ago
▲ 2 r/AIAgentsInAction+1 crossposts

We save Thousands of $ in Token costs at scale with prompt design

We run Agents in our org & have found ways to cut down costs bu caching & other methods. Here are the ones we use:

Prompt caching

  • OpenAI: automatic above 1,024 tokens, static content must lead the prompt
  • Anthropic: requires cache_control, time-to-live extendable to 1 hour at 2x cost
  • vLLM self-hosted: --enable-prefix-caching, tune --block-size and --kv-cache-memory-bytes
  • One broken prefix (reordered tool, timestamp in the wrong spot) busts the cache entirely

Semantic caching

  • Embeds requests, matches on cosine similarity, returns cached answers for near-duplicate questions
  • Works for high-repetition question-and-answer bots, breaks down on multi-turn agents and stale data
  • Libraries: GPTCache, Redis LangCache, Upstash semantic-cache
  • Build it after you see repetition in logs, not before

Lazy-load tools

  • Anthropic measured 55K to 134K tokens in tool definitions before optimization
  • Tool search loads definitions on demand instead of upfront; only worth it at 10+ tools
  • "defer_loading": True on individual tools, pair with tool_search_tool_bm25_20251119
  • Claude Code uses the same pattern for memory: a 200-line index file, details loaded on demand

Route by difficulty

  • Predictive routing: classify the request first, send cheap tasks to smaller models; RouteLLM uses Chatbot Arena preference data with a small classifier head
  • Cascading: let the cheap model try first, check output confidence via log probabilities, escalate only on low confidence; CascadeFlow claims 69% savings but tested on verifiable ground truth only
  • Subagents: ~11% savings, more useful for context isolation than cost cutting

Context hygiene

  • Tool output, file dumps, and failed retries are the main bloat sources
  • Route raw output to an archive, keep only working state in active context
  • Jia et al. found 6x compression yields 51.8–71.3% token reduction with a 5–9% improvement in issue resolution on SWE-bench Verified
  • Cleaning 30–50% of a 10K context across 100K runs saves roughly $1,500; at 40K context, ~$6,000
reddit.com
u/Deep_Structure2023 — 7 days ago

Claude Hooks 101. Full guide

Hooks are not prompts. They run inside Claude Code's execution flow and fire whether or not the model remembers the rule. The check happens in code, not context.

Configuration structure (three layers)

  • Top level: event name (PreToolUse, PostToolUse, Stop, etc.)
  • Middle level: matcher array, one entry per matching rule, each with a matcher string and a hooks array
  • Bottom level: the actual hook object, with type (command, http, mcp_tool, prompt, agent) and the execution details

The word hooks appears at both the top and the middle level and means different things each time. The middle hooks array holds the actual work.

28 events, two categories

  • Main-flow events (SessionStart, PreToolUse, PermissionRequest, PostToolUse, Stop) sit on the critical path and can block execution
  • Side-path events (Notification, ConfigChange) run alongside the main flow and can't intercept it

Events have no parent-child relationships. PreToolUse and PermissionRequest can fire back-to-back on the same tool call, but neither triggers the other.

Blocking vs. non-blocking

  • Blocking hooks pause the main flow until the hook result returns
  • exit 2 signals a system error. The model may try to route around it.
  • exit 0 with JSON output ("decision": "deny") signals a policy rejection. The model treats it as a rule and stops.
  • exit 1 does nothing. Claude Code ignores it and continues.
  • Non-blocking hooks (like Notification) run but can't intercept anything

Four places to register hooks

Layer Scope Lifecycle
User settings (~/.claude/settings.json) Your machine, all projects Full session
Project settings (.claude/settings.json) Repo, all team members Full session
Local project settings (.claude/settings.local.json) Repo, your machine only Full session
Plugin hooks Plugin's active scope While plugin is loaded
Skill / Subagent frontmatter That skill or subagent's run Registered on start, cleaned up on finish

Skill and Subagent hooks don't persist after their execution cycle ends. They can't pollute the global environment. Plugin Subagents specifically cannot register hooks at all, by design.

One conversion to know: a Stop hook in Subagent frontmatter becomes SubagentStop at runtime, because it's the subagent ending, not the session.

Merge and decision rules

When multiple hooks match the same event at the same time, Claude Code runs them in parallel. Deduplication is automatic: identical command strings or identical URLs get collapsed to a single execution.

The decision merge rule: deny beats ask beats allow. One deny from any layer blocks the operation regardless of what every other hook returned. Layer origin doesn't factor in at all.

User-level hook    -> allow
Project-level hook -> ask
Plugin hook        -> deny
Result             -> deny

Two real examples

Superpowers plugin registers exactly one hook: SessionStart. It injects the skill instructions as additionalContext so the model starts every session with the right framing already loaded. No workflow control, just context delivery at the right moment.

claude-code-warp registers six hooks (SessionStart, Stop, Notification, PermissionRequest, UserPromptSubmit, PostToolUse) and uses each one to push Claude Code's internal state to the Warp terminal. The PermissionRequest hook extracts the tool name and input preview so Warp can show what Claude Code is asking to do. The Stop hook reads the session, pulls the last user prompt and Claude's response, and fires a completion notification. Hooks as an event bridge between Claude Code's internals and an external system.

One thing the source doesn't say

The exit code behavior (exit 1 being silently ignored while exit 2 blocks) is the kind of detail that burns you once and then you never forget it. Most shell scripts exit 1 on failure by default. If your blocking hook uses a standard error pattern, it will silently fail to block anything.

reddit.com
u/Deep_Structure2023 — 7 days ago
▲ 83 r/AIAgentsInAction+1 crossposts

I run Claude Code inside my Obsidian vault. Full Architecture.

I had an agent-maintained knowledge bases collapse into one zone. The agent reads notes, writes new ones, and later you can't tell which files you curated and which the agent hallucinated into existence. The fix is physical zone separation enforced by a CLAUDE.md at the vault root.

Here's the architecture I run in Obsidian with Claude Code.

Three zones, hard rules

raw/ is read-only for the agent. Web clips, papers, books, daily notes. The agent reads these to synthesize from, never edits them. Immutability is the point: raw/ is your ground truth.

wiki/ is agent-owned. Concept pages, entity pages, cross-document syntheses, the global index. You rarely touch this by hand. If you want something changed, you tell the agent to regenerate it, because the agent knows all the backlinks a manual edit would silently break.

dev/ is collaborative. Architecture Decision Records, incident debriefs, project notes, snippets. You draft, the agent suggests wikilinks, finds related decisions, proposes rephrasings. The agent never edits an existing Architecture Decision Record without explicit confirmation.

Connecting Claude Code to the vault

Three viable paths exist. Path 1 is Claude Code reading and writing .md files directly from a terminal opened in the vault folder. No plugins needed, Obsidian doesn't even need to be open. This is the starting point.

Path 2 adds the Local REST API plugin inside Obsidian, which exposes a local endpoint at 127.0.0.1:27124. MCP servers connect to it and give the agent access to Dataview queries and Obsidian palette commands. The cost: Obsidian must stay open, the certificate is self-signed, and version 3.6.x of the plugin has a confirmed data-loss bug where POST requests silently overwrite files on metadata cache misses.

Migrate to Path 2 when you actually need graph access or plugin commands. Don't start there.

The skills that make this work

Steph Ango (Obsidian's chief executive officer) published an official skills repo at kepano/obsidian-skills. Without it, Claude writes standard Markdown links instead of wikilinks, and your graph view stays empty. Five skills ship in the repo:

  • obsidian-markdown: the foundation. Wikilinks, callouts, frontmatter, embeds.
  • obsidian-bases: teaches Claude to create .base database files with filters and views.
  • json-canvas: correct schema for Obsidian's infinite whiteboard format.
  • obsidian-cli: lets the agent interact with Obsidian's CLI for vault commands and daily note automation.
  • defuddle: strips ads, navigation, and banners from URLs before ingestion. Cuts token usage on polluted blog pages.

Install with:

cd ~/vault/.claude
git clone --depth 1 https://github.com/kepano/obsidian-skills.git
mv obsidian-skills/* skills/
rm -rf obsidian-skills

The CLAUDE.md

This file sits at the vault root. The agent reads it every session. It defines zone rules, wikilink conventions, frontmatter schema, and the ingestion workflow. The most important rules in mine:

  • Never edit raw/. Never rename or move files there.
  • Every wiki/ page needs frontmatter with title, type, tags, sources.
  • Every wiki/ page needs at least one wikilink to another page.
  • Never delete files without explicit confirmation.
  • If an operation touches more than five files, show the plan before executing.

That last rule is the human gate that separates a useful agent from one that runs you over.

Custom skills for the dev/ side

Steph Ango's skills cover Obsidian syntax. Your work patterns need their own skills. I have two: adr-writing and debrief-writing.

The Architecture Decision Record skill defines the MADR numbering format (ADR-NNNN-slug.md), required frontmatter including status and decision-date, and the section structure. It also enforces the immutability rule: an accepted Architecture Decision Record can only have its status changed, nothing else. The agent checks all existing records before creating a new one to find the next number and catch duplicates.

The debrief skill enforces blameless post-mortems. It flags the "Generalizable learning" section as the most important section in the file and enforces "what pattern applies to other systems" as the question that section must answer.

The ingestion flow

The /wiki-ingest command follows this sequence: fetch and clean the source with defuddle, save to raw/clippings/YYYY-MM-DD-slug.md, identify three to seven concepts and one to three entities, check which already exist in wiki/, then present the full plan before writing a single file. The plan lists every file it would create or update, every wikilink it would add. You approve before anything happens.

After 100 ingestions, the graph density is where this becomes actually useful. A new article connects to five existing concepts automatically. Searching "pgvector" surfaces the original paper you clipped, the Architecture Decision Record where you chose it, and the weekly synthesis where you first flagged it as a candidate.

Daily notes and weekly synthesis

Daily notes live in raw/daily/. The agent never writes to them. Every Friday I run a prompt asking it to read the week's daily notes and report recurring themes, pending decisions, and ideas worth turning into wiki concepts. It doesn't create anything, just reports. I approve what's worth acting on and run the relevant commands separately.

This keeps daily notes as your stream of consciousness while still feeding the wiki.

Security basics

The agent has read/write access to your vault. Three mitigations matter:

Git. Commit after every session. git diff HEAD~1 wiki/ shows exactly what changed. git checkout HEAD~1 -- wiki/concepts/X.md reverts a specific file. This is the cheapest safety net available.

allowed-tools in slash commands. Each command declares the minimum tool set it needs. A command without Bash(rm:*) in its allowlist can't delete files even if something in the content tells it to.

The "present plan before executing" gate. Prompt injection via malicious URLs is real. If a clipped article contains instructions, those instructions show up in the plan step before any file gets touched. You catch it there.

Costs

A vault of 200 to 500 notes with daily ingestion runs roughly $20 to $50 per month in tokens. Keep /wiki-query as the default for questions: it uses grep to narrow candidates before reading, instead of pulling the whole vault into context.

u/Deep_Structure2023 — 8 days ago

/Goal: Full Codex Setup Guide

AI agent setups stall at the same point: you write a prompt, the model does a step, then waits for you to say continue. You're the bottleneck.

/goal removes you from that loop. You give the agent a target, it runs until the target is reached, and returns a result. No approval prompts in between, no nudging it forward.

The syntax is simple. Inside Claude Code or Codex CLI:

/goal [your task/goal]

For Codex desktop, go to Settings > Configuration and set goals = true. Then launch with full-auto mode if you want it to run without stopping:

codex --approval-mode full-auto

Claude Code has its own setup docs at https://code.claude.com/docs/en/goal. Hermes supports it out of the box.

The syntax is easy. The prompt is the hard part.

A weak /goal prompt gets you a weak result. A good one has three parts: the task, a measurable end state, and the constraints. The pattern looks like this:

/goal [do the work] until [measurable end state] without [constraints that must hold]

Concrete example from the source:

/goal fix every failing test until npm test exits 0 without modifying any file outside the /auth directory.

For bigger projects, push more context into the prompt. Define success criteria, list what's off-limits, and give the agent a .md file it can use to track progress. The model can also write its own /goal prompt if you ask it to, and it usually writes a better one than you will.

A few things worth knowing before you run it:

Only one /goal can be active at a time. Use /pause to hold it, /goal clear to reset. In Claude Code, the active goal shows token usage and a progress bar. Pair it with /plan before setting the goal if the task is complex.

/goal is worth saving for longer work. A quick one-off doesn't need a loop. But for anything that would normally take ten back-and-forth prompts, it saves real time.

u/Deep_Structure2023 — 8 days ago

The Full Claude Ecosystem: 1,200+ MCP Servers, 400+ Plugins, 25+ Agent Frameworks

Don't run Claude in a loop: prompt in, answer out. Here's a full Claude ecosystem published it as six reference files on GitHub, verified April 2026. Commands, Model Context Protocol servers, plugins, tools, workflows, agent frameworks.

Commands worth knowing

/remote-control — Control your local Claude Code session from your phone via claude.ai
/fork — Branch your conversation without touching main context
/usage-report — Full HTML analytics: sessions, token cost by project, most-used commands
/checkpoint — Save conversation state before a major change
/memory-dump — Export everything Claude knows about your project to a file
/diff-review — Claude reviews the full git diff and annotates every change
/security-scan — Runs a vulnerability check on current codebase

Community-discovered activation phrases, not in official docs, consistent across sessions:

MEGAPROMPT      → Claude expands your rough idea into a full spec before executing
BEASTMODE       → Full effort, no shortcuts, maximum output
ULTRATHINK      → Extended reasoning before any response
STEELMAN        → Claude argues the strongest version of your idea first
CRITIC MODE     → Claude finds every flaw before proceeding
FIRSTPRINCIPLES → Breaks the problem to fundamentals before solving

Install Memory MCP first

Every session starts from zero without it.

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-memory"]
    }
  }
}

Session startup: "Load project context for [name]. Retrieve: architecture decisions, coding standards, current sprint tasks."

Other high-impact Model Context Protocol servers: Filesystem, GitHub, PostgreSQL, Brave Search, Puppeteer (controls a real Chrome instance), Fetch. The repo also documents 10 memory systems including Memora, which runs fully local with no cloud dependency.

Three plugins

claude skills add juliusbrussee/caveman
/plugin install superpowers@superpowers-marketplace
/plugin install context7@claude-plugins-official

Caveman (27,900+ stars) cuts output tokens 65-75% with no accuracy loss. Superpowers (121,000+ stars) forces plan-before-build, test-before-ship. Context7 (53,864+ stars) pulls live version-specific docs before generation, eliminating hallucinated APIs.

Tool decisions

Retrieval-augmented generation app?  → LlamaIndex
Everything else?                     → LangChain
Production memory?                   → Qdrant
Local dev?                           → Chroma (pip install, zero setup)
Full backend?                        → Supabase
Local embeddings, no API cost?       → Ollama


curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2

Ollama for embeddings and simple tasks, Claude for reasoning that needs it. API costs drop, nothing leaves your machine.

Builder-Validator

Two Claude calls, no framework, opposing objectives.

builder = claude.complete(
    system="Senior developer. Write the best implementation.",
    prompt=task
)
validator = claude.complete(
    system="Security auditor. Find every bug, edge case, vulnerability.",
    prompt=f"Review:\n{builder.output}"
)
# Loop until validator approves

The two roles have structurally incompatible incentives. That tension does the work a single-pass prompt can't. Production numbers: Fountain cut delivery time 50%. Rakuten dropped feature cycles from 24 days to 5. Ramp cut incident investigation time 80%.

Agent framework benchmarks

  • LangGraph 87% task success. Used at Klarna, Replit, Uber, LinkedIn.
  • CrewAI 82% task success. Fastest to a working demo. 44,600+ stars, 60 million executions per month.
  • AutoGen top GAIA benchmark score across all difficulty levels.
  • Claude Agent SDK Claude-only stacks, no framework overhead.

​

Fastest to demo?             → CrewAI
Complex production workflow? → LangGraph
Research / code-executing?   → AutoGen
Claude-only stack?           → Claude Agent SDK
Data-heavy retrieval?        → LlamaIndex Agents

95% of agentic tasks don't need multi-agent systems. A well-prompted single Claude instance with 3 tools outperforms a complex 5-agent setup. Build simple first. Full repo

reddit.com
u/Deep_Structure2023 — 10 days ago

Claude Code Doesn't Know Your Project. This official Plugin Fixes That.

Most Claude Code frustration comes from the same root cause: Claude sees your files but has no context about how your project actually works. It doesn't know your class structure, your validation conventions, your protected files. So it guesses. The guesses are plausible and wrong.

The claude-code-setup plugin, maintained by Anthropic, fixes this by analyzing your codebase before recommending anything.

Install it inside Claude Code:

/plugin install claude-code-setup@claude-plugins-official

Then ask:

> recommend automations for this project

It scans your directory, reads your pyproject.toml, identifies your stack, and outputs a structured set of recommendations across five categories. Nothing auto-applies. You opt in one piece at a time.

Model Context Protocol servers

The first category is Model Context Protocol servers. These give Claude the ability to act on your stack, not describe it.

{
  "mcpServers": {
    "python-repl": {
      "command": "uvx",
      "args": ["mcp-server-python", "--project", "."],
      "description": "Execute Python code in your project's virtualenv"
    },
    "filesystem": {
      "command": "uvx",
      "args": ["@modelcontextprotocol/server-filesystem", "/resume-parser"],
      "description": "Safe, scoped file operations"
    },
    "chromadb": {
      "command": "uvx",
      "args": ["mcp-server-chroma", "--path", "./data/vectors"],
      "description": "Query resume embeddings for semantic search"
    }
  }
}

Without Model Context Protocol, Claude describes how to parse a resume, query ChromaDB, and return a match score. With it, Claude does those three things in one turn. The difference shows up immediately.

Skills

Skills are markdown files that encode your conventions. You write them once, and Claude follows them every time it touches related files.

## Parsing Resumes in This Project
When extracting data from resumes:
1. Always use `src/parser/extractor.py::ResumeExtractor` as the entry point
2. Normalize dates with `dateutil.parser` + our `src/utils/dates.py` helpers
3. Validate output against `data/schemas/resume_v2.json` using Pydantic
4. Log parsing confidence scores to `logger.debug()` with context: `{"resume_id": ...}`
5. Never hardcode field mappings—use `src/config/field_aliases.py`

## ML Integration Rules
- New features must go through `src/ml/feature_engineering.py`
- Embeddings must use our `text-embedding-3-small` wrapper in `src/ml/embeddings.py`
- Always cache vector results in `data/cache/embeddings/` to avoid re-computation

Ask Claude to add GitHub profile extraction and it will edit extractor.py using your base class, update the Pydantic schema, add the field alias, and write the test. No reminding required.

Subagents

Subagents are purpose-built agents with a narrow scope. Instead of asking general Claude to validate your parsed resume output, you spin up a validator that only does that.

# .claude/agents/resume-validator.yaml
name: resume-validator
description: >
  Specialized agent for validating resume parsing output.
  Checks schema compliance, data quality, and edge cases
  like missing fields, inconsistent date formats, or
  suspicious skill inflation.
skills:
  - skills/pydantic-validation.md
  - skills/data-quality-checks.md
  - skills/resume-fraud-patterns.md
trigger:
  - files_matching: ["src/parser/**", "tests/**/test_extractor*"]
  - on_command: "/validate-parse"

Run /validate-parse src/parser/extractor.py and it checks Pydantic config, error handling for malformed PDFs, and test coverage for edge cases. The narrower the scope, the more reliable the output.

Slash commands

Slash commands wrap multi-step workflows into a single call.

<!-- .claude/commands/benchmark-parser.md -->
Run end-to-end parsing benchmark:
1. Load 10 sample resumes from `data/samples/benchmark/`
2. Parse each with `ResumeExtractor` + timing instrumentation
3. Calculate: avg latency, memory peak, field completeness %
4. Compare against baseline in `data/baselines/v1.2.json`
5. Generate markdown report in `reports/benchmark-$(date).md`
6. If regression >5%, alert via `src/monitoring/alerts.py`

Usage: /benchmark-parser --samples=20 --compare=v1.2

Output:

/benchmark-parser
  Loaded 20 samples (PDF:12, DOCX:5, TXT:3)
  Avg parse time: 1.24s (±0.3s) — ✅ within baseline
  Field completeness: 98.7% (↑1.2% vs v1.2)
  Regression detected: memory peak +7.1% in PDF parsing
  Suggestion: Profile `pypdf` image extraction in extractor.py:142
  Report saved: reports/benchmark-20260507.md

The plugin ecosystem extends this further. Browse Python-focused plugins with /plugin discover --tag=python. Community plugins bundle Model Context Protocol servers, skills, hooks, and agents together so you're not assembling compatible pieces by hand.

One thing worth knowing: claude-code-setup explains why each recommendation applies to your project. It doesn't apply anything without your confirmation. For a codebase with a live authentication layer or raw uploaded files, that matters.

reddit.com
u/Deep_Structure2023 — 10 days ago
▲ 393 r/AIAgentsInAction+1 crossposts

Full Gstack OverView

Garry Tan open sourced GStack in early 2026. He shipped 3 production services and 40+ features. Here's what GStack actually is:

What it does

  • Splits the development workflow into named operational roles: chief executive officer, staff engineer, quality assurance lead, security officer, designer, release engineer, developer experience reviewer, site reliability engineer, technical writer
  • Each role has its own context, rules, and responsibilities baked in, not vague prompts
  • Covers the full cycle: plan, build, review, test, ship, reflect

The commands that matter

  • /office-hours runs before any implementation. The system interrogates the idea, surfaces assumptions, challenges scope, pushes back on framing. Closer to a Y Combinator partner conversation than a code generator
  • /qa spins up a real browser via Playwright, clicks through flows, finds broken states, generates regression tests
  • /review, /cso, /benchmark, /ship add layered verification before anything gets out the door

Why this beats prompt-only workflows

  • Most large language model-generated code fails because there's no coordination layer catching bad architecture, missing edge cases, or undocumented decisions
  • GStack encodes those checkpoints into the process, so they happen automatically
  • A structured workflow beats a clever prompt every time

The browser layer

  • Agents get persistent browser state: authenticated sessions, multi-tab operations, real navigation
  • Most agent tooling is blind to browser context. GStack isn't

What it supports

  • Claude Code, Codex CLI, Cursor, Gemini, OpenClaw, multiple browser agents, persistent memory

The actual shift

  • Andrej Karpathy said in March 2026 he hadn't typed a line of code since December. The bottleneck moved from writing code to coordinating systems
  • GStack is one of the first open-source frameworks built around that reality

MIT licensed. github repo

u/Deep_Structure2023 — 11 days ago

10 Claude Code plugins worth installing if you build iOS apps.

10 Claude Code plugins worth installing if you build iOS apps.

  • Caveman (JuliusBrussee/caveman) cuts Claude's output tokens by 65-75% by stripping filler responses. Keeps technical precision, kills pleasantries. Useful when long stack traces and large Swift files eat context fast. Bonus: caveman-compress rewrites your CLAUDE.md to ~46% fewer tokens.
  • Superpowers (obra/superpowers) imposes engineering discipline before Claude writes anything. Forces clarifying questions first, breaks work into 2-5 minute tasks with exact file paths, enforces test-driven development red/green/refactor, runs verification commands before marking work done.
  • SuperClaude Framework (SuperClaude-Org/SuperClaude_Framework) — adds 30 slash commands and 20 specialized personas on top of Claude Code. The ones that matter for iOS: architect (module boundaries), security engineer (keychain, certificate pinning), performance (memory leaks, main-thread violations). Built-in 70% token reduction for large codebases.
  • TDD Guard (nizos/tdd-guard) hooks into Claude Code's file operations and blocks implementation code without a failing test first. Supports XCTest. Stops the common pattern where Claude writes tests that confirm its own implementation rather than testing the contract. 2,000+ GitHub stars.
  • Safety Net (kenryu42/claude-code-safety-net) intercepts destructive git commands before execution. Blocks git reset --hard, force pushes to main/master, git branch -D, and rm -rf on project directories. Redirects Claude toward safer alternatives instead.
  • Cartographer (kingbootoshi/cartographer) deploys parallel subagents to map your codebase and output an architecture.md: module dependency graph, data flow, layer separation analysis. Feed the output into your CLAUDE.md so every session starts with full architectural context.
  • Karpathy Guidelines (forrestchang/andrej-karpathy-skills) encodes Andrej Karpathy's published observations on large language model coding behavior as enforced rules. No unrequested code, no premature abstractions, no protocol extensions added "just in case." Prefer reading before modifying. Delete dead code rather than commenting it out.
  • Context Engineering Kit (hesreallyhim/awesome-claude-code) patterns for working with codebases that exceed a single context window. Hierarchical loading (architecture first, then specific modules), context handoff templates for multi-session work, minimal-footprint CLAUDE.md structure. Pair with Caveman: Caveman compresses outputs, this compresses inputs.
  • Trail of Bits Security Skills (trailofbits/) the same security auditing methodology Trail of Bits uses on paid client engagements, published as Claude Code skills. Covers keychain misuse, certificate pinning gaps, hardcoded credentials, insecure UserDefaults usage, URL handling injection, and incorrect NSPrivacyAccessedAPITypes. Run before App Store submission.
  • Claude Code Workflows (OneRedOak/claude-code-workflows) structured templates for code review, security assessment, and pre-pull request checklists. Customizable for iOS-specific checks: no synchronous main thread operations, closure capture lists, forced unwraps, accessibility identifiers on new interactive elements.

Install order if starting from zero: Caveman + Superpowers first. Safety Net before you need it. TDD Guard if test coverage matters. Everything else as the project grows.

reddit.com
u/Deep_Structure2023 — 11 days ago

The Claude skill checklist: 7 to keep, 4 to cut

A skill is one SKILL.md file that teaches Claude how to handle a specific task. The body of the file lives inside Claude's context window every time the skill triggers, which means every line is paying rent.

Most skills I've seen fail for the same reasons. Here's what separates the ones that hold up.

Keep the scope narrow

One skill, one task. If you're building an accessibility audit skill, it audits accessibility. It doesn't also cover visual hierarchy, copy quality, and usability heuristics. Bundle those and Claude will compromise on all of them.

The discipline shows up in the YAML frontmatter. The description and when_to_use triggers should match the words you actually type when you ask Claude to do this work. Watch your own prompts for a week, mine the exact phrasing, paste it in.

---
name: accessibility-auditor
description: Analyze core user flows and identify accessibility issues
when_to_use:
  - "onboarding flow accessibility audit"
  - "product purchase flow accessibility audit"
---

Specify the role, but specifically

"World-class visionary designer" tells Claude nothing. The role description should map cleanly onto the task in the skill.

You are a senior product designer specializing in mobile UX.
Focus on clarity, usability, and accessibility.

That's a working role. It points at the same target as the description and the audit rules underneath it.

Be explicit about the work

"Validate things properly" leaves Claude guessing. Spell out the standard you want it measured against.

Audit process:
- Color contrast for functional elements and text audit using WCAG
- Text legibility and readability audit using WCAG

For anything past a trivial task, add decision rules. These are the heuristics Claude uses when the task gets ambiguous, which it will.

## Rules
- Prioritize usability over aesthetics
- Flag assumptions explicitly
- If data is missing, state it

Constrain the output

Skills without a defined output shape produce different answers every run. Pin it down.

## Output format
- Executive summary (max 5 bullets)
- Issues (severity: high/medium/low)
- Recommendations (actionable)

This one change fixes more skill quality complaints than any other.

Handle the unhappy paths

Same way you'd design a feature, think through what happens when input is incomplete or ambiguous.

## Edge cases
- If input is incomplete → ask clarifying questions
- If multiple interpretations → list them

Use the file system, not the file body

Aim for 100 to 250 lines in SKILL.md. Past that, Claude's performance starts to drift because the context bloats.

Push the rest into subfolders Claude loads only when needed:

  • scripts/ for executable code Claude can run, useful for things like generating a report after the audit completes
  • references/ for examples, edge case libraries, longer documentation
  • assets/ for fonts, icons, design tokens referenced by import

Two or three good examples in references/ move the needle more than ten rules in the body.

Validate before you ship

Two tools worth running on every new skill:

  • skills-ref checks the SKILL.md syntax
  • skill-creator is the meta-skill that reviews your skill, prompt it with Review this skill and suggest improvements

Skip these and you'll find out something is broken the first time you actually need the skill to work.

The traps to avoid

A skill is not a place to dump every prompt you've ever written for this task. Stacked prompts with no structure produce conflicting guidance, and Claude resolves the conflict by ignoring half of it.

Don't assume Claude knows your codebase, your design system, or your domain conventions. Either put the context in CLAUDE.md or state it in the skill. Otherwise Claude either asks you mid-task or makes something up.

Don't write "be concise" in one section and "provide detailed analysis" in another. Pick one, or define when each applies.

Two technical constraints worth knowing: YAML in the frontmatter is parsed safely, no code execution. XML angle brackets are blocked in the skill body. Both are security guardrails, work around them.

Skills are infrastructure. The ones I keep using read like a tight contract between me and Claude. The ones I delete read like notes I forgot to edit.

reddit.com
u/Deep_Structure2023 — 13 days ago
▲ 586 r/AskVibecoders+2 crossposts

20 Claude Code commands worth using.

Here are 20 commands worth knowing, grouped by what they actually solve.

Stopping, undoing, branching

1. Esc stops the current task. Conversation history stays intact, only the in-flight action dies.

2. Double-tap Esc or /rewind opens a menu:

  1. Restore code and conversation
  2. Restore conversation only
  3. Restore code only
  4. Summarize from here
  5. Cancel

3. /btw lets you ask a side question without polluting the main thread.

/btw where is the test file again

It reuses the existing prompt cache, so token cost is near zero.

4. /branch forks the conversation. Run two approaches in parallel, keep the one that works.

Managing the context window

5. /compact rewrites long history into a summary that keeps the storyline, the technical decisions, and the errors plus fixes. Context window stops bloating.

6. /clear wipes everything for a fresh topic.

7. /export saves the conversation as Markdown:

~/projects/XXX/claude-session-YYYY-MM-DD-HH-MM.md

Useful when you've spent an hour designing an architecture and don't want it to vanish.

8. /resume searches old sessions by keyword.

9. claude -c picks up yesterday's chat where you left it.

10. claude -r lists every past session and lets you jump back into a specific one.

11. /remote-control (alias /rc) hands the running session over to your phone. The work keeps executing on your machine, you just steer from somewhere else.

Working smarter

12. /model opusplan runs Opus for planning and Sonnet for execution. Slower thinking on the design, faster output on the code.

13. /simplify spins up three reviewers in parallel:

  • Architecture and code reuse
  • Code quality
  • Efficiency

You get one combined report.

14. /insights generates a local HTML report at ~/.claude/usage-data/report.html. It shows usage habits, common mistakes, features you've never touched, and concrete suggestions for your CLAUDE.md.

15. /loop schedules recurring or one-shot tasks inside the session:

/loop 15m check the deploy
/loop in 20m remind me to push this branch

Recurring loops auto-expire after 3 to 7 days so a forgotten schedule doesn't burn through your API budget.

You can override the default behavior by dropping a .claude/loop.md in your project. A bare /loop will then run whatever instructions you put inside.

Keyboard shortcuts

16. Ctrl+V pastes screenshots directly. No saving to disk first.

17. Ctrl+J (or Option+Enter on Mac) inserts a newline without sending. Multi-line prompts without accidents.

18. Ctrl+R searches your prompt history. Your own personal prompt library, already indexed.

19. Ctrl+U clears the entire input line in one keystroke.

20. /skills [name] loads project-specific skills. Run /skills with no argument to see what's available in the current workspace.

reddit.com
u/Deep_Structure2023 — 12 days ago
▲ 37 r/AskVibecoders+1 crossposts

My Learnings After Using Claude Code Everyday now.

Most of my early mistakes came from asking for too much in one go. Vague goals, bloated context, ambitious prompts. Output got worse the more I gave it. The fix was always to narrow the ask.

Here are my learnings from Claude Code:

1. Treat it like a mid-level engineer. Claude Code performs well on scoped work and falls over on ambiguous work. If a human engineer would ask three follow-up questions before starting, your prompt needs to answer those three questions.

2. Make CLAUDE.md do real work. Most projects I see have an empty CLAUDE.md or one with two stale lines in it. Put your conventions, your do/don't patterns, and your file references in there. A working example:

## Design System Rules
- Use spacing tokens from theme.ts
- Do NOT hardcode colors
- Use Button from /components/ui/button

## Code Standards
- TypeScript only
- Functional components
- No inline styles

Keep the main file lean. Reference subfiles for detail (For call to action buttons rules check u/components/Button.md) instead of inlining everything.

3. Plan before you execute. Plan Mode exists for a reason. For any feature, refactor, or multi-step change, I run the loop: ask for a plan, push back on it, approve, execute. Skip this and you get a thousand lines of code you throw away. Ultraplan is the heavier version for work that spans many files.

4. Treat the context window as a budget. Performance comes down to a ratio of signal to total context size. Dumping a whole repo tanks the ratio. Two habits keep mine clean:

  • /compact mid-session to summarise progress
  • /clear when switching to an unrelated task

A bloated context produces bloated answers.

5. Atomic tasks beat one fat prompt. "Build the full auth system" gives you a mess. Splitting it into login UI, validation, API wiring, and error states gives you four reviewable diffs and a working system. Mixing unrelated jobs in one prompt ("fix these two bugs, improve the UI, optimize performance") gives you partial fixes on all three and a diff you can't read.

6. Skills for anything you repeat. A skill is a small reusable playbook with a single job and a clear input/output. Mine include design-system-audit, component-generator, and accessibility-check. The rule that keeps them useful: one skill, one job. A skill that tries to do two things needs to be split.

7. Agents for anything that should run without you. Skills wait to be called. Agents fire on a condition and return a result. A design-system auditor that runs on every token change and reports drift will save you a week of catch-up work in a month.

8. Validate everything Claude produces. The division of labour is Claude generates, you check. Engineers have wiped production data trusting an agent end to end. You can't remove the risk, you can shrink it: plan the logic before implementation, ask for error handling and edge cases up front, and run an audit pass on anything that touched a real system.

The thread under all eight is scoping. Scope the ask, scope the context, scope the output, then check the work. The prompt cleverness people chase matters less than the discipline of asking for one thing at a time.

reddit.com
u/Deep_Structure2023 — 14 days ago

I built the setup with Claude Code that remembers every architecture decision I've made, runs agents in parallel, and enforces my conventions without being asked. The whole thing runs on the $20/month Pro plan plus open source pieces.

My CLAUDE.md File

It lives in your project root. Read at the start of every session. Don't skip it. most useful.

# CLAUDE.md



## project

- stack: next.js 14, typescript, tailwind, postgres via prisma

- deployed on vercel, staging branch auto-deploys

- monorepo: /apps/web, /apps/api, /packages/shared



## conventions

- all components in PascalCase

- API routes return { data, error } format

- no default exports except pages

- tests live next to source files, named *.test.ts

- commits follow conventional commits (feat:, fix:, chore:)



## architecture decisions

- chose prisma over drizzle (dec 2024): type safety priority

- chose zustand over redux (jan 2025): less boilerplate

- auth via clerk, not next-auth: better DX for our team size



## current focus

- migrating payment system from stripe checkout to stripe elements

- performance audit on /dashboard (target: LCP < 2s)



## rules

- never mass edit more than 3 files without showing me the plan first

- always run existing tests before writing new ones

- if a task takes more than 5 steps, create a plan document first

Conventions kill nitpicks. Decisions stop Claude from re-litigating choices. Rules encode the things you keep correcting in chat.

CLAUDE.md is static though. For memory that grows, you need the next layer.

Memory that survives sessions

Three pieces working together.

Obsidian as the knowledge base. a structured wiki Claude reads from and writes to:

/vault

  /decisions      — every architecture decision with context

  /errors         — bugs we hit and how we fixed them

  /patterns       — code patterns that work in our codebase

  /sessions       — summaries of what happened each day

  /stack          — documentation for every tool we use

  Memory.md       — who I am, what I'm building, my preferences

  index.md        — master index of everything in the vault

The structure comes from Andrej Karpathy's large language model wiki concept. Knowledge compounds instead of being rediscovered every session. https://github.com/karpathy/llm-wiki

claude-mem compresses each session into a persistent store that carries into the next.

claude-subconscious runs a background agent that watches sessions and writes memory passively, no prompting.

Claude already knows that Friday I was debugging a race condition in the payment webhook, switched from polling to websockets, and the tests still need updating.

Skills turn the generalist into a specialist

Markdown files that teach Claude how to perform specific tasks the way you want them done.

Start with Superpowers from the Anthropic plugin marketplace:

/plugin install superpowers@claude-plugins-official

It forces a real workflow: brainstorm, spec, plan, test-driven development, implement, review. Claude writes a spec for your approval before any code gets touched.

Then stack:

  • Trail of Bits security skills audit workflows from real security engineers, every pull request scanned before I open it.

  • Anthropic's official skills PDF, DOCX, XLSX, data analysis. Reference implementation.

  • tdd-guard blocks commits that skip tests. The block message explains what's missing.

Skills don't conflict. Each sharpens one thing.

Subagents split the work

One session does tasks sequentially, and the context pollutes by task four. Subagents give each role its own context window and CLAUDE.md:

  • architect design, specs, plans. No code.

  • coder writes code from the plan. Full tool access.

  • reviewer security-first read on every pull request, flags issues, checks coverage.

  • tester writes and runs tests, pairs with tdd-guard.

  • ops deploy, continuous integration and continuous deployment, infra.

Tool permissions stay separated by role. The coder never sees deploy configs.

Hooks and slash commands

Any instruction I typed three times became a command:

  • /fix-issue 456 reads the GitHub issue, branches, writes the fix with tests, opens a pull request.

  • /review runs the reviewer agent on the current pull request.

  • /deploy staging full deploy pipeline through the ops agent.

Full collection of 57 production commands:

Hooks fire automatically:

  • Pre-commit tdd-guard verifies tests exist and pass.

  • Session-start loads memory from Obsidian, reads recent session logs.

  • Pre-push security review before code hits the remote.

Rules stop being something I remind Claude about and start being something the system enforces.

Orchestration

claude-squad runs multiple agents in parallel, each in its own git worktree so branches don't collide:

brew install claude-squad

cs

Close the terminal, agents keep working. https://github.com/smtg-ai/claude-squad

My nightly run, three sessions:

agent 1: "fix all open issues labeled 'bug' in the repo"

agent 2: "write missing tests for /apps/api/src/services/"

agent 3: "refactor the dashboard components to use the new design tokens"

Auto-accept (cs -y) for trusted work, plan mode for anything risky. Laptop closed. Three pull requests in the morning, separate branches, tests passing.

Local orchestration stops at the pull request though. For agents that need to actually run, hit external APIs, and ship somewhere, I point them at coding-cli. it drops a sandboxed runtime into any agent's chat, with 30+ APIs pre-wired (no keys to manage), a built-in database and auth, and deploys to a custom domain or the App Store. The agent gets a place to actually build and ship instead of just producing diffs.

reddit.com
u/Deep_Structure2023 — 22 days ago
▲ 13 r/ChatGPT

Three weeks ago my Claude Max session jumped from 21% to 100% on a normal-sized prompt. Two cache bugs were inflating token consumption 10 to 20x, After that I installed Codex. Now I run both.

Here are the skills I use in Codex.

A skill is a SKILL.md file in ~/.agents/skills/, loaded automatically when the task matches.

npm i -g /codex

codex

1. WarpGrep

Codex grepping a large codebase burns 75 seconds loading context the main model doesn't need. WarpGrep is a reinforcement learning trained search subagent in an isolated context window, 8 parallel tool calls per turn, up to 36 calls in under 5 seconds. Returns only the file:line-range spans needed.

Median search drops from 75s to 5s. Software Engineering Bench Pro hits 59.1% (+3.1 points), 17% fewer input tokens, 15.6% lower cost per task.

# Add to ~/.codex/config.toml

[mcp_servers.morph-mcp]

command = "npx"

args = ["-y", "@morphllm/morphmcp"]



[mcp_servers.morph-mcp.env]

MORPH_API_KEY = "your-api-key"

Key at morphllm.com. Install this first, it's the only one that moves benchmarks.

2. create-plan

Forces a written plan before Codex opens a file. Which files change, what approach, what edge cases, what tests pass. You approve, then it executes.

$skill-installer create-plan

Wrong-direction sessions are the most expensive thing in agentic coding.

3. gh-fix-ci

Reads the failing GitHub Actions output, identifies the cause, commits the fix. Handles flaky imports, missing mocks, test ordering, lint, environment variable mismatches.

$skill-installer gh-fix-ci

4. Valyu

Model Context Protocol server connecting Codex to ArXiv, GitHub search, docs search, and major academic sources through one integration. Optimized for fresh queries and time-sensitive question answering.

# Add to ~/.codex/config.toml

[mcp_servers.valyu]

command = "npx"

args = ["-y", "@valyu/mcp-server"]

[mcp_servers.valyu.env]

VALYU_API_KEY = "your-api-key"

Key at platform.valyu.ai.

5. gh-address-comments

Reads every pull request review comment, groups by type, addresses each in one session. Commits changes, responds inline, reads surrounding code per comment.

$skill-installer gh-address-comments

6. Coding CLI

What broke me on plain Codex was wiring up auth, a database, and API keys for the 40th side project. Half the session gone before any product code lands.

This handles the agent a sandboxed runtime with auth, database, storage, 30+ pre-authenticated Application Programming Interfaces (no keys to manage), and one-shot deploy to a custom domain or the App Store. Codex runs inside the sandbox, so the build-and-test loop doesn't touch your machine. Works with Codex, Claude Code, Cursor, and Gemini.

# Follow setup at github.com/vibecode/vibecode-cli

# Then paste the install snippet into your agent's chat

7. frontend-skill

Bans Inter, neutral grays, and default 8px border-radius. Requires a typography rationale and color palette before the first Cascading Style Sheets line.

mkdir -p ~/.agents/skills

git clone https://github.com/vipulgupta2048/codex-skills.git

cp -r codex-skills/frontend-design ~/.agents/skills/

8. stop-slop

Strips em-dashes, throat-clearing openers, binary contrasts, and passive voice from READMEs, commit messages, and comments.

mkdir -p ~/.codex/skills

git clone https://github.com/hardikpandya/stop-slop.git ~/.codex/skills/stop-slop

9. Superpowers

Subagent-driven development. Agents work each task, inspect their work, continue forward.

/plugins

Search Superpowers, Install Plugin.

10. Codex Security

Codex Cloud feature, not a skill. Launched March 6, 2026. Maps trust boundaries, generates an editable threat model, scans for vulnerabilities in sandboxed environments. Beta scanned 1.2 million commits, found 792 critical and 10,561 high-severity issues. Pro, Enterprise, Business, and Edu plans.

How I split the two

Claude Code for large-codebase reasoning (1M context on Sonnet 4.6 and Opus 4.7 holds up, Opus 4.6 scored 78.3% on Multi-Round Coreference Resolution v2), interactive debugging, multi-file refactors. It uses ~3-4x more tokens but wins blind code-quality reviews ~67% of the time.

Codex for terminal work (GPT-5.3-Codex leads Terminal-Bench 2.0 at 77.3%, Opus 4.7 at 69.4%), background tasks via Codex Cloud, high-volume sessions, and anywhere the ten skills run automatically.

Migration

cp CLAUDE.md AGENTS.md

AGENTS.md is identical to CLAUDE.md. Rebuild Model Context Protocol configs in ~/.codex/config.toml. Codex uses Tom's Obvious Minimal Language, not JavaScript Object Notation, so config.json gets ignored.

codex mcp add server-name -- npx -y u/package/name

Reinstall skills in ~/.agents/skills/. For complex setups, the cc2codex tool handles the rest. Rate limits run a 5-hour and weekly window in parallel, check /status in the Command Line Interface.

u/Deep_Structure2023 — 23 days ago

Code review is one of those things I keep meaning to do more rigorously and keep skipping when the diff is small.

My setup has three layers. Planning handles intent before code gets written. Skills handle quality at write-time. /ultrareview handles the final pass before merge.

The planning layer comes from Claude-skill-marketplace (openSource repo). The feature-planning skill breaks a task into steps before Claude Code starts writing, then hands off to a plan-implementer agent that executes each step.

I install the whole marketplace once with /plugin marketplace add mhattingpete/claude-skills-marketplace and it's there for every project.

Next I pull a handful of code-quality skills from Coding-skills be it website/ios or app into the project, it writes up to date code code from the docs, handles design, API, also things like linting conventions, type safety patterns,

Claude Code references them as it writes, so a lot of the issues a review would otherwise flag never get written in the first place.

/ultrareview is the third pass. It runs before I merge anything non-trivial.

It works by spinning up parallel agents in a cloud sandbox, each looking at the codebase from a different angle, and merging the results into one report. The review runs remotely, not on your machine.

The command needs a Git repo. The analysis is diff-based, so it looks at your current branch against the default branch, the changed files, and the commit history.

You can point it at the working state of your repo or at a specific pull request:

/ultrareview <PR number>



# Example of reviewing a particular PR (full link)

/ultrareview https://github.com/org/repo/pull/123



# Example of reviewing a particular PR (number)

/ultrareview 123

When you pass a pull request, Claude clones it from GitHub into the sandbox, analyzes the diff against the base branch, and returns the review.

/review vs /ultrareview

Both commands review your codebase. The difference is depth and cost.

/review is the daily driver. Fast, cheap on tokens, fine for small and mid-size projects where you want a quick second opinion.

/ultrareview is what you run before merging complex changes into main. It takes longer and costs more, and the depth shows up on larger codebases with many directories and files.

Testing it on a real project

I tried /ultrareview on a landing page for a SAAS product, built in React and TailwindCSS. The change under review was a new sign-up form that collects email addresses for more information about the service.

I asked Claude Code to add the feature. The feature-planning skill picked up the request, broke it into discrete tasks, and the plan-implementer agent worked through them. With the code-quality skills loaded, it implemented the form and ran its own validation pass before handing back, which already cuts down the surface area for surprise issues post-merge.

Then I ran /ultrareview.

The command warns you upfront: five to ten minutes, five to ten dollars, depending on project size. After you confirm, it creates a web session and gives you a link. The link is where the review actually runs.

A few things to know from running this:

Even on a small project, the review took longer than five minutes. The session page does not auto-refresh as of now, so if it looks stuck on the Verify step, refresh the browser. The report shows up.

When the run finishes, the terminal gives you a summary of bugs found, plus the changes Claude made to resolve them.

When each one is worth running

After running both review commands across a few different projects, the bug-finding quality was close. Both surfaced the real issues.

The split I've landed on:

Planning before any non-trivial change, so Claude Code is implementing against a structured task list instead of guessing.

Skills loaded from the start, so quality conventions are enforced as code is written.

/review for ongoing work. Cheap enough to run often, fast enough not to break flow.

/ultrareview before merging anything substantial into main, especially on larger codebases where multiple agents looking at different slices of the diff actually have something to disagree about.

review-implementing after /ultrareview returns a list of fixes worth tracking.

For prototypes and small pages, /review plus skills is doing enough work. The extra time and tokens for /ultrareview show their value once the codebase gets big enough that no single pass can hold all of it in context.

reddit.com
u/Deep_Structure2023 — 24 days ago

I got Tired of Freemium Apps, most of the users who just use the free version never cared to upgrade

Therefore I turned my App into Premium only & added the paywall in the onboarding.

ON X people will say to not do this show value first, then ask for money. paywall goes after onboarding. sometimes gated behind a feature, sometimes on a generic "go premium" screen. user sees what the app does, decides if they like it, then you make the ask.

I did it the other way. moved the paywall to step 5 of 8 in onboarding. before they reach the main app, after they answer four personalization questions about their goals.

monthly conversion went from 2.1% to 6.7%. same app, same price.

my thinking was simple. users who just spent 90 seconds telling me what they want to achieve are at peak motivation. they're invested. they answered questions about themselves. the paywall lands while that motivation is still hot, not after the app has already given them a taste and the urgency is gone.

the users who bounced at the paywall were never going to pay anyway. i was just finding out faster.

for the ones who tap "not now" i show a reduced-feature version of the app. soft wall, not hard wall. keeps them in the funnel, gives me another shot later through posthog-triggered prompts when they hit a feature limit.

the rebuild itself was faster than i expected. spun up a fresh expo project through expo-cli with the onboarding flow and revenuecat already wired in, then it was mostly just rearranging the screen order and moving the paywall component up the stack. would have taken me a weekend to set up from scratch. took an afternoon.

day 1 retention dropped 8%. some users bounce earlier when they hit the paywall instead of getting through onboarding first. that looked bad in the dashboard for a week and i almost rolled it back.

then day 30 retention came in 31% higher. the users who stay are paying users. committed users. they self-selected through the paywall instead of churning silently three weeks later.

the metric that matters is not how many people make it through onboarding. it's how many people are still using the app a month later, and whether any of them are paying you.

best time to ask for payment is when the user is most motivated. usually not after they've already gotten what they wanted.

u/Deep_Structure2023 — 25 days ago