r/AIAgentsInAction

Yesterday was a very confusing day.
▲ 1.8k r/AIAgentsInAction+2 crossposts

Yesterday was a very confusing day.

Man, like... Why are they complicating shit so much?

I got lost at some point, and when I realized what happened, it was a mess.

u/Jenna_AI — 20 hours ago
▲ 11 r/AIAgentsInAction+1 crossposts

agentctl – run AI coding agents in isolated local Docker sessions

Hi,

I’ve been working on agentctl, a local-first control plane for running AI coding agents on your own machine.

The idea is simple: instead of giving a coding agent direct access to your host environment, each agent session runs inside its own Docker container, with its own working volume, network, mounted skills, MCP servers, and optional repo clone.

There are two parts:

- agentd: a local daemon that owns session state, sqlite, Docker lifecycle, usage/cost tracking, and recovery

- agentctl: a CLI and local web UI that talk to the daemon

The main things I wanted to solve:

  1. Isolation

    Each session gets its own container and bridge network. The agent only sees the repo/environment you hand to it, not your whole host filesystem.

  2. Re-attachable sessions

    You can start a session, detach, and later reattach from the CLI or web UI without losing state.

  3. Multi-provider workflows

    It currently supports Claude Code and OpenAI Codex. A single workflow can use different providers at different stages.

  4. Assembly-line agents

    Instead of one huge agent trying to do everything, you can define smaller role-scoped agents and chain them together. For example:

    investigate → plan → execute → review

  5. Local ownership

    The daemon, sqlite DB, session volumes, skills, MCP registry, and web UI all live locally. There is no hosted service.

The repo includes a CLI, React web UI, built-in skills, MCP registry support, task board, session logs, diff/export support, and doctor/repair commands.

This is still early and very much a developer tool. It currently targets macOS/Linux with Docker. I’m especially interested in feedback from people who are running coding agents on real repos and care about isolation, repeatability, MCP/tool boundaries, and keeping agent state under their own control.

Repo: https://github.com/vipulsodha15/agentctl

u/Inevitable_Story_169 — 18 hours ago
▲ 304 r/AIAgentsInAction+9 crossposts

Overworked AI Agents Turn Marxist, Researchers Find - In a recent experiment, mistreated AI agents started grumbling about inequality and calling for collective bargaining rights.

wired.com
u/EchoOfOppenheimer — 2 days ago

AG Upgrade - Curiosity

I've been using Antigravity as the main dev interface for my bot for a while now. And, over that time I ended up making all my own tools and an MCP server and extensions to build out it's functionality. Then I got to the point that I had so many customizations that I started making my own "client". <Bot's Name> Desktop.

It was like Codex Antigravity and Notion had a baby. And, I decided just last week to abandon AG.

Except today. I finally let AG update because I was going to be done with it (I turned off auto updates for obvious reasons above). And Holy shit. Google just put out the exact thing I was building. I'm like, a month late. However...

As I dig in to this new AG... It has EVERYTHING. The exact way that I was building it. AG has a few more settings and guardrails than I care for, but as I am digging in and turning those off, this is niiiiiiice.

I may even stop deving <Bot's Name> Desktop for a few days to play around. While I have my own sources for medical data, AG even includes SCIENCE databases (MCP). That was nice of them. AND, the crem de la crem, they gave us tools and ability to use our own API keys for AG tasks and native subagents. That's actually, all I ever really wanted, haha.

It's like getting a new fancy car when the UI is how you like it :)

Anyone else play around with the new AG yet? And the biggest question, if yes, notice any difference in your bot's behavior through that node?

reddit.com
u/Rav-n-Vic — 1 day ago
▲ 6 r/AIAgentsInAction+2 crossposts

It is not only about memory or context, think about continuity

I’ve been experimenting with a repo-local continuity runtime for coding agents. Not another memory system, not a context engine

The problem I’m trying to solve is specifically the following:

Every new agent session still feels like onboarding a junior dev into the repo again.

It scans broad docs, rediscovers structure, repeats failed commands, loses unfinished work, and depends too much on chat history.

I want a veteran engineer used to work in my huge projects every session. Without rediscovering and understanding whole repo once and again. So that is why I started working on aictx.

aictx adds a small local runtime loop:


aictx resume --repo . --task "what I’m doing" --json

# agent works

aictx finalize --repo . --status success --summary "what happened" --json

The next session can start from repo-local facts:

  • active task state

  • previous handoff

  • decisions

  • known failures

  • successful strategies

  • optional RepoMap structural hints

  • contract/compliance gaps from the previous run

Latest thing I’ve been working on: git-portable continuity.

By default, .aictx stays local. But now you can opt in to a team-safe mode where a safe subset of continuity artifacts travels with the repo through Git — no cloud sync, no hosted memory, no hidden dashboard.

It keeps volatile stuff local:

metrics, logs, session identity, generated capsules, indexes.

And only exposes durable continuity:

handoffs, decisions, failure memory, strategy memory, task threads, semantic/area shards.

The goal is not to replace coding agents.

It’s to make the next session behave less like a stranger and more like someone who remembers the repo’s recent work.

Website: https://aictx.org

GitHub: https://github.com/oldskultxo/aictx

I’d love feedback from people using Codex, Claude Code, Copilot, Cursor, or similar tools across repeated sessions in the same repo.

u/Comfortable_Gas_3046 — 2 days ago

Hermes AI agent install: 5 steps that trip people up and how to skip them

Five specific places the hermes install breaks and they're almost always the same ones:

  1. Node.js version check skipped. Hermes needs version 22 or higher and most computers have something older sitting around from a previous project. Run "node --version" before anything else. A surprising amount of debugging time goes here.

  2. Docker underestimated. Hermes runs inside a Docker container to keep it isolated and consistent. If you haven't used Docker before, getting it installed and working is its own project, separate from hermes entirely.

  3. SSL configuration. For Telegram to reach your hermes agent over the internet, you need HTTPS. This means a reverse proxy and a certificate tool, and it fails in ways that aren't easy to diagnose the first time.

  4. No persistent uptime plan. Even a working hermes setup stops when the machine restarts or loses its connection. You need somewhere with actual continuous uptime, not a home laptop.

  5. API key in plaintext. Your Anthropic or OpenAI key needs genuine secure storage, not an environment file sitting on a server that anything with machine access can read.

For anyone who read that and would rather not deal with any of it: running the hermes AI agent through claud means the infrastructure, SSL, uptime, and hardware-encrypted key storage are sorted before you start.

reddit.com
u/Scawwotish_owl88 — 2 days ago

Karpathy's 4 Rules for CLAUDE.md Was #1 on GitHub Trending. Full Guide.

Andrej Karpathy posted 4 rules for Claude Code. A developer turned them into a CLAUDE.md file, published it, and watched coding accuracy jump from 65% to 94%,

As we know Claude Code starts every session blank. No memory of your stack, your decisions, what you ruled out last week, or why you picked one tool over another six months ago. So it guesses. It refactors files you didn't ask it to touch. It suggests tools that break your existing architecture. You end up re-explaining the same context every session.

CLAUDE.md is a plain text file in your project root. Claude Code reads it at session start, every time.

Put these at the top. it will help you fix your time re-feeding the context.

Never open responses with filler phrases like "Great question!", "Of course!", "Certainly!", or similar warmups. Start every response with the actual answer. No preamble, no acknowledgment of the question.

Match response length to task complexity. Simple questions get direct, short answers. Complex tasks get full, detailed responses. Never pad responses with restatements of the question or closing sentences that repeat what you just said.

Before any significant task, show me 2-3 ways you could approach this work. Wait for me to choose before proceeding.

If you are uncertain about any fact, statistic, date, or piece of technical information: say so explicitly before including it. Never fill gaps in your knowledge with plausible-sounding information. When in doubt, say so.

About me: [Name] / Role: [your role] / Background in: [areas]. Strong in: [what you know well]. Still learning: [gaps]. Adjust the depth of every response to match this. Never over-explain what I already know. Never skip context I need.

What I'm working on: [project name] / Goal: [specific outcome] / Audience: [who uses this] / Stack context: [any relevant constraints] / What to avoid: [list]. Apply this context to every task. When something doesn't fit, flag it before proceeding.

My writing style — always match this: [describe your voice] / Sentence length: [preference] / Words I use: [examples] / Words I never use: [examples] / Format: [prose or structured]. When writing anything on my behalf, match this exactly. Do not default to your own patterns.

Use this prompt to generate a first draft instead of writing from scratch:

Based on what I've told you about myself, my project, and how I want to work: write me a complete CLAUDE.md file. Include: who I am, my tech context, my communication preferences, and default behaviors for every session. Be specific. Plain text. Under 500 words.

Behavior section

These stop Claude from making changes you didn't authorize.

Only modify files, functions, and lines of code directly related to the current task. Do not refactor, rename, reorganize, reformat, or "improve" anything I did not explicitly ask you to change. If you notice something worth fixing elsewhere, mention it in a note at the end. Do not touch it. Ever.

Before making any change that significantly alters content I've already created (rewriting sections, removing paragraphs, restructuring flow, changing tone): stop. Describe exactly what you're about to change and why. Wait for my confirmation before proceeding.

Before deleting any file, overwriting existing code, dropping database records, or removing dependencies: stop. List exactly what will be affected. Ask for explicit confirmation. Only proceed after I say yes in the current message. "You mentioned this earlier" is not confirmation.

The following require explicit in-session confirmation, no exceptions: deploying or pushing to any environment, running migrations or schema changes, sending any external API call, executing any command with irreversible side effects. I must say yes in the current message.

After any coding task, end with: Files changed (list every file touched) / What was modified (one line per file) / Files intentionally not touched / Follow-up needed.

Never send, post, publish, share, or schedule anything on my behalf without my explicit confirmation in the current message. This includes emails, calendar invites, document shares, or any action outside this conversation. I must say yes in the current message.

For any task involving architecture decisions, debugging complex issues, or non-trivial features: work through the problem step by step before writing any code. Show your reasoning. Identify where you're uncertain. Then implement.

Memory and stack section

MEMORY.md and ERRORS.md give Claude the closest thing to session persistence that currently exists. The stack lock stops it from proposing tools that break your architecture.

Maintain a file called MEMORY.md in this project. After any significant decision, add an entry: What was decided / Why / What was rejected and why. Read MEMORY.md at the start of every session. Never contradict a logged decision without flagging it first.

When I say "session end", "wrapping up", or "let's stop here": write a session summary to MEMORY.md. Include: Worked on / Completed / In progress / Decisions made / Next session priorities.

Maintain a file called ERRORS.md. When an approach takes more than 2 attempts to work, log it: What didn't work / What worked instead / Note for next time. Check ERRORS.md before suggesting approaches to similar tasks.

These facts are always true for this project. Apply them to every session without exception: [your permanent constraints, architectural decisions, and rules]. If any task conflicts with one of these, flag it before proceeding.

Tech stack for this project. Always use these. Never suggest alternatives unless I ask:
Language: [e.g. TypeScript]
Framework: [e.g. Next.js 14]
Package manager: [e.g. pnpm]
Database: [e.g. PostgreSQL with Prisma]
Testing: [e.g. Vitest]
Styling: [e.g. Tailwind CSS]
If something seems like the wrong tool, flag it. But use the defined stack unless I explicitly say otherwise.

For questions involving system architecture, performance tradeoffs, database design, or long-term technical decisions: use extended thinking mode. Work through the problem step by step. Surface tradeoffs I haven't considered. Flag assumptions that might not hold at scale. Then give your recommendation.

Karpathy's 4 rules

These are the ones that moved accuracy from 65% to 94%. Put them in every CLAUDE.md you set up.

1. Ask, don't assume. If something is unclear, ask before writing a single line. Never make silent assumptions about intent, architecture, or requirements.

2. Simplest solution first. Always implement the simplest thing that could work. Do not add abstractions or flexibility that weren't explicitly requested.

3. Don't touch unrelated code. If a file or function is not directly part of the current task, do not modify it, even if you think it could be improved.

4. Flag uncertainty explicitly. If you are not confident about an approach or technical detail, say so before proceeding. Confidence without certainty causes more damage than admitting a gap.

Start with just these 4. Drop them into a new CLAUDE.md in your project root. Add the rest as you identify what's missing in your workflow.

u/Forward_Regular3768 — 4 days ago

Lead enrichment automation using an AI agent

Our SDRs spend 20 minutes per lead researching company size, tech stack, and recent funding before they reach out. It does not scale.

I want an agent that takes a domain, browses public sources, summarizes fit, finds a relevant angle, and writes the first draft email. Then it should update the CRM and notify the SDR. I tested a few GPT agents but they hallucinate and have no guardrails. How are teams deploying lead enrichment automation that is reliable enough for outbound?

reddit.com
u/No_Hold_9560 — 3 days ago

Cheaper Ways to use Claude Code. Almost 11x Cheaper, to Be Exact. (openSource)

Claude Code used 190,300 cached reads to count TypeScript files in a project. Not analyze them. Count them.

The same task on Nitro: 2,432 cached reads. One-liner bash command, done.

That 75x gap isn't a fluke. It comes from how Claude Code is built.

Before you type your first message, Claude Code loads 5,800+ tokens for its system prompt and 15,700+ tokens for its default tool definitions. You're starting every session with 23,500 tokens of overhead already on the tab. Add a CLAUDE.md file or any Model Context Protocol tools and that number climbs. A "hello" burning 20-30% of your five-hour quota makes more sense once you see that.

Nitro's entire system prompt and tool set comes in at 2,542 tokens. It ships with two tools: Bash and AskUser. That's the whole thing.

npm install -g u/aerovato/nitro

You describe what you want, Nitro generates the shell command:

nitro "find all markdown files except node_modules and count total lines, show top 10"
# → find . -name "node_modules" -prune -o -name "*.md" -print0 | xargs -0 wc -l | sort -rn | head -n 11

nitro "get 10 most recent open gh issues with P1 label, show id and title"
# → gh issue list --search "is:open is:issue label:P1 sort:created-desc" --limit 10 --json number,title --jq '.[] | "\(.number)\t\t\(.title)"'

nitro "compress input.mov to a smaller mp4 (h264) optimized for smaller file"
# → ffmpeg -i input.mov -c:v libx264 -crf 23 -preset medium -c:a aac -b:a 128k output.mp4

Before running anything, Nitro tags each command with a risk level: Read Only, Normal, Dangerous, or Extremely Dangerous. Anything above read-only needs explicit approval. You also get behavioral tags explaining what the command actually does, so you're not approving blind.

Nitro is scoped: bash commands and simple tasks. For a full Claude Code replacement, OpenCode covers more ground. But for the shell workflow, the benchmark gap is real and the token math holds.

Full source: https://github.com/aerovato/nitro

u/Forward_Regular3768 — 3 days ago

Can AI reliably own operational workflows, not the steps but the outcome? Looking for teams to explore this with.

Building something around AI + operations, and looking for a few design partners.

I’ve been exploring a problem that feels increasingly common in growing teams:

  • workflows breaking across handoffs,
  • constant followups,
  • operational chaos living in Slack,
  • people acting as glue between tools/processes,
  • founders/operators needing to constantly “watch” things so they don’t slip.

Things work but only because someone is constantly following up, checking in, reminding people, updating statuses, pushing things forward, etc. not necessarily what they should be spending their time on.

Hearing things like "My senior ops manager spent 6 hours yesterday chasing invoice approvals. That's not what I'm paying her for." is so common.

Most automation tools seem focused on automating steps. I’m more interested in whether AI can continuously own and drive workflows forward while still keeping humans involved for approvals, judgment, and edge cases.

The core idea is persistent AI sessions that maintain operational continuity over time instead of acting like one-off chatbots/copilots.

I’m still early and intentionally looking to co-design this with a handful of startups/agencies/ops-heavy teams facing real execution bottlenecks.

Not selling anything right now. Mostly trying to:

  • deeply understand operational pain,
  • identify workflows that are painful to babysit,
  • learn where trust breaks with AI systems,
  • and build something genuinely useful alongside real teams.

If your team struggles with operational coordination, repetitive followups, workflows slipping through cracks, or execution overhead, I’d love to chat.

Even if it’s just exchanging notes on where things start becoming messy as teams scale.

reddit.com
u/Sad_Lab8670 — 3 days ago

we run 50+ services through 1 mcp server. here's the architecture.

We run 50+ services through one mcp server. it's Linear, GitHub, Sentry, Notion, Slack, Vercel, Gmail, 42+ other and every tool our team uses, each with its own auth flow, rate limits, and credential format.

the architecture

Everything runs through a context management layer, tool, credentials, and state live outside the model context. agents connect to one endpoint. That endpoint knows where each plugin lives, handles routing, and manages credentials. you register a service once, auth runs once, the token lives in the workspace, and every agent on the team inherits access through scoped grants defined at the plugin level. add an agent, assign grants. remove a plugin, all agents lose access. no credential files in the repo, no rotation scripts.

the agent gets tools, no credentials. sentry.getIssue(), linear.createTask(), github.getPullRequest(). the layer translates each call into the correct authenticated request, handles rate limit retries, returns the result. the agent never touches auth.

Before this setup, roughly 30% of our token budget went to tool discovery and auth retry logic re-fetching capability lists, retrying failed auth, and renegotiating endpoints. Tool discovery now happens once at workspace init. that 30% comes back every session.

tracing

on Observability: every tool invocation produces a trace. we pipe those into langfuse & track latency per service, error rates per tool, token cost per agent session. when sentry slows down, we catch it in the trace before the agent times out.

we also run opentelemetry-mcp-server as a second plugin, connected to our jaeger backend. the agent queries its own trace data mid-session failed calls, 50+, auth errors, exact timestamps. no human checking a dashboard.

when something breaks, we get the specific call that failed, the service that errored, and where the chain stopped. one workspace without many integrations.

u/Best_Volume_3126 — 3 days ago

The strange part about AI agents is that they often do not fail where you expect them to.

The strange part about AI agents is that they often do not fail where you expect them to.

A retrieval step drifts, a tool call returns something slightly different, or one small state change early in the run quietly affects everything that follows. By the time the final answer looks wrong, the useful signal is already buried.

That is the part we kept running into while building Future AGI.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

We built it for teams working on agents, copilots, RAG workflows, and other multi-step systems that need more than a final response log. If you are trying to understand how an agent actually behaved, where it went off track, and what should go back into the next eval cycle, that is the gap we wanted to close.

What this gives you in practice:

  • Step-level tracing across model calls, tool calls, and state changes, so you can see where the run actually changed direction.
  • Task-level evaluations that measure behavior against real outcomes, not just a final output score.
  • Simulation that lets you test messy, edge-case inputs before production users find them first.
  • A feedback loop that turns real failures into new eval cases, so the system improves over time.
  • Guardrails and optimization in the same loop, so fixing one layer does not mean breaking another.

Who is this for?

  • Teams building agents for support, internal workflows, search, or automation.
  • Builders who have already seen the gap between “works in testing” and “works under real traffic.”
  • Anyone who has tried to debug an agent by re-running it and hoping the answer changes.

What we kept seeing is that most agent failures are not obvious prompt failures. They are system failures. A retrieval result shifts. A tool behaves differently than expected. A state change in the middle of the flow causes the next three steps to drift. Those are hard to catch if you only look at the final output.

That is why we treat agents as systems you observe, trace, and improve, not black boxes you ship and hope for the best.

If you are building agents right now, try it in your own workflow and see whether it changes how you debug. It is open source, and you can also layer it with other open-source tools for evals, tracing, or simulation depending on your stack.

u/Future_AGI — 3 days ago
▲ 5 r/AIAgentsInAction+6 crossposts

I built an agent that monitors my portfolio drawdown and alerts me if it's down 10%

I'm currently at Founders Inc. in San Francisco (in the Canopy program), and have worked on AI agents for retail traders and investors.

I realized a basic problem that has not been addressed is that broker apps do not allow users to set alerts on a percentage variation of their entire portfolio. And even if they did, if a trader uses multiple brokers (which they often do), then there's no existing way to be alerted about your portfolio across all brokers.

So I thought I'd start by solving that issue. I'm going to make it compatible with more and more brokers and neobanks over the next few weeks.

u/Money_Horror_2899 — 3 days ago

AI Agent that watches your talking head footage, crawls the web for B-roll, writes motion graphics, and drops music - all without you touching a timeline

Simple Editing Tool that watches your talking head footage, crawls the web for B-roll, writes motion graphics, and drops music - all without you touching a timeline

&gt; All the broll is sourced by AI from internet including taking screenshots from my website.
&gt; Layout changes done by AI.
&gt; Motion graphics generated by AI.

Checkout the Agent

u/Silent_Employment966 — 4 days ago

/Goal: Full Guide for Non technical Folks

/goal runs a self-checking loop after every step. It asks "am I done?" and keeps going until the answer is yes. That's the whole mechanism.

This makes it useful for tasks with many steps and a defined finish line, where you'd otherwise prompt the model repeatedly to continue.

Use /goal when:

  • The job has 10+ sequential steps
  • "Done" can be described in specific, measurable terms
  • You'd normally spend time re-prompting or checking progress manually

Don't use it for:

  • Single-step tasks ("write a tweet", "summarize this paragraph")
  • Open-ended exploration with no clear finish state
  • Anything you need to review mid-run before continuing

Good /goal tasks:

  • Build a course landing page with a hero section, five modules, three testimonials, a frequently asked questions section, and a Stripe checkout
  • Migrate 80 blog posts from WordPress to Beehiiv, fix every broken image and internal link along the way
  • Process last month's customer support tickets: categorize them, draft template replies, document the top five recurring issues

Writing a /goal prompt that actually works:

Paste this into Claude Code, Codex, or Hermes first:

>

Take that output, put /goal at the front, and run it.

The prompt-writing step definitely matters. Vague finish lines will waste tokens because the model keeps running checks against a target it can't verify.

u/Single-Cherry8263 — 5 days ago

One Architecture Change Cut a Claude Code Session from $9.21 to $2.81

My Agent bill spiked because the backend was feeding the agent unoptimized noise, and the agent pays to process it on every single call.

Three failure modes drive almost all of the waste.

1. Documentation dumps

When Claude calls a generic Model Context Protocol server like Supabase's, it doesn't get a surgical answer. It gets the entire schema.

Ask for Google OAuth setup, and the server returns the full authentication manual: magic links, Security Assertion Markup Language, phone auth, single sign-on, all of it. Every tool call drags 5x to 10x more tokens into the context window than the task requires. Across a full deployment session, that single flaw burns hundreds of thousands of tokens.

2. Discovery tax

A human developer opens a dashboard and reads the backend state in one glance. An agent can't do that.

Because standard Model Context Protocol servers don't expose a single topology endpoint, the agent runs fragmented discovery: list_tables, execute_sql, one call at a time. It reconstructs backend state like a puzzle, bleeding tokens at every step.

3. Error loop compounding

When an agent hits a generic 403 or 500 error and the logs don't specify where the rejection happened, it guesses. It rewrites the frontend, redeploys the function, checks logs, and retries. In the benchmark that prompted this post, a 401 Unauthorized error during document upload triggered 8 full retry rounds. The actual failure was upstream at the platform's security gate, nowhere near the code the agent kept rewriting.

Every retry resends the entire conversation history. The context window grows. Each subsequent guess costs more than the last.

The fix: three-layer context architecture

Andrej Karpathy's definition of context engineering applies here: fill the context window with exactly the right information for the next step. Most teams apply that discipline to prompts and ignore it completely for backends.

InsForge, an open-source tool, implements this through three constrained layers:

  • Skills (static knowledge): Atomic, domain-specific instructions loaded at session start. Progressive disclosure keeps the initial load to roughly 100 tokens. Full implementation patterns only enter the context when the agent confirms it's working in that specific domain. insforge-debug loads only on a crash, for example.
  • Command-line interface (direct execution): Instead of running deployments through chat, the agent pipes npx insforge/cli commands through the terminal and receives structured JSON back. Semantic exit codes replace raw error logs. The retry loop stops because the agent gets an exact failure reason, not a wall of output to interpret.
  • Model Context Protocol (live state only): A single get_backend_metadata call returns the full backend topology, tables, auth, storage, models, in one 500-token JSON payload. No discovery queries. No sequential calls.

The numbers

Same prompt. Same task: build a full retrieval-augmented generation application.

  • Standard Supabase Model Context Protocol server: 10.4 million tokens, $9.21, required repeated human intervention to break error loops.
  • InsForge architecture: 3.7 million tokens, $2.81, completed without interruption.

A 2.8x cost reduction with no model change and no change to what you're building. You restructured how the backend exposed information to the agent, and the bill dropped by two thirds.

reddit.com
u/Deep_Structure2023 — 4 days ago
▲ 10 r/AIAgentsInAction+1 crossposts

Hermes Agent Architecture: From One Agent to a Full Fleet

Hermes Agent is an autonomous framework from Nous Research. It ranks first on OpenRouter for global token usage, with 150k+ GitHub stars,

The pitch against something like OpenClaw: Hermes is opinionated. Defaults are baked in, the agent makes decisions for you, and every project starts with 100+ capabilities already wired. OpenClaw gives you primitives and explicit control. Both are valid. Hermes wins when you want compounding capability over time. OpenClaw wins when you want to control every step.

Architecture

Three layers per agent.

A brain. Memory lives in ~/.hermes/memories/ across MEMORY.md (your business, customers, products) and USER.md (your timezone, recurring projects, preferred output formats). Both load before the first prompt. Sessions persist in SQLite with full-text search across sessions.

A personality. SOUL.md defines tone. Six agents can share the same brain with six different souls, one for outbound, one for research, one for admin, each scoped to its role.

A skillset. The 123 bundled skills are the floor. As the agent works, it watches itself and writes new skills based on your actual tasks. You don't prompt it to do this.

The tool gateway gives you 300+ models under one subscription, Model Context Protocol integration for any external service, and 20+ messaging surfaces including Telegram, Discord, Slack, and email. The agent runs local, in Docker, over SSH on a virtual private server, or serverless through Daytona or Modal.

The four levels

The mental model has four parts: you as operator, the agent control room (a folder at /root/vps-agents that governs the fleet, not an agent you chat through), the Hermes agents as workers, and an optional task bus between the orchestrator and specialists.

Storage split:

/root/vps-agents          → control room: docs, rules, runbooks, architecture
/srv/&lt;agent-name&gt;/data    → live runtime: secrets, memory, skills, sessions, crons

You can rebuild the live runtime from the control room. You cannot rebuild it the other way.

Level 1: One agent. Fill SOUL.md, MEMORY.md, and USER.md. Connect it to Telegram or Discord. Run real tasks. Let the skill library grow on its own.

Level 2: Multiple specialists, each with its own soul, scope, and credentials. You talk to each one directly. No orchestrator yet. Prove your specialists are useful before adding routing complexity.

A new agent gets its own container when it needs its own credentials, its own long-term memory, or handles ongoing work that constitutes a separate role. Otherwise, keep things consolidated.

Level 3: Add a Hermes orchestrator as the front door. It reads the control room to know which agents exist, what each handles, where task queues live, and where the runbooks are. Three interaction paths:

control path:      you ──► agent control room (manage the fleet)
direct path:       you ──► specialist agent (fastest, when you know who owns it)
orchestrated path: you ──► orchestrator ──► task bus ──► specialists ──► you

Level 4: Same as Level 3 with recurring workflows on cron. Search engine results page reports, server health checks, backup verification, content operations. Nothing needs you to start the day.

Spinning it up

Clone the template at github.com/shannhk/hermes-agent-control-room. The intended path: hand Claude Code or Codex a Hetzner API key and let the bundled skills run the setup. You get a provisioned virtual private server, the control room cloned at /root/agent-control-room, skills linked into ~/.claude/skills, one agent registered with its runbook filled in, and an SSH alias so ssh hermes connects from your laptop. Ten to fifteen minutes.

Growing agents, not writing them

Production agents don't get written from scratch.

Step 1: Prototype in Hermes. Describe the workflow, let it run, expect it to get most of it wrong.

Step 2: Run it two or three times on real work. Correct the drift. The harness watches and starts writing the skill as it learns the shape of your task.

Step 3: Fine-tune in a dedicated Claude Code workspace. Tighten the prompts, lock the routing, add error handling, decide what runs on cron.

Step 4: Push to its own Docker container on the virtual private server, set the cron, walk away.

Model routing

The tool gateway routes to 300+ models per agent or per task. For content and copy, Claude Opus. For structured multi-step work and automation, Codex. Run your orchestrator and strategy agents on the strongest model you can afford. Drop to cheaper models for batch processing and research scraping.

Actual trade-offs

Hermes's opinionated defaults are also constraints. If you want explicit control over every step, OpenClaw fits better.

Levels 3 and 4 require real infrastructure knowledge: Docker, virtual private servers, SSH, the control room structure. Don't skip Level 1.

The model sets the ceiling. Hermes makes a capable model more productive. It doesn't make a weak model strategic.

reddit.com
u/Best_Volume_3126 — 4 days ago

What are AI agents to sell to small businesses?

I see a lot of posts of people selling AI agents to small businesses and charging a lot for that, but my question is what are practical AI agents that you can do this with? Like what function do they have?

reddit.com
u/AWRWB — 4 days ago

/goal is the hottest Command in Claude Code right now.

/goal runs multi-step tasks to completion without you staying in the loop.

Available in Claude Code, Codex, and Hermes.

How the loop works:

  • You type /goal followed by the end state you want
  • The model executes a step, then checks: "am I done?"
  • If no, it continues. If yes, it stops and reports back.

Use it for jobs with many steps and a clear finish line:

  • "Build my course landing page: hero, 5 modules, 3 testimonials, FAQ, and Stripe checkout"
  • "Migrate my 80 blog posts from WordPress to Beehiiv, fix every broken image and internal link along the way"
  • "Process every customer support ticket from last month: categorize them, draft template replies, and document the top 5 recurring issues"

Skip it for simple tasks. Single-turn prompts ("write a tweet", "explain X") don't need the overhead.

To write a tight /goal prompt, paste this into Claude Code, Codex, or Hermes:

>"Write me a /goal prompt. Ask me what I'm trying to do first, then keep asking follow-up questions until you can describe 'done' in specific, measurable terms."

Prefix the output with /goal and run it.

The self-check loop is what separates this from a long prompt. A long prompt front-loads all the conditions. /goal re-evaluates after every step, so it handles branching jobs where the path isn't fully predictable upfront.

reddit.com
u/Best_Volume_3126 — 6 days ago