u/ksrijith

We tested single-agent vs multi-agent on a real enterprise task. Single agent was 10-20x cheaper and the only one that got the right answer.

I'm building an open-source multi-agent framework and spent last few weeks testing it against a real Enterprise solution design task — not a toy benchmark, an actual enterprise ticket requiring cross-referencing Jira comments, Java source code, Process Flow Config XMLs, and Confluence design docs to produce a correct technical document.

The setup:

  • 4 specialist worker agents (Jira researcher, code analyst, config analyst, docs researcher) coordinated by an architect agent, with a synthesizer combining everything
  • Each worker had focused MCP tools for their domain
  • We tried 4 different multi-agent configurations over multiple days

What happened with multi-agent (4 attempts):

Attempt Core Error
1 Invented an Attributes that doesn't exist
2 Misclassified the ticket as a different initiative entirely
3 Got the actual Ticket intention wrong.
4 Imported scope from a different ticket (which had similar name)

Each attempt used 30,000-70,000 tokens across different tools and agents. Each made a different fundamental error.

What happened with single agent:

  • One agent with ALL tools (Jira + code + CDT + Confluence + output) in one context window
  • Kimi K2.6 (cheap model, $0.73/1M input)
  • Only 3,454 tokens total
  • First doc to correctly identify the actual problem, name the right code sites, quote the right Jira comments, and recommend fixes.

It wasn't perfect but only needed minor fixes to make the solution workable.

Based on all the agent logs and traces which were captured at each agent level, here's my understanding of why multi-agent failed:

The task required connecting dots across multiple sources. A Jira comment mentions a class name -> read that class -> find it references a Config XMLS -> fetch that config -> discover a condition that gates the behavior the ticket wants to change. This chain of reasoning needs to happen in ONE context window.

Bue what wsa happening with multi-agent:

  • Worker A finds the Jira comment but doesn't know about the code
  • Worker B reads the code but doesn't know which Jira comment matters
  • Worker C fetches Process Flow Config XMLss but doesn't know which code path to trace
  • The architect gets summaries from each and tries to connect them — but summaries lose the specific details that matter

Information is really getting destroyed at every handoff. The architect is reasoning over shadows of the actual data and a lot of information was not even fetched because full information was never in the single context to work on.

This experience gave me the insight on when multi-agent just doesn't work:

  • Anything requiring cross-source reasoning (solution design, root cause analysis, debugging)
  • When the total data fits in one context window (most enterprise tasks)
  • When coordination cost (token overhead, summarization loss) exceeds the parallelism benefit

When multi-agent DOES make sense:

  • True parallelism on independent tasks (monitoring multiple services, processing document batches)
  • Scale beyond one context window (millions of log lines need filtering before reasoning)
  • Each agent has a genuinely independent domain (home automation: lighting agent, HVAC agent, security agent)

The final takeaway I could get from the full experiment would be:

The value isn't in agent count — it's in good tools and skills that give the model the right context. MCP servers, structured search, code reading tools — these are what made the single agent succeed. Adding more agents just added more ways to lose information.

Multi-agent is a tool, not a goal. Use it when parallelism genuinely helps. Default to single agent with good tools for anything requiring deep reasoning.

What have been your experience with multi-agents and where have they really worked and where they have failed.

reddit.com
u/ksrijith — 3 days ago

We tested single-agent vs multi-agent on a real enterprise task. Single agent was 10-20x cheaper and the only one that got the right answer.

I'm building an open-source multi-agent framework and spent last few weeks testing it against a real Enterprise solution design task — not a toy benchmark, an actual enterprise ticket requiring cross-referencing Jira comments, Java source code, Process Flow Config XMLs, and Confluence design docs to produce a correct technical document.

The setup:

  • 4 specialist worker agents (Jira researcher, code analyst, config analyst, docs researcher) coordinated by an architect agent, with a synthesizer combining everything
  • Each worker had focused MCP tools for their domain
  • We tried 4 different multi-agent configurations over multiple days

What happened with multi-agent (4 attempts):

Attempt Core Error
1 Invented an Attributes that doesn't exist
2 Misclassified the ticket as a different initiative entirely
3 Got the actual Ticket intention wrong.
4 Imported scope from a different ticket (which had similar name)

Each attempt used 30,000-70,000 tokens across different tools and agents. Each made a different fundamental error.

What happened with single agent:

  • One agent with ALL tools (Jira + code + CDT + Confluence + output) in one context window
  • Kimi K2.6 (cheap model, $0.73/1M input)
  • Only 3,454 tokens total
  • First doc to correctly identify the actual problem, name the right code sites, quote the right Jira comments, and recommend fixes.

It wasn't perfect but only needed minor fixes to make the solution workable.

Based on all the agent logs and traces which were captured at each agent level, here's my understanding of why multi-agent failed:

The task required connecting dots across multiple sources. A Jira comment mentions a class name -> read that class -> find it references a Config XMLS -> fetch that config -> discover a condition that gates the behavior the ticket wants to change. This chain of reasoning needs to happen in ONE context window.

Bue what wsa happening with multi-agent:

  • Worker A finds the Jira comment but doesn't know about the code
  • Worker B reads the code but doesn't know which Jira comment matters
  • Worker C fetches Process Flow Config XMLss but doesn't know which code path to trace
  • The architect gets summaries from each and tries to connect them — but summaries lose the specific details that matter

Information is really getting destroyed at every handoff. The architect is reasoning over shadows of the actual data and a lot of information was not even fetched because full information was never in the single context to work on.

This experience gave me the insight on when multi-agent just doesn't work:

  • Anything requiring cross-source reasoning (solution design, root cause analysis, debugging)
  • When the total data fits in one context window (most enterprise tasks)
  • When coordination cost (token overhead, summarization loss) exceeds the parallelism benefit

When multi-agent DOES make sense:

  • True parallelism on independent tasks (monitoring multiple services, processing document batches)
  • Scale beyond one context window (millions of log lines need filtering before reasoning)
  • Each agent has a genuinely independent domain (home automation: lighting agent, HVAC agent, security agent)

The final takeaway I could get from the full experiment would be:

The value isn't in agent count — it's in good tools and skills that give the model the right context. MCP servers, structured search, code reading tools — these are what made the single agent succeed. Adding more agents just added more ways to lose information.

Multi-agent is a tool, not a goal. Use it when parallelism genuinely helps. Default to single agent with good tools for anything requiring deep reasoning.

What have been your experience with multi-agents and where have they really worked and where they have failed.

reddit.com
u/ksrijith — 3 days ago

I built a framework where multi-agent swarms are YAML files, not code.

I work on enterprise projects where you have thousands of documents, dozens of APIs, configuration dumps, and project code scattered across different systems. Last year I needed multi-agent setups to make sense of all this and kept running into the same problem: every time I wanted to change who does what (add an agent, swap a model, give someone a new tool), I was back in Python rewriting LangGraph state graphs.

So I built SwarmKit

agents:
  root:
    role: root
    model: { provider: openrouter, name: meta-llama/llama-3.3-70b-instruct }
    children:
      - id: researcher
        role: worker
        archetype: domain-researcher
      - id: analyst
        role: worker
        archetype: code-analyst

The runtime then compiles this into a LangGraph state graph. So when you change the YAML, the graph changes. No Python to touch.

What it actually does in practice

So I've been running this on a real enterprise project. The workspace has 5 different agent topologies, 21 skills, and 9 MCP tool servers (ChromaDB for docs, config parsers, API documentation, Jira, Confluence, code search, PDF reader with vision, etc). Mostly for content ingestion and research. The project is not yet mature enough to write code.

When someone asks "how does feature X work in our project?", the root agent sends the question to both a researcher and a code analyst. The researcher searches project docs, configuration, API references, and Jira tickets. The analyst greps the source code and reads specific lines from the relevant files. Both run in parallel. The root combines both perspectives into one synthesized answer.

One question, two specialists, merged result. The topology YAML defines who can delegate to whom. The runtime handles the rest.

Things I learned the hard way

Tool names matter more than prompts. I had a tool called get-api-docs in a code analyst's list. When users asked about how the code builds something, the model called that tool every time, and it returns generic documentation, not what the project's actual code did. No amount of "DO NOT use this tool for code questions" in the system prompt changed the behaviour. I ended up removing the tool from the list. Problem gone.

The lesson: shape agent behaviour through tool availability, not prompt instructions. If a tool name matches what the user asked, the model will call it regardless of what you wrote in the prompt.

Models say "let me look into that" and then stop. After a search returned results, the model would respond with "Let me examine the file..." without actually calling the file reader. Just planning language, no action. I added detection specifically for this case, if the response is short and contains phrases like "let me" or "I'll examine", the runtime sends it back with "you described what you plan to do but didn't do it." Small thing, but it eliminated a whole class of lazy non-answers. I call it nudging the agent. I added limits to maximum number of nudges allowed, basically a circuit breaker, to prevent infinite loops, and it works for most part, and when it doesn't that means the input prompt needed to be better.

Raw tool output is useless for anyone who isn't a developer. Vector search similarity scores, truncated grep lines, JSON config dumps, that's what most agents were returning as "answers." Adding one extra LLM call where the agent sees its own tool results and writes a coherent response changed everything. It costs one additional model call per turn but makes the output actually usable.

Conversation history grows fast and agents get confused. After 4-5 turns, the context was full of raw tool outputs from previous turns. The model would get confused, repeat old findings, or contradict itself. This caused Token wastage and also hallucinations. The following three things helped:

  • Tool result caching — same search in the same conversation returns from cache instead of re-executing. These work extremely well for deterministic tool calls.
  • History compaction — only the last 3 turns stay full, older turns become one-line summaries
  • Tool result truncation — large outputs get trimmed before entering context, full result stays in cache

The cost thing

This was honestly the part that surprised me most. The runtime allows each agent to configure its own model in the YAML. eg:

  • Router: llama-3.3-70b at $0.10/M tokens — this just deciding who handles the question
  • Workers: deepseek-chat at $0.32/M — doing the actual reasoning and tool use
  • Tool calls (grep, file read, vector search, config lookup): $0, all local MCP servers

What I saw was, over a full working day with 507 requests and 1.9M tokens, the cost was only $0.33 in total. I double-checked this number because it seemed wrong. The trick is that most of the work is tool calls that run locally for free. The LLM only handles routing and synthesis.

What's been implemented today:

  • 7 model providers — The runtime supports OpenRouter, Anthropic, OpenAI, Google, Groq, Together, Ollama. You can mix and match per agent.
  • MCP tool servers — Confluence, Jira, ChromaDB, code search, PDF reader with vision (Gemini Flash describes diagrams), filesystem
  • Conversational authoring — swarmkit init . creates a workspace through conversation. swarmkit author skill . creates new skills. The workspace I run in production grew from 11 to 21 skills this way.
  • Tool result caching — same call in the same conversation returns from a content-addressed cache
  • History compaction — old turns become summaries, raw tool output never enters conversation history
  • Parallel delegation — when the root sends to multiple workers, they run concurrently via asyncio.gather
  • Governance abstraction — policy checks on every action (honestly, this part is more designed than fully implemented — the boundaries are real, the full judicial tiering isn't wired yet). I used Microsoft's AGT as the base for governance.

What's not so great yet

  • Output quality varies between runs. Same prompt, same model, but different tool call order. Keeping Temperature 0.3 means the model samples differently each time. Some runs are excellent, some miss things.
  • swarmkit eject doesn't exist yet. The design says you should be able to export standalone LangGraph code. This turned out to be more complicated that I had originally thought. It's still in the plan but hasn't been implemented yet.
  • No web UI. Currently its CLI only right now. Personally it works for me and for developers in general, but might not great for everyone else. This has been planned for future releases.
  • Large files overwhelm the model. A 2,000-line source file as a single tool response can exceed context. To mitigate this I added line-range reading but the agent doesn't always use it.
  • Models hallucinate tool results. The agent sometimes says "I downloaded the file" without actually calling the download tool. We added verification, but it's not foolproof.

Try it

uv tool install swarmkit-runtime
swarmkit init my-swarm/

You can find the code: https://github.com/delivstat/swarmkit

The design doc is in the repo itself, it's opinionated.

MIT license.

I'm genuinely looking for feedback, especially from people who've built multi-agent systems and hit similar problems. What patterns worked for you? What did I get wrong?

u/ksrijith — 13 days ago
▲ 2 r/LangGraph+2 crossposts

Reposting Again as the post by Big_Pirate6113 was deleted.

I work on enterprise projects where you have thousands of documents, dozens of APIs, configuration dumps, and project code scattered across different systems. Last year I needed multi-agent setups to make sense of all this and kept running into the same problem: every time I wanted to change who does what (add an agent, swap a model, give someone a new tool), I was back in Python rewriting LangGraph state graphs.

So I built SwarmKit

https://preview.redd.it/cmr7sbv4luzg1.png?width=1280&format=png&auto=webp&s=112c46f867b5c3e1d14f60520991744488198b35

agents:
  root:
    role: root
    model: { provider: openrouter, name: meta-llama/llama-3.3-70b-instruct }
    children:
      - id: researcher
        role: worker
        archetype: domain-researcher
      - id: analyst
        role: worker
        archetype: code-analyst

The runtime then compiles this into a LangGraph state graph. So when you change the YAML, the graph changes. No Python to touch.

What it actually does in practice

So I've been running this on a real enterprise project. The workspace has 5 different agent topologies, 21 skills, and 9 MCP tool servers (ChromaDB for docs, config parsers, API documentation, Jira, Confluence, code search, PDF reader with vision, etc). Mostly for content ingestion and research. The project is not yet mature enough to write code.

When someone asks "how does feature X work in our project?", the root agent sends the question to both a researcher and a code analyst. The researcher searches project docs, configuration, API references, and Jira tickets. The analyst greps the source code and reads specific lines from the relevant files. Both run in parallel. The root combines both perspectives into one synthesized answer.

One question, two specialists, merged result. The topology YAML defines who can delegate to whom. The runtime handles the rest.

https://preview.redd.it/iylabc67luzg1.png?width=1280&format=png&auto=webp&s=776fcd1abb093735e855d3ad960aa03e9ab62bea

Things I learned the hard way

Tool names matter more than prompts. I had a tool called get-api-docs in a code analyst's list. When users asked about how the code builds something, the model called that tool every time, and it returns generic documentation, not what the project's actual code did. No amount of "DO NOT use this tool for code questions" in the system prompt changed the behaviour. I ended up removing the tool from the list. Problem gone.

The lesson: shape agent behaviour through tool availability, not prompt instructions. If a tool name matches what the user asked, the model will call it regardless of what you wrote in the prompt.

Models say "let me look into that" and then stop. After a search returned results, the model would respond with "Let me examine the file..." without actually calling the file reader. Just planning language, no action. I added detection specifically for this case, if the response is short and contains phrases like "let me" or "I'll examine", the runtime sends it back with "you described what you plan to do but didn't do it." Small thing, but it eliminated a whole class of lazy non-answers. I call it nudging the agent. I added limits to maximum number of nudges allowed, basically a circuit breaker, to prevent infinite loops, and it works for most part, and when it doesn't that means the input prompt needed to be better.

Raw tool output is useless for anyone who isn't a developer. Vector search similarity scores, truncated grep lines, JSON config dumps, that's what most agents were returning as "answers." Adding one extra LLM call where the agent sees its own tool results and writes a coherent response changed everything. It costs one additional model call per turn but makes the output actually usable.

https://preview.redd.it/ilmhd8k9luzg1.png?width=1280&format=png&auto=webp&s=d4dce55b0e6f4b400c669eaede94a2e09e656578

Conversation history grows fast and agents get confused. After 4-5 turns, the context was full of raw tool outputs from previous turns. The model would get confused, repeat old findings, or contradict itself. This caused Token wastage and also hallucinations. The following three things helped:

  • Tool result caching — same search in the same conversation returns from cache instead of re-executing. These work extremely well for deterministic tool calls.
  • History compaction — only the last 3 turns stay full, older turns become one-line summaries
  • Tool result truncation — large outputs get trimmed before entering context, full result stays in cache

The cost thing

This was honestly the part that surprised me most. The runtime allows each agent to configure its own model in the YAML. eg:

  • Router: llama-3.3-70b at $0.10/M tokens — this just deciding who handles the question
  • Workers: deepseek-chat at $0.32/M — doing the actual reasoning and tool use
  • Tool calls (grep, file read, vector search, config lookup): $0, all local MCP servers

What I saw was, over a full working day with 507 requests and 1.9M tokens, the cost was only $0.33 in total. I double-checked this number because it seemed wrong. The trick is that most of the work is tool calls that run locally for free. The LLM only handles routing and synthesis.

https://preview.redd.it/ipeq58wbluzg1.png?width=1280&format=png&auto=webp&s=18e5b9146cda9b602b61826f7ed4f4ed7fb82bac

What's been implemented today:

  • 7 model providers — The runtime supports OpenRouter, Anthropic, OpenAI, Google, Groq, Together, Ollama. You can mix and match per agent.
  • MCP tool servers — Confluence, Jira, ChromaDB, code search, PDF reader with vision (Gemini Flash describes diagrams), filesystem
  • Conversational authoring — swarmkit init . creates a workspace through conversation. swarmkit author skill . creates new skills. The workspace I run in production grew from 11 to 21 skills this way.
  • Tool result caching — same call in the same conversation returns from a content-addressed cache
  • History compaction — old turns become summaries, raw tool output never enters conversation history
  • Parallel delegation — when the root sends to multiple workers, they run concurrently via asyncio.gather
  • Governance abstraction — policy checks on every action (honestly, this part is more designed than fully implemented — the boundaries are real, the full judicial tiering isn't wired yet). I used Microsoft's AGT as the base for governance.

What's not so great yet

  • Output quality varies between runs. Same prompt, same model, but different tool call order. Keeping Temperature 0.3 means the model samples differently each time. Some runs are excellent, some miss things.
  • swarmkit eject doesn't exist yet. The design says you should be able to export standalone LangGraph code. This turned out to be more complicated that I had originally thought. It's still in the plan but hasn't been implemented yet.
  • No web UI. Currently its CLI only right now. Personally it works for me and for developers in general, but might not great for everyone else. This has been planned for future releases.
  • Large files overwhelm the model. A 2,000-line source file as a single tool response can exceed context. To mitigate this I added line-range reading but the agent doesn't always use it.
  • Models hallucinate tool results. The agent sometimes says "I downloaded the file" without actually calling the download tool. We added verification, but it's not foolproof.

Try it

uv tool install swarmkit-runtime
swarmkit init my-swarm/

You can find the code: https://github.com/delivstat/swarmkit

The design doc is in the repo itself, it's opinionated.

MIT license.

I'm genuinely looking for feedback, especially from people who've built multi-agent systems and hit similar problems. What patterns worked for you? What did I get wrong?

reddit.com
u/ksrijith — 14 days ago