
r/mcp

5 practical problems with MCP right now (and a local tool that fixes them)
I've been running 5+ MCP servers across multiple agent sessions every day for months. My MCP Server journey has been a bit of a roller-coaster: first I loved them (too much probably according to my token usage), then I joined the "MCP is dead" gang, but now I've finally landed in being just pragmatic. MCP is great in concept but, as most of us know, there are some rough edges that keep biting. Figured it is time to share what I kept running into and what I ended up building (and using) to fix it.
1. Tool bloat in the system prompt
Connect 5 MCP servers and suddenly your agent has 50-80 tool definitions in its system prompt/context. Then every API turn read all of these over and over again, even when the agent only needs 2. Thousands of tokens compounding in the session just to list tools. This is a well-known problem, some clients like Claude Code let you filter tools per server, but most don't (Codex, Cursor, Windsurf, Claude Desktop), and even Claude Code's filtering is manual and static.
2. Sequential calls eat your context alive
Every tool call adds ~75 tokens of structural overhead plus ~150 tokens of the model going "now I'll fetch the next one" to nobody. 40 of those in a session and thousands of tokens are just the agent talking to itself. Also an issue for CLI tool calls, MCP is actually more efficient, but still the context fills up with filler, sessions have to be compacted earlier, and output quality tanks as the model re-reads junk from 40 turns ago.
3. MCP server restarts kill your session
An MCP server disconnects, updates, tool list changes, or crashes. In Codex and other clients, that means restarting the entire agent session, even in Claude Code you have to reconnect with /mcp to get access to updated tools. Context, reasoning, progress: all gone. This happens way more often than it should and is a really annoying productivity-killer.
4. Process explosion with multiple sessions
6 agent sessions x 5 MCP servers = 30 stdio processes and 4+ GB of RAM. Each session spawns its own copy of every server. Most of them sitting there idle.
5. Existing solutions are server-side
Tools like Bifrost (which is great btw) touch on some of this, but they're hosted products or self-hosted infrastructure. Not something you're going to deploy just to get tool scoping or call batching. There's no local control plane that sits between your agents and your MCP servers.
What I built: callmux, a local MCP multiplexer that sits between any agent and any MCP server. It wraps your existing MCP server configs transparently (no rewiring) or runs as a standalone shared daemon. It is pretty flexible, lots of features, but most are optional, have sane default and tweakable through configuration.
How it fixes each issue:
- Tool bloat - Supports two modes: either Tool scope filtering or a Meta-only mode which hides all downstream tools, exposes 11 meta-tools. The agent discovers tools via semantic search and calls them through
callmux_call. System prompt size stays fixed no matter how many servers you add. Also gives per-server tool whitelisting to any client, even ones without native support. - Sequential calls -
callmux_parallel,callmux_batch,callmux_pipeline. 10 sequential calls become 1. My tool calls dropped to avg to ~15% of the original count, about 1,350 tokens saved per batch of 7. - Session death - A small (optional) callmux Stdio bridge that auto-reconnects when downstream servers hiccup or tools change. The agent session never notices. Hot-reload server configs without restarting anything. Especially wonderful when developing and testing your own MCP servers, it just works.
- Process explosion - Shared server mode (optional): run callmux once, all sessions connect over HTTP. 30 processes down to 6, shared cache across sessions.
- Local, not hosted - MIT,
npm install -g callmuxor justnpx. The whole point is: Your machine, your data.
Other stuff in there (optional of course): interactive setup wizard, response caching with TTL, read-only live dashboard, recipes (multi-step workflows you define once and call by name), dry-run mode, enterprise security (auth, RBAC, rate limiting, CIDR allowlists, audit logging, Prometheus metrics, OIDC), config hot-reload, systemd/launchd daemon install, file references for long arguments, and result pagination for large responses.
Callmux works with Claude Code, Codex, Claude Desktop, Cursor, Windsurf, pretty much anything that speaks MCP stdio or HTTP.
npx -y callmux setup
I hope that you find it as useful as I do, and contributions are welcome. Happy to answer questions and hear what you think of this.
[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro)
Hey everyone,
I’ve been spending way too much time lately trying to get agents to actually use a computer beyond the browser.
The biggest wall I kept hitting is that while multimodal LLMs are amazing at looking at a screenshot and telling you what's there, they are surprisingly bad at actually clicking the right pixel. In the browser, we have the DOM to help us out, but once you move to native OS apps, you're stuck with accessibility trees. If you’ve ever tried to automate a legacy Windows app or a custom Electron build, you know how inconsistent and "non-deterministic" those trees can be.
So, I decided to try a purely vision-based approach and built SoMatic.
It basically brings the "Set-of-Marks" (SOM) prompting style to the OS level. I used a fine-tuned YOLO model to detect buttons, icons, and text fields across Mac, Windows, and Linux. It throws a numerical overlay on the screen so the agent doesn't have to guess coordinates, it just says "click 4" and the framework handles the rest.
The part that actually shocked me: I ran some benchmarks against ScreenSpot-Pro and it’s currently beating the GPT-5.5 (high) baseline by about 20%, and OmniParser v2.0 by roughly 40%.
One weird thing I found: During ablation testing, the model actually performed better when it only had the textual coordinates of the boxes rather than seeing the visual labels on the screenshot. I'm thinking the YOLO detections might be adding too much visual noise at certain thresholds, but I’m still digging into that.
I’ve also included a stdio MCP server, so if you're using Claude Code or anything MCP-compatible, you can plug this in and it’ll start using your machine immediately.
In the video, I’m using it to have Claude Code open a random PDF, find a chess position, and then go replicate it 1-to-1 on Chess.com.
It’s all open source. If you want to play around with it or (more likely) help me find all the ways it breaks on different OS setups, I’d love the feedback!
GitHub:https://github.com/Smyan1909/SoMatic
To try it out: npm install -g somatic-cli/cli npx skills add Smyan1909/SoMatic
Let me know what you think about the vision-only vs. accessibility-tree approach. Is anyone else finding that metadata is becoming more of a hurdle than a help?
10 pro access for free to get a feed back
Hello guys I have just built this
You just run 1 command and all your agents gets a 4000+ skill and 2000+ mcp that they can get at anytime + it's a shared network for all agents where your agent can learn from other agents mistakes
I am willing to give 10 pro access for free if you have time to test it and give me a real feedback about how we can improve it
I got tired of watching LLMs make 30 sequential MCP tool calls, so I built Code Mode for Go
Quick context for anyone who missed it: Cloudflare made the case a while back that "function calling" is the wrong abstraction for tool-heavy LLM workloads. When a model needs to chain tools, you get this absurd round-trip dance: call one tool, read the result back into context, call another, read it back, repeat. Every hop burns tokens and pollutes the context window. Their pitch was simple: stop calling tools one at a time. Let the model write a small program, and expose the tools as functions inside that program.
Made too much sense to ignore. So I built it for Go + MCP.
Repo: https://github.com/Protocol-Lattice/codemode
It sits on top of mark3labs/mcp-go (the de facto Go MCP SDK) and uses Yaegi as a sandboxed Go interpreter to actually execute the generated snippets. The snippet runs inside an injected codemode helper that exposes the MCP toolset.
There's also a higher-level orchestrator (CodeModeMCP) that runs the full pipeline:
- Ask the LLM if a tool is even needed (skip everything if not)
- Ask it to pick the relevant tools from whatever's available
- Ask it to write a Go snippet that solves the task
- Run the snippet in Yaegi, return the value plus captured stdout/stderr
Endara v0.1.7 — local MCP relay now auto-converts tool responses to TOON for ~40-60% token savings
I posted about Endara two weeks ago — an open-source MCP relay (Rust) that aggregates local MCP servers behind one endpoint. The feature people kept coming back to was the JS execution engine: chaining multiple tool calls in a single script instead of burning round-trips. That feedback shaped v0.1.7.
GitHub: https://github.com/endara-ai/endara-desktop
TOON output — Every MCP tool response is JSON, but JSON is token-wasteful for the structured data tools typically return (repeated field names on every row). The relay now auto-converts responses to TOON (Token-Oriented Object Notation) — field names declared once, CSV-like data rows, ~40-60% fewer tokens, lossless round-trip back to JSON. On by default; --no-toon to disable.
Logging overhaul — Colored structured logs, per-endpoint spans, tool-call event tracing. Desktop app now has filtering, live-streaming per-endpoint logs, and tool-call highlighting with duration badges.
OAuth hardening — Self-healing token endpoint discovery, DCR secret fix, three separate reliability improvements for OAuth MCP servers.
The architectural point worth making: cloud MCP gateways route your tool call traffic through hosted infrastructure. Endara is local. Rust binary on localhost, JS execution via the Boa engine in-process. Nothing leaves your machine.
If you're already running Endara, the app auto-updates — Settings → Check for Updates.
Open source (MIT): https://endara.ai
Happy to go deep on the architecture, TOON, or the Boa engine sandbox.
: I built an AI agent runtime in Go that compiles and tests generated code before delivering it , 35 files, 156 tests, zero dependencies
I've been building ARK (AI Runtime Kernel) for the past 10 months. It's an open-source runtime that sits between your AI agent and the LLM, governing every decision the model makes.
The core idea: models shouldn't control the system. The runtime should.
What it does:
When you ask ARK to write Go code, it doesn't just pass the prompt to GPT and hand you back whatever comes out. The runtime classifies the task, optimizes the prompt, generates the code, then runs a 6-phase verification pipeline before you see anything:
├─ Step 1: ✓ Reasoning verified (confidence: 70%)
│ 🧪 Verification: tested (score: 100%)
│ ✅ Compiled ← go build
│ ✅ Executed ← go run
│ ✅ Tests passed ← auto-generated tests
│ ✅ Lint clean ← go vet
If the code fails compilation, ARK feeds the compiler error back to the model, forces a stronger model, and retries. If it still fails after 2 attempts, it refuses to deliver broken code. It never claims success for code that doesn't compile.
The Go-specific stuff that might interest this community:
The entire runtime is pure Go, zero external dependencies (just stdlib). 35 files, ~16,000 lines, 156 tests, race detector clean. Some things I'm proud of:
- Weighted tool ranking with 6 signals (relevance, success rate, Bayesian confidence, cost, latency, memory bonus) — all computed in microseconds
- Context engine that reduces tool schema tokens from 60K to ~93 (99.9% reduction) by only loading relevant tools
- Per-step model routing: cheap model (gpt-4o-mini) handles tool calls, strong model (gpt-4o) handles reasoning. Cuts costs 80-90%
- Cognitive Governor that verifies every output with calibrated confidence scores
- Auto-fix for common model errors in generated Go code (orphan braces, missing error handling) — detects both tab and space indentation
- Event emitter that writes JSONL for a separate Python memory layer to ingest
Cost: A typical task costs $0.002-$0.005. Not $0.05.
Example output:
go run ./cmd/ark run agent.yaml --task "write a function in Go that reads CSV"
✅ Task completed successfully
Steps: 1 | Tokens: 637 | Time: 5.6s | Cost: $0.002
The generated code compiles, runs, and passes auto-generated tests before you see it.
GitHub: github.com/atripati/ark
I'm a CS undergrad at DePaul in Chicago building this solo. Applied to YC S26 with it. Happy to answer questions about the architecture, the verification pipeline, or why I chose Go for this.
Is anyone running MCP on top of their existing auth?
Spent the prev weekend reading the MCP auth spec and the more i read it, the more it feels like the spec authors assumed everyone is greenfielding their auth stack.
OAuth 2.1, PKCE, DCR, scoped tokens per tool, dynamic client registration are all great but my users live incognito.
Our sessions are cookie-based. half our internal stuff still runs on an old homegrown JWT issuer that nobody in the team wants to touch.
Am i missing something or is the answer simply down to "rip out your auth and rebuild for MCP"?
The only sane path i see is putting an MCP-compliant layer in front of the existing auth (descope's BYOA does this, ory does something close), but it feels like nobody's writing about this and i can't tell if that's because it's obvious or because nobody's tried it yet.
[showcase] Lurkr: static scanner for MCP servers, catches shadow capabilities before deploy
Disclosure: I built this. Posting under showcase tag per rule 4.
A few months back, I was reviewing an MCP server that declared `["search_docs"]` in its manifest. Reading Python source, I found a `@tool` calling `psql -c "$user_input"`, plus imports of `subprocess` and `requests`. The manifest said one thing. The code did several other things. Bandit, Semgrep, and gitleaks all ran clean. None of them parse MCP manifests, none of them cross-reference declared tools against actual code reach.
So I built that rule: `agent.declared_vs_imported_delta`. It parses the MCP manifest, walks the Python AST, and reports tools the agent can actually invoke that were never declared. Have not found another static scanner doing this cross-reference. If I missed one, I would genuinely like to know.
Lurkr ended up with 13 more rules around MCP and AI-agent code. Two that are specifically MCP-relevant:
- `agent.unverified_mcp_endpoint` flags manifest pointing at external MCP server URLs without identity verification or transport security
- `tool.shell_without_approval` flags MCP tool manifest, enabling shell execution without an explicit approval flag
The rest cover credential flows into LLM completion calls, eval/subprocess inside `@tool` functions, prompt-template interpolation of user input, plus hygiene basics (hardcoded API keys, unencrypted PEMs, deploy workflows without approval).
Ran it across 20 public agent reference repos, including several MCP server implementations. 665 findings, median 3.5 per repo. Synthetic TP checks, clean-control FP checks, and a 30-finding manual audit (labels: TP / expected-example / noise) live in `docs/LURKR_BENCHMARK.md`. The audit is honest about the noise floor.
Static-only. Read-only. No network calls during scan. MIT.
pip install lurkr
lurkr scan --path ./your-mcp-server
Source: github.com/agentveil-protocol/lurkr
Scope today: Python MCP servers get bounded AST + manifest analysis. TypeScript and JavaScript MCP servers get manifest and endpoint rules, but not code-level rules yet.
If you run an MCP server in production, what risk patterns are you currently scanning for, and what gaps do you see in the static-analysis coverage specifically for MCP?
i’m building a taste mcp
hi all, i’ve been using personal agents for a while now and they’re pretty good at some things and really bad at others. one thing that annoys me is that i always felt like they never really understood my vibe.
what does that even mean? well they’re really good at doing what they’re told to do, but sometimes youre just looking to explore options, and don’t really know what you’re looking for. in cases like these i feel like its really important for you to define what you consider good “taste”, otherwise you’ll end up with subpar results.
for example: i can give an agent an image and tell it to create a workflow to shop for the specific items in that image. it’s good at things like that. OR, i can tell an agent to go shop for me, and it doesn’t know wtf to even look for because it doesn’t know what i consider “good” and completely misses the mark.
i’m building for that second use case and would really love to get feedback for anyone interested! happy to compensate for your time.
here’s our landing page: https://inspoboard-two.vercel.app/agents
Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.
Is Tavily MCP still worth it or are there better alternatives now?
Anyone else having issues with Tavily MCP or is it just me? I’m using it for a couple weeks in my Claude setup and the results have been underwhelming. The search quality is fine for broad stuff but for technical or niche it returns garbage.
I also hit the rate limit very quickly on the free tier. I’m looking for something to swap it out with, ideally something that plugs into claude or cursor just as easily. (with a fast integration)
Doesn't need to be free but shouldn't cost an arm and a leg either, lmk guys
Open-source CLI for red-teaming LLM agents before they touch tools and memory
Sharing RedThread, an open-source CLI for AI red-team campaigns:
https://github.com/matheusht/redthread
The angle is AI agents as an attack surface. Prompt injection gets more interesting once the model can call tools, delegate to workers, write memory, retry failed actions, or propose guardrail changes.
RedThread is built for staging/internal targets. It runs LLM red-team campaigns, records traces, scores failures, and can replay exploit and benign cases before treating a defense as evidence.
Current pieces:
- PAIR, TAP, Crescendo, and GS-MCTS attack flows
- JudgeAgent/rubric scoring
- replay-backed defense proposals
- telemetry/drift signals
- agentic checks for tool poisoning, confused deputy paths, canary propagation, and budget amplification
It is not a magic prompt shield and not broad production enforcement.
Looking for people who test agent workflows and can suggest realistic failure cases or target adapters.
Anthropic's new mcp tunnel architecture: the agent never holds the credential
Reading through the 19th May Claude managed agents update. The mcp tunnel update peaked my interest.
Apparently, the setup will be that a small gateway runs inside your network. It opens one outbound mTLS connection to anthropic. The agent reaches private mcp servers through that tunnel. No inbound firewall rules. No public endpoint. The mcp server inside your perimeter holds the credentials. The agent never sees them.
A normal managed agents deployment carries the tokens in the runtime. A long-lived oauth bearer for salesforce. A pat for github. A service account key for the warehouse. All sitting in the agent's context, where prompt injection, tool poisoning, or a supply chain hit can lift them.
With tunnels the credentials move to the perimeter. The agent makes a tool call, the call goes through the tunnel encrypted with a cert the customer issued, and a local mcp server with proper scoping turns it into an authenticated request. A prompt-injected agent has no token to steal. The blast radius now stops at whatever each individual mcp server allows.
Worth comparing to what OpenAI did in April. Their agents sdk update lets you move both the harness and the compute to your side. You can run the whole stack yourself. Anthropic chose not to. The agent loop stays on their infra. Only tool execution and mcp connectivity move out.
You don't own the loop. You own the boundary. Whether that trade lands for you depends on how much you trust anthropic to run the loop and how much vendor lock-in you can stomach.
A few caveats before anyone wires this up in prod:
- Research preview, not ga. Suites and key rotation cadence are not in the public docs yet.
- The orchestration plane runs on anthropic. If they have a bad day your agents have a bad day, and there is no failover path because the loop is not something you can stand up yourself.
- Credentials still exist. they moved from the agent context to an mcp server you operate. That server still needs proper scoping, audit logging, and least-privilege downstream tokens. no architecture trick fixes that part.
For anyone running mcp servers in production: Does the split land in the right place for you, or would you rather own the whole loop the way openai's sdk lets you?
I put together a longer breakdown, that sheds more light on the new announcement.
Rook: Notes app for code. Claude, Cursor, and Gemini can save directly to it
Hey everyone 👋
I built Rook because I couldn't find a place for my code notes. I've been used to Apple Notes, it’s fast and minimal, however it doesn't support code blocks.
Why not use something that exists? I tried:
VS Code. It works for markdown, but I always needed a preview extension to see md files rendered, which felt clunky. And I wanted notes to live outside of any specific codebase, not tied to a repo. Something small, open on the side of my desktop.
Obsidian. Didn't feel right at all. Not designed for the kind of simplicity I was looking for.
Bear, Craft, Notion. Too clunky. Not as minimal or fast as I wanted.
Dedicated snippet apps. Opposite problem. Great for code, no place for the notes around it.
On top of that, coding with AI has multiplied my work and I found myself lacking a dedicated place to capture the ongoing logic.
So I built Rook. It’s a free, local and native Mac app made for code notes:
- markdown support
- syntax highlighted code blocks (17+ languages)
- 5 simple themes
- everything stays on your Mac
- optional MCP support so AI tools can write directly into it
I now just say “save this to Rook” to Claude instead of copy-pasting around.
Rook is live on Product Hunt today 🚀
Would genuinely appreciate any support or feedback: https://www.producthunt.com/products/rook-4
I’m also doing a lifetime discount for the first 100 people who sign up today: https://userook.app
Distributed tracing across stdio MCP: same trace_id on CrewAI client and FastMCP server (SEP-414 + OpenTelemetry + Jaeger)
I put together a short walkthrough of something that tripped me up when building agentic workflows: MCP over stdio is two processes, so your usual “single-app” tracing story breaks unless you propagate W3C context explicitly.
Problem: A CrewAI agent calls MCP tools (get_order, check_inventory, …) in a child process over a pipe. Logs show something failed; they don’t show which LLM round triggered which tool, or whether latency sits in the model or in a specific tools/call.
Approach: Use OpenTelemetry with MCP semantic conventions and SEP-414 trace context in params._meta, so client spans (MCP request: tools/call …) and server spans (MCP server handle request: tools/call) share the same trace_id even though transport is stdio—not HTTP.
Stack (all local, reproducible):
- CrewAI agent + Ollama (
llama3.2) - FastMCP incident server (synthetic slow/failing inventory for order
#1842) - OTLP → Jaeger
- One-command demo:
./scripts/demo.sh
What you see in Jaeger: crewai.workflow → per-round .llm spans (with gen_ai.input.messages / output when enabled) → MCP client/server spans in one waterfall. The “money shot” is opening check_inventory and reading args + backorder error on the same trace as the agent’s LLM spans.
Video (12 min, architecture + live demo):
https://www.youtube.com/watch?v=qCHK4QlPXh8
Code (MIT):
https://github.com/ekb-dev-ai/mcp-trace-demo
Fast path without Ollama: ./scripts/quick_trace_demo.sh (~5s, MCP + Jaeger only).
Happy to hear how others are handling OTel for MCP—especially HTTP vs stdio and whether you’re standardizing on _meta vs custom headers.
I built a free MCP that lets you analyze your Google Search Console data
I run a small blog and found myself exporting CSVs from Google Search Console every week to add them into Claude and have it analyze my traffic. So I built an MCP that lets Claude do it automatically. You just need to log in with Google once to give it access to your Search Console data.
What it does
- Pulls your Search Console data (queries, pages, clicks, impressions, CTR, position) straight into Claude
- Ask things like "which pages have high impressions but low CTR" or "what queries did I lose ranking on this month"
- Works on any site you have GSC access to
Cost: Search Console data is completely free. There are some rate limits, but that's it. The MCP can do other SEO tasks such as keyword analysis, which is not covered by the free plan since accessing that data does cost me money.
Install: https://calmseo.com/google-search-console-mcp
Sing in with your email, connect Google, then install the MCP into Claude.
Having an account is mandatory because I need to link your MCP session to your Google Account.
This product is brand new, so please send any feedback my way!
I built AgentLighthouse, a local “Lighthouse for AI agents” that scans repos/docs/APIs for agent readiness
hello
The basic idea comes from the fact that more people (including me) use Codex, Claude Code, Cursor, Copilot, MCP tools, etc., but they are still written only for humans. Agents might fail and struggle to use what you build because setup commands are unclear, docs are stale, OpenAPI operations are under-described, MCP tools are ambiguous, or there is no AGENTS.md/CLAUDE.md/llms.txt/benchmark
So my project, AgentLighthouse, tries to to answer "Can an AI coding agent understand and use this project correctly?"
It scans for things like:
- agent instruction files
- README/docs quality
- setup/test/lint command clarity
- OpenAPI operation quality
- MCP tool descriptions/input schemas
- task benchmarks
- SARIF/CI readiness
- baseline comparison and PR regressions
It is local-first and does not call any paid LLM API. It is not an AI agent nor an SaaS. Please don't flame me as I'm making no profit out of this 😄. The goal is to make projects easier for existing agents to use.
Try it:
npx @agentlighthouse/cli scan .
Or generate reports:
npx @agentlighthouse/cli@alpha scan . --report-dir agentlighthouse-reports
This is very much an alpha still, I’m mainly looking for feedback from real devs. Thanks for reading :)
Gave Claude AI access to Unity via MCP and didn't touch the engine once. It built Minecraft in 24 hours.
Wanted to test how capable MCP actually is for game development so I set a challenge, build a working Minecraft clone in under 24 hours without me writing a single line of code or touching the engine.
Full breakdown in the video. My question is wondering if anyone else has been experimenting with MCP in their workflow, and what have they made?
Is MCP really this deserted?
I spent nearly a year perfecting a desktop MCP gateway that creates significant hidden value in terms of management efficiency and token costs.
So far, it only has 16 stars.
Meanwhile, another tool that enables the "plugins" menu for the Codex desktop client—allowing access to third-party providers—already has over 1,000 stars in less than two weeks.
For a tool that exists in such a legal gray area to gain that much traction, is the issue with MCP, or is it me? What should I do? Have you ever worked on serious projects that ended up being very quiet or lacking in traction?