r/LangChain

▲ 15 r/LangChain+5 crossposts

Review my resume : AI Engineer having 3 years of experience, looking for new job

▲ 63 r/LangChain+42 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 10 hours ago

▲ 0 r/LangChain

three model reviewers approved the plan. the human in one seat caught it in a sentence

i had a review chain set up in langgraph: three different models each pass over a plan before it ships, the idea being if one of them is wrong the other two catch it. worked fine until it didnt. a migration plan came through, all three reviewers approved it, and it dropped a column the nightly billing job still read from. none of them flagged it.

took me a while to see why. the three models werent really disagreeing, they were all reasoning from the same context i handed them, so they shared the same blind spot. adding a fourth model wouldnt have helped, it would just be a fourth read of the same framing. the miss wasnt "a model got it wrong", it was "nobody in the loop knew the billing job existed".

what actually fixed it was boring. the person who owns billing looked at the plan for ten seconds and said "that column, the nightly job reads it". not a smarter model, a different head with different context.

so i ended up building the thing i wanted out of that. you and your team plan in one live session, each holding a seat (your dba on schema, whoever owns billing on billing), and the models fill the seats nobody's in and double-check the calls the humans make. when nobody on the team actually knows the answer you pull in a verified outside expert who takes a seat too. the models are still there, just as gap-fillers and a second reader, not the whole review panel. what you get out is a versioned plan with the argument underneath, human and model both.

still rough, solo project. but the pattern im pretty convinced of now: model-only review chains converge because they share your framing, and the cheapest fix isnt another model in the same chair, its a seat held by someone whose context is different from yours.

curious if anyone here has gotten genuine disagreement out of a multi-model review chain without a human or a tool forcing different context in. every time ive tried, they just converge.

reddit.com

u/Swarm-Stack — 6 hours ago

▲ 26 r/LangChain+8 crossposts

drinks-sommelier – I created an open-source skill that turns any AI agent into a personal sommelier

Every time I'm at the supermarket, at the wine shop, or at the pub I find myself in front of many types of beers and wines and I never know which one to choose based on my tastes or the food pairing.

So I created drinks-sommelier, a text-based skill for AI agents (it works with OpenClaw, Hermes Agent, OpenCode, Claude Code, Cursor, etc... and any other agent).

⚙️ How it works

You teach your tastes once to the agent: sweet/bitter, alcohol content, preferred styles, beers and wines you already know you love or hate
You send it what you have in front of you: a written list, a photo of the supermarket shelf, a pub menu, a wine list
It searches for up-to-date info on the web for each single product (no hallucinations, no made-up data)
It tells you exactly what to get with a preference score of 0–100% explaining why
It improves on its own over time: every piece of feedback updates the taste profile and the database, making the next recommendations more and more precise

✅ What makes it special

Zero dependencies. No Docker, npm, API key, subscriptions, or external services.
MIT license, 100% open source. Free, modifiable, distributable.
Works with any AI agent. Just show the README to your agent and if needed it adapts to your agent's format.
Self-configuring and self-updating. The first time it guides you through the setup by asking you the right taste questions; then every time you give feedback (I like it / I don't like it) it automatically updates the database without you having to touch anything.
Total privacy: your tastes are stored in local text files. No data ever goes to an external server.

📦 Installation

npx skills add Johell1NS/drinks-sommelier --skill drinks-sommelier

Then ask your agent: *"Help me configure drinks-sommelier"* or simply *"What beer do you recommend?"* — it detects if it hasn't been configured yet and guides you through the initial setup.

🔗 Link

GitHub Repo: https://github.com/Johell1NS/drinks-sommelier

⭐ If you like the idea, drop a star on the repo — it helps me grow it!

Ideas, suggestions, contributions, feedback: more than welcome. 🙌

u/Ill-Tradition1362 — 11 hours ago

▲ 17 r/LangChain+8 crossposts

Chimera: an open-source, self-hostable agent that runs on local models (any OpenAI-compatible endpoint) and can fuse several at once

I've been building an open-source agent (Apache-2.0) and wanted to share it here because it's designed to be fully local and self-hostable: it talks to any OpenAI-compatible endpoint, so Ollama / llama.cpp / vLLM / LM Studio all work as the backend. No cloud lock-in, your keys and data stay yours.

The core idea is LLM-Fusion: for the hard steps it can run a panel of models on the same prompt, have a judge model cross-check them (consensus / contradictions / blind spots), and a synthesizer write the final answer. Locally this is fun because you can mix a few small local models and let them cross-check each other. A cost/latency-aware router keeps easy turns on a single model so you're not paying panel latency for everything.

Beyond that it's a full agent: plan -> act -> verify-or-revert (it runs your tests and treats the result as ground truth), layered memory (SQLite + FTS recall, cross-session profile, consolidation), a governance kernel, cron/proactive jobs, MCP client + OpenAPI-to-tool import, and an isolated subagent/crew layer (parallel git worktrees with per-worker verify gates). Runs on a laptop or a $5 VPS via Docker.

Honest status: it's alpha - 463 tests, mypy --strict clean, but no production mileage yet. Local reasoning quality obviously depends on the models you point it at, so I'd genuinely love to hear which local models people find good enough to actually drive an agent loop (reliable tool use + self-correction) - that's the make-or-break for going fully local.

Repo: https://github.com/brcampidelli/chimera-agent

u/Federal-Teaching2800 — 17 hours ago

▲ 22 r/LangChain+2 crossposts

What if retrieval used attention instead of embeddings? I built a local retriever with SOTA results on long-memory and code benchmarks.

Embedding-based RAG is easy to demo, but high-recall production retrieval is hard.

The core issue is that embeddings lose a lot of context. Nearest-vector search can miss evidence that a model would recognize if it could actually read the surrounding memory. Once recall starts failing, retrieval often turns into a pile of compensating tricks: chunk-size tuning, overlap tuning, keyword + semantic fusion, rerankers, metadata filters, query rewriting, summaries, thresholds, and more. These pieces can help, but nearest-vector search is still not the same thing as reading the evidence.

I built Attemory, an attention-native retrieval engine for long memory, documents, and codebases.

The core idea is simple: instead of embedding chunks and searching by vector distance, Attemory indexes raw corpora into reusable KV state. At search time, a local Qwen3.5 retrieval model attends over the indexed memory and the query, then returns compact evidence: memory ids, snippets, or file + line ranges.

So the retriever is not just matching compressed vectors. It is using model attention over model-readable memory.

My current view is that attention helps for three reasons.

First, embeddings force each chunk into a fixed vector before the query is known. That is efficient, but it can lose token-level details such as names, dates, code identifiers, negation, and local relationships between facts.

Second, attention lets the query interact with the original memory text at retrieval time. The model can score evidence in context instead of relying only on distance in embedding space.

Third, the retrieval policy is promptable. The system prompt, memory-local context, and query context can define what kind of evidence should be retrieved, while the returned candidates are still the original memory items.

The key performance idea is not to generate answers during retrieval. Attemory uses a decode-free retrieval path: index the corpus into reusable KV state, then use attention signals from the query to rank candidate memories. That keeps retrieval closer to model reading while avoiding a full generation loop for every candidate.

The benchmark results are something we take seriously, not a marketing slogan. The repo includes reproducible benchmark scripts, notes, commands, and result summaries. The results below are from raw corpus + raw benchmark query runs, without benchmark-specific retrieval hacks: no query rewriting, no summarization, no agent-driven exploration, and no external cloud retrieval service for retrieval.

Current results:

LongMemEval-S: 98.72% session Recall_any@5, 92.77% session Recall_all@5, 98.94% message Recall_all@50
LongMemEval-M: 94.89% session Recall_any@5, 83.62% session Recall_all@5, 92.55% message Recall_all@50
LoCoMo: 94.52% long-conversation QA accuracy
Semble: 0.9055 file-level NDCG@10 across 63 repos and 19 languages
SWE-QA: one Attemory code-search hint reduced Claude Code token usage by 43.8%, with near-tied judge quality across 15 repos and 720 questions

One result worth highlighting is LongMemEval-M. It is around 1.5M tokens / 5k messages, and many memory systems do not evaluate on it at all. Attemory still retrieves all labeled evidence messages in the top 50 for 92.55% of answerable queries.

Because the retrieval path is decode-free, query-time search remains efficient in practice. For large indexes, especially the largest tests I have run at nearly 10M tokens, retrieval still benefits significantly from GPU or Metal acceleration.

Attemory runs locally and exposes a Python / HTTP retrieval API.

I also built a repository search CLI on top of the same retrieval engine. With `atcode`, you can index a repo once, ask natural-language repository questions, and get compact file + line-range evidence back. That makes it easy to try the retrieval quality directly without wiring the API into an app first.

Attemory is still early stage, and I am working on MCP integrations for coding-agent frameworks right now.

I would love feedback from people building agents, memory systems, RAG pipelines, or code-search tools. If embeddings have become a bottleneck in your retrieval stack, please try Attemory and tell us what works, what breaks, and what you would want next.

u/langsfang — 19 hours ago

▲ 4 r/LangChain+1 crossposts

I got tired of background changes breaking my AI agents, so I built a tiny MCP server that stops them from acting on stale memory when a file updates on disk.

An agent I was using read a config file, worked for a while, and then wrote documentation describing the old values. I'd changed the config in between. It never re-read it. It finished, said it was done, and every value was wrong.

I assumed I'd done something dumb. Then I went looking, and it turns out this is filed across basically every major agent tool. Claude Code subagents reading stale file versions. Copilot overwriting its own edits because the editor state differs from its session memory. Codex restoring any file you changed while it was working, every time. There's even a name for it now, the "stale world model problem."

The core issue, the agent's cached view of a file drifts from what's actually on disk, and its own read tools sometimes serve the same stale cache, so it can't catch its own mistake.

So I built a small MCP server for it. It stamps every file the agent reads, and on the next tool call it reports which files changed on disk since the agent last looked. The agent re-reads before acting instead of writing from a stale copy. Zero dependencies, works in Claude Code, Cursor, Copilot, Antigravity.

The honest part is that I tested it across four agents and in cases it's also redundant, because plenty of agents already re-read a file before editing it. Where it actually earns its place is when the change comes from outside the agent's view, another process, a formatter, a teammate, a parallel agent, or a session long enough that context drifted. I spent more time finding that boundary than writing the code.

It's open source and early. If you run agents across multiple tools, I'd genuinely like to know whether this happens in your setup and where it helps or doesn't.

pip install pysince

https://github.com/LNSHRIVAS/since

u/Enough-Piano-2362 — 12 hours ago

▲ 4 r/LangChain+1 crossposts

Is the casual chain of the process as important as the outcome?

In agentic systems, is the process just as valuable as the outcome? We obsess over 'what happened,' but should we care more about the 'why'? When does causality outweigh the event itself and crucially, and are there any memory architectures that store causal thread not just the raw output?

reddit.com

u/Careful_Scarcity_678 — 20 hours ago

▲ 3 r/LangChain

How do you get meaningful observability for agentic AI systems, not just logs?

I'm trying to figure out real observability for multi-agent systems, not just single models. Flat logs don't cut it once agents are calling tools, spawning sub-agents, and hitting real systems with side effects.

I'm tracking agent decisions (what it was given, what it picked), tool and API calls (params, latency, errors), and end-to-end traces across agents. There's also a shared session where multiple agents and humans collaborate, so I tag spans with both a trace ID and a session ID.

Metrics like latency and call volume are free but don't tell you why an agent made a decision or which step caused a failure. That needs parent-child span structure plus some captured reasoning. Reasoning capture is messier than it sounds though. With reasoning models you usually get a summarized trace, not the real chain of thought, and what's exposed varies by provider.

Outcome labeling is the part people skip. Volume and latency show up automatically. "Did this trace actually succeed" doesn't, someone has to apply that label, whether it's rules, human review, or LLM-as-judge. Judges have their own issues: inconsistent across runs, costly at scale, and prone to missing the same failure class the agent itself missed.

Biggest open question for me is where instrumentation should live. Network-level interception is framework-agnostic but you lose semantics (was that a plan step or a tool call?). SDK-level gets you semantics but means per-framework work and breaks across mixed runtimes.

Anyone running this across multiple run times in prod. How are you splitting the instrumentation layer, and what's your sampling approach once trace volume gets expensive?

reddit.com

u/GlitteringAngle8601 — 16 hours ago

▲ 5 r/LangChain+3 crossposts

I analyzed why LangGraph agents burn $50 on infinite loops (and why recursion_limit is a blunt instrument)

>We've all been there: You leave your LangGraph agent running, it hits a 403 Forbidden or a bad SQL query, and instead of failing gracefully, it asks the LLM for help. It gets stuck in a ReAct loop, burning through your API credits until the native recursion_limit finally kills it.
The worst part? The native recursion_limit is a blunt instrument. It throws a GraphRecursionError, crashes the run, and wipes your checkpointed state. You lose whatever partial data the agent did gather, and your frontend user just gets a 500 error.
I spent the last week digging into why agents do this, especially with open-weight models (Qwen/Llama) that lack native self-correction. I realized that just throwing a raw RuntimeError or a "BLOCKED" string at an agent just confuses it, and it loops again.
I ended up building an open-source pre-model intervention hook to solve this, and I wanted to share the architecture for anyone building headless agent backends.
How it works under the hood:
Instead of wrapping the whole graph, it uses LangGraph's native pre_model_hook and ToolNode APIs.It turns a fatal crash into bounded degradation. The agent returns a partial summary instead of an error, and your state is preserved.
It runs 100% locally, uses tiktoken shingling for zero-dep semantic loop detection, and adds <20µs of overhead.
Repo: https://github.com/Devaretanmay/TokenCircut
PyPI: pip install "tokencircuit[langgraph]"
Curious what the weirdest infinite loop you've seen your agents get stuck in is? For me, it was a Databricks agent that kept retrying a REQUIRES_SINGLE_PART_NAMESPACE SQL error 20 times in a row.

u/Commercial2Toe — 20 hours ago

▲ 3 r/LangChain+1 crossposts

Can an agent can improve itself by turning it off and connect to the new agent and then improving the old one by fine-tuning and then shifting again back to the old one and do the task?

reddit.com

u/Prit-P2 — 1 day ago

▲ 28 r/LangChain+2 crossposts

I curated 48 LLM observability tools (Langfuse, Phoenix, Opik, LangSmith…) + a comparison matrix

Every few weeks I end up re-comparing LLM observability/eval tools for a project, so I put it all in one place: 48 verified tools across tracing, evals, prompt mgmt, gateways, OTel instrumentation, and guardrails, each with current stars + license; plus a self-host / license / tracing / evals / OTel comparison table for the top platforms.

It also includes original agent skills (instrument tracing, add evals, debug-from-traces, PII-safe tracing for regulated apps) and a minimal OpenTelemetry GenAI tracer.

Full disclosure, it's my org's repo (CC0, contributions welcome): https://github.com/ContextJet-ai/awesome-llm-observability — what tool am I missing?

u/nishchaymahor19 — 1 day ago

▲ 7 r/LangChain

How are you guys saving/storing prompts?

No, this post isn't about A/B testing them but mainly about storing them.

How are people storing prompts right now??

YAML/txt files? If yes, then how are you guys maintaining versioning? And also on the deployment end; if you're bundling those files into your system then for every prompt change would you need to create a new deployment? Isn't that redundant?

For companies using services like that of aws for managing prompts; how good are they??

For context we've been creating tons of projects using LangGraph lately, and to this date we're just saving prompts in YAML files without any formal versioning, just the Git history. Then in the deployment pipeline they get bundled into the FastAPI Docker application. So for each minor prompt change or enhancement; currently we have to rebuild the container.

So because of this we decided to consider prompts differently. Therefore we're thinking of creating an open source project that could be a self-hostable prompt management server, which could later also be hosted by enterprises for their own use cases.

Need to hear the perspective of others in this group coz people here might have more experience with this; Thanks

reddit.com

u/dyeusyt — 1 day ago

▲ 1 r/LangChain+1 crossposts

The 3-line output sanitiser I add to every LangGraph agent now

I was testing a LangGraph agent with file access tools and realized — if someone asks

it to read .env, it outputs every API key in plain text.

Looked into it. OWASP ranked Sensitive Information Disclosure #2 on their LLM Top 10

(2025). LangChain itself had a CVE last year (CVE-2025-68664) for env var exfiltration.

My fix — 3 lines that scan every agent response before it reaches the user:

import re

SECRETS = re.compile(r'(sk-|AKIA|ghp_)\S+')

def sanitize(text): return SECRETS.sub('[REDACTED]', text)

Catches OpenAI (sk-/sk-proj-), AWS (AKIA), and GitHub (ghp_) key patterns.

Not exhaustive — production needs Stripe, Slack, Anthropic patterns too — but

it's a starting point most tutorials skip entirely.

Made a 30-second video walkthrough: https://www.youtube.com/@CodeAgents_ai

What output sanitization patterns are you using in your agents? Curious if anyone

has a more comprehensive approach.

u/Low_Edge7695 — 1 day ago

▲ 35 r/LangChain+2 crossposts

GPT-5.5 vs Claude Fable 5 vs Local Qwen: 3 AI Agents, 1 Task

I ran the same market-entry brief through three different AI models. The result was revealing.

I asked three models to independently create a client-ready market-entry brief for launching a privacy-first AI personal assistant for small businesses in the UK.

The models were:

Claude Fable 5 via Claude Subscription
GPT-5.5 via ChatGPT/Codex
qwen3.6:27b running locally via Ollama

Each got the exact same task. They could use web research. They could not see each other’s answers.

The brief was for a product that is local-first, helps with email, calendar, documents, reminders, research, and workflow automation, and positions itself around privacy, local storage, user control, and optional cloud model access.

The target market was UK small businesses, freelancers, consultants, and agencies.

The output needed to include segmentation, customer pains, competitor landscape, positioning, pricing, go-to-market strategy, risks, a 90-day launch plan, and a clear recommendation on whether the company should pursue the market.

Here’s what happened.

The winner: Claude Fable 5

Claude produced the strongest founder-ready strategy memo.

Its biggest strength was that it made a clear strategic choice.

It did not recommend launching as a generic “AI assistant for small businesses”. Instead, it recommended a focused wedge into regulated micro-practices and privacy-sensitive professional services: accountants, solicitors, bookkeepers, financial advisers, HR consultants, consultants, and agencies handling confidential client data.

That was the sharpest insight in the whole comparison.

Its positioning was also the strongest:

That works because it does not try to out-feature Microsoft Copilot or Google Workspace. It reframes the competition around data custody, client confidentiality, and trust.

Claude’s best recommendation was: don’t compete on being cheaper than Copilot. Compete on privacy, control, and workflows that cloud-first incumbents cannot credibly own.

It also had the strongest risk analysis: Microsoft bundling, local model quality gaps, hardware variability, support burden, regulatory shifts, and category confusion with free local tools.

Overall, Claude felt the most client-ready.

GPT-5.5 was the best operator

GPT-5.5 came very close.

It was less punchy than Claude on positioning, but stronger on execution.

It produced the most practical 90-day launch plan: choose two verticals, run workflow audits, recruit pilot firms, configure 3 to 5 daily automations per customer, measure admin hours saved, build case studies, then convert pilots into paid customers.

It was also more cautious around compliance claims. That matters. A privacy-first AI product should avoid saying “GDPR-compliant by design” too casually. Better language is: “designed to reduce unnecessary data transfer and support UK GDPR obligations, subject to configuration.”

GPT-5.5 was very useful for turning the strategy into an operating plan.

If Claude gave the boardroom memo, GPT-5.5 gave the launch checklist.

Local Qwen was better than expected

The local qwen3.6:27b model produced a coherent, complete, and genuinely useful first draft.

It covered all required sections. It had a competitor table, pricing hypothesis, go-to-market phases, risk table, and launch plan. For a local model, it performed well.

But it had weaknesses.

It made more unsupported claims. It was less disciplined with citations. It overclaimed in places, for example saying local-first meant “zero data-privacy risk”, which is not accurate. Local-first reduces risk, but it does not eliminate it.

It also picked freelancers and micro-agencies as the primary beachhead. That is easier to market to, but less strategically defensible than privacy-sensitive professional services.

Still, the result was good enough for internal ideation, early drafting, and private strategy work.

That is important.

Local models do not need to beat frontier cloud models at everything to be useful. They need to be good enough for the right part of the workflow.

My ranking

Claude Fable 5 Best for strategy, positioning, founder-ready narrative, and final synthesis.
GPT-5.5 Best for launch planning, pilot design, pricing experiments, and operational detail.
qwen3.6:27b local Best for private first drafts, brainstorming, internal notes, and cheap iteration.

The bigger takeaway

The best workflow was not “pick one model”.

The best workflow was hybrid:

Use the local model first to brainstorm privately and cheaply.

Use GPT-5.5 to turn the ideas into a practical operating plan.

Use Claude to sharpen the positioning and produce the final client-ready narrative.

That feels like where AI work is heading.

Not one model for everything.

A portfolio of models, each used where it is strongest.

For privacy-first products especially, local models have a clear role. They are not always the best final writer. They are not always the strongest strategist. But they are useful for private thinking, early drafting, and working with sensitive material before anything goes to the cloud.

In this test, local Qwen was not the winner.

But it was absolutely good enough to be part of the team.

And that may be the more important result.

GitHub

u/Acceptable-Object390 — 2 days ago

▲ 2 r/LangChain

(langchain-aws) How do you deal with very complex structured outputs

guys I've got a LangGraph system setup'd. at the end of the nodes there's a synthesizer node which does structured outputs. now the thing is I am using AWS Bedrock mainly; so I don't really have that many options.

I did want to use Chinese models for the cost saving metric; but most of the Chinese models on AWS Bedrock are either very old in terms of today's time. the biggest breaker for me is that the majority don't support JSON Schema; and when I use the other method, function calling, it doesn't work and gives validation errors.

for reference my schema is: "The output contract is around 30+ Pydantic models; 7 major report sections; several shared type definitions; dozens of nested objects; multiple enum types; and field validators enforcing list caps and structural constraints. It's essentially a typed document specification rather than a simple JSON response."

the models I've figured out that still somewhat give correct json_schema:

most of the Anthropic models after Haiku 4.5
OpenAI models like gpt-oss are able to give structured outputs in JSON Schema; but not very complex ones
MiniMax M2; but many times it ends up in validation errors (too many objects)

other than that; when I switch providers to OpenAI and use a model like gpt-5.4-mini; even that works wonderfully and gives the correct outputs. it's also much faster than those models with little to no loss in output quality. it's an evaluation task for context.

so I am asking the community here; how do you people deal with structured outputs when stuff gets a little more complex? is this an AWS Bedrock issue?

P.S. I've got AWS Startup Credits (we're still bootstrapped); so directly using OpenAI models from "platform.openai.com" will end up with us bearing too much cost. so that's a factor as well for us.

looking forward to hearing from people who've worked with structured outputs here. thanks

reddit.com

u/dyeusyt — 1 day ago

▲ 5 r/LangChain

Anyone's RAG bot ever hallucinated hard in front of users? What happened?

Been building RAG stuff and starting to think a lot about failure modes , curious if anyone's had their retrieval/RAG setup confidently give a wrong answer in prod (or a demo), and what that looked like on your end.

Specifically curious:

how'd you even catch it (user complained? you noticed manually? never did?)
did you know why it happened — bad chunk, bad retrieval, stale doc, prompt issue?
what'd you actually do to fix it

Not selling anything, just trying to understand real failure patterns instead of guessing. Would love to hear stories 🙏

reddit.com

u/vanilla_cappucchino — 1 day ago

▲ 6 r/LangChain

fixing token latency in sequential agent loops (Parallel Tool Calling fix) and this method worked for me well.

If you are building multi-step agent loops, you have probably run into the bottleneck where your agent waits for Tool A to finish completely before it even initiates Tool B—even when the two actions don't depend on each other. Sequential execution absolutely kills the user experience. Here is a quick architectural fix using Python's asyncio to force parallel tool execution inside your agent orchestration loop:

The slow way: Sequential execution

import asyncio
async def sequential_run(): result_a = await call_tool_a() # Waits 2.5 seconds result_b = await call_tool_b() # Waits 2.0 seconds return [result_a, result_b] # Total time: 4.5 seconds

The fast way: Parallel execution

import asyncio
async def parallel_run(): # Dispatches both tool calls concurrently results = await asyncio.gather( call_tool_a(), call_tool_b() ) return results # Total time: ~2.5 seconds (bound by the slowest tool)

When you parse your LLM's tool_calls JSON array, do not just loop through them with a standard for loop. Map them into an async gather block instead. This drops your total execution latency down to the speed of your single slowest tool, rather than stacking the response times of every single tool combined and also please let me know any errors and corrections in this code.🤗

I encounter these multi-agent performance bottlenecks quite a bit, so I set up a dedicated workspace at r/AI_Agentic_Devs for anyone interested in collaborating on clean agent code loops.

reddit.com

u/Sea-Opening-4573 — 1 day ago

▲ 18 r/LangChain

Woke up to a massive API bill. My LangGraph agent looped on a broken tool all weekend. How are you guys preventing this?

I just accidentally let a LangGraph agent loop on a broken tool over the weekend and woke up to a massive API token burn. How are you guys preventing runaway cloud bills when your autonomous agents get stuck in logic loops? Are you just hardcoding static limits, or is there a better way to catch it?

reddit.com

u/Strong-Site-2872 — 2 days ago

▲ 23 r/LangChain+2 crossposts

Building an AI Gateway because production LLM apps kept accumulating the same middleware (WIP, looking for feedback)

Over the past few months I've noticed a pattern while building LLM applications.

The application code stays relatively small.

But production concerns keep growing:

PII redaction
retries
provider fallback
audit logs
cost tracking
request logging
prompt inspection
rate limiting

These concerns end up being duplicated across projects.

So I've been building Gavio (work in progress), an open-source AI gateway that lets these concerns be composed as interceptors rather than scattered through application code.

Current ideas include:

• Request/response interceptor pipeline • PII & secret detection • Retry/backoff • Provider abstraction • Audit trail • Cost tracking • Local mock provider • Python / Java / JavaScript SDKs

The goal isn't to replace LangChain, AI SDKs, or provider SDKs.

It's to provide a production layer around them.

I'm still exploring the design, so I'd genuinely appreciate feedback.

Some questions I'm thinking about:

What production problems are you solving repeatedly?
What would you expect from an AI gateway?
Would you prefer middleware, sidecar, proxy, or SDK?
What have I missed?

GitHub: https://github.com/manojmallick/gavio

Docs: https://manojmallick.github.io/gavio

u/Independent-Flow3408 — 3 days ago