u/Future_AGI

If you are shipping AI agents, do you know what one real conversation costs end to end?

If you have ever tried to estimate what an AI agent will cost in the real world, you already know the annoying part: the number changes the moment the flow starts doing real work.

A request is no longer just “one LLM call.” It can become STT, retrieval, tool calls, retries, memory, and TTS all in the same run, and each piece adds cost in a different unit. That is where the clean spreadsheet version usually breaks down.

We keep running into the same problem: the cost of an agent is easy to estimate on paper, and much harder to estimate once the flow starts behaving like a real system.

STT is usually priced per minute of audio. LLMs are priced per token. TTS can be priced per character or per second depending on the provider. Those units do not line up cleanly, so once retries or branching enter the flow, the total stops being obvious very quickly.

The part that makes this annoying is not just the pricing. It is that the cost is tied to behavior. A slightly longer user turn, one extra retry, a tool call that adds context, or a different response style can change the final number more than people expect.

For us, cost is not a static line item. It is part of the run. If the agent branches, retries, carries too much context, or crosses providers, the cost shifts with it. That is why we started looking at cost alongside tracing and evaluation instead of treating it as a separate spreadsheet problem.

Who this is for:

  • Teams building voice agents, copilots, RAG systems, or multimodal flows.
  • People trying to estimate cost before usage gets real and messy.
  • Technical teams that want to understand what is actually driving spend inside a run.

What you can do with it:

  • Trace which step is adding the most cost.
  • Compare runs across models, prompts, and providers.
  • See how retries and branching change total spend.
  • Simulate different conversation patterns before they hit production.
  • Tie cost back to actual behavior instead of guessing from pricing pages.

We are curious how other teams are handling this in practice.

Do you estimate cost from provider pricing first and then adjust later, or do you already have run-level cost visibility built into your agent stack? What is the part that usually surprises you most: retries, retrieval, tool use, or response length?

If you are building AI agents right now, how are you estimating this today by provider pricing, by back-of-the-envelope assumptions, or from actual run data? And when the number surprises you, what is usually the reason: retries, retrieval, tool use, or just longer-than-expected conversations?

If you are working on this kind of system, try it on one of your own flows and see what shows up. We would genuinely like to hear what feels accurate, what feels off, and what you would want to measure better.

u/Future_AGI — 23 hours ago

When a LangChain agent starts drifting, what do you actually inspect first?

You know that awkward moment when the agent gets stuck halfway through a task, not because the model failed, but because one tool returned something slightly off and the next step quietly built on that bad state?

That is the kind of failure that is hard to catch in LangChain workflows, especially when you have retrieval, retries, branching, and memory all feeding into the same run.

A lot of teams can see the final answer is off. What is harder is finding the first step where the run started drifting. Was it a retrieval result that looked close enough? A tool output that changed the context in a small way? A state update that made the rest of the chain behave differently?

That is the part we have been focused on at Future AGI.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

It has also picked up 999 stars on GitHub in just a few weeks.

For LangChain users, the useful part is not just seeing what the model said. It is being able to inspect the full run in a way that matches how these workflows actually break.

Who this is for:

  • People building LangChain agents with tools, retrieval, or memory.
  • Teams that need to understand where a multi-step run went wrong.
  • Builders who want evals that reflect real behavior, not just neat examples.

What you can do with it:

  • Trace model calls, tool calls, and state changes across a run.
  • See where the workflow started drifting.
  • Run evaluations against real task behavior.
  • Simulate edge cases before they show up in production.
  • Feed failures back into the eval loop so the next run is easier to trust.

2 lines to trace a LangChain workflow.

from traceai_langchain import LangChainInstrumentor

LangChainInstrumentor().instrument(tracer_provider=trace_provider)

Import, instrument, and start seeing the full run.

The pattern we keep seeing is simple: once you add real tools and state, the hard part is no longer “did the model answer well.” It becomes “what changed the run before the final answer ever showed up?”

If this is the kind of workflow you are dealing with, try the platform on one of your own LangChain runs and see what it surfaces. We would genuinely like to hear what felt useful, what felt missing, and where it changed the way you debug.

u/Future_AGI — 2 days ago

The strange part about AI agents is that they often do not fail where you expect them to.

The strange part about AI agents is that they often do not fail where you expect them to.

A retrieval step drifts, a tool call returns something slightly different, or one small state change early in the run quietly affects everything that follows. By the time the final answer looks wrong, the useful signal is already buried.

That is the part we kept running into while building Future AGI.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

We built it for teams working on agents, copilots, RAG workflows, and other multi-step systems that need more than a final response log. If you are trying to understand how an agent actually behaved, where it went off track, and what should go back into the next eval cycle, that is the gap we wanted to close.

What this gives you in practice:

  • Step-level tracing across model calls, tool calls, and state changes, so you can see where the run actually changed direction.
  • Task-level evaluations that measure behavior against real outcomes, not just a final output score.
  • Simulation that lets you test messy, edge-case inputs before production users find them first.
  • A feedback loop that turns real failures into new eval cases, so the system improves over time.
  • Guardrails and optimization in the same loop, so fixing one layer does not mean breaking another.

Who is this for?

  • Teams building agents for support, internal workflows, search, or automation.
  • Builders who have already seen the gap between “works in testing” and “works under real traffic.”
  • Anyone who has tried to debug an agent by re-running it and hoping the answer changes.

What we kept seeing is that most agent failures are not obvious prompt failures. They are system failures. A retrieval result shifts. A tool behaves differently than expected. A state change in the middle of the flow causes the next three steps to drift. Those are hard to catch if you only look at the final output.

That is why we treat agents as systems you observe, trace, and improve, not black boxes you ship and hope for the best.

If you are building agents right now, try it in your own workflow and see whether it changes how you debug. It is open source, and you can also layer it with other open-source tools for evals, tracing, or simulation depending on your stack.

u/Future_AGI — 3 days ago

We shipped 6 prompt-optimization algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, Random) in one Apache 2.0 Python library.

If you have ever tuned a prompt by hand, you already know the pattern.

You make a small change, run the same examples again, and hope the output gets better without breaking something else. Sometimes it works. Sometimes it gets worse in a way that is hard to spot until later.

That is the problem we wanted to make more structured.

We built prompt optimization in-house and shipped it as an Apache 2.0 Python library so people can move from manual prompt edits to a repeatable improvement loop.

The idea is simple: take a prompt, run it on real data, score it with evals, and let the optimizer search for better versions instead of guessing by hand.

We support 6 optimization algorithms:

  • GEPA
  • PromptWizard
  • ProTeGi
  • Bayesian Search
  • Meta-Prompt
  • Random Search

Why 6? Because different prompts behave differently.

Some prompts need a search strategy that explores more. Some work better when the optimizer changes the wording in a more guided way. Some need a judge signal that is very clear and task-specific. In practice, the “best” optimizer depends on your data, your evals, and how messy the task is.

This is built for people who are actually shipping prompts, not just experimenting with them in notebooks.

If you are working on RAG, support flows, extraction, copilots, or any system where prompt quality changes the outcome in a measurable way, the goal is the same: make improvement repeatable instead of manual.

A typical run looks like this:

  • Start with a baseline prompt.
  • Run it against a dataset.
  • Score the outputs with your evals.
  • Generate candidate prompts with an optimizer.
  • Compare the results.
  • Keep the version that performs best.
  • Repeat when your data changes.

What we have found is that prompt work gets much easier once the loop is clear. You stop asking, “Which wording feels better?” and start asking, “Which version actually performs better on the cases that matter?”

That is what we wanted to build.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

Who is this for?

  • Prompt engineers who want a repeatable optimization flow.
  • Builders shipping production prompts who need safer iteration.
  • Teams comparing different optimization methods on the same dataset.
  • Anyone who wants prompt quality to be measurable instead of subjective.

What can you do with it?

  • Optimize prompts with six different algorithms in one library.
  • Run a prompt against a dataset and compare candidates side by side.
  • Use your own evals to define what “better” means.
  • Keep optimization tied to real task performance.
  • Move from one-off edits to a loop you can actually reuse.

If you are working on any project with prompts, try it in your own workflow and see what the optimizer changes. It is open source, and you can also layer it with other open-source tools for evals, tracing, or simulation if that fits your setup.

reddit.com
u/Future_AGI — 4 days ago

AI agents break in ways that dashboards do not explain. Here is the stack we built to deal with that. And it's open-source.

AI agents do not fail all at once. They break step by step, and that is the part most teams still miss, or maybe don't have any idea of it. Agreed?

In some cases, a retrieval result drifts, a tool call returns something unexpected, or an early state change quietly affects the rest of the run. By the time the final answer looks wrong, the useful context is already gone.

That gap is what we kept running into while building Future AGI.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

We built it for teams working on agents, copilots, and RAG-heavy workflows that need more than a final response log. If you are trying to understand how an agent behaved, where it went off track, and what should become part of the next eval cycle, this is the kind of system we wanted to have ourselves.

What we can do with it:

  • Trace model calls, tool calls, and state changes across an agent run.
  • Inspect failures at the step level instead of only seeing the final output.
  • Run evaluations tied to real tasks and behavior.
  • Simulate scenarios before production users find the edge cases.
  • Feed failures back into the eval loop so the system improves over time.

What we have learned so far is that teams do not just need another dashboard. They need a way to connect production failures to a trace, understand the exact point of failure, and use that signal to tighten the next version.

That is the problem we are focused on.

If you are working on agents, what is still the hardest part for you today, understanding where a run failed, building evals that stay useful, simulating real edge cases, or feeding production failures back into the loop? Or any similar problem..

Try it in your stack and tell us where it breaks, what feels off, and what you would want us to improve.

u/Future_AGI — 7 days ago

Most teams are running their production AI agents on pure vibes and a few test chats. We need to talk about what a serious evaluation stack actually looks like.

If you ask most teams “do you trust your agent in production?”, you usually get a shrug and a story, not an answer. Actually we get the same answer

Dashboards, a few example chats, maybe a one-off eval notebook… but very few people can point to a clear, living eval setup and say: “this is why we still trust it today, not just the week we shipped it.” honestly.

We have spent the last 18 months talking to teams running agents for support, internal copilots, RAG search, and multi-step workflows, the same problems keep coming up.

  • When something goes wrong, it is hard to tell which step actually failed.
  • Retrieval quality drifts, but there is no way to tie a bad answer to a specific tool call or document.
  • Eval sets are written once and slowly rot while prompts, tools, and models keep changing.
  • Real failures in production rarely make it back into the test set, so the system keeps “passing” old tests.

At that point, saying “the agent is in production” does not mean “we understand its behavior.” It mostly means “nothing has burned down yet.”

The way we started thinking about it is simple: if agents are systems, not single prompts, then “evaluation” has to follow the system, not just the final answer.

If agents are systems, not single prompts, then “evaluation” has to cover more than final answers.

we think a serious agent stack needs at least four things:

  1. Tracing down to the step level, so you can say “step 4 failed because retrieval returned garbage” instead of “the agent was bad here.”
  2. Evaluations that can be tied to tasks and steps, not just global thumbs up or down.
  3. Simulation so you can test agents against a wide range of scenarios before users discover the weird edge cases for you.
  4. A feedback loop where production failures become new eval cases, so the system does not just keep re-passing the same old test.

We ended up building our own stack around that idea and then open-sourcing it.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

Who is it for?

  • People building agents, copilots, and RAG systems who want to see where the system actually fails, not just whether it “looks good” in a few test prompts.
  • Teams who want to keep eval logic and traces inside their own stack instead of pushing everything into a closed SaaS.
  • Anyone who wants to treat agents as systems to monitor and improve, not features to “fire and forget.”

What can you actually do with it?

  • Trace every call, tool use, and step in an agent flow, with enough detail to debug real failures.
  • Run evaluations with readable scoring code that you can change when your domain needs different rules.
  • Generate and run simulations so you can see how the system behaves under varied, messy inputs.
  • Close the loop by using eval results and traces to drive fixes, guardrails, and optimization.

We have open-sourced the same stack we run ourselves, and the repo has now crossed 950+ stars with people starting to use it and push on it in real projects.

The reason we are sharing it here is less “launch” and more “sanity check.”

If you think about agents and evaluation seriously, what do you see as missing from most stacks right now?

Is it better task-level metrics, better traces, better simulation, a cleaner feedback loop from production, or something else entirely?

If you want to try what we built in your own setup, the links are in the first comment.

reddit.com
u/Future_AGI — 8 days ago

We open-sourced the platform for self-improving AI agents. Now comes the part that matters, developers building on top of it.

A few weeks ago, we shared Future AGI here as our open-source AI stack for production agents.

Since then, the project crossed 800+ GitHub stars, people started contributing, and the feedback got much more real.

The useful part was not the launch itself. It was seeing what happened once developers started trying to use the stack in their own workflows.

Some people came in through tracing. Some cared more about evals, simulations, or guardrails. Some wanted the full loop, from prototype to production, without stitching five separate tools together.

That has been the most interesting part for us.

The open-source platform for shipping self-improving AI agents. Evaluations, tracing, simulations, guardrails, gateway, optimization. Everything runs on one platform and one feedback loop, from first prototype to live deployment.

That sounds clean on paper. Open-source gets honest very quickly once people try it in real projects.

If setup is rough, people notice. If the docs miss a step, people notice. If a workflow makes sense in theory but feels awkward in practice, people notice.

That has helped a lot.

It has pushed us to think less about what sounds good in a launch post, and more about what actually helps a developer once an agent starts failing in non-obvious ways.

A few parts of the stack seem to pull the most attention:

  • traceAI, when teams want visibility into model calls, tool calls, latency, and failures.
  • evaluations, when teams want something more concrete than “the output looked fine.”
  • simulations, when teams want to test behavior before production becomes the test environment.
  • the broader loop, when teams want tracing, evals, guardrails, gateway, and optimization to work together instead of living in separate dashboards.

Once developers start using a stack in real agent workflows, the truth shows up fast.

That is where the rough edges become obvious, setup gaps, broken assumptions, missing steps, workflow friction, and bugs that no launch post will catch. If you are building with agents, try it in your own flow, build something with it, and tell us where it breaks or feels harder than it should.

That kind of feedback is the most useful one for us right now. What worked, what did not, what felt confusing, and what you would want fixed before trusting it in a real system.

If you have not tried it yet and want to explore it, the links are in the first comment.

reddit.com
u/Future_AGI — 9 days ago

Everyone says they have AI agents in production. Nobody can clearly answer "how do you know it's actually working" Can you?

Once an agent is live, the next question gets surprisingly hard to answer.

How do you know it is actually working?

Not in a demo. Not on a benchmark. In production.

We have spent a lot of time looking at agent systems across support bots, internal copilots, RAG workflows, and multi-step setups.

The surprising part is that the model is usually not the main problem. The harder part is defining what “working” means, then measuring it in a way that survives real usage.

A few patterns keep coming up.

An “autonomous research agent” gets judged by thumbs-up rate, but nobody can clearly describe what the bad 20% actually looks like.

A multi-agent workflow fails, but the team cannot tell whether the issue came from retrieval, routing, tool use, or state passed between steps.

An eval set looks strong in staging, but nobody is measuring production outputs closely enough to know whether that behavior holds up under real traffic.

A team says the agent does the job well, but they have never run it enough times across varied inputs to know where consistency starts to break.

That has changed how we think about production agents.

The fix is usually not “switch to a better model.” More often it is one of a few less glamorous things.

Write down what success looks like in a form that can actually be graded.

Trace each step, not just the final output.

Run broader scenario coverage before production sees the edge cases first.

Take failures from production and push them back into the eval set so the system does not keep passing the same stale checks.

That last one feels especially important.

A lot of eval sets get written once, then stay mostly frozen while prompts, models, tools, and workflows keep changing underneath them.

But the issue is that many teams still talk about agents like they are features, when in practice they behave more like systems.

They have state, dependencies, failure modes, and weird interactions between parts. If that system is non-deterministic, then the job is not to pretend it is deterministic. The job is to make the behavior visible enough to debug, score, and improve.

That is where evals and observability start to matter.

Not as reporting layers, but as the thing that makes non-deterministic behavior legible.

We are curious how this looks for others shipping real agents.

What was the first thing that broke once your agent hit real users?

reddit.com
u/Future_AGI — 10 days ago

After 6 months building AI eval tooling, here's what I keep getting wrong

Hi community.

I am Nikhil, the founder at Future AGI, we build an open-source platform for evaluating, observing & guardrailing AI agents. Posting because 6 months in, I keep noticing the same pattern across teams and want to know if it's real. Not here for signups, demos, links, or DMs, bio has the rest. What I'd love: honest pushback, especially from people who think I'm wrong.

Now the actual thing.

I thought the hard part of building this would be the engineering, wrangling LLM outputs, building scoring, making dashboards fast. That part is work, but it's tractable.

The actual hard part has been getting teams to admit out loud what "good" means for their AI.

Stuff we keep watching happen:

  • A team ships an AI feature. Three months in I ask "how do you know it's working?" The answer is some variation of: thumbs up/down counts, vibe checks from the PM, or "nobody's complained loudly enough." That's the whole evaluation.
  • Eval criteria written down on a whiteboard during kickoff. Three weeks later, half of them contradict each other once there are real outputs to grade. Nobody had forced themselves to think it through.
  • The "we need evals" request comes from engineering. The "we already have evals" pushback comes from leadership. They mean completely different things by "evals." Nobody has named the gap.
  • Teams will spend a month building a fancy eval pipeline before deciding on one metric they actually care about. The pipeline runs. Numbers go up. Nobody knows if that's good.

What I keep getting wrong as a builder:

  1. I over-invest in better measurement before the customer has decided what to measure. Slow lesson: a clear definition of "good" beats a sophisticated evaluator of "??".
  2. I assume people will read a 600-word doc explaining a methodology. They will not. They will read a screenshot of one specific failing example.
  3. I keep building features for the eval pipeline. What customers actually want is the conversation that forces their team to align on what good looks like. The pipeline is just the excuse for the conversation.

Broader pattern I think we're seeing: most AI products are running blind, not because the tooling doesn't exist, but because nobody on the team has had the uncomfortable conversation about what failure even looks like. The tool is downstream of that conversation.

The version of this that actually works, I think, is a boring closed loop: catch what breaks in production, know why it broke, fix it, then ship a better baseline next time. That's what "self-improving agents" actually means in practice a loop that closes. Without that, it's three new dashboards and a Slack channel nobody opens.

So here's the honest ask, since I said I'd be transparent about it:

I want to know if the pattern is real for you, or if I am building for a problem only a small slice of teams actually have.

If you're shipping AI features in any form internal copilot, customer-facing chat, agent, RAG, anything three questions I'd genuinely love an answer to:

  • What does "good" actually mean for your AI, in a way that survived contact with real model outputs?
  • How are you catching regressions right now? Structured evals, golden datasets, prod logs, thumbs-up/down, or honestly just gut feel?
  • Where does your current approach hurt the most?

If your honest answer is "we don't really have one yet" please say that. That's the most useful data point I could get.

And if you think the whole framing is wrong, tell me why. That's the comment I'm hoping for most.

reddit.com
u/Future_AGI — 11 days ago

The same question lands on this sub a few times a week, and the standard answers (RAGAS, DeepEval) are correct but stop one layer short of what you actually need once your app leaves a notebook. Wanted to lay out the full picture for anyone learning this in 2026.

LLM evaluation tooling sits in three layers. Most learners get pointed at layer one, hit a wall, and assume the field has nothing else to offer. It does.      
  
Layer 1: Metric libraries                                                                                                                                      

RAGAS is the cleanest example. You hand it rows of (question, context, answer, ground truth) and it scores each row on faithfulness, answer relevancy, context precision/recall, noise sensitivity, plus newer agentic metrics (tool call accuracy, agent goal accuracy).

Good for: a static eval set, an offline notebook, a paper.                                                                                                 

Limit: shaped around RAG. Once your app is an agent loop or multimodal beyond images, the metric set thins out fast.

Layer 2: Test frameworks
DeepEval is the canonical one. ~50 metrics including G-Eval, hallucination, bias, toxicity, task completion, tool correctness, plus image-level metrics. Pytest-style assertions, CI hook, custom LLM-as-judge.

Good for: regression-testing prompts and chains the way you regression-test code.
Limit: mostly offline. It tells you version N+1 is worse than N on a frozen dataset. It will not tell you what is happening on real traffic at 3 AM, or which  span in a 20-step agent trace produced the failure.                                                                                                            
  
Layer 3: Observability and evaluation platforms                                                                                                                

The layer most tutorials skip, and the layer most production teams end up at. Tools here include Arize Phoenix, Langfuse, Braintrust, and Future AGI's ai-evaluation. They sit on top of OpenTelemetry traces (the GenAI Semantic Conventions are now a real spec) and run evaluators against live spans, not only static datasets.                                                                                                                                               

One technical detail worth knowing about this tier: almost all of them call third-party LLM judges (GPT-4, Claude) under the hood, so eval cost scales linearly with traffic and you inherit the judge model's latency. The interesting outlier is ai-evaluation, which ships its own trained evaluation models (the TURING family, covering text, image, and audio) and runs guardrails sub-100ms on live spans. 

Different trade-off: fixed-cost, low-latency scoring vs. the flexibility of swapping judge models per metric. Whether it matters depends on your scale, an MVP doesn't care, an app doing online evals on every request very much does.
  
Good for: real users, agent loops, multimodal inputs, drift over time.                                                                                         
Limit: heavier setup. You instrument your app and accept some vendor coupling.

Why this matters more in 2026
Agents are now the default architecture. A single query can fan out into 20+ LLM calls, tool invocations, and retrieval steps. Sierra Research's τ²-bench (2025) showed dual-control settings cause large drops vs. single-turn evals; SWE-bench Pro pushed top models to ~23% from 70%+ on Verified. A single faithfulness score on the final answer hides where the failure happened.                                                                                       

Multimodal is also in production. lmms-eval v0.5 added 50+ audio/vision benchmarks; Video-MME (CVPR 2025) is the de facto video MLLM benchmark. The metric libraries have not caught up, and only a couple of the platform-tier tools natively score audio or video today.

A rough decision rule

-Static RAG dataset, offline only: RAGAS.                                                                                                                     
-Prompt or chain regression in CI: DeepEval or promptfoo.
-Production traffic, agents, multimodal, drift: a platform-tier tool.                                                                                 -All three together is normal. They compose.   

Question for the sub
For anyone running LLM apps close to or in production: what single metric has actually caught regressions for you, and how often does your judge disagree with your own review when you spot-check? Curious whether anyone has wired their CI eval into a production observability tool, and what the integration pain points were.                                                                                                                                                          

Happy to go deeper on any layer in the comments. 

reddit.com
u/Future_AGI — 15 days ago
▲ 5 r/dev

Software testing has a quiet assumption underneath it. Run the same input twice, get the same output twice.

In early day of building most of our tooling and instincts are built on top of that one line.

Agents break that assumption immediately.

The same prompt, the same tool definitions, the same user message can produce a different sequence of tool calls depending on the agent's "mood" (yes, that is a real variable now), the temperature setting, or whether the user said "please".

We started by trying to force agents into our existing test framework. Pin the seed, snapshot the outputs, diff the trees.

None of it worked, because the bugs were not at the level we were testing.

The actual production failures were sequencing bugs. The agent calling tools in the wrong order under pressure, skipping a verification step when a user got emotional, or calling the same tool twice and getting confused about its own state.

You cannot find these by reading transcripts, because there are too many. You cannot find them with assertions, because there are too many valid sequences. You cannot find them with pinned seeds, because the failure mode is the variance itself.

What started working for us was treating the agent less like a function and more like a service we have to load-test continuously.

We started calling this internally a self-improving AI agent pipeline, because the agent's production failures become inputs to its next version, with no human in the middle of the loop.

The loop has three parts.

Generate synthetic users with explicit personas (mood, context, pressure tactics) and drive them at the live agent.

Score the transcripts against domain metrics. Function-call accuracy and ordering matter much more than output similarity.

Take the failures and feed them back into prompt rewrites or fine-tuning data. Run the simulation again.

It looks more like fuzzing than unit testing.

The point is not to catch one specific bug. It is to surface the distribution of behaviors and shrink the tail of the bad ones.

The thing we did not expect was that the prompts that came out of this loop were uglier than what we would have written by hand. Long, defensive, full of clauses our reviewer would have flagged as redundant.

They were not redundant. They were closing failure modes a human did not think to script.

We are starting to think the SDLC for agents is going to look closer to chaos engineering than to traditional testing. Less "does this assertion hold" and more "how often does the whole system stay inside its envelope across a population of users".

The question we keep getting stuck on: how do you decide when to stop iterating, since this kind of loop technically never converges?

If anyone wants the full breakdown of how the self-improving AI agent pipeline works end-to-end (architecture, evals, the optimizer we use, code samples), link is in the first comment.

reddit.com
u/Future_AGI — 16 days ago

We've been going deep on evaluation pipelines lately and wanted to share something that clicked for us the difference between LLM Judges and Eval Agents, and why you probably need both.

Most of us start with LLM-as-a-Judge. You write some instructions, throw in your {{input}} and {{output}}, pick a model, and get a Pass/Fail back. It works, it's fast, it's cheap. For things like relevance, tone, coherence, honestly it's pretty solid.

But here's where it breaks down.

Say you're evaluating a RAG pipeline and you want to know if the generated answer is actually factually correct based on your internal docs. An LLM Judge can't do that, it can only judge what's in the prompt. It has no way to go verify anything.

That's where Eval Agents come in.

Instead of a single-turn judgment, an Eval Agent actually reasons over multiple steps, up to 15 iterations and has access to tools like knowledge base search, web browsing, or external APIs. So it'll go look something up, compare it, reason about it, and then give you a verdict with a detailed explanation.

Our rough mental model for when to use each:

  • Checking if a response is helpful, on-topic, or well-written → LLM Judge (fast + cheap)
  • Verifying facts against your docs or the web → Eval Agent
  • High-volume batch scoring → LLM Judge
  • Auditing complex multi-step agent traces → Eval Agent
  • Format/regex/exact match checks → Code Eval (don't burn LLM tokens on this)

The trap we see a lot of devs fall into is trying to make LLM Judges do everything. They end up writing these massive prompt instructions trying to get the judge to "simulate" fact-checking, and it just doesn't hold up at scale.

Use the right tool for the job. LLM Judges for subjective quality. Eval Agents for anything needing real reasoning or external grounding. Code Evals for deterministic stuff.

we are just want to know what evaluation setups you all are running in prod are you doing anything beyond basic LLM Judges?

reddit.com
u/Future_AGI — 17 days ago
▲ 10 r/dev

At Future AGI, we drop an engineering leaderboard in our internal tech channel every week.

It started as a fun way to summarize the week. It turned into a better question about what engineering teams choose to reward.

Most teams already notice the obvious work. Big features, visible launches, ticket count. The less visible work is where things usually get distorted, code reviews, docs, tests, refactors, and deleting bad code before it turns into future pain.

So we built a weighted score instead of a raw output board.

PRs count. Reviews count. Tickets count. Docs count. Tests count. Deleting code counts too, because subtraction is often real engineering progress.

A few examples from last week:

  • One engineer led on ticket throughput across annotations, LiveKit migration, and RBAC.
  • Another touched six repos, shipped across the TraceAI SDK, Prism Gateway, and agentic-eval, and still did 45 reviews.
  • Another rewrote the docs end to end.
  • Another deleted 188,000 lines from the eval engine, and that counted because healthy codebases need subtraction too.

The part that made this useful was not the ranking. It was the weighting.

We gave partial credit for code deletion, because cleanup matters. We gave credit for tests, because shipping without confidence is debt with better marketing. We gave lower but explicit credit for docs, because documentation is engineering work even when the author did not write the feature.

We also reward smaller focused PRs.

That one changed behavior fast. If teams only reward volume, they get giant Friday PRs that nobody wants to review and everybody merges with half their brain turned off. If teams reward reviewable changes, they get faster feedback and fewer silent regressions.

A leaderboard like this can go wrong very easily.

If it becomes performance management, people game it. If it overweights lines changed, people optimize for motion instead of outcomes. If it ignores reviews, docs, and cleanup, it teaches the team that maintenance work is second-class work.

So we treat it as a highlight reel, not a compensation system.

The goal is simple, make invisible engineering work visible enough that the team actually respects it.

That has led to better conversations than we expected. People now argue about weighting. Should test work count more, should deletion count the same as new code, should cross-repo changes get extra credit, should reviews be weighted by complexity instead of count.

Those arguments are useful. They force a team to say what “good engineering” actually means in practice.

A small side effect of this culture has been the OSS response. We recently open-sourced the core Future AGI stack on GitHub as “the open-source platform for shipping self-improving AI agents,” and the repo is now past 800+ stars on GitHub, with people contributing across the stack.

That has been fun to watch, because the same work that improves an internal codebase, reviews, docs, cleanup, test discipline, also makes an open-source project easier for other engineers to trust and join.

For anyone curious, the repo is in the first comment.

We just want to know how, other teams handle this.

When it comes to engineering work, what would you reward more than most teams do? Would it be shipping, reviewing quality, writing documentation, running tests, fixing bugs, or getting rid of code?

u/Future_AGI — 18 days ago

A lot of teams have already made the first important choice.

They picked LangChain as the orchestration layer.

That usually makes sense. It gives you a flexible way to connect models, retrievers, tools, memory, and workflows into one application. Once LangChain is in place, the next layer starts to matter more, and that is where teams begin choosing between point tools and a broader production stack.

Where Langfuse fits well

Langfuse is already a strong open-source option for teams that want observability around LLM apps. It is open source, supports self-hosting, and covers tracing, prompt management, datasets, experiments, and evaluation workflows in a way that fits naturally into modern LLM app development.

If your LangChain setup mainly needs better visibility, cleaner prompt workflows, experiment tracking, and evaluation tied to traces and sessions, Langfuse already solves a meaningful part of that stack well.

That is why a lot of teams like it. It gives structure to the observability layer without forcing you into a closed product model.

Where Future AGI adds more

What we built at Future AGI starts from a different assumption.

We assumed LangChain would already handle orchestration. What many teams still need after that is the production system around the orchestration layer, not just the observability layer. So the stack we open-sourced goes beyond tracing and experiments into simulation, evaluation, protection, gateway control, prompt optimization, and the platform loop that connects them.

That matters because most production teams do not stop at visibility. They want to replay the pattern, test the fix, score the output, block unsafe responses, route traffic cleanly, and keep watching the rollout after deployment.

How the platform is structured

Future AGI is built around six platform layers:

  • Simulate, for multi-turn testing across personas, adversarial inputs, and edge cases, including text and voice workflows.
  • Evaluate, with 50+ metrics including groundedness, hallucination, tool-use correctness, PII, tone, and custom rubrics.
  • Protect, with 18 built-in scanners plus 15 vendor adapters for jailbreaks, prompt injection, privacy, and policy checks.
  • Monitor, with OpenTelemetry-native tracing across 50+ frameworks, including LangChain, plus latency, token cost, span graphs, and dashboards.
  • Agent Command Center, an OpenAI-compatible gateway with 100+ providers, routing strategies, semantic caching, virtual keys, MCP, and A2A support.
  • Optimize, with six prompt-optimization algorithms, including GEPA and PromptWizard, where production traces feed into optimization workflows.

In simple terms, Langfuse is strong on the LLM engineering and observability side, while Future AGI goes further into the full production loop around the agent.

What this means for a LangChain team

If LangChain is your orchestration layer, then the stack around it shapes what you can do next.

With an observability-first stack, you can inspect traces, compare prompts, run experiments, and score outputs more cleanly.

With a broader production stack, you can generate synthetic scenarios before rollout, run evaluation suites against those scenarios, block unsafe outputs on the live path, route requests across providers, and feed failed cases back into prompt optimization.

That means a support agent can move from “we saw a bad answer in tracing” to “we reproduced the pattern, tested candidate fixes, protected the output path, and shipped with monitoring in place.” It also means routing and cost control do not need to live as ad hoc logic inside the app layer, because the gateway can handle provider routing, caching, keys, and traffic management as part of the stack.

Deployment and libraries

Deployment is part of the difference too.

Langfuse is open source and supports self-hosting, which is one reason teams choose it. Future AGI is also open source, with the full platform repo live on GitHub, public documentation, and self-hosted deployment paths documented as part of the platform.

Future AGI also ships multiple client libraries that map to different production jobs:

  • traceAI for zero-config OTel tracing across Python, TypeScript, Java, and C#.
  • ai-evaluation for 50+ evaluation metrics and guardrail scanners.
  • futureagi for datasets, prompts, knowledge bases, and experiments.
  • agent-opt for prompt optimization workflows.
  • simulate-sdk for voice-agent simulation.
  • agentcc for gateway clients across Python, TypeScript, LangChain, LlamaIndex, React, and Vercel.

That makes the integration story broader than just “send traces somewhere.” Different layers can be adopted based on what the team needs first.

Repo in the first comment. Happy to answer technical questions.

u/Future_AGI — 25 days ago