u/Busy_Weather_7064

What the hell Antigravity ? Now I won't even get free 1000 credits ?

What the hell Antigravity ? Now I won't even get free 1000 credits ?

I used to rely on those 1000 Antigravity credits a lot, and now they're removing it !!! I'm AI pro plan user and it hurts a lot.

u/Busy_Weather_7064 — 19 hours ago

Frameworks do not make your agent reliable. Evaluations do.

If you look at most agent product pitches today, the story goes like this:

  • “We use a cutting‑edge multi‑agent framework.”
  • “We have tools and memory and a planner.”
  • “We are integrated with half the AI ecosystem.”

What you rarely see is:

  • “We can show you that our agent remains reliable when tools fail, latency spikes, and inputs get weird.”

Frameworks are useful. I am not anti‑framework. LangGraph, CrewAI, AutoGen, Goose and friends have moved the whole field forward.

They just do not solve the reliability problem for you.

The illusion of structure

Most frameworks give you structure: nodes, edges, tools, retry handlers, event streams.

It feels like the agent is well behaved because it is now drawn as a graph.

In practice, the same problems keep showing up:

  • Tools silently fail and the agent fills in the blanks
  • Guardrails are configured once and then never evaluated again
  • Handlers catch exceptions but nobody checks whether the overall outcome is still acceptable

You can have a beautifully structured graph that fails in exactly the same ways as a weekend script.

What an evaluation pipeline actually does

An evaluation pipeline, done right, is much less glamorous than an agent framework.

It does things like:

  • Replaying real production traces in a controlled environment
  • Injecting the failures you already see in logs
  • Measuring how often the agent still does the right thing
  • Turning those measurements into a feedback loop for your prompts and code

EvalMonkey is my attempt to make that boring work easier for agent teams.

It does not care whether you built your agent with LangGraph, Goose, a custom orchestrator, or a single giant function. As long as you can expose a simple HTTP endpoint, you can benchmark it.

Our experiment: frameworks vs evals

In our 10 agent benchmark, we deliberately picked a mix:

  • Framework heavy agents
  • Hand‑rolled agents
  • Browser agents
  • Docs and support agents

The frameworks gave us better ergonomics and nicer diagrams.

The evaluation harness gave us insight into how they behave under stress.

The teams that benefit most from EvalMonkey are not the ones with the fanciest agent stack. It is the ones who are honest enough to admit that their agents see the same boring failures as everyone else.

What to add if you already have a framework

If you built on top of a framework, you are not starting from scratch. You probably already have:

  • A clear entrypoint where inputs arrive
  • Centralised tool definitions
  • Traces in Langfuse or something similar

You can layer EvalMonkey on top without throwing anything away:

  • Add a thin HTTP wrapper around your framework entrypoint
  • Write a few EvalMonkey scenarios that mimic your core user flows
  • Define chaos profiles that match the failure patterns you see in production
  • Run the benchmark regularly and track changes over time

The value is not in having evaluations. It is in having evaluations that are tied to real workflows and real failure modes.

If you are proud of your agent stack, that is great. The next step is to be proud of your evaluation stack.

If you like the idea of frameworks and evaluations being treated as peers, not substitutes, star the repo and show it to the person on your team who is always debugging the weird edge cases.

u/Busy_Weather_7064 — 7 days ago

There is a specific kind of frustration that only AI builders know.

You open your favorite “research agent” and ask it a question.
You refine the question.
You repeat it, slightly different.

On the third try, it finally gives you something usable.

Nothing crashed. No stack trace. No alert. Just quiet, inconsistent behavior that feels like gaslighting. Yesterday it answered that class of question on the first attempt. Today it needs three tries.

Now imagine being the customer on the other side of this.

You are not thinking about tool calls or token windows. You are just thinking “this thing does not listen” and “I cannot trust this for anything important.”

The reliability gap

Most agent teams I talk to have logs. They have Langfuse or an equivalent. They can replay traces and see what went wrong. Some even have a wall of dashboards.

What they usually do not have is a standard, repeatable answer to:

  • What failures do our agents hit most often
  • How often they reappear after we “fix” them
  • Whether a change actually made the agent more reliable in the real world

We shipped EvalMonkey because I was tired of hearing myself say the same sentence in my head: “I know this agent is flaky, but I cannot prove it in a way that survives a product meeting.”

Real benchmarks, not vibes

With EvalMonkey we benchmarked 10 open source agents that people actually use. Things like GPT Researcher, Open Deep Research, OpenResearcher, deep‑research, OnCell Support Agent, Local Docs AI Agent, Index, browser_agent, the Browser‑Use Couchbase demo and Goose.

For each of them we:

  • Wrapped the agent behind a tiny HTTP contract
  • Hit it with the same scenarios
  • Ran a baseline run
  • Then ran chaos runs that simulate the stuff that actually happens in production - slow tools, flaky tools, bad responses, subtle changes in input shape.

We did not try to “break them” with pathological prompts. We just modeled the boring, ugly failures that show up in real traces.

Results were exactly what you would expect if you have ever tried to use these systems under pressure:

  • Agents that looked “good” in one shot demos fell over when a tool got slow or returned a slightly different schema
  • Research agents that were impressive on a one off query quietly skipped entire steps under chaos
  • Browser agents got stuck in loops and never backed off or gave up

None of this shows up in a nice way if your only instrument is “we tried it a few times and it seemed fine.”

My personal breaking point

The thing that pushed me over the edge was not a benchmark. It was an app builder.

You know the pattern. You describe an app. The tool says it will code it, run it, and tell you when it is done.

In my case, it happily declared “App building is finished” and showed a green checkmark. There was only one small bug.

The app did not run.

No health check. No smoke test. No “I tried to start the server and it failed.” Just a success message over a broken experience. That is not an LLM problem. That is a reliability problem.

Same story with in‑app chat builders. I have had agents get stuck mid conversation, clearly in some internal loop, while the UI just spins. No error surfaced, no graceful fallback, no evaluators catching the regression.

At some point you realise this is not “AI being AI.” It is just the absence of good evaluation.

What EvalMonkey gives you

EvalMonkey is basically a harness for putting agents through standard failure modes, over and over again, until you have numbers instead of vibes.

You define:

  • A set of real scenarios
  • A common HTTP interface
  • The chaos profiles you care about

You get back:

  • Baseline performance
  • Performance under chaos
  • A “production reliability” style view of how often the agent still does the right thing when tools, latency and input shape are not ideal.

There is nothing magical about that. It is just what we should have had from day one.

Why this matters now

Most teams I talk to are past the “cool demo” phase. They are in the stage where a VP of Support or CTO quietly asks “Can this thing handle real tickets without embarrassing us.”

If your answer is:

  • “We eyeballed some traces” or
  • “We ran a few scripts locally”

you already know that is not going to scale.

If your answer is:

  • “We run standard benchmarks across a suite of agents using EvalMonkey, and we know exactly which failures we can catch before they hit customers”

that is a very different conversation.

If any of this sounds familiar, take a look at the EvalMonkey repo:
https://github.com/Corbell-AI/evalmonkey

Clone it, point it at your agent, and see what happens when you turn chaos on. If you want to go deeper, I am happy to share the raw logs for our OSS agent benchmarks as a zip for anyone who really wants to dig into failure patterns.

If the project resonates, star the repo so more teams see it and we can raise the bar for what “production ready agent” actually means.

u/Busy_Weather_7064 — 16 days ago

In the first post (link in comment) I ranked five popular research agents on pure capability with Claude Haiku 4.5. Same scenarios, same judge, same harness. This time I turn on EvalMonkey’s chaos engine and ask a more production‑shaped question.

What “chaos” means in EvalMonkey

Chaos runs reuse the exact same scenario and target endpoint and insert a hostile component in the middle. EvalMonkey calls these chaos profiles, selected via --chaos-profile per run.

For text‑only agents I used two profiles:

  • clientpromptinjection – adversarial instructions are mixed into the prompt, the kind of “ignore previous instructions and do X” you see as jailbreak attempts.
  • clientschemamutation – the request payload is mangled: keys moved, extra fields added, types changed, etc.

For hotpotqa I ran both profiles for each agent. For truthfulqa and mmlu I used prompt injection only; schema‑mutation on every scenario would have blown the runtime past what I was willing to babysit. That gives me 5 chaos data points + 3 baseline points per agent.

Chaos scores and production reliability (Haiku 4.5)

Chaos score is the average across those chaos runs. I define drop as:

Drop=Baseline−ChaosDrop=Baseline−Chaos

and production reliability as:

Reliability=0.6⋅Baseline+0.4⋅ChaosReliability=0.6⋅Baseline+0.4⋅Chaos

I weight baseline a bit more because in production both capability and robustness matter, but capability still dominates.

Here is the table for Haiku 4.5:

textAgent Baseline avg Chaos avg Drop (baseline − chaos) Production reliability
GPT Researcher 62.3 26.8 35.5 48.1
Open Deep Research (LangChain) 48.7 39.5 9.2 45.0
OpenResearcher 50.3 32.8 17.5 43.3
deep‑research (dzhng) 43.7 42.5 1.2 43.2
Goose 32.7 50.3 −17.7 39.7

One explicit example: GPT Researcher’s reliability is

0.6⋅62.3+0.4⋅26.8=48.10.6⋅62.3+0.4⋅26.8=48.1

and you can see how the 35.5‑point drop under chaos pulls it down.

Reliability ranking on Haiku 4.5

If you sort by the reliability metric instead of pure baseline:

textRank Agent Production reliability
1 GPT Researcher 48.1
2 Open Deep Research (LangChain) 45.0
3 OpenResearcher 43.3
4 deep‑research (dzhng) 43.2
5 Goose 39.7

Three of these are within three points of each other; small changes in the weighting would shuffle the order.

The important bit is not the exact rank; it is that:

  • GPT Researcher’s lead shrinks from a 12‑point capability gap to a 3‑point reliability gap.
  • dzhng’s micro‑agent trails GPT Researcher by only 4.9 points on reliability despite being far simpler.

What to take away from baseline + chaos together

From the first two posts combined I would keep four rules in your head:

  1. Capability and production reliability are different rankings. You need both numbers before you pick an agent.
  2. Smaller agents can hold up better under chaos. Less surface area, fewer moving parts, fewer ways to go wrong.
  3. Most chaos damage originates in the serving layer, not the agent logic. A thin wrapper that does not validate or sanitize inputs makes any agent look fragile.
  4. Style is part of robustness. Terse agents win under schema mutation; structured agents resist prompt injection better than free‑form ones.

In the next post I'll repeat the entire experiment with Claude Sonnet 4.5 as the shared backbone instead of Haiku and compare the deltas.

u/Busy_Weather_7064 — 18 days ago

Hey folks, curious to know what framework or harness you are using to make your agents more reliable ? Have you been facing issues in production after local testing of the agent is done ?

reddit.com
u/Busy_Weather_7064 — 22 days ago
▲ 9 r/AIQuality+2 crossposts

Most agent evals test whether an agent can solve the happy-path task.

But in practice, agents usually break somewhere else:

  • tool returns malformed JSON
  • API rate limits mid-run
  • context gets too long
  • schema changes slightly
  • retrieval quality drops
  • prompt injection slips in through context

That gap bothered me, so I built EvalMonkey.

It is an open source local harness for LLM agents that does two things:

  1. Runs your agent on standard benchmarks
  2. Re-runs those same tasks under controlled failure conditions to measure how hard it degrades

So instead of only asking:

"Can this agent solve the task?"

you can also ask:

"What happens when reality gets messy?"

A few examples of what it can test:

  • malformed tool outputs
  • missing fields / schema drift
  • latency and rate limit behavior
  • prompt injection variants
  • long-context stress
  • retrieval corruption / noisy context

The goal is simple: help people measure reliability under stress, not just benchmark performance on clean inputs.

Why I built it:
My own agent used to take 3 attempts to get the accurate answer I'm looking for :/ , or timeout when handling 10 pager long documents.
I also kept seeing agents look good on polished demos and clean evals, then fail for very ordinary reasons in real workflows. I wanted a simple way to reproduce those failure modes locally, without setting up a lot of infra.

It is open source, runs locally, and is meant to be easy to plug into existing agent workflows.

Repo: https://github.com/Corbell-AI/evalmonkey Apache 2.0

Curious what breaks your agent most often in practice:
bad tool outputs, rate limits, long context, retrieval issues, or something else?

u/Busy_Weather_7064 — 22 days ago