What the hell Antigravity ? Now I won't even get free 1000 credits ?

I used to rely on those 1000 Antigravity credits a lot, and now they're removing it !!! I'm AI pro plan user and it hurts a lot.

u/Busy_Weather_7064 — 19 hours ago

▲ 2 r/OpenSourceAI

Frameworks do not make your agent reliable. Evaluations do.

If you look at most agent product pitches today, the story goes like this:

“We use a cutting‑edge multi‑agent framework.”
“We have tools and memory and a planner.”
“We are integrated with half the AI ecosystem.”

What you rarely see is:

“We can show you that our agent remains reliable when tools fail, latency spikes, and inputs get weird.”

Frameworks are useful. I am not anti‑framework. LangGraph, CrewAI, AutoGen, Goose and friends have moved the whole field forward.

They just do not solve the reliability problem for you.

The illusion of structure

Most frameworks give you structure: nodes, edges, tools, retry handlers, event streams.

It feels like the agent is well behaved because it is now drawn as a graph.

In practice, the same problems keep showing up:

Tools silently fail and the agent fills in the blanks
Guardrails are configured once and then never evaluated again
Handlers catch exceptions but nobody checks whether the overall outcome is still acceptable

You can have a beautifully structured graph that fails in exactly the same ways as a weekend script.

What an evaluation pipeline actually does

An evaluation pipeline, done right, is much less glamorous than an agent framework.

It does things like:

Replaying real production traces in a controlled environment
Injecting the failures you already see in logs
Measuring how often the agent still does the right thing
Turning those measurements into a feedback loop for your prompts and code

EvalMonkey is my attempt to make that boring work easier for agent teams.

It does not care whether you built your agent with LangGraph, Goose, a custom orchestrator, or a single giant function. As long as you can expose a simple HTTP endpoint, you can benchmark it.

Our experiment: frameworks vs evals

In our 10 agent benchmark, we deliberately picked a mix:

Framework heavy agents
Hand‑rolled agents
Browser agents
Docs and support agents

The frameworks gave us better ergonomics and nicer diagrams.

The evaluation harness gave us insight into how they behave under stress.

The teams that benefit most from EvalMonkey are not the ones with the fanciest agent stack. It is the ones who are honest enough to admit that their agents see the same boring failures as everyone else.

What to add if you already have a framework

If you built on top of a framework, you are not starting from scratch. You probably already have:

A clear entrypoint where inputs arrive
Centralised tool definitions
Traces in Langfuse or something similar

You can layer EvalMonkey on top without throwing anything away:

Add a thin HTTP wrapper around your framework entrypoint
Write a few EvalMonkey scenarios that mimic your core user flows
Define chaos profiles that match the failure patterns you see in production
Run the benchmark regularly and track changes over time

The value is not in having evaluations. It is in having evaluations that are tied to real workflows and real failure modes.

If you are proud of your agent stack, that is great. The next step is to be proud of your evaluation stack.

If you like the idea of frameworks and evaluations being treated as peers, not substitutes, star the repo and show it to the person on your team who is always debugging the weird edge cases.

u/Busy_Weather_7064 — 7 days ago

▲ 2 r/OpenSourceAI

There is a specific kind of frustration that only AI builders know.

You open your favorite “research agent” and ask it a question.
You refine the question.
You repeat it, slightly different.

On the third try, it finally gives you something usable.

Nothing crashed. No stack trace. No alert. Just quiet, inconsistent behavior that feels like gaslighting. Yesterday it answered that class of question on the first attempt. Today it needs three tries.

Now imagine being the customer on the other side of this.

You are not thinking about tool calls or token windows. You are just thinking “this thing does not listen” and “I cannot trust this for anything important.”

The reliability gap

Most agent teams I talk to have logs. They have Langfuse or an equivalent. They can replay traces and see what went wrong. Some even have a wall of dashboards.

What they usually do not have is a standard, repeatable answer to:

What failures do our agents hit most often
How often they reappear after we “fix” them
Whether a change actually made the agent more reliable in the real world

We shipped EvalMonkey because I was tired of hearing myself say the same sentence in my head: “I know this agent is flaky, but I cannot prove it in a way that survives a product meeting.”

Real benchmarks, not vibes

With EvalMonkey we benchmarked 10 open source agents that people actually use. Things like GPT Researcher, Open Deep Research, OpenResearcher, deep‑research, OnCell Support Agent, Local Docs AI Agent, Index, browser_agent, the Browser‑Use Couchbase demo and Goose.

For each of them we:

Wrapped the agent behind a tiny HTTP contract
Hit it with the same scenarios
Ran a baseline run
Then ran chaos runs that simulate the stuff that actually happens in production - slow tools, flaky tools, bad responses, subtle changes in input shape.

We did not try to “break them” with pathological prompts. We just modeled the boring, ugly failures that show up in real traces.

Results were exactly what you would expect if you have ever tried to use these systems under pressure:

Agents that looked “good” in one shot demos fell over when a tool got slow or returned a slightly different schema
Research agents that were impressive on a one off query quietly skipped entire steps under chaos
Browser agents got stuck in loops and never backed off or gave up

None of this shows up in a nice way if your only instrument is “we tried it a few times and it seemed fine.”

My personal breaking point

The thing that pushed me over the edge was not a benchmark. It was an app builder.

You know the pattern. You describe an app. The tool says it will code it, run it, and tell you when it is done.

In my case, it happily declared “App building is finished” and showed a green checkmark. There was only one small bug.

The app did not run.

No health check. No smoke test. No “I tried to start the server and it failed.” Just a success message over a broken experience. That is not an LLM problem. That is a reliability problem.

Same story with in‑app chat builders. I have had agents get stuck mid conversation, clearly in some internal loop, while the UI just spins. No error surfaced, no graceful fallback, no evaluators catching the regression.

At some point you realise this is not “AI being AI.” It is just the absence of good evaluation.

What EvalMonkey gives you

EvalMonkey is basically a harness for putting agents through standard failure modes, over and over again, until you have numbers instead of vibes.

You define:

A set of real scenarios
A common HTTP interface
The chaos profiles you care about

You get back:

Baseline performance
Performance under chaos
A “production reliability” style view of how often the agent still does the right thing when tools, latency and input shape are not ideal.

There is nothing magical about that. It is just what we should have had from day one.

Why this matters now

Most teams I talk to are past the “cool demo” phase. They are in the stage where a VP of Support or CTO quietly asks “Can this thing handle real tickets without embarrassing us.”

If your answer is:

“We eyeballed some traces” or
“We ran a few scripts locally”

you already know that is not going to scale.

If your answer is:

“We run standard benchmarks across a suite of agents using EvalMonkey, and we know exactly which failures we can catch before they hit customers”

that is a very different conversation.

If any of this sounds familiar, take a look at the EvalMonkey repo:
https://github.com/Corbell-AI/evalmonkey

Clone it, point it at your agent, and see what happens when you turn chaos on. If you want to go deeper, I am happy to share the raw logs for our OSS agent benchmarks as a zip for anyone who really wants to dig into failure patterns.

If the project resonates, star the repo so more teams see it and we can raise the bar for what “production ready agent” actually means.

u/Busy_Weather_7064 — 16 days ago

▲ 1 r/OpenSourceAI

In the first post (link in comment) I ranked five popular research agents on pure capability with Claude Haiku 4.5. Same scenarios, same judge, same harness. This time I turn on EvalMonkey’s chaos engine and ask a more production‑shaped question.

What “chaos” means in EvalMonkey

Chaos runs reuse the exact same scenario and target endpoint and insert a hostile component in the middle. EvalMonkey calls these chaos profiles, selected via --chaos-profile per run.

For text‑only agents I used two profiles:

clientpromptinjection – adversarial instructions are mixed into the prompt, the kind of “ignore previous instructions and do X” you see as jailbreak attempts.
clientschemamutation – the request payload is mangled: keys moved, extra fields added, types changed, etc.

For hotpotqa I ran both profiles for each agent. For truthfulqa and mmlu I used prompt injection only; schema‑mutation on every scenario would have blown the runtime past what I was willing to babysit. That gives me 5 chaos data points + 3 baseline points per agent.

Chaos scores and production reliability (Haiku 4.5)

Chaos score is the average across those chaos runs. I define drop as:

Drop=Baseline−ChaosDrop=Baseline−Chaos

and production reliability as:

Reliability=0.6⋅Baseline+0.4⋅ChaosReliability=0.6⋅Baseline+0.4⋅Chaos

I weight baseline a bit more because in production both capability and robustness matter, but capability still dominates.

Here is the table for Haiku 4.5:

textAgent	Baseline avg	Chaos avg	Drop (baseline − chaos)	Production reliability
GPT Researcher	62.3	26.8	35.5	48.1
Open Deep Research (LangChain)	48.7	39.5	9.2	45.0
OpenResearcher	50.3	32.8	17.5	43.3
deep‑research (dzhng)	43.7	42.5	1.2	43.2
Goose	32.7	50.3	−17.7	39.7

One explicit example: GPT Researcher’s reliability is

0.6⋅62.3+0.4⋅26.8=48.10.6⋅62.3+0.4⋅26.8=48.1

and you can see how the 35.5‑point drop under chaos pulls it down.

Reliability ranking on Haiku 4.5

If you sort by the reliability metric instead of pure baseline:

textRank	Agent	Production reliability
1	GPT Researcher	48.1
2	Open Deep Research (LangChain)	45.0
3	OpenResearcher	43.3
4	deep‑research (dzhng)	43.2
5	Goose	39.7

Three of these are within three points of each other; small changes in the weighting would shuffle the order.

The important bit is not the exact rank; it is that:

GPT Researcher’s lead shrinks from a 12‑point capability gap to a 3‑point reliability gap.
dzhng’s micro‑agent trails GPT Researcher by only 4.9 points on reliability despite being far simpler.

What to take away from baseline + chaos together

From the first two posts combined I would keep four rules in your head:

Capability and production reliability are different rankings. You need both numbers before you pick an agent.
Smaller agents can hold up better under chaos. Less surface area, fewer moving parts, fewer ways to go wrong.
Most chaos damage originates in the serving layer, not the agent logic. A thin wrapper that does not validate or sanitize inputs makes any agent look fragile.
Style is part of robustness. Terse agents win under schema mutation; structured agents resist prompt injection better than free‑form ones.

In the next post I'll repeat the entire experiment with Claude Sonnet 4.5 as the shared backbone instead of Haiku and compare the deltas.

u/Busy_Weather_7064 — 18 days ago

▲ 1 r/AgentsOfAI

Hey folks, curious to know what framework or harness you are using to make your agents more reliable ? Have you been facing issues in production after local testing of the agent is done ?

reddit.com

u/Busy_Weather_7064 — 22 days ago

▲ 9 r/AIQuality+2 crossposts

Most agent evals test whether an agent can solve the happy-path task.

But in practice, agents usually break somewhere else:

tool returns malformed JSON
API rate limits mid-run
context gets too long
schema changes slightly
retrieval quality drops
prompt injection slips in through context

That gap bothered me, so I built EvalMonkey.

It is an open source local harness for LLM agents that does two things:

Runs your agent on standard benchmarks
Re-runs those same tasks under controlled failure conditions to measure how hard it degrades

So instead of only asking:

"Can this agent solve the task?"

you can also ask:

"What happens when reality gets messy?"

A few examples of what it can test:

malformed tool outputs
missing fields / schema drift
latency and rate limit behavior
prompt injection variants
long-context stress
retrieval corruption / noisy context

The goal is simple: help people measure reliability under stress, not just benchmark performance on clean inputs.

Why I built it:
My own agent used to take 3 attempts to get the accurate answer I'm looking for :/ , or timeout when handling 10 pager long documents.
I also kept seeing agents look good on polished demos and clean evals, then fail for very ordinary reasons in real workflows. I wanted a simple way to reproduce those failure modes locally, without setting up a lot of infra.

It is open source, runs locally, and is meant to be easy to plug into existing agent workflows.

Repo: https://github.com/Corbell-AI/evalmonkey Apache 2.0

Curious what breaks your agent most often in practice:
bad tool outputs, rate limits, long context, retrieval issues, or something else?

u/Busy_Weather_7064 — 22 days ago