
What the hell Antigravity ? Now I won't even get free 1000 credits ?
I used to rely on those 1000 Antigravity credits a lot, and now they're removing it !!! I'm AI pro plan user and it hurts a lot.

I used to rely on those 1000 Antigravity credits a lot, and now they're removing it !!! I'm AI pro plan user and it hurts a lot.
If you look at most agent product pitches today, the story goes like this:
What you rarely see is:
Frameworks are useful. I am not anti‑framework. LangGraph, CrewAI, AutoGen, Goose and friends have moved the whole field forward.
They just do not solve the reliability problem for you.
Most frameworks give you structure: nodes, edges, tools, retry handlers, event streams.
It feels like the agent is well behaved because it is now drawn as a graph.
In practice, the same problems keep showing up:
You can have a beautifully structured graph that fails in exactly the same ways as a weekend script.
An evaluation pipeline, done right, is much less glamorous than an agent framework.
It does things like:
EvalMonkey is my attempt to make that boring work easier for agent teams.
It does not care whether you built your agent with LangGraph, Goose, a custom orchestrator, or a single giant function. As long as you can expose a simple HTTP endpoint, you can benchmark it.
In our 10 agent benchmark, we deliberately picked a mix:
The frameworks gave us better ergonomics and nicer diagrams.
The evaluation harness gave us insight into how they behave under stress.
The teams that benefit most from EvalMonkey are not the ones with the fanciest agent stack. It is the ones who are honest enough to admit that their agents see the same boring failures as everyone else.
If you built on top of a framework, you are not starting from scratch. You probably already have:
You can layer EvalMonkey on top without throwing anything away:
The value is not in having evaluations. It is in having evaluations that are tied to real workflows and real failure modes.
If you are proud of your agent stack, that is great. The next step is to be proud of your evaluation stack.
If you like the idea of frameworks and evaluations being treated as peers, not substitutes, star the repo and show it to the person on your team who is always debugging the weird edge cases.
There is a specific kind of frustration that only AI builders know.
You open your favorite “research agent” and ask it a question.
You refine the question.
You repeat it, slightly different.
On the third try, it finally gives you something usable.
Nothing crashed. No stack trace. No alert. Just quiet, inconsistent behavior that feels like gaslighting. Yesterday it answered that class of question on the first attempt. Today it needs three tries.
Now imagine being the customer on the other side of this.
You are not thinking about tool calls or token windows. You are just thinking “this thing does not listen” and “I cannot trust this for anything important.”
Most agent teams I talk to have logs. They have Langfuse or an equivalent. They can replay traces and see what went wrong. Some even have a wall of dashboards.
What they usually do not have is a standard, repeatable answer to:
We shipped EvalMonkey because I was tired of hearing myself say the same sentence in my head: “I know this agent is flaky, but I cannot prove it in a way that survives a product meeting.”
With EvalMonkey we benchmarked 10 open source agents that people actually use. Things like GPT Researcher, Open Deep Research, OpenResearcher, deep‑research, OnCell Support Agent, Local Docs AI Agent, Index, browser_agent, the Browser‑Use Couchbase demo and Goose.
For each of them we:
We did not try to “break them” with pathological prompts. We just modeled the boring, ugly failures that show up in real traces.
Results were exactly what you would expect if you have ever tried to use these systems under pressure:
None of this shows up in a nice way if your only instrument is “we tried it a few times and it seemed fine.”
The thing that pushed me over the edge was not a benchmark. It was an app builder.
You know the pattern. You describe an app. The tool says it will code it, run it, and tell you when it is done.
In my case, it happily declared “App building is finished” and showed a green checkmark. There was only one small bug.
The app did not run.
No health check. No smoke test. No “I tried to start the server and it failed.” Just a success message over a broken experience. That is not an LLM problem. That is a reliability problem.
Same story with in‑app chat builders. I have had agents get stuck mid conversation, clearly in some internal loop, while the UI just spins. No error surfaced, no graceful fallback, no evaluators catching the regression.
At some point you realise this is not “AI being AI.” It is just the absence of good evaluation.
EvalMonkey is basically a harness for putting agents through standard failure modes, over and over again, until you have numbers instead of vibes.
You define:
You get back:
There is nothing magical about that. It is just what we should have had from day one.
Most teams I talk to are past the “cool demo” phase. They are in the stage where a VP of Support or CTO quietly asks “Can this thing handle real tickets without embarrassing us.”
If your answer is:
you already know that is not going to scale.
If your answer is:
that is a very different conversation.
If any of this sounds familiar, take a look at the EvalMonkey repo:
https://github.com/Corbell-AI/evalmonkey
Clone it, point it at your agent, and see what happens when you turn chaos on. If you want to go deeper, I am happy to share the raw logs for our OSS agent benchmarks as a zip for anyone who really wants to dig into failure patterns.
If the project resonates, star the repo so more teams see it and we can raise the bar for what “production ready agent” actually means.
In the first post (link in comment) I ranked five popular research agents on pure capability with Claude Haiku 4.5. Same scenarios, same judge, same harness. This time I turn on EvalMonkey’s chaos engine and ask a more production‑shaped question.
Chaos runs reuse the exact same scenario and target endpoint and insert a hostile component in the middle. EvalMonkey calls these chaos profiles, selected via --chaos-profile per run.
For text‑only agents I used two profiles:
clientpromptinjection – adversarial instructions are mixed into the prompt, the kind of “ignore previous instructions and do X” you see as jailbreak attempts.clientschemamutation – the request payload is mangled: keys moved, extra fields added, types changed, etc.For hotpotqa I ran both profiles for each agent. For truthfulqa and mmlu I used prompt injection only; schema‑mutation on every scenario would have blown the runtime past what I was willing to babysit. That gives me 5 chaos data points + 3 baseline points per agent.
Chaos score is the average across those chaos runs. I define drop as:
Drop=Baseline−ChaosDrop=Baseline−Chaos
and production reliability as:
Reliability=0.6⋅Baseline+0.4⋅ChaosReliability=0.6⋅Baseline+0.4⋅Chaos
I weight baseline a bit more because in production both capability and robustness matter, but capability still dominates.
Here is the table for Haiku 4.5:
| textAgent | Baseline avg | Chaos avg | Drop (baseline − chaos) | Production reliability |
|---|---|---|---|---|
| GPT Researcher | 62.3 | 26.8 | 35.5 | 48.1 |
| Open Deep Research (LangChain) | 48.7 | 39.5 | 9.2 | 45.0 |
| OpenResearcher | 50.3 | 32.8 | 17.5 | 43.3 |
| deep‑research (dzhng) | 43.7 | 42.5 | 1.2 | 43.2 |
| Goose | 32.7 | 50.3 | −17.7 | 39.7 |
One explicit example: GPT Researcher’s reliability is
0.6⋅62.3+0.4⋅26.8=48.10.6⋅62.3+0.4⋅26.8=48.1
and you can see how the 35.5‑point drop under chaos pulls it down.
If you sort by the reliability metric instead of pure baseline:
| textRank | Agent | Production reliability |
|---|---|---|
| 1 | GPT Researcher | 48.1 |
| 2 | Open Deep Research (LangChain) | 45.0 |
| 3 | OpenResearcher | 43.3 |
| 4 | deep‑research (dzhng) | 43.2 |
| 5 | Goose | 39.7 |
Three of these are within three points of each other; small changes in the weighting would shuffle the order.
The important bit is not the exact rank; it is that:
From the first two posts combined I would keep four rules in your head:
In the next post I'll repeat the entire experiment with Claude Sonnet 4.5 as the shared backbone instead of Haiku and compare the deltas.
Hey folks, curious to know what framework or harness you are using to make your agents more reliable ? Have you been facing issues in production after local testing of the agent is done ?
Most agent evals test whether an agent can solve the happy-path task.
But in practice, agents usually break somewhere else:
That gap bothered me, so I built EvalMonkey.
It is an open source local harness for LLM agents that does two things:
So instead of only asking:
"Can this agent solve the task?"
you can also ask:
"What happens when reality gets messy?"
A few examples of what it can test:
The goal is simple: help people measure reliability under stress, not just benchmark performance on clean inputs.
Why I built it:
My own agent used to take 3 attempts to get the accurate answer I'm looking for :/ , or timeout when handling 10 pager long documents.
I also kept seeing agents look good on polished demos and clean evals, then fail for very ordinary reasons in real workflows. I wanted a simple way to reproduce those failure modes locally, without setting up a lot of infra.
It is open source, runs locally, and is meant to be easy to plug into existing agent workflows.
Repo: https://github.com/Corbell-AI/evalmonkey Apache 2.0
Curious what breaks your agent most often in practice:
bad tool outputs, rate limits, long context, retrieval issues, or something else?