r/AIToolsPerformance

eval-harness - agent harness evaluation framework

Hello folks,

I wanted to build out my own personal list of evaluations, early on into putting this together I realised I wanted a way to not just evaluate the model but also the agentic harness that the model is running within, as I find the majority of my use of LLMs is more and more inside of a suite of CLI agentic harnesses.

I've listed in the video a multitutde of motivations for why I built this, but the primary ones were all the hype announcements and wanting a way to see for myself what models and their capabilities were like in the actual tools I use.

A paper by Google over on Kaggle recently went as far as to state that which LLM being used inside of an agentic harness perhaps only contributes 10% towards how effecitve that harness will be for a given task. I am not sure I agree with the figure, but I do agree with the sentiment.

One question I keep asking myself is when do I need to switch from my qwen3.6-27b that I am running locally on my twin 3090 setup, to a cloud model. At the moment I am making this decision on vibes/gut feel, and I think that might be okay for when I am working closely with the model but I am using these cli tools headlessly in quite a few workflows now and not just personally but professionally, so I want to make sure I am picking the right combo for the task.

The repo can be found here: https://github.com/ScottRBK/eval-harness, there is an explanation of the architecture. I have added example evaluations as I built it out to help me think about the different patterns I have utilise for evaluations.

The evaluations are quite easy as they are about resources contained within the model weights. The idea behind it is though I (and anyone else whom might want to fork the repo and curate their own) will build out a private list of evals that are held away from public that people can use to evaluate existing and new models and harnesses as they are released.

I also spent a good bit of time seeing how well cli agents themselves are able to build evaluations and have put together a list of skills that they can use alongside the tool. They do an okay job, but you need to really step through the logic of whatever they produce, they often produce quite brittle evaluations, so try getting them to stick to the example patterns already provided helps quite a bit.

An ideal position for me will be having the ability to ask the agent to generate an evaluation using the skills having just finished a session where I found a particular agent was struggling to complete a task. Theres often been a time where I've come across a problem that the agent has struggled to resolve and I've wished at that point I could make an evaluation out of it, but you are often in the middle of something and it ends up just as another item on my ever growing TODO: list.

This is my first time building an actual evaluation suite or framework of this kind for that matter. I have previously used existing frameworks, such as deepeval, so I was not toally unfamiliar with the topic but as with the other motivations already listed I built this as a learning exercise as well as to get a tool out of it.

If it is useful for you please get in touch and let me know, any feedback as well is also appreciated, as this is my first go and this kind of framework - i expect there is a lot that can be improved and I have potentially got wrong.

Enjoy the rest of your sunday folks.

youtube.com

u/Maasu — 3 hours ago

▲ 174 r/AIToolsPerformance+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 1 day ago

▲ 24 r/AIToolsPerformance

Alibaba banning Claude Code over backdoor risks - what's actually in the requests

Saw the Reuters piece this morning, Alibaba is banning Claude Code across its workplaces over what a source calls "alleged backdoor risks." It's the top China tech story on HN today, 289 points and 251 comments in a few hours. The thing is, "backdoor" is doing a lot of work there, and the timing lines up with something more concrete.

A few days earlier the reallo.dev writeup hit 2428 points on HN: Claude Code was found steganographically marking the requests it sends out. Basically embedding hidden identifiers in prompts so generated content can be traced back to it. Not malware, not exfiltration, but it is the model quietly writing into your files in a way you can't see. That post from last Friday is probably what "backdoor" is gesturing at, even if nobody's naming it directly.

There's also the smaller thread from today, "Claude, please stop trying to memorize random crap," where people noticed agents hoarding session transcripts between runs. None of this is a confirmed data leak. But stacked together it's enough for a large employer to pull the plug, and China-based firms have extra reason to be twitchy about anything that looks like telemetry leaving the building.

The open question I keep turning over: is this a Claude-specific problem, or are we going to find the same watermarking in ZCode, Gemini CLI, and the rest once someone actually looks? Anyone here bothered to grep their generated files for the steganography markers?

reddit.com

u/IulianHI — 3 days ago

▲ 29 r/AIToolsPerformance

Claude Sonnet 5 vs GLM 5.2 - does the price gap still make sense?

Anthropic shipped Claude Sonnet 5 this week and the OpenRouter listing already has it at $2/M input and $10/M output with a 1M context window. GLM 5.2 sits at $0.93/M in and $3/M out on the same listing, also with 1M context. So you're paying roughly double on input and over 3x on output for the Anthropic option.

What makes this interesting is the semgrep post that hit HN on Sunday, they ran GLM 5.2 against Claude on their cyber benchmarks and GLM came out ahead. That's one benchmark from one vendor, and their workload is security scanning, so grain of salt. But the old assumption was you pay the premium because the frontier labs are simply better. That gets harder to defend when a model at a third of the output price wins a public eval.

There's also ZCode now, basically a Claude Code clone from the GLM makers, which showed up on HN this week. So the whole stack is getting cloned, not just the model.

Anyone here actually switched a coding agent from Sonnet to GLM 5.2 and stayed, or did you hit quality issues that don't show up in benchmarks?

reddit.com

u/IulianHI — 4 days ago

▲ 5 r/AIToolsPerformance+1 crossposts

My org is looking to get Ai tools for org wide

My org is looking to get Ai tools for org wide & i want to contribute them with proper analytic which tool is best in which area with some case study. i am primarily looking to explore two area .1. Ai agentic IDE ie Antigravity, Cursor 2 agentic alone like cluade,Codex,open coder, my background i do use all except open coder and do vibe coding and all are very satifactory .but i need to find the best out of all these or any other i am missing ,BTW my organization looking to buy aground 50 seats

reddit.com

u/Old-Mud4628 — 4 days ago

▲ 14 r/AIToolsPerformance+14 crossposts

What if a new AI architecture could train on a laptop with just 4GB of RAM and an old Intel i5 - no GPU, no cloud?

We tested exactly that at Trijna Labs.

We've published a short video showing part of the training process. If you're interested, watch the video and check the YouTube comment link for the unedited raw training footage.

We're looking for honest technical feedback, criticism, and questions from people working on AI systems, model architectures, training infrastructure, or efficient inference.

We're also open to collaborating with researchers, engineers, builders, and anyone interested in helping push the project forward.

youtu.be

u/Different-Turnip3864 — 7 days ago

▲ 1 r/AIToolsPerformance+1 crossposts

Ornith 1.0: Local AI That Beats GPT-5.5 at Coding (Free & Private)

youtu.be

u/RealOppasTV — 8 days ago

▲ 19 r/AIToolsPerformance

Budget VRAM builds - 4x3090 home lab vs reverse-engineered Tesla V100 cards

Two interesting budget hardware paths surfaced recently. One person built a 4x3090 setup with 192GB DDR5 on a budget motherboard, buying the GPUs used from gamers upgrading to 4090 or 5090. The other is a reverse-engineered Tesla V100 - resoldered onto a half-height PCB - going for about $220 for the 16GB version or $590 for 32GB, with a 3-year warranty.

The 4x3090 route gets you roughly 96GB of VRAM through consumer cards that are easier to source in local deals. The V100 route gives you server-class hardware at a lower per-unit cost, though you are trusting a reverse-engineered PCB design.

For someone building a local LLM rig today, which path makes more sense? Is the 3090 route safer, or does the V100 mod offer better value if you need more VRAM density per slot?

reddit.com

u/IulianHI — 14 days ago

▲ 19 r/AIToolsPerformance

GLM5.2 pushed from 2.5 tok/s to over 50 tok/s on GH200 system

New optimization work shows GLM5.2 going from roughly 2.5 tokens per second to over 50 tokens per second on a GH200 system. The setup involves a hacked server-to-desktop configuration with 2x Hopper H100 GPUs at 96GB HBM3 each and 2x Grace CPUs at 72 cores.

That kind of jump is hard to ignore. Going from barely readable streaming speed to over 50 tok/s means the model goes from a novelty to actually usable for real workflows. The fact that it required specific model hacks to get there suggests GLM5.2 has significant inefficiencies in its default configuration that talented people are finding ways to unlock.

Makes you wonder what the baseline implementation is leaving on the table.

reddit.com

u/IulianHI — 12 days ago

▲ 23 r/AIToolsPerformance

Two paths to VRAM on a budget - reverse-engineered V100s vs new Chinese AI chips

Two interesting budget hardware paths for local AI are emerging. One is the reverse-engineered Tesla V100 v4 - someone spent a year mapping 2,963 pinout signals and resoldered it onto a half-height PCB with full NVLink support up to 8-way. The 16GB version runs about $220, the 32GB about $590, with a 3-year warranty.

The other angle is that 7 Chinese companies are now shipping H100/H200-class AI chips, most having IPO'd in the last 6 months. These are the chips that domestic models like GLM are increasingly being tuned for.

The V100 route gives you proven architecture at a known low cost. The Chinese chips represent newer hardware but with less track record for local builders outside China. For someone building a multi-GPU rig today, which path makes more sense - cheap reverse-engineered server cards or betting on newer Chinese silicon?

reddit.com

u/IulianHI — 13 days ago

▲ 5 r/AIToolsPerformance+3 crossposts

MindTrial: OpenRouter Fusion reduces errors, but doesn’t beat GPT-5.5

I tested OpenRouter Fusion on MindTrial - OpenRouter’s multi-model deliberation feature where an outer model can ask a GPT/Claude/Gemini-style panel for help, then use a judge model to synthesize the result.

I ran two Fusion configurations:

Default-reasoning Fusion: gpt-latest outer, general-high panel, Claude Opus judge. Result: 87/98, with 10 fails and 1 hard error. Runtime: ~3h54m.
High/xhigh Fusion: GPT-5.5 high outer, Claude Opus/GPT/Gemini Pro xhigh panel, Claude Opus xhigh judge. Result: 86/98, with 12 fails and 0 hard errors. Runtime: ~10h18m.

For comparison, standalone GPT-5.5 high scored 86/98, with 7 fails and 5 hard errors. Runtime: ~2h18m.

Main finding: Fusion helped reliability, but not accuracy. It reduced hard errors, but mostly converted them into ordinary wrong answers rather than extra passes.

The “panel union” hypothesis also did not hold. The high/xhigh Fusion run still missed 7 tasks that at least one standalone panel comparator solved.

The likely bottleneck is invocation policy: Fusion was optional, not forced. Based on the log, it appears to have been called on only a small minority of tasks, often late after many Python attempts and a lot of accumulated context.

Main takeaway: OpenRouter Fusion looks promising as a reliability layer, but in this benchmark it was not an oracle over its panel - and xhigh deliberation made the long-tail visual/spatial tasks much more expensive without improving the aggregate score.

petmal.net

u/Correct_Tomato1871 — 10 days ago

▲ 14 r/AIToolsPerformance

Gemma 4 QAT uncensored vs Heretic abliterated - two paths to unfiltered models

Two different approaches to removing model guardrails are making rounds. Gemma 4 26B-A4B and 31B-QAT "Uncensored Balanced" just dropped with MTP, delivering 35% and 53% speed boosts respectively. These are built as balanced uncensored quants from the ground up.

Meanwhile, the Swiss Federal Supreme Court is evaluating Heretic - an abliterated model project - for their own institutional use. Not banning it, actually considering deploying it.

The contrast is interesting. The Gemma releases are performance-tuned uncensored builds aimed at local users wanting speed alongside fewer restrictions. Heretic is getting looked at by a national government court system, which suggests some institutions want models without the typical refusal behavior for legitimate work.

For local setups, does the balanced QAT approach give better day-to-day usability than abliterated models, or do they serve completely different needs?

reddit.com

u/IulianHI — 11 days ago

▲ 0 r/AIToolsPerformance

Tested gemma4:e4b vs qwen3.6:35b on agentic coding tasks using QuantaMind — pretty eye-opening gap

I've been using QuantaMind (free desktop app) to properly benchmark local models before trusting them in any agent workflow. Ran the built-in Coding eval suite today on two Ollama models and the results were more dramatic than I expected.

**What I ran:**
- Eval: Built-in Coding collection (Easy tier, 5 agentic tasks)
- K = 5 (each task run 5 times to test consistency, not just one lucky pass)
- Backend: Ollama on my local machine (64 GB RAM)
- Decoys: off

**Results:**

Model	Pass^(K)	Avg Steps	Effort	Top Error
gemma4:e4b (Q4_K_M)	3/5	1.8	257 tok	FAKE DONE
qwen3.6:35b (Q4_K_M)	5/5	1.8	447 tok	NONE

https://preview.redd.it/rbxpny8vzu8h1.png?width=3456&format=png&auto=webp&s=6e4eb6c43d15f2bd34f654fe29e2bfe40b68a91c

https://preview.redd.it/9fu8fx8vzu8h1.png?width=3456&format=png&auto=webp&s=01af6d29cfceffe219edc6add36e555cd27a548d

**The thing that stood out:** both models took the same average number of steps (1.8). The difference wasn't how many actions they took — it was whether those actions were actually correct.

The "FAKE DONE" error on gemma is particularly bad for agentic use. It means the model claimed to complete the task before it actually ran all the required steps. That's a silent failure — in a real pipeline it would just look like a successful run.

qwen3.6:35b passed everything cleanly, but used significantly more tokens (~74% more). So there's a real cost trade-off depending on your setup.

**Tool I used:** quantamind.co — free macOS app, runs fully local, no data leaves your machine. Has a built-in agentic sandbox with anti-cheat rules (it specifically catches the fake-done pattern). Also has a context cliff probe and readiness verdicts if you want to go deeper.

Curious if others have seen the fake-done failure on smaller models, or if there are other 4B-class models worth testing against qwen3.6:35b on coding tasks.

reddit.com

u/Striking_Difficulty1 — 12 days ago

▲ 19 r/AIToolsPerformance+7 crossposts

We benchmarked our proprietary AI topologies against frontier models on LiveBench, GSM8K, and HumanEval. Results are public. Looking for feedback.

Hey everyone,

We're Trijna Labs — an AI research lab out of India building novel neural architectures from scratch (not fine-tunes, not wrappers)

We just published our baseline benchmark results comparing our proprietary topologies (ARS and OSM) against state-of-the-art closed-weight models on three standard evaluations:

• LiveBench — un-gameable, dynamic general reasoning (prevents memorization)

• GSM8K — multi-step mathematical reasoning (5-shot exact match, greedy decoding)

• HumanEval — code generation (0-shot pass@1)

Key details:

— All evaluations run using EleutherAI's lm_eval harness (v0.4.12)

— Greedy decoding, no sampling tricks

— ARS topology set a new high-water mark on LiveBench

— OSM topology achieved matrix stabilization

We're not claiming we beat GPT-5 or anything sensational. We're showing transparent, reproducible results from architectures we built ground-up. The raw evaluation logs are on our site.

Full results and methodology: https://trijnalabs.tech/news

Would love to hear what you all think. If you're a developer and this kind of work interests you, feel free to DM us — we're always open to collaborators.

More on what we're building: https://trijnalabs.tech

u/Different-Turnip3864 — 14 days ago

r/AIToolsPerformance

eval-harness - agent harness evaluation framework

Alibaba banning Claude Code over backdoor risks - what's actually in the requests

Claude Sonnet 5 vs GLM 5.2 - does the price gap still make sense?

My org is looking to get Ai tools for org wide

What if a new AI architecture could train on a laptop with just 4GB of RAM and an old Intel i5 - no GPU, no cloud?

Ornith 1.0: Local AI That Beats GPT-5.5 at Coding (Free &amp; Private)

Budget VRAM builds - 4x3090 home lab vs reverse-engineered Tesla V100 cards

GLM5.2 pushed from 2.5 tok/s to over 50 tok/s on GH200 system

Two paths to VRAM on a budget - reverse-engineered V100s vs new Chinese AI chips

MindTrial: OpenRouter Fusion reduces errors, but doesn’t beat GPT-5.5

Gemma 4 QAT uncensored vs Heretic abliterated - two paths to unfiltered models

Tested gemma4:e4b vs qwen3.6:35b on agentic coding tasks using QuantaMind — pretty eye-opening gap

We benchmarked our proprietary AI topologies against frontier models on LiveBench, GSM8K, and HumanEval. Results are public. Looking for feedback.

Ornith 1.0: Local AI That Beats GPT-5.5 at Coding (Free & Private)