u/rohansrma1

r/LocalLLM r/ClaudeAI r/Bard r/codereview r/nvidia r/Agentic_Marketing r/OpenSourceAI r/PlacementsPrep r/sysadmin r/AIDiscussion r/codex r/aiagents r/ClaudeCode r/ZaiGLM r/AI_Agents r/GeminiFeedback r/GoogleGeminiAI

▲ 43 r/ClaudeAI

Found this today. How correct is it? I do use 4.8 over 5.5 because it feels better

u/rohansrma1 — 8 days ago

▲ 0 r/LocalLLM+1 crossposts

Where Does an Agent Actually Start? Testing NVIDIA Nemotron's "Capability Floor"

NVIDIA recently released the open-weight Nemotron family, and we wanted to see how the different sizes perform on real agentic coding workflows instead of traditional benchmarks. For context, I work at Tessl.

We evaluated the models using around 1,000 real-world coding agent tasks derived from nearly 500 published skills, with every model running through the same agent framework and evaluation pipeline.

One pattern showed up very clearly.

The jump from Nano 30B to Super 120B wasn't just a higher benchmark score. It looked like crossing a capability threshold.

Nano 30B is a genuinely useful model for focused tasks like API integrations, documentation lookups, and smaller code changes. But once tasks became longer and required planning across multiple steps, reliability dropped off quickly.

Super 120B was the first size that consistently handled those longer agent loops while also benefiting much more from skills. In other words, once the model had enough capability, additional guidance actually translated into better execution instead of just longer runs.

We ended up describing this as an agent capability floor. Below a certain level, you don't simply get a weaker agent. You get a model that struggles to complete the act, observe, and decide loop that agentic workflows depend on.

One other takeaway was around cost. Nano is roughly half the inference cost per task, but its much higher failure rate means retries become part of the equation. Looking only at token cost can hide the real cost of getting a usable result.

Full write-up: https://tessl.io/blog/how-small-can-an-agent-model-get-the-nemotron-floor

u/rohansrma1 — 7 days ago

▲ 135 r/PlacementsPrep

What do you think?

u/rohansrma1 — 10 days ago

▲ 4 r/PlacementsPrep

Don't lose hope. You are just one step away!

reddit.com

u/rohansrma1 — 14 days ago

▲ 64 r/ZaiGLM

GLM 5.2 is now a better model than Sonnet 4.6 on coding tasks!!!

We've been evaluating coding-agent workflows at Tessl and recently ran GLM 5.2, MiniMax M3, Kimi K2.7-code, Qwen 3.7-Plus and Sonnet 4.6 across nearly 1,000 coding-agent scenarios.

The tasks came from skills in the Tessl Registry and each scenario was evaluated both with and without the relevant skill loaded. The evaluation dataset is public through the task-evals-for-skills dataset on Hugging Face.

For transparency, I work at Tessl, and we ran the benchmark.

Model	Overall	Instruction Following	Task Completion	Cost / Task
GLM 5.2	91.9	87.4	97.8	$0.289
MiniMax M3	91.4	87.2	97.0	$0.207
Sonnet 4.6	90.8	86.1	97.1	$0.296
Kimi K2.7-code	88.7	82.5	96.9	$0.661
Qwen 3.7-Plus	82.2	77.2	88.9	$0.068

GLM ended up with the highest overall score in the benchmark while also costing slightly less per task than Sonnet.

One thing I found interesting is that Sonnet actually won more individual scenarios, but GLM had fewer low-scoring runs. The average ended up favoring GLM because it was more consistent across the full set of tasks.

We will now be running additional evaluations (time to bring in Opus) because these results were much stronger than we expected going in.

Read full benchmark here: https://tessl.io/blog/open-source-coding-agents-one-ties-sonnet-one-wont-listen/

For people using GLM daily, where do you think it still trails Sonnet or Opus today? Most of the criticism I've seen is around code quality, bloating, and maintainability rather than task completion itself.

u/rohansrma1 — 14 days ago

▲ 92 r/ClaudeCode

Open models are making Sonnet comparisons a lot less ridiculous.

We benchmarked GLM 5.2, MiniMax M3, Kimi K2.7-code, Qwen 3.7-Plus and Sonnet 4.6 across nearly 1,000 coding-agent scenarios.

The tasks came from skills in the Tessl Registry and were evaluated both with and without the relevant skill loaded. The dataset is public and available through the task-evals-for-skills project on Hugging Face.

We ran this internally at Tessl while evaluating skills and agent workflows.

Model	Overall	Instruction Following	Task Completion	Cost / Task
GLM 5.2	91.9	87.4	97.8	$0.289
MiniMax M3	91.4	87.2	97.0	$0.207
Sonnet 4.6	90.8	86.1	97.1	$0.296
Kimi K2.7-code	88.7	82.5	96.9	$0.661
Qwen 3.7-Plus	82.2	77.2	88.9	$0.068

The gap at the top ended up being much smaller than I expected. GLM 5.2 finished ahead of Sonnet overall while costing slightly less per task, and MiniMax M3 wasn't far behind either. (nevermind, ik minimax hangs, and something you would not like to see topping the list, but this is what we got)

Given how close GLM landed to the top of the table, we will be running a separate evaluation against Opus as well. I don't know where it'll end up yet, but after seeing these results, it felt worth testing rather than assuming the gap is still there. :)

After digging through the runs is clear that the models weren't separated much by task completion. The larger differences showed up in instruction following. Most of them could get the work done. The question was whether they did it the way they were asked.

For more context and clarity: It's a benchmark of coding-agent tasks with and without the relevant skill/context loaded. No criticism. Only love. 🙂

Read full benchmark here: https://tessl.io/blog/open-source-coding-agents-one-ties-sonnet-one-wont-listen/

u/rohansrma1 — 15 days ago

▲ 77 r/OpenSourceAI

Open-source models outperformed Sonnet 4.6 on coding tasks!!

We recently benchmarked GLM 5.2, MiniMax M3, Kimi K2.7-code, Qwen 3.7-Plus and Sonnet 4.6 across nearly 1,000 coding-agent scenarios.

The scenarios were run twice: once normally and once with the relevant skill loaded. The skills came from the Tessl Registry, and the tasks/evals are publicly available in the task-evals-for-skills dataset on Hugging Face.

For context, I work at Tessl and we're the ones who ran the benchmark.

Model	Overall	Instruction Following	Task Completion	Cost / Task
GLM 5.2	91.9	87.4	97.8	$0.289
MiniMax M3	91.4	87.2	97.0	$0.207
Sonnet 4.6	90.8	86.1	97.1	$0.296
Kimi K2.7-code	88.7	82.5	96.9	$0.661
Qwen 3.7-Plus	82.2	77.2	88.9	$0.068

The part that surprised me wasn't that an open model got close.

It was GLM 5.2 that finished ahead of Sonnet while costing slightly less per task, and MiniMax M3 also finished ahead of Sonnet while costing about 30% less.

Sonnet still performed extremely well. The top three models are separated by just 1.1 points overall, and Sonnet had the g improvement when skills were added.

What stood out from the results is how different the conversation feels compared to a year ago. The question used to be whether open models could compete with frontier models on coding workloads at all.

Now the discussion is mostly about cost, consistency, and instruction-following because the performance gap at the top is becoming very small.

Read full benchmark here: https://tessl.io/blog/open-source-coding-agents-one-ties-sonnet-one-wont-listen/

u/rohansrma1 — 16 days ago

▲ 12 r/AIDiscussion+1 crossposts

GLM 5.2 and MiniMax M3 are a lot closer/better to Sonnet 4.6 than I expected on coding-agent workloads

We benchmarked GLM 5.2, MiniMax M3, Kimi K2.7-code, Qwen 3.7-Plus and Sonnet 4.6 across nearly 1,000 coding-agent scenarios.

The scenarios were run twice. Once normally and once with the relevant skill loaded. The skills came from the Tessl Registry, and the tasks/evals are publicly available in the task-evals-for-skills dataset on Hugging Face for anyone who wants to inspect them.

Worth mentioning that I work at Tessl since we're the ones who ran the benchmark.

Model	Overall	Instruction Following	Task Completion	Skill Lift	Cost / Task
GLM 5.2	91.9	87.4	97.8	+20.2	$0.289
MiniMax M3	91.4	87.2	97.0	+20.9	$0.207
Sonnet 4.6	90.8	86.1	97.1	+24.4	$0.296
Kimi K2.7-code	88.7	82.5	96.9	+19.5	$0.661
Qwen 3.7-Plus	82.2	77.2	88.9	+19.5	$0.068

The gap at the top ended up being much smaller than I expected. GLM 5.2 finished slightly ahead of Sonnet in overall score while costing slightly less per task. MiniMax M3 landed within half a point of Sonnet and was around 30% cheaper.

One thing that probably gets lost in model-vs-model discussions is the effect of context. Every model gained roughly 20 points when the relevant skill was provided. Sonnet actually saw the largest improvement in the group (+24.4).

The result I keep coming back to isn't that an open model edged out Sonnet on this benchmark. It's the same skill that improved every model by roughly the same amount.

Read full benchmark here: https://tessl.io/blog/open-source-coding-agents-one-ties-sonnet-one-wont-listen/

u/rohansrma1 — 16 days ago

▲ 280 r/AIDiscussion+4 crossposts

You know Gemini 3.1 Pro is actually cheaper than Gemini 3.5 Flash?

We recently benchmarked four Gemini models across ~3,300 coding-agent runs and found a surprising result.

For context, we're the team behind the Tessl Registry (https://tessl.io/registry), so take the usual vendor-disclosure caveat into account that I work for Tessl.

Across the tasks we measured:

Gemini 3.1 Pro: 87.9 score @ $0.66/task
Gemini 3.5 Flash: 88.6 score @ $1.05/task

That's a 0.7-point difference in score for roughly 59% higher cost per task.

The part we didn't expect is that Gemini 3.1 Pro's published input-token pricing is actually higher than Gemini 3.5 Flash's.

And the agent logs explain it.

- Gemini 3.1 Pro averaged 26 turns and ~650k input tokens per task.

- Gemini 3.5 Flash averaged 39 turns and ~1.4M input tokens per task.

In other words, the cheaper token price was overwhelmed by the amount of context the model chose to process while solving the task.

Another interesting result:
when we added relevant skills from the registry, Gemini 3.1 Pro's cost dropped by ~23% while its score increased substantially. The Flash models saw much smaller gains and little to no cost reduction.

The takeaway wasn't which model won.

It was that the actual cost ranking looked very different from what you'd predict by reading Google's pricing page. Turn count and token consumption ended up mattering more than list price.

Benchmark details, methodology, token breakdowns, and raw cost calculations are here: https://tessl.io/blog/why-your-gemini-bill-doesnt-match-the-model-names/

Interested to see whether others have observed the same pattern.

u/rohansrma1 — 23 days ago

▲ 128 r/AIDiscussion+3 crossposts

tested Claude Fable 5 and Opus 4.8 across 917 coding-agent scenarios. Fable won by 0.9 points.

We compared Claude Fable 5 and Opus 4.8 across 917 shared coding-agent scenarios to see what the first public Mythos-class model actually looks like on day-to-day agent workloads. Btw, small disclosure, I work at Tessl.

The tasks came from skills in https://tessl.io/registry and were evaluated both with and without the relevant skill loaded. We scored them using our task eval framework so we could measure the impact of both the model and the skill independently.

The headline result:
- Fable 5: 92.9 overall score at ~$1.25/task
- Opus 4.8: 92.0 overall score at ~$0.74/task

That works out to roughly a 73% premium for a 0.9-point gain on the tasks we measured.

Fable 5 refused 26 tasks that Opus completed successfully. Some were security-review tasks. Others were routine bioinformatics workflows - you can find the full list of skills here, and the evaluation approach here. Anthropic has already acknowledged that parts of the initial rollout were overly conservative, so it'll be interesting to see how this evolves.

The practical question I have is whether people are actually seeing enough additional capability to justify the extra cost.

Has anyone here switched their Claude Code workflow from Opus to Fable yet? If so, what kinds of tasks made the upgrade worth it?

Full benchmark, methodology, and all findings: https://tessl.io/blog/claude-fable-5-vs-opus-48-the-mythos-hype-meets-reality/

u/rohansrma1 — 21 days ago

▲ 0 r/codereview

frustrated with AI code reviewers? check this out

so i was deep into coding last week and ran into this moment where an AI pull request reviewer kept misclassifying some changes as security risks. it was super frustrating to see the potential it had but also the gaps in its understanding. then i found this article: https://tessl.io/blog/i-spent-a-week-fixing-the-wrong-skill-and-other-lessons-from-evaluating-an-ai-pr-reviewer/

Baruch shares how he tweaked an AI reviewer to boost its accuracy from around 70% to 97% just by refining the scoring criteria and getting more specific about the types of vulnerabilities it recognized. it got me thinking about how crucial it is to understand the domain knowledge behind these tools. also, the lesson about building developer trust is so important. it really made me reconsider how I evaluate the AI tools in my workflow and what adjustments I can make to improve their performance.

u/rohansrma1 — 1 month ago

▲ 132 r/PlacementsPrep

You have to accept me! Time to Reject Rejection. 🙂🙈

u/rohansrma1 — 2 months ago

▲ 1 r/AIDiscussion+2 crossposts

GPT-5.5 is OpenAI’s best model. But your benchmark might be measuring the judge too.

posted part 1 last week about model costs/scores and a bunch of people pushed back (fairly) that one benchmark shouldn’t be treated like a universal truth. totally agree with that btw.

but while going through the follow-up analysis from Tessl, i think the more interesting finding actually ended up being the judges themselves:
https://tessl.io/blog/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame/

same 6 models. ame 11 engineering skills. ame outputs

only the judge changed. and the rankings moved around way more than i expected.

opus-4-7 stays #1 regardless of judge:
94.5 under Sonnet
89.2 under GPT-5.5
96.5 under Opus

so the top-end signal seems pretty real. but below that, things get messy fast.

gpt-5.3 goes from rank #3 under Sonnet to rank #5 under GPT-5.5. 91.9 vs 75.7 on the exact same outputs. that’s a 16 point swing caused purely by swapping the evaluator. one individual skill apparently shifted by 47 points depending on which judge graded it.

that’s the part that stood out to me most because it explains a lot of the reactions on the last post too.

some people in the comments were saying:
“there’s no way composer-2 is that good”
others were saying:
“opus 4.7 is miles ahead in practice”
others focused entirely on cost/performance

and honestly all of them can kinda be “right” depending on what the evaluator rewards.

Judge	Avg without-skill	Avg with-skill	Avg lift
Sonnet	76.1	90.3	+14.2
Opus-4-7	72.6	88.3	+15.7
GPT-5.5	70.7	83.4	+12.7

Sonnet was consistently the most lenient. GPT-5.5 was the strictest.
Almost a 7 point average gap between judges grading the same work.

the self-judge stuff is interesting too:
- Opus grading itself gets a +4.6 boost vs cross-judge average
- GPT-5.5 grading itself actually scores lower than the other judges gave it

so yeah, maybe i’m biased because i work at Tessl, but i think the takeaway here is less: “this model wins”

and more: “single-judge evals are probably noisier than most people think”

especially once the model gaps get small.

BUT THE MOST INTERESTING PART!!

The average leaderboard still looks very very very close to the previous benchmark check the screenshot attached.

u/rohansrma1 — 2 months ago

▲ 6 r/Agentic_Marketing+2 crossposts

the part that took us a while to realize about workspace roles

we went through this exact mess a few months ago when we were onboarding a bigger team and nobody had really thought through who should be able to do what. It sounds boring until you're four weeks in and someone has accidentally published something to the wrong workspace or a new hire has been sitting with basically no access because nobody remembered to bump their role up from the default. The default member role is pretty limited intentionally, like search and install only, which is fine for a joe-who-started-monday situation but it's easy to forget to revisit it.

what i ended up doing was just mapping out who needed what before touching any of the settings, because otherwise you end up clicking around reactively. The breakdown we landed on basically looks like this:

user type	what they actually need	role
org admin (samira)	manage all workspaces, add users, create new spaces	org admin
lead engineer (eddie)	publish in engineering workspace, read-only elsewhere	publisher in one workspace, member in others
manager (jennifer)	add users, manage workspace settings, publish	workspace admin
new hire (joe)	search and install skills, nothing else yet	member

the part that took us a while to realize is that workspace-level roles and org-level roles are separate, so you can give someone full access in one workspace without touching their permissions anywhere else. That's not obvious from just poking around the ui. We had someone set up like samira across everything when they really should have been like eddie, just scoped to their team's workspace.

i'm probably biased here since i work at tessl, but this isn't something we put together internally, someone else worked through the same setup and documented it and it maps pretty closely to what our team kept running into: https://tessl.io/blog/tessl-admin-guide-organizations-workspaces-and-roles/

u/rohansrma1 — 4 days ago

▲ 3 r/AIDiscussion+1 crossposts

went through simon maple’s eval again and honestly the interesting part is not who wins, its how close everything is once you add skills.

baseline (no skills) still shows differences, sure. gpt-5.5 is clearly ahead there. but the moment you give models structure and context, things compress a lot.

and then cost starts to matter way more than raw capability.

here’s the cleaner view:

Model	Baseline (no skill)	With skill	Cost/run	Time
claude-opus-4-7	80.8	93.4	$1.00	158.9s
cursor:composer-2	74.3	89.6	$0.23	152.0s
gpt-5.5	75.6	89.4	$0.49	89.5s
gpt-5.4	74.1	89.3	$0.30	135.4s
gpt-5.3-codex	65.5	83.9	$0.44	87.9s
gpt-5-codex	68.7	78.7	$1.05	136.2s

few things that stood out to me:

biggest gap is in baseline, not real usage
5.5 leads raw, but disappears into the pack with skills
5.4 almost same output for way cheaper
cursor is kind of wild on cost efficiency
opus still king on absolute score, but expensive

and then the weird one again: 5.3
lower baseline, lower final score, still costs more than 5.4
that one just doesnt make sense from any angle

also quick note, i work at tessl. we focus on agent enablement, basically helping teams run evals like this and manage skills, context, and workflows around models. so yeah i might look at this stuff more than normal people.

but takeaway feels pretty simple now:

models are getting good enough that how you use them matters more than which one you pick

skills, context, constraints thats where the real gains are.
model choice is starting to look like a pricing and latency decision more than anything else.

read the full breakdown here: https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/

u/rohansrma1 — 2 months ago

▲ 1 r/AI_Agents

Simon maple just dropped a pretty clean benchmark, and the result is kinda funny

gpt-5.5 is the strongest model out of the box, no doubt. but once you give models skills (which is how people actually use them), it basically performs the same as gpt-5.4

like almost identical. same tasks, same setup, same outputs.

the only real difference is you pay a lot more for 5.5 just to get things done a bit faster.

Model	Task Scores (with skills)	Cost/run	Score per $
gpt-5.5	89.4	$0.49	182
gpt-5.4	89.3	$0.30	298
gpt-5.3	83.9	$0.44	191

so yeah:

5.5 vs 5.4 is basically 0.1 difference in score
but costs 63% more
only real win is speed

and the weird one, 5.3, is just a bad deal. costs more than 5.4 and still performs worse.

also quick disclosure: i work at tessl, which is an agent enablement platform focused on helping teams manage, evaluate, and improve the skills and context that AI agents rely on in real workflows

feels like we are hitting a point where picking a model is less about "which is smartest" and more about "what are you optimizing for, cost or latency".

reddit.com

u/rohansrma1 — 2 months ago

▲ 2 r/Agentic_Marketing+1 crossposts

In a new benchmark by Simon Maple, 1,742 tests across 45 scenarios and 11 real engineering skills show something surprising: GPT-5.5 is the most capable model out of the box, but once you add skills, it’s basically tied with GPT-5.4 (89.4 vs 89.3).

Now look at the cost. $0.49 per run vs $0.30. That’s a 63% premium for a 0.1 point gain.

The only clear win for GPT-5.5 is speed: ~89s vs ~135s. If latency matters, it’s defensible. If cost or value matters, it’s hard to justify.

The more interesting story is GPT-5.3. It scores worse (83.9), costs more than GPT-5.4, and burns tokens inefficiently. You’re literally paying more for less.

Check the full comparision here: https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/ (disclosing that i work for this org)

u/rohansrma1 — 2 months ago