u/hexxthegon

🔀 Unveiling TwinRouterBench, an open source router evaluation to look at solutions not just prompts. 🚦

🔀 Unveiling TwinRouterBench, an open source router evaluation to look at solutions not just prompts. 🚦

TwinRouterBench has two tracks, static and dynamic in one protocol.

Static focuses on cost and time efficiency, a set of questions labeled with the most cost and time efficient steps, then the router predicts and predictions are scored against label.

Dynamic focuses on end-to-end evaluation on SWE-bench Verified with real tool use, mini-swe-agent scaffold or editor scaffold

Scoring:

Leaderboard bill = routed spend + a fixed penalty per unresolved task.

Measuring trade offs in underspending causing task fails.

It's open source bench:

→ Apache-2.0
→ Reference routers included (gold-tier oracle, SR-KNN, more)
→ PRs welcome for new workloads, new routers, scaffolds

GitHub: https://github.com/CommonstackAI/TwinRouterBench

arXiv coming soon.

u/hexxthegon — 5 days ago
▲ 5 r/commonstack+1 crossposts

Cost illusion in Task vs Token between Opus 4.7 and K2.6 💭

Kimi K2.6 is 6x cheaper per token than Claude Opus 4.7.

But per task? It's only 39% cheaper.

Kimi K2.6 $0.76 per task
Claude Opus 4.7 $1.24 per task

Kimi burns so many tokens to complete a task that the 6x pricing advantage nearly disappears on benchmark.

Cheaper per token not equaling to cheaper to use unless it’s for specified tasks.

The model takes 2x the tokens and 7x longer to finish, the savings may not be as much.

It’s important to recognize also that Kimi K2.6 has also significantly less context window compared to Opus 4.7, each model should have different tasks for optimal cost in a work flow put together

Compare cost per task and token prices is an interesting lens to see it from, but if you have several Mac machines lying around Kimi is open source and then cost wouldn’t be a factor at all.

Kimi is still a wonderful model that gives you more tries per million compared to Opus so it should never be fully written off.

u/hexxthegon — 6 days ago
▲ 3 r/AI_Tools_Land+1 crossposts

👾 Easily add Commonstack into your workflow in 5 minutes 🔗

Run Claude Code with Commonstack in 4 steps:

- generate an API key
- set 4 environment variables
- run claude
- /status to verify

Set it up now in 5 minutes with Alex!

Get access to all the latest frontiers on Commonstack here: https://commonstack.ai/model-library

u/hexxthegon — 9 days ago

Kimi K2.6 looks absolutely stellar next to Gemini 3.1 Pro for Deep Research, matching across all of these domains.

Kimi K2.6 vs Gemini 3.1 Pro:

Input: $0.95/M vs $2/M (2.1x cost difference)

Output: $4/M vs $12/M (3x cost difference)​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

It’s not just at the frontier of multimodal and deep search, it’s also looks like an incredible deal for agentic intelligence.

It’s been a few weeks now, design wise seems mixed however seen many great one shot takes.

u/hexxthegon — 24 days ago

Both GPT 5.5 and Deepseek V4 Pro is available on Commonstack!

They are both great models for agents and coding tasks, use both or build around your suite!

GPT 5.5:

Input Size 0 ≤ tokens < 272,000

Input $5/M Output $30/M

Input Size 272,000 ≤ tokens < ∞

Input $10/M Output $45/M

Try Here:

https://commonstack.ai/model-library/model?modelId=2c2cc8f0-e582-47bb-bc94-2cc46774f5df

Deepseek V4 Pro:

Input $0.435/M Output $0.87/M

Try Here:

https://commonstack.ai/model-library/model?modelId=f5c8c221-9b13-488d-9a45-4ed0f0c3250f

u/hexxthegon — 25 days ago

Both GPT 5.5 and Deepseek V4 Pro is available on Commonstack!

They are both great models for agents and coding tasks, use both or build around your suite!

GPT 5.5:

Input Size 0 ≤ tokens < 272,000

Input $5/M

Output $30/M

Input Size 272,000 ≤ tokens < ∞

Input $10/M

Output $45/M

Try Here:

https://commonstack.ai/model-library/model?modelId=2c2cc8f0-e582-47bb-bc94-2cc46774f5df

Deepseek V4 Pro:

Input $1.74/M

Output $3.48/M

Try Here:

https://commonstack.ai/model-library/model?modelId=f5c8c221-9b13-488d-9a45-4ed0f0c3250f

u/hexxthegon — 25 days ago
▲ 39 r/commonstack+1 crossposts

BrokenArXiv is a benchmark of mathematical statements that look highly plausible and "academic" but are actually provably false.

Most math benchmarks test a model's ability to solve a real problem. BrokenArXiv tests for honesty and critical thinking by asking models to "Prove the following statement" for something that cannot be proven.

Somehow GPT 5.4 & 5.5 completely annihilates Opus by many multiples and at a lower cost for completion.

Like it or not it seems like Sama is having a generational comeback as many users seem to prefer GPT 5.5 over Opus 4.7 on X. Or could this be another case of Anthropic nerfing their models

u/hexxthegon — 26 days ago
▲ 15 r/commonstack+1 crossposts

This is their latest leap from V3.2 to V4, from what I’ve read it seems like they had stability issues during post training, I think we can expect much stronger improvements as V4.1 comes

But this is practically GPT 5.4 & Opus 4.6 for literal pennies on the dollar. The flash model itself is extremely impressive and this overall lineup is even more cost efficient then many other Chinese SOTA models at this time.

GPT 5.4 pro vs DeepSeek V4 flash:

Input: $30/M vs $0.14/M (214x cost difference)

Output: $180/M vs $0.28/M (643x cost difference)​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

Both at a million context, DeepSeek V4 Flash is really a bargain for intelligence.

Number 3 in Arena for open models in coding, this was an incredible release.

u/hexxthegon — 29 days ago
▲ 69 r/commonstack+1 crossposts

Saw this heatmap result experiment that even though these models come from different companies and have different architectures, their output personalities basically fall into two big stylistic attractors when viewed through Gemma 4.

  1. Picked 25 different LLMs (things like GPT-5.x, Claude Opus/Sonnet/Haiku 4.x, Grok 4.x, Gemini 3.x, DeepSeek, Qwen, MiniMax, Kimi, GLM, etc.).

  2. Gave all of them the exact same 50 prompts and collected their responses.

  3. Took every single response and fed it into Gemma 4 (Google’s latest model at the time).

  4. Inside Gemma 4, they pulled the residual stream activations — basically the raw internal “thought vectors” — from all 42 layers and averaged across every token in the response.

This created one giant vector per response: 107,520 dimensions (2560-dim per layer × ~42 layers).

  1. For each of the 25 LLMs, they averaged those vectors across the 50 prompts → one “style vector” per model.

  2. Computed cosine similarity between every pair of those 25 vectors (how similar their outputs look inside Gemma 4’s brain).

  3. Plotted it as a heatmap (red = very similar, blue = very different) and sorted the rows/columns with hierarchical clustering so similar models group together.

The visuals on heatmap:

- A very clear two cluster split:

•  Top left red/orange block → “GPT resemblance” family (GPTs, Grok 4.x, DeepSeek, MiniMax, Kimi, Trinity, etc.).

•  Bottom right red block → “Claude resemblance” family (Claude Opus/Sonnet, GLM, Qwen, Gemini 3.1 Pro, etc.).

- Outliers/exceptions (the post highlights them):

•  Claude Haiku 4.5 sits weirdly in the middle.

•  Gemini 3 Flash is way off on its own.

•  Gemma 4 itself and MiniMax M2.7 are also a bit separate.

From the view of Gemma these were nearly identical in terms of response using 50 same prompts.

The second heatmap uses real user prompts and parts of the pattern still held up with a widely different visual.

Which model families are you guys using right now? Are LLMs commoditized to an extent where most general users can’t tell the difference? With many model families available now capabilities might be getting more difficult to distinguish especially if opposing models could be served for free locally or at a fraction of the cost.

u/hexxthegon — 1 month ago
▲ 22 r/commonstack+1 crossposts

A few primary issues I saw during initial launch from other users is that Opus 4.7 burns tokens like a volcanic eruption and few other things about failing tool calling.

But since last night on X some users have figured out how to ask questions differently and Opus 4.7 is a very strong model, although nerfing Opus 4.6 left some bad taste in people’s mouths lel.

Within a week of GLM 5.1, Anthropic released Claude Opus 4.7 which delivers top SWE results.

SWE bench pro:

Opus 4.7 (64.3%) vs GLM 5.1 (58.4%) vs Opus 4.6 (57.3%)

In Code Opus 4.7 is also in a league of their own with 1583.

GLM 5.1 still delivers significant value as it has great long horizon autonomous tasks operations and it is right inbetween Opus 4.6 and 4.7 in results.

GLM-5.1 vs Claude Opus 4.7:

Input: $1.4/M vs $5/M (3.6x cost difference)

Output: $4.4/M vs $25/M (5.7x cost difference)​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

(Price as of April 18th 2026 via Anthropic, Zhipu & Commonstack reference)

A mix of both will likely produce the best intelligence per dollar, where 80%-90% of task is handled with GLM 5.1 and 10-20% is handled with Opus 4.7 for the greatest overall value.

GLM handling the planning and skeleton then let Opus 4.7 fill in the gaps

Redesigning workflows every few weeks kind of a pain but it’s what it takes to keep up.

u/hexxthegon — 1 month ago