u/cjami

Image 1 — Gemini 3.5 Flash - tested on a social deduction benchmark
Image 2 — Gemini 3.5 Flash - tested on a social deduction benchmark
Image 3 — Gemini 3.5 Flash - tested on a social deduction benchmark
Image 4 — Gemini 3.5 Flash - tested on a social deduction benchmark
▲ 34 r/Bard+1 crossposts

Gemini 3.5 Flash - tested on a social deduction benchmark

Hi folks, just wanted to share some fresh results on Gemini 3.5 Flash from a benchmark I've made.

This benchmark pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game. If you're unfamiliar, it's like Mafia/Werewolf or The Traitors TV show.

Results:

Gemini 3.5 Flash performs strongly, hitting the top 5 and holding performance comparable to Gemini 3.1 Pro.

What's interesting here is the cost difference (based on API usage):

Model Cost
Gemini 3.1 Pro $3.93/Game
Gemini 3.5 Flash $2.26/Game
Gemini 3 Flash (Medium) $0.34/Game

3.5 Flash costs nearly half as much as 3.1 Pro for the same performance and faster.

However, it also costs nearly seven times as much as 3 Flash. That's a huge difference, yet this also comes with a big jump in intelligence (Rating change 1541->1698).

Taking a look at verbosity (this usually affects responsiveness and token consumption limits):

Model Average Output Tokens per action
Kimi K2.6 5,038
Gemini 3.5 Flash 1,590
GPT-5.5 403

Verbosity during reasoning is moderate - almost the same as 3.1 Pro. Not the best or worst.

There is a 0% tool call error rate.

Its favourite word extracted from recorded thoughts is inspect.

Notable Moves:

Notable Mistakes:

Overall this is an interesting model if you look at it as a faster, more affordable version of Gemini 3.1 Pro.

It feels a bit awkwardly named when compared to 3 Flash, given the price tier that it's in, but maybe this is just a naming shift creating space for the Flash-lite models?

Full transcripts: https://clocktower-radio.com/search?a=Gemini+3.5+Flash

How-it-works: https://clocktower-radio.com/how-it-works

u/cjami — 1 day ago

Hello! I've benched DeepSeek V4 Pro over the past few days and would like to share my results.

For context, this is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game. If you're unfamiliar, it's like Mafia/Werewolf or The Traitors TV show.

Results:

DeepSeek V4 Pro has shown a consistent strong performance against most models - losing out only to the top few.

It is well priced for its intelligence (based on non-discounted prices).

Model Cost
Gemini 3.1 Pro $3.93/Game
DeepSeek V4 Pro $1.24/Game
GLM 5.1 $1.06/Game

Its verbosity during reasoning is fairly restrained. This usually affects responsiveness and token consumption limits.

Model Average Output Tokens per action
Kimi K2.6 5,038
DeepSeek V4 Pro 1,199
GPT-5.5 403

However, tool call reliability is a bit temperamental with a 5.0% error rate.

Notable Moves:

Overall fairly impressed - this provides strong intelligence for the price, especially when discounted, making it a great everyday model.

DeepSeek V4 Pro transcripts: https://clocktower-radio.com/search?a=DeepSeek+V4+Pro

How-it-works: https://clocktower-radio.com/how-it-works

u/cjami — 17 days ago

Following an impressive shake-up by Kimi K2.6, I've now got some results for Xiaomi's MiMo-V2.5-Pro.

For context, this is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game. If you're unfamiliar, it's like Mafia/Werewolf or The Traitors TV show.

MiMo-V2.5-Pro joins Kimi K2.6 as another dominant player, both models pulling away from the crowd in their own class. Note I have not yet benched GPT 5.5 (Xhigh) or Claude Opus 4.7 (Max) that may also be in this area.

Interestingly, its win rate is a bit lop-sided (Good 88%/ Evil 48%) - having a extremely high good team win rating but a poorer evil team win rating that holds it back from being the top.

Why MiMo-V2.5-Pro over Kimi K2.6?

Kimi K2.6 has incredibly verbose reasoning at 580,000 average output tokens per game, leading to a $2.65/game cost - this also leads to long response times, matches taking around 10-15 hours to complete. It feels a bit impractical for many use cases.

MiMo-V2.5-Pro on the other hand, while slightly verbose at 183,639 tokens per game (similar to Gemini 3.1 Pro verbosity), costs less than half as much at a cooler $0.99/game. On the high end, Claude Opus 4.6 costs $3.76/game. Matches also usually finish around a typical 2-3 hours (if not vs kimi).

It is also fairly reliable with a 0.4% tool call error rate.

This currently places it as the best value model at the top-end of the group.

Notable moves:

Notable mistakes:

MiMo-V2.5-Pro transcripts: https://clocktower-radio.com/search?a=MiMo-V2.5-Pro

How-it-works: https://clocktower-radio.com/how-it-works

u/cjami — 21 days ago
▲ 13 r/OpenAI

I've been benching GPT-5.5 for the past couple days and would like to share my findings.

This is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game.

This is using GPT-5.5 on default settings, which would default to medium reasoning via the OpenAI API.

Findings:

GPT-5.5 holds decent performance over 34 matches (2 games per match) - notable wins against Kimi K2.6 accounts for a lot of its rating gain, but it is held back by some inconsistent performance against weaker models.

However it is very token efficient for the amount of intelligence it's showing at around only 52,000 tokens per game - which highlights very efficient reasoning. Gemini 3.1 Pro ranks above but uses 180,000 tokens per game while Kimi K2.6 takes a brute-force reasoning approach with an eye-watering 570,000 tokens per game.

This results in a cost of $3.38/game - which isn't cheap. Less than the $3.83/game for Claude Opus - while GLM 5.1 is still the value king at $0.91/game.

It is also fully reliable with a 0% tool call error rate.

Notable moves:

Notable mistakes:

GPT-5.5 transcripts: https://clocktower-radio.com/search?a=GPT-5.5

How-it-works: https://clocktower-radio.com/how-it-works

u/cjami — 22 days ago