u/ENT_Alam

Image 1 —
Image 2 —
Image 3 —
Image 4 —
Image 5 —
Image 6 —
Image 7 —
Image 8 —
Image 9 —
Image 10 —
Image 11 —
Image 12 —
Image 13 —
Image 14 —
▲ 15 r/ChatGPT

Some Notes:

  • The released benchmarks for GPT 5.5 showed marginal gains; if anything I thought GPT 5.5 might have been more of an improvement on OpenAI's end than the consumer end (providing the same level of outputs with much less thinking tokens and compute power), but after benchmarking them here, I was pretty impressed.
    • Though again, I can see how people might interpret the results to be quite similar in quality
  • I will say, with the 5.5 family, the differences between the Pro and standard model are (in my opinion) the least pronounced they've ever been; 5.5 -> 5.5 Pro have very similar output quality
    • It's uncanny how similar their outputs are actually; I'll likely have to look into adding more difficult/technical prompts; feel free to suggest new ones on the repo
  • Total cost was $19.98 | Average inference time was: 624 seconds
    • GPT 5.4 was ~$25 in total; I don't remember the exact cost and unfortunately wasn't documenting costs like I am now
      • Despite doubling the API costs, OpenAI's claim about the model using much less thinking tokens and being faster is definitely true
      • I think most benchmarks the also found that GPT 5.5 around the same cost, though I don't believe it's common for GPT 5.5 to in up cheaper, so this benchmark seems to be an outlier (or I'm remembering the price wrong)
    • If you enjoy these posts please feel free to help fund the benchmark
      • Thanks for all the support!! I've been able to benchmark GPT 5.5 Pro as well as a result (will post soon)

Feel free to see the all my thoughts on the GitHub release (thanks for the suggestion!) TDLR:

  • GPT 5.5 Pro + DeepSeek V4 were also benchmarked
  • Made an official Twitter/X account
    • Don't really care to maintain it so probably won't be posting much, but thought it was a good suggestion
  • Added vertical gif comparison exports
    • Was doom scrolling and ran into an AI-slop post about my benchmark which was really cool lol
  • Actually (tried) optimized the backend
    • Still not the best, but serving 300MB JSONs isn't that easy 😭 developers please feel free to help contribute 🙏

Benchmark: https://minebench.ai/
Git Repository: https://github.com/Ammaar-Alam/minebench

Previous Posts:

Extra Information (if you're confused):

Essentially it's a benchmark that tests how well a model can create a 3D Minecraft like structure.

So the models are given a palette of blocks (think of them like legos) and a prompt of what to build, so like the first prompt you see in the post was a fighter jet. Then the models had to build a fighter jet by returning a JSON in which they gave the coordinate of each block/lego (x, y, z). It's interesting to see which model is able to create a better 3D representation of the given prompt.

The smarter models tend to design much more detailed and intricate builds. The repository readme might provide might help give a better understanding.

(Disclaimer: This is a public benchmark I created, so technically self-promotion :)

u/ENT_Alam — 24 days ago
▲ 25 r/OpenAI

Some Notes:

  • The released benchmarks for GPT 5.5 showed marginal gains; if anything I thought GPT 5.5 might have been more of an improvement on OpenAI's end than the consumer end (providing the same level of outputs with much less thinking tokens and compute power), but after benchmarking them here, I was pretty impressed.
    • Though again, I can see how people might interpret the results to be quite similar in quality
  • I will say, with the 5.5 family, the differences between the Pro and standard model are (in my opinion) the least pronounced they've ever been; 5.5 -> 5.5 Pro have very similar output quality
    • It's uncanny how similar their outputs are actually; I'll likely have to look into adding more difficult/technical prompts; feel free to suggest new ones on the repo
  • Total cost was $19.98 | Average inference time was: 624 seconds
    • GPT 5.4 was ~$25 in total; I don't remember the exact cost and unfortunately wasn't documenting costs like I am now
      • Despite doubling the API costs, OpenAI's claim about the model using much less thinking tokens and being faster is definitely true
      • I think most benchmarks the also found that GPT 5.5 around the same cost, though I don't believe it's common for GPT 5.5 to in up cheaper, so this benchmark seems to be an outlier (or I'm remembering the price wrong)
    • If you enjoy these posts please feel free to help fund the benchmark
      • Thanks for all the support!! I've been able to benchmark GPT 5.5 Pro as well as a result (will post soon)

Feel free to see the all my thoughts on the GitHub release (thanks for the suggestion!) TDLR:

  • GPT 5.5 Pro + DeepSeek V4 were also benchmarked
  • Made an official Twitter/X account
    • Don't really care to maintain it so probably won't be posting much, but thought it was a good suggestion
  • Added vertical gif comparison exports
    • Was doom scrolling and ran into an AI-slop post about my benchmark which was really cool lol
  • Actually (tried) optimized the backend
    • Still not the best, but serving 300MB JSONs isn't that easy 😭 developers please feel free to help contribute 🙏

Benchmark: https://minebench.ai/
Git Repository: https://github.com/Ammaar-Alam/minebench

Previous Posts:

Extra Information (if you're confused):

Essentially it's a benchmark that tests how well a model can create a 3D Minecraft like structure.

So the models are given a palette of blocks (think of them like legos) and a prompt of what to build, so like the first prompt you see in the post was a fighter jet. Then the models had to build a fighter jet by returning a JSON in which they gave the coordinate of each block/lego (x, y, z). It's interesting to see which model is able to create a better 3D representation of the given prompt.

The smarter models tend to design much more detailed and intricate builds. The repository readme might provide might help give a better understanding.

(Disclaimer: This is a public benchmark I created, so technically self-promotion :)

u/ENT_Alam — 25 days ago

Some Notes:

  • The released benchmarks for GPT 5.5 showed marginal gains; if anything I thought GPT 5.5 might have been more of an improvement on OpenAI's end than the consumer end (providing the same level of outputs with much less thinking tokens and compute power), but after benchmarking them here, I was pretty impressed.
    • Though again, I can see how people might interpret the results to be quite similar in quality
  • I will say, with the 5.5 family, the differences between the Pro and standard model are (in my opinion) the least pronounced they've ever been; 5.5 -> 5.5 Pro have very similar output quality
    • It's uncanny how similar their outputs are actually; I'll likely have to look into adding more difficult/technical prompts; feel free to suggest new ones on the repo
  • Total cost was $19.98 | Average inference time was: 624 seconds
    • GPT 5.4 was ~$25 in total; I don't remember the exact cost and unfortunately wasn't documenting costs like I am now
      • Despite doubling the API costs, OpenAI's claim about the model using much less thinking tokens and being faster is definitely true
      • I think most benchmarks the also found that GPT 5.5 around the same cost, though I don't believe it's common for GPT 5.5 to in up cheaper, so this benchmark seems to be an outlier (or I'm remembering the price wrong)
    • If you enjoy these posts please feel free to help fund the benchmark
      • Thanks for all the support!! I've been able to benchmark GPT 5.5 Pro as well as a result (will post soon)

Feel free to see the all my thoughts on the GitHub release (thanks for the suggestion!) TDLR:

  • GPT 5.5 Pro + DeepSeek V4 were also benchmarked
  • Made an official Twitter/X account
    • Don't really care to maintain it so probably won't be posting much, but thought it was a good suggestion
  • Added vertical gif comparison exports
    • Was doom scrolling and ran into an AI-slop post about my benchmark which was really cool lol
  • Actually (tried) optimized the backend
    • Still not the best, but serving 300MB JSONs isn't that easy 😭 developers please feel free to help contribute 🙏

Benchmark: https://minebench.ai/
Git Repository: https://github.com/Ammaar-Alam/minebench

Previous Posts:

Extra Information (if you're confused):

Essentially it's a benchmark that tests how well a model can create a 3D Minecraft like structure.

So the models are given a palette of blocks (think of them like legos) and a prompt of what to build, so like the first prompt you see in the post was a fighter jet. Then the models had to build a fighter jet by returning a JSON in which they gave the coordinate of each block/lego (x, y, z). It's interesting to see which model is able to create a better 3D representation of the given prompt.

The smarter models tend to design much more detailed and intricate builds. The repository readme might provide might help give a better understanding.

(Disclaimer: This is a public benchmark I created, so technically self-promotion :)

u/ENT_Alam — 25 days ago