



At the end of the day, ARC-AGI 3 scores measure action efficiency compared to humans as squared relation.
Quadratic penalisation for every linear multiple inefficient action compared to humans
And you even if you have hours worth of continual learning, which is absolutely not needed for something as small as ARC-AGI 3 games, you'll still score poorly if you take that many trials to figure it out, it's completely useless even if you are 100% of the levels but take that many hours + steps to figure it out
So just like with ARC-AGI and ARC-AGI 2, it has been an RL+Test Time Compute problem all along...add token efficiency to the mix
Given how massive of a step change in token efficiency GPT-5.5 has been....and just the general trajectory of GPT models since "-5"
ARC-AGI 3 is destined to fall to this scale too.