u/qundefined

Created LLM quiz to check if AIs' performance varies over time
▲ 4 r/LLM+3 crossposts

Created LLM quiz to check if AIs' performance varies over time

I've been noticing an increasing number of posts and comments on Reddit claiming that LLM models are either becoming dumber over time or have varying performance throughout the day. I tried to find long-form, over-time performance graphs or repos that tracked this but came up empty after a 5-minute search across GitHub and Google.

So I ended up building LLM Canary

What it is and how it works: the program fires a pseudo-randomized questionnaire at a set of LLMs, scores every answer programmatically, and logs the results. There are 25 questions per run: arithmetic tasks, counting letters, reversing a word, predicting JavaScript output, a chained password game with 5, 10, and 15 simultaneous rules, and more.

I ran it for a week with crontab every hour across 7 models: Claude Haiku 4.5, Claude Sonnet 4.6, GPT-4.1, GPT-4.1 Mini, GPT-4o Mini, GPT-4.1 Nano, Gemini 2.5 Flash Lite. The most consistent data came from Claude, since I only introduced the other providers partway through — and Gemini's expensive flagships burned through budget too quickly to collect enough data. Check the readme in the repo if you want to learn more.

Note: One week is not enough to prove or disprove the degradation claim yet — I need to run it longer and review performance week over week or month over month. What I have is a project capable of asking questions and establishing an ELO score.

FINDINGS

LLM ELO score fluctuations by Nth hour

First things first — ALL models fluctuate throughout the day and not in any consistent pattern. Some are more volatile, like Gemini 2.5 Flash Lite, while others like GPT-4.1 Nano show an island of steady, predictable performance with smaller deviations between 6 AM and 1 PM GMT+0. If API load were driving degradation at specific hours, you'd expect the same hours to look bad across multiple providers simultaneously — but that's not what we see here.

With the data collected so far, there's no "smoking gun" clearly showing a model becoming dumber. Models struggle with hard questions, some more than others. So that's one immediate finding — a model that successfully answers a question once isn't guaranteed to pass it the next hour. What matters is consistency and question difficulty.

Next:

It isn't really fair to compare model to model by question since some are naturally better at math while others are designed for language and writing — but let's do it anyway.

Take `letter_count` for example. The prompt is something like:

How many times does the letter 'c' appear in the word 'ecophysiologies'? Reply with just the number.

Pretty much all models pass this with 40–60% accuracy. However, GPT-4.1 Nano and Gemini 2.5 Flash Lite embarrassingly score 16.8% and 17.76% respectively.

Another interesting find: Claude Haiku 4.5, the cheaper Anthropic model, outperforms Claude Sonnet 4.6 at counting vowels in a paragraph (71.58% vs 64.74%). Almost everywhere else, Sonnet 4.6 takes the lead.

`count_f` is a prompt where the program takes random excerpts from the Bible and asks an LLM to count the letter 'f'. Pretty much ALL models fail here with around a 7.5% pass rate — they tend to skip stopwords like "of" and "for" — but Claude Sonnet 4.6, the most capable model in this list, manages 45.79%.

`word_count` is a similar test: the prompt takes a random paragraph from the Bible and asks the LLM to count the words. Again, most models skip stopwords and the average hovers around a 5.5% pass rate, though GPT-4o Mini manages 16.54%.

GPT-4.1 Nano is the weakest of the bunch. Its total average score is only 45% with an ELO of 965.98 — and it had the lowest scores on 9 out of 25 questions — while Claude Sonnet 4.6 leads at a 75% average and ELO 1293.29. A 327-point ELO gap might not sound dramatic on paper, but the per-question breakdowns make the performance difference pretty hard to ignore.

Finally, going back to the within-day fluctuations (min-max deltas per hour), you're looking at roughly a 150-point swing except for Claude (both Haiku and Sonnet). Their fluctuation delta SUM is around 4.4k. Divide that by 24 and you get ~183.3 ELO points.

That's probably what tips people off — it makes it feel like "Claude is dumber this morning than yesterday."

reddit.com
u/qundefined — 19 hours ago