
Small Model Forensics, benchmarking prefill and decode scaling across 9 models, 3 providers, 100–1M tokens
We made 2,000 API calls to nine small closed-weight models (Gemini Flash variants, GPT-4o-mini, GPT-4.1-nano, GPT-5.4-mini, Claude Haiku 4.5) across prompt sizes spanning four orders of magnitude.
Key findings:
Every model's prefill scales sub-linearly. Fitting power laws to min TTFT gives exponents ranging from 0.15 (Gemini 3.1 Flash Lite) to 1.02 (GPT-4.1-nano at the top end). No model exhibits the O(n²) prefill you'd expect from dense attention, even at 100K+ contexts where provider overhead becomes negligible.
Decode behavior varies wildly across providers. Gemini Flash Lite's decode cost actually decreases at large context (from 4.6ms/token to 3.3ms/token). GPT-5.4-mini goes the opposite direction, 7ms/token at small context to 108ms/token at 1M. Different inference architectures, different tradeoffs.
Model rankings invert across context sizes. GPT-4.1-nano is fastest at <1KB, Gemini Flash Lite is fastest at >600KB. Quoting a single latency number for a model is meaningless without specifying the context window.
Gemini Flash Lite exhibits reproducible negative scaling around 100K tokens, 144K input is faster than 62K input. Both prefill and decode improve, suggesting a routing transition to different hardware.
Cross-provider tokenizer efficiency differs by ~14% between Anthropic and OpenAI for the same English text content.
Interactive viewer, code, and raw dataset: https://blog.0xmmo.co/forensics/post.html