u/Ferozk03

▲ 45 r/LocalLLM+1 crossposts

The H100 GPU can theoretically do 62,000 tokens/sec. Production gets 200. I wrote a deep dive on why the gap is structural, with an interactive explainer.

Long story short, an 8B model in 16-bit precision is 16 GB. Every token requires a full weight transfer from HBM to on-chip SRAM. With 3.35 TB/s bandwidth: 3,350 / 16 = approx 200 tokens/sec ceiling. The compute units capable of 1,000 TFLOP/sec sit idle most of the time waiting for data.

The article covers: the memory hierarchy bottleneck, KV cache tradeoffs, speculative decoding, diffusion LLMs, block diffusion, and where each sits on the roofline model.

Also built an interactive explainer with live animations for each concept: https://ferozk0333.github.io/memory-wall/

Please let me know your thoughts on where you think LLMs will become capable of real-time applications.

reddit.com
u/Ferozk03 — 5 days ago