The real LLM inference bottleneck isn't compute — it's memory bandwidth
Most people optimize for GPU utilization and wonder why inference is still slow. The issue is that transformer inference is almost entirely memory-bandwidth-bound, not compute-bound.
Here's what's actually happening:
During prefill, you're loading model weights once per forward pass — manageable. But during autoregressive decoding, every single token generation requires reading ALL the KV cache for every active sequence from HBM. With an 80GB A100 at ~2TB/s bandwidth, a 70B model with 4K context and batch size 8 can saturate that bandwidth before you've even started worrying about FLOPs.
The useful metric here is MFU (Model FLOPs Utilization) — ratio of achieved FLOPs to theoretical peak. Most production systems run at 30–50% MFU during decoding. If yours is higher, you're probably measuring prefill-heavy workloads.
The three levers that actually help:
- Continuous batching (increase batch size to amortize weight reads)
- KV cache quantization (reduce the data being moved)
- Speculative decoding (change the compute/memory ratio)
Curious what MFU numbers others are seeing in production. What's your hardware + serving stack?