Diffusion inference batching breaks when prompt embeddings vary in token length
Spent the last two weeks chasing a throughput regression on our SDXL serving stack after we enabled variable-length prompt support for the product team. Static batching with padded embeddings was giving us a clean 4.2 img/s per A100. Switched to continuous batching expecting the usual LLM-style win, got 2.8 img/s instead. The nuance here is that the UNet cross-attention cost does not scale the way KV-cache reuse does for autoregressive models. Every denoising step recomputes attention against the full text embedding, so padding waste compounds across 30 steps rather than amortizing. We ended up bucketing prompts by token length into three pools and running them as separate static batches. Boring fix, but the literature on diffusion serving assumes uniform prompts and nobody warns you about this.