▲ 0 r/mlscaling+1 crossposts

The real LLM inference bottleneck isn't compute — it's memory bandwidth

Most people optimize for GPU utilization and wonder why inference is still slow. The issue is that transformer inference is almost entirely memory-bandwidth-bound, not compute-bound.

Here's what's actually happening:

During prefill, you're loading model weights once per forward pass — manageable. But during autoregressive decoding, every single token generation requires reading ALL the KV cache for every active sequence from HBM. With an 80GB A100 at ~2TB/s bandwidth, a 70B model with 4K context and batch size 8 can saturate that bandwidth before you've even started worrying about FLOPs.

The useful metric here is MFU (Model FLOPs Utilization) — ratio of achieved FLOPs to theoretical peak. Most production systems run at 30–50% MFU during decoding. If yours is higher, you're probably measuring prefill-heavy workloads.

The three levers that actually help:

  • Continuous batching (increase batch size to amortize weight reads)
  • KV cache quantization (reduce the data being moved)
  • Speculative decoding (change the compute/memory ratio)

Curious what MFU numbers others are seeing in production. What's your hardware + serving stack?

reddit.com
u/ArchitectingAI — 3 days ago
▲ 1 r/deeplearning+1 crossposts

The Evolution of Context Representation: From RNNs to Hybrid Memory Models

I’m working on a deeper write-up about how neural networks have evolved to learn and represent context.

In the meantime, I created this diagram mapping the progression from:

  • RNNs and gated memory
  • Convolutional sequence models
  • Attention and Transformers
  • State-space models
  • Hybrid architectures
  • Test-time learning and persistent memory

The key shift is not just toward larger context windows, but toward better mechanisms for deciding what to retrieve, compress, retain, forget, or update during inference.

Would appreciate feedback, especially on any important architectures or transitions I may have missed.

https://preview.redd.it/pp1f53eju49h1.png?width=2580&format=png&auto=webp&s=c69973a17f1589736278a78927737d690b95de26

reddit.com
u/ArchitectingAI — 5 days ago
▲ 30 r/MLQuestions+1 crossposts

Quantum ML for classical ML engineers — what's actually real vs. hype (and what to ignore)

After spending weeks cutting through QML research, here's my honest take for working ML engineers:

What QML will NOT do (near-term):

  • Speed up your Transformer inference
  • Make your LLMs cheaper or faster
  • Replace PyTorch or CUDA anytime soon

Where QML might actually matter:

  • Combinatorial optimization problems (logistics, scheduling) where classical heuristics plateau
  • Quantum-native sampling for certain generative model variants
  • Hybrid QPU+GPU pipelines for specific kernel computations

The actual architecture shift:
Classical ML: data → classical features → GPU → output
Hybrid QML: data → quantum feature map → QPU circuit → measurement → classical post-processing → output

The QPU isn't replacing the GPU — it's handling a narrow subproblem that classical hardware struggles with structurally.

What to watch:

  • Variational Quantum Eigensolvers (VQE) applied to molecular ML
  • Quantum kernel methods vs. classical kernel SVM at scale
  • IBM/Google error correction timelines (current QPUs are noisy — NISQ era limitations are real)

The honest answer: if you're building production ML systems today, QML is a 3-5 year horizon story. But understanding the fundamentals now puts you ahead of the curve before it becomes noise.

Happy to discuss specific use cases or go deeper on any of these areas.

reddit.com
u/ArchitectingAI — 5 days ago
▲ 3 r/LocalLLM+1 crossposts

RLHF, PPO, DPO, GRPO — these aren't alternatives. They're different layers of the same stack.

I kept seeing debates online about "PPO vs DPO" or "RLHF vs GRPO" framed as if you have to pick one. It took me a while to realize they're not competing — they operate at completely different levels of abstraction.

Here's the framework that clicked for me:

Layer 1 — Problem Framework
How do you model the learning problem? Options: Multi-Armed Bandit, Contextual Bandit, MDP, POMDP. Most LLM alignment work simplifies to contextual bandit (single-turn) or MDP (multi-turn/agentic).

Layer 2 — Solution Algorithm
How do you optimize the policy? This is where Policy Gradient, Actor-Critic, Monte Carlo, TD Learning live. PPO is an algorithm at this layer — not an alignment method.

Layer 3 — LLM Alignment Method
How do you apply RL to align an LLM? RLHF, DPO, GRPO, KTO, IPO all sit here. They differ in whether they need a reward model, how they compute the gradient, and what they optimize against.

Layer 4 — Inference-Time Optimization
How do you squeeze more quality at inference without retraining? Best-of-N, MCTS, beam search variants. This is what o1/o3-style reasoning models do heavily.

Most tutorials teach one layer. Frontier labs engineer across all four.

Curious if others have a different mental model for this — especially how you think about the MDP vs contextual bandit framing for alignment.

https://preview.redd.it/7er72kzs5p8h1.png?width=1456&format=png&auto=webp&s=df6facb7c6ee51a132b376ab780ebe3ebb1b75f9

reddit.com
u/ArchitectingAI — 7 days ago
▲ 1 r/deeplearning+1 crossposts

When your AI agent fails, you usually can't tell which layer broke. Here's the map.

[deleted]

u/ArchitectingAI — 8 days ago
▲ 29 r/deeplearning+1 crossposts

Staff/Principal ML System Design interviews evaluate something most candidates completely miss

After going through these interviews at multiple FAANG/top-tier companies and running enough prep sessions to see patterns, the single biggest failure mode I observe is this: strong ML engineers treating a Staff-level system design interview like a Senior-level one.

The mental shift is subtle but important. At Senior level, you're mostly showing you can build things. At Staff/Principal, you're showing you can reason about systems — under ambiguity, scale constraints, latency budgets, failure modes, and competing business objectives simultaneously.

The best analogy I've found: it's like a behind-the-wheel driving test. There's no single "correct" driving style. But the instructor is watching whether you signal, check mirrors, manage speed, handle edge cases, and complete the route with consistent judgment. You can take slightly different paths and still pass. You fail when you lose structure — spending 20 minutes on model architecture while skipping tradeoffs, monitoring, and failure modes entirely.

The five dimensions I see evaluated most consistently: (1) problem decomposition, (2) tradeoff reasoning, (3) ML + infra depth together, (4) communication clarity, and (5) engineering maturity — thinking about failure modes, retraining loops, and operational cost, not just the happy path.

One underrated prep strategy: for every ML system you study (search, recommendations, fraud, etc.), decompose it into its core components, then trace how each component evolved in waves. Retrieval alone went from BM25 → dense → hybrid → RAG → agentic. Understanding why each wave happened makes you far more credible in deep dives.

I wrote this up in much more detail — full 8-stage interview framework, canonical system patterns, and how Principal engineers think differently — if anyone's actively prepping for these rounds. Link in comments.

What are the most common gaps you've seen — either in your own prep or candidates you've interviewed?

reddit.com
u/ArchitectingAI — 10 days ago
▲ 19 r/aiinfra+2 crossposts

The next AI infrastructure bottleneck isn't compute — it's moving data at energy costs transistors can't sustain

We talk constantly about scaling laws and model capabilities. The constraint nobody's discussing enough: getting data across chips, nodes, and racks at the power density modern datacenters can't keep up with.

A single H100 SXM draws 700W. Eight of them in a server = 5–6kW, just for GPUs. Scale to 1,000 servers and you need dedicated power infrastructure most cities don't have.

I wrote a deep dive on why photonic computing — using light instead of electrons to move data — is the infrastructure shift that everyone building at scale will eventually have to reckon with.

Covers: why interconnect is the real bottleneck, how photonic chips work differently from silicon, which companies are building in this space, and realistic timelines.

https://pawankjha.substack.com/p/from-gpus-to-photons-the-quiet-revolution

u/ArchitectingAI — 11 days ago
▲ 9 r/Vllm+4 crossposts

I wrote a deep dive on how large-scale LLM inference actually works — from user prompt to final token

Most explanations of LLM inference stop at "it's a transformer forward pass." The production reality is a lot more interesting.

I've been working on LLM inference systems in production and wanted to write the article I wish existed when I started — a complete end-to-end mental model covering the full stack:

  • How requests actually flow: CDN → API gateway → model router → inference runtime → GPU cluster
  • Why autoregressive generation creates a fundamentally different problem than training
  • The latency breakdown (TTFT vs TPOT vs throughput) and why they pull in different directions
  • What production monitoring actually looks like — not just GPU utilization, but hallucination rate, cost per request, distribution shift
  • Where memory becomes the real bottleneck (spoiler: it's why KV cache exists)

This is Part 1 of a series. Upcoming parts go deep on KV cache, continuous batching, vLLM internals, speculative decoding, parallelism, and quantization.

Link: Architecting LLM Inference Part 1

Happy to answer questions or go deeper on any piece of this in the comments.

u/ArchitectingAI — 11 days ago
▲ 31 r/Vllm+5 crossposts

How LLM inference actually works at scale — a breakdown for anyone learning ML systems

One thing that confused me early on: I understood how LLMs are trained, but had no idea how they actually serve millions of requests efficiently.

Here's a quick breakdown of the key concepts:

Why inference is harder than it looks

A user sends a prompt → the model returns tokens. Simple on the surface. But underneath, the system is managing GPU memory, scheduling thousands of concurrent requests, and generating tokens one at a time in a loop.

KV Cache — every time the model generates a token, it needs to remember the context of everything before it. This is stored in a KV (key-value) cache. For long conversations, this cache can consume more GPU memory than the model weights themselves.

Continuous Batching — naively, you'd process one request at a time. Modern systems batch many requests together and schedule at the token level — finished requests leave the batch, new ones enter. This keeps the GPU busy and dramatically improves throughput.

Tensor Parallelism — when a model is too large for one GPU, you split it across multiple GPUs. Each GPU holds a shard of the weight matrices and they communicate during the forward pass.

The most important insight: there isn't one way to "scale" inference. High traffic needs replicas. Large models need tensor parallelism. Low GPU utilization needs better scheduling. Long contexts need KV cache management. Picking the wrong solution for the wrong bottleneck wastes money and doesn't fix the problem.

I've been writing a deep-dive series on all of this — just published Part 6 on parallelism strategies with hands-on experiments and code if anyone wants to go deeper:

https://pawankjha.substack.com/p/architecting-llm-inference-part-6

Happy to answer questions on any of this in the comments!

u/ArchitectingAI — 9 days ago