u/elise_moreau_cv

Diffusion inference batching breaks when prompt embeddings vary in token length

Spent the last two weeks chasing a throughput regression on our SDXL serving stack after we enabled variable-length prompt support for the product team. Static batching with padded embeddings was giving us a clean 4.2 img/s per A100. Switched to continuous batching expecting the usual LLM-style win, got 2.8 img/s instead. The nuance here is that the UNet cross-attention cost does not scale the way KV-cache reuse does for autoregressive models. Every denoising step recomputes attention against the full text embedding, so padding waste compounds across 30 steps rather than amortizing. We ended up bucketing prompts by token length into three pools and running them as separate static batches. Boring fix, but the literature on diffusion serving assumes uniform prompts and nobody warns you about this.

reddit.com
u/elise_moreau_cv — 1 day ago

GPU spot reclaim rate on H100s jumped from 8% to 31% in Frankfurt

Our training cluster runs on spot H100s across three EU regions for cost reasons. Frankfurt has historically been the cheapest and most stable. April reclaim rate was 7.8%, manageable with checkpointing every 400 steps. May numbers came in yesterday at 31.4%, which means our effective hourly cost went up roughly 2.7x once you factor in restart overhead and lost gradient steps. Talked to two friends at other labs seeing the same shift in Frankfurt and Stockholm, nothing in Dublin yet. The complication is that the cost dashboards still show the same spot price per hour, so finance thinks nothing changed. Real cost is hour-price divided by completion probability, and nobody's reporting that metric. We moved to a 60/40 split with on-demand for the long runs.

reddit.com
u/elise_moreau_cv — 2 days ago

Why does every OpenClaw tutorial on YouTube end up pushing Hostinger

I get it, running agents locally without proper sandboxing can be risky, and VPS setups are cleaner and safer for long-running workflows. But at this point it feels like every creator magically arrived at the exact same recommendation.

Is Hostinger actually that good for OpenClaw setups, or did they just sponsor the entire ecosystem? Also, I’m new to OpenClaw. I’ve tried it, but I want to dive deeper into the technical side of things. Can anyone suggest some good resources or creators who genuinely understand how it works?

reddit.com
u/elise_moreau_cv — 2 days ago

Diffusion inference batching breaks when prompt embeddings vary in token length across the batch

Spent the last two weeks chasing a throughput regression on our SDXL serving stack after we enabled variable-length prompt support for the product team. Static batching with padded embeddings was giving us a clean 4.2 img/s per A100. Switched to continuous batching expecting the usual LLM-style win, got 2.8 img/s instead. The nuance here is that the UNet cross-attention cost does not scale the way KV-cache reuse does for autoregressive models. Every denoising step recomputes attention against the full text embedding, so padding waste compounds across 30 steps rather than amortizing. We ended up bucketing prompts by token length into three pools and running them as separate static batches. Boring fix, but the literature on diffusion serving assumes uniform prompts and nobody warns you about this.

reddit.com
u/elise_moreau_cv — 4 days ago

Diffusion model inference latency dropped 31% after rewriting our scheduler in custom CUDA

We had a 4-step distilled SDXL variant running at 340ms per image on A100s, which sounded fine until we measured the actual GPU utilization. Roughly 22% of wall time was the scheduler doing tensor copies between denoising steps, not compute. The nuance here is that torch.compile handles the UNet beautifully but cannot fuse across the scheduler boundary because the noise prediction depends on a Python-level conditional we wrote two years ago. Rewriting the DPM++ step as a single CUDA graph capture brought it to 234ms and pushed SM utilization above 80%. The lesson I keep relearning: profiling tells you where time goes, not where the architecture forces time to be wasted. Those are different problems and only one of them is fixable with better kernels.

reddit.com
u/elise_moreau_cv — 8 days ago

Three years of failed Notion second-brain setups. PARA, Zettelkasten, every popular template. None of them lasted past 4-6 weeks.

The thing that finally worked was the opposite of what most productivity content suggests. Instead of trying to capture everything I might need, I let the system stay deliberately incomplete. Three principles:

  1. Inbox first, organization later. Daily capture in one place, organize on weekends if at all. The friction of "where does this go" was killing capture rate.
  2. No templates beyond two pages. Daily note, weekly review. Everything else is just pages with whatever structure makes sense in the moment. Tried databases for everything for two years and the maintenance overhead was the real issue.
  3. Linked mentions over folders. Tags and backlinks let me find things without committing to a hierarchy. Folder hierarchies always end up wrong six months later.

The bigger insight was that productivity systems fail because they require ongoing maintenance to stay valuable. The system that wins is the one that has the lowest maintenance cost while still being good enough. "Good enough and used" beats "perfect and abandoned."

Three months in and this is the longest I've stuck with any setup. Wondering if anyone else has had similar experiences with the "less structure" approach

reddit.com
u/elise_moreau_cv — 18 days ago
▲ 0 r/Notion

Three years of failed Notion second-brain setups. PARA, Zettelkasten, every popular template. None of them lasted past 4-6 weeks.

The thing that finally worked was the opposite of what most productivity content suggests. Instead of trying to capture everything I might need, I let the system stay deliberately incomplete. Three principles:

  1. Inbox first, organization later. Daily capture in one place, organize on weekends if at all. The friction of "where does this go" was killing capture rate.
  2. No templates beyond two pages. Daily note, weekly review. Everything else is just pages with whatever structure makes sense in the moment. Tried databases for everything for two years and the maintenance overhead was the real issue.
  3. Linked mentions over folders. Tags and backlinks let me find things without committing to a hierarchy. Folder hierarchies always end up wrong six months later.

The bigger insight was that productivity systems fail because they require ongoing maintenance to stay valuable. The system that wins is the one that has the lowest maintenance cost while still being good enough. "Good enough and used" beats "perfect and abandoned."

Three months in and this is the longest I've stuck with any setup. Wondering if anyone else has had similar experiences with the "less structure" approach

reddit.com
u/elise_moreau_cv — 18 days ago

Was looking at non-NVIDIA inference options last week and stumbled on xLLM. It's an LLM inference engine open-sourced by JD.com, optimized for diverse AI accelerators. The repo has a technical report on arXiv from October 2025 that's worth reading.

A few things that stood out:

Day-zero support for GLM-5 in February, GLM-4.7 in December, GLM-4.6V in late 2025. Their release cadence tracking Chinese model launches is faster than vLLM's tracking of Llama or Qwen, which makes sense given who builds it.

Hybrid KV cache management built on top of Mooncake, with intelligent offloading and prefetching for global cache state. This is one of the more thoughtful KV cache designs I've seen outside of vLLM's PagedAttention. They've published actual benchmarks on the cache hit ratio impact.

Built specifically for Chinese AI accelerators (Huawei Ascend 910B, etc.) which are basically a black box from a tooling perspective for most Western teams. If you're looking at hardware diversification or running inference outside of CUDA, this is one of the few production-grade options.

The thing I haven't been able to verify is real-world latency vs vLLM on equivalent hardware. The technical report has internal benchmarks but I'd want third-party numbers before betting an inference layer on it.

Repo: github.com/jd-opensource/xllm. If anyone here has actually deployed it, I'd be curious what the gotchas are.

u/elise_moreau_cv — 24 days ago

Datadog dropped their State of AI Engineering report this week. The numbers reframed how I think about LLM reliability.

February 2026: 5% of all LLM call spans across their customer base reported an error. 60% of those errors were rate limits.

March 2026: 2% of spans returned errors, but rate limits were still ~30% of the total. That works out to 8.4 million rate limit failures across their telemetry in a single month.

The takeaway is that the dominant production failure mode for LLM apps is not hallucinations, not bad context, not flaky tools. It's plain capacity exhaustion. 429s and 529s, the boring kind of failure that classical infra engineers have known how to handle for 20 years.

What's making it worse is the architectural pattern most teams use. Variable ReAct loops and multi-agent collaboration produce concurrency spikes that exhaust shared org-level quotas in unpredictable bursts. Your p50 throughput looks fine and your p99 falls off a cliff.

The other line in the report that I keep thinking about: context quality, not volume, is the new limiting factor. Most teams aren't even close to using the full context window of their model. The 1M token capability is wasted if your retrieval pipeline can't pick the right 10K tokens.

Capacity engineering and context engineering are quietly becoming the two skills that move the needle in 2026 production LLM systems. Prompt engineering as a discipline is increasingly downstream of these.

reddit.com
u/elise_moreau_cv — 24 days ago

Datadog dropped their State of AI Engineering report this week. The numbers reframed how I think about LLM reliability.

February 2026: 5% of all LLM call spans across their customer base reported an error. 60% of those errors were rate limits.

March 2026: 2% of spans returned errors, but rate limits were still ~30% of the total. That works out to 8.4 million rate limit failures across their telemetry in a single month.

The takeaway is that the dominant production failure mode for LLM apps is not hallucinations, not bad context, not flaky tools. It's plain capacity exhaustion. 429s and 529s, the boring kind of failure that classical infra engineers have known how to handle for 20 years.

What's making it worse is the architectural pattern most teams use. Variable ReAct loops and multi-agent collaboration produce concurrency spikes that exhaust shared org-level quotas in unpredictable bursts. Your p50 throughput looks fine and your p99 falls off a cliff.

The other line in the report that I keep thinking about: context quality, not volume, is the new limiting factor. Most teams aren't even close to using the full context window of their model. The 1M token capability is wasted if your retrieval pipeline can't pick the right 10K tokens.

Capacity engineering and context engineering are quietly becoming the two skills that move the needle in 2026 production LLM systems. Prompt engineering as a discipline is increasingly downstream of these.

reddit.com
u/elise_moreau_cv — 24 days ago

Diffusion inference batching breaks when classifier-free guidance scales differ per request

Spent two weeks chasing a throughput regression on our SDXL serving stack after we exposed CFG scale as a per-request parameter. Throughput dropped 38% even though batch sizes looked identical in the logs. To be precise, the issue was that CFG requires running conditional and unconditional passes, and when guidance scale is the same across a batch you can fuse them into one forward pass with a doubled batch dimension. The moment scales diverge, that fusion silently breaks and you fall back to sequential passes per item, but no metric flagged it because GPU util stayed at 94%. The nuance here is that GPU util is a useless health signal for diffusion serving once you have any per-request conditioning. We now bucket requests by CFG scale before scheduling, which got us back to 2.1x throughput. Compile traces caught it eventually but only after I stopped trusting dashboards.

reddit.com
u/elise_moreau_cv — 26 days ago

torch.compile recompiled silently on 11 different input shapes and killed our p95 latency

Running inference on our product photography diffusion model, we used torch.compile and saw a 23% latency improvement in benchmarks. Moved it to prod and after two days the p95 latency had drifted up 40% from baseline. No alerts fired because mean latency stayed flat. The issue: compile creates separate cached graphs per input shape, and our images were arriving at 11 different aspect ratios in the wild, not the 3 we tested. Each new shape triggers a recompilation that blocks the thread for roughly 4 seconds. The nuance here is that compile's shape specialization is not documented as a footgun for variable-resolution workloads, but it is. Bucketing inputs to 4-5 fixed shapes at the API boundary dropped our recompilation events from around 300 per day to zero.

reddit.com
u/elise_moreau_cv — 1 month ago