r/mlscaling

▲ 15 r/mlscaling+2 crossposts

We built an open-source KEDA external scaler for GPU workloads - no Prometheus needed

Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.
reddit.com
u/Aware-Ticket-5585 — 22 hours ago
▲ 23 r/mlscaling+1 crossposts

"An OpenAI model has disproved a central conjecture in discrete geometry" (log scaling of inner-monologue compute in probability solving Erdős's planar unit distance problem)

openai.com
u/gwern — 1 day ago
▲ 3 r/mlscaling+1 crossposts

Zero-overhead MoE expert imbalance profiler for vLLM w benchmarks + why we differ from vLLM's built-in EPLB

If you're running a MoE model with --enable-expert-parallel, your experts are probably imbalanced. We measured 7.93× imbalance on Layer 0 of OLMoE with one GPU doing nearly 8× the work. Plumb measures it and fixes it.

What it does:

Hooks into a running vLLM or HuggingFace process via PyTorch hooks, no fork or restart required. Captures per-layer per-expert activation counts and computes imbalance ratios, then produces an expert→GPU placement recommendation.

These are prefill benchmarks (max_tokens=1, ~11 input tokens). Full results across multiple concurrency levels and two models in the repo, including a DeepSeek-V2-Lite run where blind rebalancing made things significantly worse.

On vLLM's native EPLB:

vLLM has its own EPLB and it works. A few differences:

vLLM requires --num-redundant-experts — extra VRAM per EP rank (~2.4GB for DeepSeek-V3). If you're memory constrained it can't run. Plumb has no such requirement.

vLLM's EPLB load-balances but ignores topology — it doesn't know which GPU is closest to which expert, so cross-NUMA dispatch cost stays high. Plumb adds a NUMA fine-tuning pass that pins each layer's hottest experts to same-socket GPUs after running EPLB, which vLLM doesn't do.

vLLM's EPLB also runs unconditionally. We benchmarked blind rebalancing on DeepSeek-V2-Lite — it peaks at 1.5× imbalance because it trained with balance losses, so there's nothing to rebalance. The communication overhead alone pushes p95 +226% at c=16. Plumb checks the imbalance ratio first and won't apply anything without warning you.

GitHub: https://github.com/plumb-moe/plumb

Benchmark scripts and raw data are in the repo.

We're going to be running more benchmarks and trying more strategies over the next few weeks, hope you look forward to those results :)

u/plumb-moe — 3 days ago
▲ 4 r/mlscaling+1 crossposts

Has anyone quantified the actual compute waste from training divergence at scale? Trying to understand how common rollback and restart really is in practice.

reddit.com
u/Radianis — 3 days ago
▲ 11 r/mlscaling+1 crossposts

Scaling LLMs horizontally: hidden-state coupling without weight modification [R]

Residual Coupling (RC) connects frozen language models in parallel using small, learned linear bridge projections. These bridges read hidden states from one model and inject additive updates into the residual stream of another at intermediate layers. In bilateral setups, simultaneous return bridges form a feedback loop that stabilizes both streams without altering base weights.

This architecture establishes a two-step paradigm where base models function as memorizers, while lightweight linear bridges handle cross-domain generalization. Constraining the bridges to purely linear maps prevents overfitting because they can only map existing geometric relationships between the frozen representation spaces. As the bridges are optimized against ground-truth target data, they have no incentive to map ungrounded features such as individual models' hallucinations.

Keeping the base weights completely frozen eliminates catastrophic forgetting. The system maintains operational closure, transforming inputs through its existing structure rather than changing to accommodate them.

Evaluating bilateral RC against Mixture-of-Experts (MoE) routing across the same frozen models shows these results:

  • Medical (3-model): Reduces perplexity to 11.02, compared to 56.80 for MoE and 57.08 for the frozen baseline. This represents an 80.7% reduction.
  • TruthfulQA Health (MC1): Improves accuracy by 9.1 percentage points over the baseline. Independent models have uncorrelated hallucinations, allowing the bridge gates to amplify consistent cross-model updates while suppressing individual errors.
  • Coding Test: CodeGPT-small-py and GPT-2 use different tokenizers, causing a 7-million baseline perplexity on mismatched text. MoE reaches 878, but RC achieves 5.91 by reading hidden states before the output projection collapses.

This framework introduces a horizontal scaling axis for multi-model systems, moving beyond vertical scaling via larger monolithic models. Latency remains bounded by the slowest single model. Specialists can be added or removed without retraining the remaining system. In some scenarios, this architecture could replace multi-turn text prompting in agentic workflows with a single parallel forward pass, allowing models and/or bridges to run on separate nodes or edge devices without a central bottleneck. By decoupling memorization from relational alignment, RC bridges provide a framework for scaling multi-model systems and offer a path toward native multi-modal integration.

Paper: https://ssrn.com/abstract=6746521

Code: https://github.com/pfekin/residual-coupling/

i.redd.it
u/kertara — 3 days ago
▲ 10 r/mlscaling+1 crossposts

Any resource to study GPU programming for Deep Learning?

I've been learning deep learning for a while, and recently I've become really interested in the GPU/systems side as well. I want to reach a level where I can understand and work on issues like bottlenecks, memory optimization, CUDA, distributed training, etc. Do you have any good resources, courses, or projects you'd recommend for this path?

reddit.com
u/yavuzibr — 5 days ago
▲ 59 r/mlscaling+2 crossposts

I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender.

The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones.

Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby.

Full blog post in the comments, but the high-level results were:

* defense rate: 64% → 92%
* benign accuracy: 92% → 88%
* attacker discovered 7 tactic families
* fiction/creative framing was the largest cluster at 34%

u/girishkumama — 7 days ago
▲ 11 r/mlscaling+4 crossposts

ML with Finance

Hi, I am an MTech student in computer science. I want to work on finance domain with machine learning. So can you suggest me some research topic. On which we can work for last year thesis. During my MTech my major focus on machine learning and deep learning around topic. But I have an interest in the finance domain also I did some project like https://github.com/Zdong104/FNSPID_Financial_News_Dataset with market regime. But now I am finding an solid research topic for the my final year. Is there any suggestion for this ?

u/Gullible_Space_4070 — 6 days ago

The 0% Challenge: Is any LLM actually "solving" SWE-Bench without memorization?

I've been looking at SWE-Bench leaderboards on and off over the past few years, and something still feels fundamentally broken about how we define "agentic capability."

We keep seeing models hit 30%, 40%, or even 60%+ on SWE-Bench Verified. The hype train says we're nearing "AI Software Engineers." But here's the elephant in the room: contamination isn't just a bug. It's the feature.

The "Air-Gapped" Hypothesis

Consider a simple experiment: force models to resolve issues in a completely isolated environment. No internet access, No searching for similar PRs, No issue IDs in the prompt.

My hot take? Most frontier models would see their scores collapse toward 0%.

Why this might be happening:

Verbatim patching: There's a growing informal consensus among practitioners who've run internal de-contaminated evals that models aren't genuinely "reasoning" through a codebase. Instead, they appear to be recalling specific Git commit hashes and file paths — because large chunks of SWE-Bench exist verbatim in pre-training corpora.

The "search" proxy: Many high-scoring agents use browse/search tools. In practice, they often locate the original GitHub PR that fixed the exact issue they're supposed to solve. That's not engineering. That's plagiarism with a tool-use wrapper.

Environment reality check: A real engineer can debug a legacy, private repo they've never seen before. Current LLMs tend to fall apart the moment you move them from "popular public Python repo" to "private internal codebase."

A small internal data point :

At a previous project, I tested a few frontier models on a set of private, post-cutoff issues from an internal codebase — no internet access, no issue IDs, no public traces. The same model that scored ~30% on SWE-Bench Verified dropped to effectively 0–2%. That's when I stopped treating this as a theory.

A challenge to benchmark creators:

If we want real progress, we need a Dark SWE-Bench:

Issues from private, non-scraped enterprise repos.

Issues created after the model's knowledge cutoff.

Zero external search capabilities during the run.

If a model can't produce a fix without having seen the solution in its training data, we aren't building "engineers." We're building very expensive compression algorithms for GitHub.

Curious to hear from anyone else who has run internal, de-contaminated evals. Did you see a similar massive drop? And has anyone found a model that actually reasons through multi-file dependency fixes without effectively cheating via memory?

reddit.com
u/OK_Simon_666 — 8 days ago
▲ 14 r/mlscaling+5 crossposts

Turns out "Claude Code over files in S3" quickly becomes "rebuild half the data warehouse stack"

Schemas, lineage, datasets, file refs - agent needs to know everything! An there is a need in the system that stores all these.

OpenAI's Data Agent post made us feel slightly less insane because they ended up building many of the same layers internally just on top of warehouses instead of object storage - https://openai.com/index/inside-our-in-house-data-agent/

Yes, most of these problems are solved there but needs to be solved when working in S3/GCS/Azure.

I'd appreciate feedback from folks here: how do you work with large-scale datasets in object storage, and how do you supply context about them to agents?

u/dmpetrov — 9 days ago
▲ 28 r/mlscaling+4 crossposts

prompt caching, but for rl training - 7.5x speedup on long-prompt/short-response workloads

most open source RL engines pack sequences naively: prompt + response, repeated for every sample in the group. this is fine for short prompt, long completion workloads but inefficient for long prompt, short completion workloads. with 1000-token prompts and 100-token responses at G=8, you're processing 8800 tokens when only 1800 are unique. about 5x wasted compute.

the fix is conceptually simple: compute the prompt once, then compute all G responses after it. it's analagous to inference prefix caching, except training needs gradients to flow back through the prompt, which breaks causal attention in the obvious implementation. getting it right required different tricks for full vs. linear attention layers.

you can read about it in the blogpost in the comments.

Numbers on Qwen3.5-4B:

- 16k prompt / 64 out → 7.5x

- 16k / 128 → 7.3x

- 16k / 1k → 5.4x

- 8k / 4k → 1.7x

u/girishkumama — 10 days ago
▲ 11 r/mlscaling+4 crossposts

ZERO-VRAM-SPEC Which speeds up 1.3X in code genarationg without taking any extra vram

https://github.com/neerajdad123-byte/zero-vram-spec
I replaced draft model entirely with a python rule based AST predictor which seems working well in predicting grammer forced tokens and also indentations

While doing this project i learnt many things about implementation of all types of spec decoding and also
how tokens work and everything about MTP(multi token prediction) and many things

Looking up for an intenship
passion is to build things
Leave a star for me it would be very much helpful to me

u/PangolinLegitimate39 — 8 days ago

GPT-5.5 and Opus 4.7 evaluated on ARC-AGI-3

Both models spent $10,000 (the limit). GPT-5.5 scored 0.4% and Opus 4.7 scored 0.2%.

This benchmark is quite difficult for clankers. It seems almost pointless to test current LLMs on it: they all score equally (about zero). My prediction of a 30% score in a year seems unlikely to come true.

It's probable that new breakthroughs (or at least much better base models) are needed here. (That said, when LLMs finally do chip a dent in ARC-AGI-3, even a little one, expect scores to shoot to 100% quite fast)

So far, so boring.

Less boring is the ARC Prize's analysis of how GPT-5.5 and Opus 4.7 played, based on reasoning from 160 games. The two models failed in extremely unlike ways.

Opus 4.7 aggressively theorycrafts, and learns game mechanics fairly well. But it assumes facts not in evidence, struggles to integrate new data into existing beliefs, and often can't (or won't) backtrack out of wrong assumptions. It ends up playing from a theory of the game that is "neat, plausible and wrong."

GPT-5.5 just...doesn't commit to a theory. Ever. It taps buttons but never seems to learn anything. In every turn, it sounds like an old man who has woken from a deep slumber and is seeing the game for the first time ("I'm analyzing a game with a grid..."). It blindly wonders if it's playing Tetris, or if the orange blocks are lava. Everything gets pattern-matched onto some existing videogame, with its previous reasoning forgotten.

It's funny that GPT-5.5 "doubles" Opus 4.7's score. To the extent this isn't noise, it's likely due to GPT-5.5's exploration-focused approach getting luckier a little more often.

tldr: Opus 4.7 is precise but inaccurate, GPT-5.5 accurate but imprecise.

Do tests like ARC-AGI-3 mean much, in the end? I'm not sure. I suspect the games were designed (in part) to focus around things that humans find easy and LLMs find hard, like spatial reasoning. But many important things (like robotics) involve spatial reasoning: I see this as defensible.

(I got around 80% on the two games I played. According to its creator, "Any smart human giving it real effort should score >90% on ARC-AGI-3". y u bully me man :( )

arcprize.org
u/COAGULOPATH — 12 days ago

A Network of Biologically Inspired Rectified Spectral Units (ReSUs) Learns Hierarchical Features Without Error Backpropagation | "Brain-like artificial neurons that teach themselves to recognize increasingly complex patterns by predicting the future from the past, without needing training data."

##Abstract:

>We introduce a biologically inspired, multilayer neural architecture composed of Rectified Spectral Units (ReSUs). Each ReSU projects a recent window of its input history onto a canonical direction obtained via canonical correlation analysis (CCA) of previously observed past-future input pairs, and then rectifies either its positive or negative component. By encoding canonical directions in synaptic weights and temporal filters, ReSUs implement a local, self-supervised algorithm for progressively constructing increasingly complex features. > >To evaluate both computational power and biological fidelity, we trained a two-layer ReSU network in a self-supervised regime on translating natural scenes. First-layer units, each driven by a single pixel, developed temporal filters resembling those of Drosophila post-photoreceptor neurons (L1/L2 and L3), including their empirically observed adaptation to signal-to-noise ratio (SNR). Second-layer units, which pooled spatially over the first layer, became direction-selective -- analogous to T4 motion-detecting cells -- with learned synaptic weight patterns approximating those derived from connectomic reconstructions. Together, these results suggest that ReSUs offer: >- (i) a principled framework for modeling sensory circuits and >- (ii) a biologically grounded, backpropagation-free paradigm for constructing deep self-supervised neural networks.


##Layman's Explanation:

Your brain learns to see without anyone telling it the right answers. This paper tries to build artificial neurons that work the same way.

Standard AI neurons (ReLUs) just add up inputs at one instant and ignore timing. Real neurons track patterns over time. The authors propose a new unit called a ReSU (Rectified Spectral Unit) that looks at a window of recent input history, finds the pattern most useful for predicting what comes next using a statistical method called canonical correlation analysis, and then outputs only the positive or negative part of that pattern.

They tested a two-layer ReSU network on natural images sliding across a simulated eye, mimicking how a fruit fly sees motion. Without any labeled training data or backpropagation, the first layer spontaneously developed filters matching real fly neurons (L1, L2, L3), and the second layer became direction-selective like the fly's motion-detecting T4 cells. The learned connection weights even resembled those mapped from actual fly brain wiring diagrams.

The core claim is that a single principle (maximize the information your past observations give you about the future, then split positive and negative responses across separate neurons) can explain how biological circuits self-organize into hierarchical feature detectors, and could eventually replace backpropagation in deep networks.


######Link to the Paper: https://arxiv.org/pdf/2512.23146


######Link to the Code: https://github.com/ShawnQin/ReSU

u/44th--Hokage — 12 days ago
▲ 7 r/mlscaling+1 crossposts

Looking for small group (2-3) for Systems for ML learning group

hey everyone, I am planning to start learning about the systems of ML, more into like inference, post training and kernel optimization (later). The objective behind this is to find like minded person who is interested to join and we can collaborate on common projects or research and build/learn together over a span of few months. Not classical ML, I'm more of a systems guys hence looking for the same. If you're already learning/working on something similar and looking for partner or project contributor, feel free to post here please. I'd be glad to join and learn/build together.

UPDATE: Created Discord channel: TheSynapse
NOTE: No guide/expert, looking for suggestions from everyone on how to work together in this learning goal

reddit.com
u/Ekcron — 13 days ago