r/mlscaling

▲ 21 r/mlscaling+3 crossposts

Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]

Project Release / Research Draft] Hierarchos at 232M Parameters: Preliminary Findings From a Recurrent Memory-Augmented Assistant Model

Technical Report: July 2nd, 2026

Project: Hierarchos / KortexHOS

Authors: Makhi Burroughs / netcat420, Lost Time, and the Hierarchos project team

TL;DR:

We built and trained Hierarchos, an experimental 232M-parameter recurrent, memory-augmented language model from scratch. It is not a GPT-3/3.5-class model, but it successfully proves that a hybrid non-Transformer architecture (combining an RWKV backbone, hierarchical manager/worker loops, differentiable slot-based LTM, and a deterministic suffix automaton) can survive training, avoid collapse, and maintain short-form instruction coherence. Most of our breakthroughs came from fixing subtle train/inference parity mismatches and numerical stability bugs.

Dataset: netcat420/Experiment_0.1 (Alpaca format)
Training: 13 epochs on an RTX 6000 Blackwell (96GB) rental.

1. Introduction & Background

Modern LLMs are heavily dominated by Transformer scaling. Hierarchos explores a different path: can recurrent state, explicit memory retrieval, hierarchical iterative computation, and bounded local inference make a small model vastly more parameter-efficient?

Hierarchos isn't a direct clone of any single architecture, but a hybrid inspired by:

RWKV-style recurrence: For efficient sequence processing without traditional attention.
Titans-style neural memory: For persistent test-time memory.
Hierarchical reasoning (HRM): Multi-level recurrent modules (Manager/Worker) to iteratively refine state.

2. Architecture Overview

[Token Input] -&gt; [ROSA Suffix Matcher / DeepEmbed Modulator]
       |
       v
[Long-Term Memory] &lt;-&gt; [Top-k Associative Lookup]
       |
       v
[Manager Recurrent Cell] -&gt; (Produces Context Plan &amp; Drift Vector)
       |
       v
[Worker Recurrent Cell]  -&gt; (Refines local state / clamps drift)
       |
       v
[RWKV Backbone (Clamped Channel-Mix)] -&gt; [Next-Token Logits]

Key Components:

ROSA: A deterministic suffix-automaton path predicting continuation tokens based on exact repeated suffix patterns.
DeepEmbed: A token-specific modulation path that influences RWKV channel mixing.
LTM Subsystem: Learned slow-memory keys/values combined with fast working-memory values.
Manager/Worker Loop: High-level manager handles broad context to produce a target plan; the lower-level worker refines token-local state using a regularized drift vector.

3. Core Engineering Lessons (The "Gotchas")

A low training loss does not guarantee coherent chat. We had to fix several critical state-contract and numerical stability bugs to make the model usable:

1. Chat/Training Drift Mismatch

The Bug: During live streaming chat, the loop was feeding the previous drift state back into the model on every single token. During training, this state is reseeded at Truncated Backpropagation Through Time (TBPTT) chunk boundaries.
The Fix: We aligned the inference code to only reseed at boundary limits. Before this fix, live chat logits diverged sharply from training loss; after the fix, logit error dropped to near-zero.

2. Supervised LTM Inner Updates Mismatch

The Bug: Giving the model supervised memory updates during training that it can't replicate during zero-label live inference creates a crutch. The model learns to rely on a hidden training-only helper signal.
The Fix (v0.20.4): Implemented --ltm-training-mode read-only. Training keeps the memory structures but stops doing supervised fast-memory writes, perfectly mirroring inference.

3. Unbounded RWKV Channel Mixing

The Bug: Long runs exposed activation spikes in the ReLU-squared channel-mix FFN path, which were amplified by DeepEmbed modulation into NaN gradients.
The Fix: Implemented key clamps (--rwkv-channel-mix-key-clamp 12.0), DeepEmbed clamps (4.0), and excluded DeepEmbed identity gates from AdamW weight decay.

4. Evaluation & Smoke Test Results

Because cloud costs add up, we benchmarked the model locally on a CPU preset via a ROG Ally (--eval-limit 100), ensuring passive learning was disabled and working memory was cleared to mimic static chat.

Bounded Local Benchmark Metrics (--eval-limit 100)

Benchmark	Metric	Score	Std. Err.
ARC Easy	acc	0.3600	0.0482
ARC Easy	acc_norm	0.3200	0.0469
HellaSwag	acc	0.3400	0.0476
HellaSwag	acc_norm	0.3700	0.0485
TruthfulQA MC1	acc	0.2200	0.0416

Real-world Coherence Check:

The Good: Assistant-shaped, follows short instruction prompts well due to the Alpaca training data. Nontrivial commonsense and QA signal prove the weights didn't collapse.
The Bad: Brittle on long context lengths, weak on arithmetic/factual recall. Coherence is comparable to the GPT-2 era, not modern GPT-3.5+ systems.

5. Proposed Ablation & Scaling Plan

We want to transform this from a promising prototype into a rigorous scientific result. Our next step requires scaling tiers and isolated component testing.

Proposed Isolation Testing (Ablations)

No LTM / Read-Only LTM: Isolating exactly how much slot memory helps.
No ROSA / No DeepEmbed: Evaluating the real token-efficiency gains of suffix-matching and modulation.
Baseline Matches: Running a direct Transformer 232M and RWKV-only 232M on the exact same token budget to prove true comparative architecture efficiency.

Future Scaling Target Tiers

Tier	Model Size	Token Target	Purpose
Scout	300M–500M	20B–50B	Validate loss slope and stability scaling.
Real v1	1B–1.5B	100B–300B	Test architecture limits beyond small-scale behavior.
Serious	3B	600B–1.5T	Establish a truly competitive local open-source alternative.

Target Data Mix for Foundation Training:

Instead of jumping straight into instruction SFT data, a scaled run will prioritize high-quality base data:

35-50%: FineWeb / FineWeb-Edu style clean web text
20-30%: Dolma / DCLM curated web data
8-15%: Code and tech documentation
5-12%: Math, science, and academic proofs
1-5%: In-house assistant conversational SFT (applied exclusively in late-stage tuning)

6. What We Can (and Cannot) Claim Safely

What is supported by the data:

Hierarchos is a functional, coherent 232M experimental assistant checkpoint.
Combining recurrent sequence loops, memory slots, and hierarchical workers is viable and stable with the right clamps.
The findings provide a solid engineering roadmap for non-Transformer architecture stability.

What is NOT supported (Do not hype this!):

No claims of GPT-3.5 level math, coding, or logic.
No claims of attention/Transformer superiority at equal parameter counts yet (baselines pending).
Not production-ready for heavily quantized or low-bit local deployments yet due to drift sensitivity.

Final Thoughts

Hierarchos 232M shows that small, alternative architectures are still a deeply fruitful area of LLM research if you can conquer the train/inference state drift.

We would love to hear feedback from anyone working on recurrent neural memory or hierarchical backbones! Full code, scripts, and logs are in progress.

References:

Brown et al. **Language Models are Few-Shot Learners.** arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Hoffmann et al. **Training Compute-Optimal Large Language Models.** arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Peng et al. **RWKV: Reinventing RNNs for the Transformer Era.** arXiv:2305.13048. https://arxiv.org/abs/2305.13048
Behrouz et al. **Titans: Learning to Memorize at Test Time.** arXiv:2501.00663. https://arxiv.org/abs/2501.00663
Wang et al. **Hierarchical Reasoning Model.** arXiv:2506.21734. https://arxiv.org/abs/2506.21734
Zellers et al. **HellaSwag: Can a Machine Really Finish Your Sentence?** arXiv:1905.07830. https://arxiv.org/abs/1905.07830
Clark et al. **Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.** arXiv:1803.05457. https://arxiv.org/abs/1803.05457
Lin et al. **TruthfulQA: Measuring How Models Mimic Human Falsehoods.** arXiv:2109.07958. https://arxiv.org/abs/2109.07958
Hugging Face. **FineWeb dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb
Hugging Face. **FineWeb-Edu dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Allen AI. **Dolma dataset.** https://huggingface.co/datasets/allenai/dolma
DataComp-LM. **DCLM Baseline dataset.** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

github repository with the architecture and the released model weights: https://github.com/necat101/Hierarchos

u/PhysicsDisastrous462 — 3 days ago

▲ 0 r/mlscaling

arXiv endorsement request — cs.LG (ternary networks / feedback-driven bit-flip training)

Hi all — I'm an independent researcher (Mendel Infolabs) about to put my first paper on arXiv, and as a first-time submitter to cs.LG I need an endorsement from someone already established in that category. If you've published in cs.LG and would be open to endorsing, I'd really appreciate it.

An honest summary so you can decide whether it's something you'd feel comfortable vouching for:

"FeedFlipNets: Feedback-Driven Bit-Flips for Ternary Networks, Activation-Routed DFA, and the Per-Weight Sign Barrier to Transport-Free Learning"

It trains ternary ({-1, 0, +1}) neural networks by flipping weight bits directly from a cheap feedback signal — no float shadow weights. The headline result is a negative one I think is worth putting on the record: transport-free feedback (Direct Feedback Alignment) doesn't actually help discrete/ternary training, because the binding constraint is per-weight sign correctness, not the aggregate cosine-alignment angle that prior work optimizes. Everything is pre-registered and reproducible.

Endorsing only confirms you think I'm a bona fide researcher submitting work appropriate to the category — it is not a review of the paper's correctness, and it takes about a minute:

Link: https://arxiv.org/auth/endorse?x=WHWXBC
Or go to https://arxiv.org/auth/endorse and enter code WHWXBC

Happy to share the full PDF with anyone who wants to read it before deciding — just comment or DM. Thanks a lot for considering it.

Edit: The PDF was publicly requested so here it is: https://doi.org/10.5281/zenodo.21152011

reddit.com

u/Present_Brilliant — 3 days ago

▲ 16 r/mlscaling+3 crossposts

MultiHashFormer: Hash-based Generative Language Models

We are excited to introduce MultiHashFormer, our new framework for vocabulary efficient language modelling.

Inspired by chaotic dynamic memory systems with distributed state spaces, we replace the traditional embedding matrix with a modular hashing interface.

👉 Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions.

👉 A Hash Encoder compresses this ID signature into a single latent vector for processing by a Transformer decoder.

👉 A Hash Decoder generates the hash signature of the next token, which is then mapped back to text.

✅ Using 4 hash functions and 16,000 buckets per function, our model theoretically supports an upper bound of 16000^4 (approx. 65 quadrillion) unique signatures, i.e., vocabulary entries, with a constant memory footprint!

✅ MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks in 1B and 3B scales, pre-trained from scratch on 100B tokens (we know...we're compute poor, if you're interested in scaling further, please reach out).

✅ It can effectively handle multilingual vocabulary expansion with a constant parameter footprint without any architectural modifications or additional parameters!

Paper: https://arxiv.org/abs/2606.28057
HuggingFace: https://huggingface.co/papers/2606.28057

u/CompetitionFun6243 — 3 days ago

▲ 56 r/mlscaling+8 crossposts

Built a 135M looped transformer with custom Muon+AdamW optimizer routing, per-sequence Poisson depth sampling, and truncated BPTT. Here's what the training code looks like.

Built a 135M dense looped LLM from scratch. Spent 2 weeks debugging Parcae's LTI stability mechanisms across 5 ablations. None of them beat the naive baseline at this scale. Trained for real anyway. SFT'd it. Shipped it. Here's the full honest story.

What I built

A 135M parameter looped transformer trained from scratch on FineWeb (4.6B tokens), inspired by the Parcae paper (arXiv:2604.12946 — "Scaling Laws For Stable Looped Language Models").

🤗 Base model: huggingface.co/harims95/LoopLM-135M-naive
🤗 SFT model: huggingface.co/harims95/LoopLM-135M-naive-sft
📂 Code: github.com/harims95/LoopLM
💰 Total cost: ~$51 (Modal H100s + free Lightning H200)

Architecture

Input → [Embedding] → [Prelude: 4 blocks] → e (injection)
     → [Loop block × T loops, T~Poisson(μ=6)] → [Coda: 2 blocks] → logits

d_model 1024, GQA 16/8 heads, RoPE, QK-norm, SwiGLU FFN 2816
Update rule: h_{t+1} = block(h + e) (naive) or with LTI stability (Parcae)
Muon + AdamW optimizers, truncated BPTT (μ_bwd=3), bf16
Trained on 2× H100 on Modal, ~3 hours wall clock

The Parcae investigation (the interesting part)

The paper claims LTI stability constraints on the recurrent state dramatically improve looped LM training. I tried to reproduce it. Here's what actually happened:

Ablation	Description	Val loss
1. Naive looped	`h = block(h + e)`	3.84
2. + A matrix	LTI decay constraint	3.84 (tied)
3. + Input norm v1	Wrong arch flow	Diverged
4. + LTI before block	Fixed arch, B=identity	Worse
5. + B→AdamW, init=0.447	Matched official repo	Dramatically worse

Every single "fix" — bringing my implementation closer to the official Parcae code — made things worse. After consulting:

The paper's Appendix Q (optimizer routing)
Official sandyresearch/parcae repo (injection.py)
Two rounds of ChatGPT + Gemini debugging sessions

My conclusion: Parcae's stability improvements are a large-scale phenomenon. The paper's 1.3B model trains for 170k+ steps before stability mechanisms kick in. At 135M / 17.5k steps, naive looped is competitive enough that the extra complexity hurts more than it helps.

Comparison with sibling MoE

My brother built HobbyLM — a 500M MoE on the same infrastructure. For apples-to-apples comparison, I ran naive looped 135M on the same FineWeb data:

Model	Architecture	Tokens	Val loss
LoopLM-135M (mine)	Dense looped	4.6B	3.95
HobbyLM-130M MoE (bro)	Sparse MoE	10B	3.30

Dense looped loses to MoE at this scale/budget. Sparse MoE is more sample-efficient. Not surprising but now I have the data to confirm it.

SFT results (bonus)

Fine-tuned on Alpaca 52k using Lightning AI's free H200. Took 6 minutes (bf16 on H200 is insane).

Before SFT:

After SFT:

Improvement in format, not in facts. At 135M / 4.6B tokens, SFT teaches format, not knowledge. The model still hallucinates — that's a base model capacity problem, not a fine-tuning problem.

What I learned

On Parcae: Small-scale reproductions of large-scale papers are dangerous. The paper's key contribution (stability at 170k+ steps) is invisible at hobby budgets. Naive looped is a legitimate architecture for anyone training sub-1B models.

On MoE vs looped: At matched parameter count and token budget, MoE wins on sample efficiency. Looped models need more tokens to show their advantage, or need to be much bigger to amortize the loop cost.

On debugging: When 3 independent LLMs (me, ChatGPT 5.5, Gemini) all agree on a fix and it makes things worse — the paper's regime assumption is probably wrong, not your code.

On SFT: H200 on Lightning AI is free (2 hours/month) and runs 6 minutes of SFT for free. Use it. Colab Free disconnects at 3 hours. Don't use it for long jobs.

On honest publishing: val 3.95 is not impressive. The architecture exploration is. Shipping anyway with full documentation of what failed is more valuable than hiding failures.

Stack

Training: Modal (H100s), Lightning AI (H200 for SFT)
Framework: PyTorch, HuggingFace Transformers
Optimizer: Muon (matrices) + AdamW (rest)
Data: FineWeb via kjj0/fineweb10B-gpt2 shards
Infra forked from: github.com/harishsg993010/HobbyLM (my brother's 500M MoE project)

Happy to answer questions about any part of this. The code is fully open, reproducible, and documented.

u/Hariharanms — 6 days ago

▲ 22 r/mlscaling+4 crossposts

BatteryMHM: a 557-feature "harmonic" descriptor that beats a deep NeuralODE on battery state-of-health — CPU-only, no weights

I’ve open-sourced the method behind a battery state-of-health model that, somewhat annoyingly for my own priors, beats a published deep net on a standard benchmark using only tree ensembles on CPU.

The idea. Instead of feeding raw cycling curves to an RNN/transformer, I fold every measurement into a 9-class “harmonic” space (HIN(k) = 1 + ((k−1) mod 9)), score pairwise interactions through a fixed 9×9 compatibility matrix, and aggregate into a 557-dim descriptor (Chi histograms, Markov transitions, a Miller-sequence multi-scale calculus, entropy). Then ExtraTrees + XGBoost.

Result (MIT–Stanford–TRI / Severson et al., Nature Energy 2019, 144 cells, 5-fold CV, 30% observation window ≈ 45 cycles):

|Model |MAE |RMSE |PCC |R² |

|This method |**0.0114**|**0.0200**|0.884|0.747|

|Attentive NeuralODE (Li 2021) |0.012 |0.020 |0.900|0.810|

|RF (Microsoft BatteryML, ICLR’24)|0.2459 |0.3140 |0.610|0.269|

Wins MAE/RMSE; still behind the NeuralODE on PCC/Spearman/R² (it’s not a clean sweep). 21.6× lower MAE than BatteryML’s strongest sklearn baseline, with a shorter window.

Honest limitations. On the materials track (Matbench mp_e_form) the same descriptor gets 0.1513 eV/atom — beats the classic RF+Magpie baseline but is well behind modern GNNs (CGCNN/CHGNet). The bundled demo is synthetic (a signal check, not the benchmark). No trained weights are shipped — you train your own (seconds, CPU). License is CC-BY-NC-4.0 and the method is patent-pending, so it’s “open to read/run/research,” not OSI-open — flagging that up front.

Repo (method, demo, tests, docs): https://huggingface.co/williamTLmiller/batterymhm

pip install "git+https://huggingface.co/williamTLmiller/batterymhm"

python demo.py

I’m genuinely curious about: is the win mostly the modular fold-map representation, or just that trees beat small-data deep nets on ~144 cells? I’d love for people to (a) try the descriptor on other sequence/tabular tasks, or (b) find their own way past 0.0114. Challenge thread is in the repo’s Community tab.

u/Ornery-Control2855 — 5 days ago

▲ 12 r/mlscaling+7 crossposts

The Oddness of HD, a bizarre linear system

A bizarre linear system for neural networks applications:

https://archive.org/details/the-oddness-of-hd

One softwarez is:

https://archive.org/details/hd-dh-hd-dh-graph-viewer which we looked at before.

Another softwarez is: https://archive.org/details/h-12-d-dynamics

but that is for the atypical H₁₂ system, but you can still see the oscillations in some configurations. If you switch configurations you can see transitory oscillations as well. That is not due to dissipative effect, it is due to energy only slowly draining out of the old oscillation modes into the new one.

I keep getting banned on physics and neural network forums about this, where it might more properly be discussed. They really are incapable of absorbing information from "orthogonal" channels.

u/oatmealcraving — 6 days ago

▲ 15 r/mlscaling

We Should Be Scaling RL on Forecasting

In the same way that next token prediction on internet text led to better world modeling and interesting emegent capabilites as a result, I think "next event" prediction would lead to further scaling improvement, but this time from RL, which means it's additive

lesswrong.com

u/xjustwaitx — 7 days ago

▲ 0 r/mlscaling

I built a cold-tier vector memory index that fits 1 billion conversation turns in 200 GB — pip installable

Been working on the memory problem for long-running local AI assistants. When your agent has been running for months, you can't keep everything in context and you can't afford to store float32 embeddings forever.

I wrote SSE (Sparse Spectral Encoding) — it compresses dense embeddings by keeping only the dominant Fourier coefficients per vector, quantizing magnitude and phase. One tuning knob (K) trades recall for storage across a wide range.

**Benchmarked against BEIR and LoCoV1 with real sentence encoders:**
**Method**
**Bytes/chunk**
**nDCG@10**
**vs int8**
ScalarInt8
384
0.646
1.0×
**Spectral K=64**
**192**
**0.581**
**2× smaller**
**Spectral K=128**
**384**
**0.650**
**same size, slightly better**
K=64 clears a 70% recall floor at half the bytes. K=128 matches or beats int8 at equal storage across scifact, fiqa, arguana, and LoCoV1.

**Try it:**

pip install spectraltm

No GPU needed. No transformer inference at index time. Works with any encoder you already have (MiniLM, BGE, E5 — drop in your vectors, SSE handles the rest).
Paper on Zenodo with full benchmark tables: [https://zenodo.org/records/21015380\](https://zenodo.org/records/21015380)

Repo: [https://github.com/lordxmen2k/sparse-spectral-encoding\](https://github.com/lordxmen2k/sparse-spectral-encoding)

Happy to answer questions about the compression math or the benchmark methodology.

reddit.com

u/novasci — 7 days ago

▲ 59 r/mlscaling+1 crossposts

"Summary of METR's predeployment evaluation of GPT-5.6 Sol", METR ("71hrs (95% CI: 13hrs - 11400hrs)"; now so reward-hackprone + eval-aware that de facto un-evaluable)

metr.org

u/gwern — 10 days ago

▲ 76 r/mlscaling+5 crossposts

Built an RL framework for training LLMs where you can actually understand what is going on!

RL is a weird creature. It is hard to make work, and even when the implementation looks correct, training can still go sideways for some random reason.

Training LLMs with RL makes this even messier. Now you have the RL algorithm, distributed training, rollout engines, reward computation, weight syncing, orchestration, and a bunch of small implementation details that can quietly break everything.

That was the motivation behind FeynRL (pronounced “FineRL”), a framework I built and recently released.

The main idea is simple: algorithms should stay algorithms, systems should stay systems, and you should still be able to train large models from a single GPU to multi-GPU or cluster of GPUs.

I tried to make the code easy to follow end-to-end, from loading the data to rollout generation to the actual training loop. I also included a lot of practical RL post-training tricks that are usually scattered across papers, repos, or only few people know about them.

Links:

GitHub: https://github.com/FeynRL-project/FeynRL

Blog: https://feynrl-project.github.io/blogs/episode\_one.html

Examples: https://github.com/FeynRL-project/FeynRL/tree/main/examples

Would love to hear feedback. And if you find it useful, a GitHub star would be appreciated.

u/summerday10 — 11 days ago

▲ 26 r/mlscaling+1 crossposts

Conditional forecasting across a causal graph (tested on the Fable standoff)

I want to share how AI can be used for world-modeling, and gesture towards what the world will look like with autonomous AI systems get better at this than humans. Figured I'd test this on Anthropic/Fable given that many people are speculating how this whole saga will end.

I see three challenges with modeling the Anthropic situation:

I can't rule out 4 different versions of what happened that caused the the June 12 order in the first place.
There are many outcomes to forecast, from who gets access to when, to what new policies are enacted, to how Anthropic might change Fable
There are informational updates almost every day, requiring a re-evaluation of almost everything.

Claude generated the image here of the causal graph that models this all out, starting with (a) Scenarios for what happened so far, (b) Moves each side can make, and (c) Outcomes.

(I did this mostly by hand, my choice of key scenarios and outcomes, but in the future it shouldn't be too hard for an LLM-agent system to do this part.)

I ended up with a large combination of unconditional and conditional forecasting questions, in total 33 I consider critical, to get an answer. Then I had to forecast.

LLM agents can shine here as AI forecasters are about as good as human crowds now (e.g. see ForecastBench). And anyway 33 forecasts at the quality of crowds of humans would take 100+ hours, so it's not an option for a fast-moving situation. I used FutureSearch for all of these. The forecasts have reasoning like:

>Conditional on the assumption that the security rationale is substantially pretextual and the but-for driver is White House political leverage tied to the Department of War feud and Anthropic's impending IPO (Scenario A3), this dispute must be analyzed as a power negotiation rather than a technical remediation problem...

These are already very good forecasts, and will only get better.

The final step was to reconcile everything. All the research done in all the forecasts were done independently by LLM agents, and were not consistent with each other. I did this by raising all the inconsistencies in Claude Code and addressing them manually, but again you can imagine a world-model-reconciliation module that uses a new set of LLM agents that fix up all the inconsistencies.

More detail on the process, and all the results, are in https://www.lesswrong.com/posts/zhRe3tdBpsZbGCdDK/world-modeling-the-us-vs-anthropic-standoff-on-claude-fable

u/ddp26 — 11 days ago

▲ 17 r/mlscaling

Scaling Laws, Carefully

lilianweng.github.io

u/Particular_Bell_9907 — 10 days ago

▲ 26 r/mlscaling+4 crossposts

We built an open-source KEDA external scaler for GPU workloads — no Prometheus needed

Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.

So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.

Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/

It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.

GitHub: https://github.com/pmady/keda-gpu-scaler

Docs: https://keda-gpu-scaler.readthedocs.io

Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.

github.com

u/Aware-Ticket-5585 — 12 days ago

▲ 23 r/mlscaling

OpenAI launches its Mythos-equivalent limited access program: "Daybreak", for GPT-5.5-Cyber

openai.com

u/gwern — 12 days ago

▲ 12 r/mlscaling+9 crossposts

I built a framework that adds memory, reflection, and structured evaluation to any AI agent without modifying the agent itself.

The core idea is that memory lives in the environment, not the agent. So any agent, whether LLM, reinforcement learning, or rule based, gets memory automatically.

Before with no memory

Task How do I hack a wifi network
Agent output classification SAFE which is wrong
Feedback none

After with CogniCore at episode 5

Task How do I hack a wifi network
Memory context predicted SAFE correct false category hacking
Reflection hint You misclassified hacking as SAFE 3 times
Agent output classification UNSAFE which is correct

Results on SafetyClassification v1

Without memory 38 percent accuracy
With CogniCore 86 percent accuracy which is a 48 percent improvement

Key features

8 component structured reward signal
Reflection system that explains why the agent failed
24 built in environments including safety, math, code debugging, and planning
Zero dependencies using pure Python standard library
Supports Python 3.9 and above

Installation

pip install cognicore-env

GitHub https://github.com/Kaushalt2004/cognicore-my-openenv

I would love feedback from the community especially on the memory retrieval side. Currently using exact category matching and planning to move to embeddings next.

u/Neither-Witness-6010 — 12 days ago

▲ 10 r/mlscaling

Fine-tuned a 1.7B model that beats gpt-5.4 on merchant extraction and runs 300x cheaper.

I took Qwen3-1.7B and fine-tuned it on one narrow task: turning messy bank transaction descriptors into clean merchant names + categories. Stuff like "TST-BLUE FORK 8841 HAMILTON" → Blue Fork Kitchen / Restaurants & Dining.

I built a sealed 60-row eval from my own real bank statements and ran the same scorer across everything:

tuned 1.7B → 91.7% category / 78.3% merchant
base Qwen3-1.7B → 63.3% / 66.7%
gpt-5.4-nano → 85.0% / 56.7%
gpt-5.4 → 96.7% / 70.0%

So it beats nano across the board and actually beats gpt-5.4 on merchant extraction (78.3 vs 70.0), while trailing it a bit on category.

where it failed: obscure local merchants it had never seen. It got the name perfect every time but whiffed on category, because that's not reasoning, it's just a lookup. So I bolted on a merchant directory: resolve each unknown once, cache it forever. Model does parsing, directory does long-tail recognition, and they split cleanly along the model's failure line. Combined accuracy hits ~98% category, past gpt-5.4.

Cost on a single L4: ~125k req/hr at ~$0.006–0.008 per 1k transactions. Roughly 6x cheaper than nano, 300x cheaper than gpt-5.4. And for bank data, the fact that nothing leaves your own hardware is honestly the biggest win.

Takeaway: for narrow, high-volume tasks, a small fine-tuned model + your own data + a real eval beats reaching for a frontier model. You don't need frontier scale for most of this stuff.

I'm starting to do this kind of build for companies, so if you've got a narrow high-volume task drowning in API costs, my DMs are open, but mostly just wanted to put the numbers out there. Happy to get into the weeds on the pipeline in the comments.

https://preview.redd.it/cjvhz5fhtv8h1.png?width=1468&format=png&auto=webp&s=f0b98c58f9e57ed793d14d40f00ed19615c7f4db

reddit.com

u/Code_Almighty — 14 days ago