r/reinforcementlearning

▲ 10 r/reinforcementlearning+2 crossposts

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs' any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.

zenodo.org
u/Megixist — 18 hours ago
▲ 2 r/reinforcementlearning+1 crossposts

Maxing out two P40s

Yes, I know they're not the best out there... But it's still nice to see the system using them both for learning.

u/redfoxkiller — 19 hours ago
▲ 23 r/reinforcementlearning+1 crossposts

"An OpenAI model has disproved a central conjecture in discrete geometry" (log scaling of inner-monologue compute in probability solving Erdős's planar unit distance problem)

openai.com
u/gwern — 1 day ago

When would you prefer DMPO over SAC for continuous control if real-world deployment is not the issue?

Hi everyone,

I have been reading about Distributional Maximum a Posteriori Policy Optimization (DMPO), especially in the context of the DeepMind bipedal robot soccer paper, and I am trying to understand when one would practically prefer it over SAC.

My current understanding is:

  • SAC is a strong off-policy continuous-control baseline.
  • It directly optimizes the actor using an entropy-regularized objective.
  • It is widely implemented, easier to find baselines for, and generally very strong in simulation.

On the other hand, DMPO seems to use a more structured actor update.

So my interpretation is that DMPO is more like: conservatively update the actor by matching kl divergence from old policy

whereas SAC is more like: mantain entropy and more aggressive updates of actor

I understand why DMPO might be attractive for real-world robotics, since conservative policy updates can reduce dangerous or unstable behavior. But suppose real-world deployment is not the issue, and all trials are in simulation.

In that case, when would you still prefer DMPO over SAC?

For example, would DMPO be more attractive in tasks where:

  • the policy is very sensitive to sudden changes?
  • the critic is noisy or easy to exploit?
  • the task involves contact-rich dynamics?
  • the return distribution is multi-modal?
  • preserving partially learned behaviors matters?
  • coordination between multiple agents is fragile?

Or would you generally just use SAC unless DMPO clearly performs better in ablations?

I am especially interested in practical opinions from people who have tried MPO/DMPO-style algorithms. In what kinds of environments did they outperform SAC, and where did SAC remain the better choice?

Thanks

reddit.com

Multi-armed Bandits

Hi all, I wanted to get some insights on solving a problem that I'm trying to model as a bandit. I'm fairly new to the subject, so if I'm saying nonsensical things, please explain. Basically, the idea is that pulling an arm gets you a reward, but that reward depends on some factors that change, so pulling the same arm again won't give the same reward. I tried to use epsilon greedy, and things sort of make sense. But, if I want to try UCB or Thompson sampling using Gaussian, it is unclear whether it would be appropriate. Because there is no need to keep pulling an arm if its reward is low when it has been tried only a few times. Depending on the reward design, it indicates that this need not be pulled. Arms, as such, may only be occasionally visited (like in epsilon). So, would this sort of behavior only be like a cold-start problem, and would Thompson eventually learn not to pick it? But how soon would that eventually be? I would appreciate any insights, and I can clarify more if needed, thanks!

reddit.com
u/Leather_Amount_2268 — 2 days ago
▲ 11 r/reinforcementlearning+1 crossposts

I built a backprop-free RL agent using Hebbian plasticity + Predictive Coding: it nearly matches standard deep RL on Pong (57% vs. 59%)

Neuroscience question that motivated this: can the kind of learning rules we actually see in the brain; Hebbian plasticity, predictive coding, distributional dopamine signals, be sufficient for a real control task?

I tested this on Pong with a fully backprop-free agent:

  • Predictive Coding (Rao & Ballard 1999) for visual feature learning
  • Distributional Hebbian plasticity for value estimation, inspired by Dabney et al. 2020 (the finding that dopamine neurons encode a full distribution over future reward, not just a scalar)

Results: BioAgent reaches 57% vs. PPO's 59%. Close, but self-play training exposed a hard problem: Hebbian rules that adapt fast also forget fast under non-stationary opponent dynamics. The plasticity– stability dilemma shows up immediately.

The dopamine-inspired distributional encoding helped stability compared to a scalar baseline, which I found interesting because it suggests the distributional coding might have a functional role beyond just representing uncertainty.

Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong

Curious what people think about the plasticity–stability angle: Is there a biological mechanism for stabilising Hebbian rules under non-stationarity that I'm missing?

reddit.com
u/ConfusionSpiritual19 — 2 days ago

Agent Systems - Discussion

What y'all think of the new "agentic" era, pay 200$ to Anthropic to automate a simple task, I really like the idea of automation with reasoning models, but it seems that now everyone can do one, I don't feel comfortable in the current market is like a dystopia,

As a reinforcement learning enthusiast in this sub, do you think this is the lowest moment of humanity? (I do),
How much time do you think this "era" is going to exist? Is it forever?

I am really sad with 2026 honestly, I just think in the line of "The Incredibles":

And when everyone is super...  no one will be!

reddit.com
u/Volta-5 — 2 days ago

Isaaclab GPU recommendation

hey guys I’m new to this whole subject. As the title says I need help upgrading my GPU.

I’m working on my capstone mechanical engineering project, a quadrupedal robot. I decided a few weeks ago that it needed to be trained using Isaac lab. Currently I have isaac sim 6 and isaac lab 3 in a container on my laptop with a 2070.

I’m switching to a desktop but what do you guys think is a better GPU for this software, 3060 12gb or 3080 10gb?

reddit.com
u/EstateMinimum — 2 days ago
▲ 9 r/reinforcementlearning+4 crossposts

self-promotion thread

I’m working on a small open repo focused on physics-informed AI for manufacturing.

The goal is not to release a production model, but to create lightweight templates for deciding whether a manufacturing workflow is actually AI-ready: clear inputs/outputs, controllable variables, feedback loops, sparse-data constraints, and where physics priors may help.

Would appreciate feedback from people working on ML for physical systems, scientific ML, or industrial AI.

Repo: https://github.com/programmablemanufacturing/programmable-manufacturing-lab

u/Consistent_Scene3887 — 3 days ago
▲ 17 r/reinforcementlearning+1 crossposts

Remote MuJoCo / Robotics RL opportunity — contractor role

I recently joined Alignerr for a different technical role and noticed they’re looking for people with hands-on MuJoCo / robotics simulation / reinforcement learning experience.

The role seems best suited for people who have worked with MuJoCo, MJCF/XML, Gymnasium/dm_control, reward shaping, PPO/SAC/TD3, physics debugging, and robot control.

It’s remote contractor work. I don’t want to oversell it because project availability can vary, but the listed rate is high and it may be worth checking out if you already have this background.

I have a referral link, but only reach out if you genuinely have MuJoCo/RL experience — this probably isn’t a beginner-friendly role.

reddit.com
u/Asimpleyoungkid — 3 days ago

Looking for an RL study/project accountability partner

Hey folks,

I'm in the midst of some interview prep / learning RL (right now working through spinningup, trying to code/derive some algos from scratch, and building a few example projects) somewhat from scratch. I've found that having accountability is really helpful for making sure progress is made.

Anyone in the same boat who wants an accountability partner? I imagine daily/regular checkins, progress on learning/projects (aka a mini "build in public"), feedback on each others plans, and even some collaboration.

Thanks and If so, DM me!

reddit.com
u/temp12345124124 — 4 days ago

When Chaos Wins: noisy net eval with noise off gave wildly inconsistent results. Turning it back on fixed everything.

Running a Rainbow DQN ablation on Snake (C51 + dueling + noisy nets). When I evaluated checkpoints with noise off (mean weights, sigma zeroed out, the standard approach), the scores were all over the place. Some checkpoints averaged 78, others averaged 18. Training curve at those same points was perfectly stable.

First instinct was a bug. Checked everything. It wasn't.

The worst case was at ep450K. Deterministic eval produced a bimodal distribution: ~25% of episodes scored near zero, ~75% scored above 80. The average was 59 but that number is meaningless with two separate peaks and nothing in between.

What's happening: the mean-weight policy has traps. Game states where Q-values for two actions are nearly identical. Without noise, the agent picks the same action every time. If it's the wrong one, it loops and dies. 25% of starting states consistently hit these traps.

Same checkpoint, same seeds, noise turned back on: bimodal failure mode vanished entirely. p25 jumped from 2 to 59. Average went from 59 to 73. Std dropped from 42 to 26. This held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval across the board.

The noise isn't residual exploration overhead. The agent learned a policy where the sigma values are functional. They provide just enough Q-value perturbation to prevent degenerate action loops. Zero them out and you get a policy that's strictly worse than what the agent actually learned.

Snake makes this especially acute because a single wrong turn at length 100+ is immediately fatal. The deterministic traps are lethal in a way they wouldn't be in more forgiving environments.

One caveat: at one very late checkpoint where sigma had grown extremely large, stochastic eval finally dropped below deterministic. There's a productive zone for noise magnitude, and past it the noise becomes destructive. So it's not "always evaluate with noise." It's "don't assume deterministic eval is automatically the ground truth."

Has anyone else seen this kind of eval divergence with noisy nets? Curious whether it's specific to tight spatial environments like Snake or shows up more broadly.

reddit.com
u/statphantom — 5 days ago
▲ 78 r/reinforcementlearning+2 crossposts

github: https://github.com/amathislab/musclemimic

MuscleMimic is a JAX-based motion imitation learning research benchmark specifically designed for biomechanically accurate muscle-actuated models. It focuses on advancing research in muscle-driven locomotion and manipulation through high-performance neural policy training. 

u/CharlieLee666 — 6 days ago

How should I plan my learning path for reinforcement learning courses?

Hi everyone, I have a question about planning my reinforcement learning studies.

I'm currently a sophomore majoring in a non-CS field. My math background includes calculus, probability and statistics, linear algebra, and some mathematical analysis. I want to start learning reinforcement learning, but according to many recommendations, it seems I may also need additional math courses such as ODEs, real analysis, stochastic processes, etc.

Is that really necessary at my current stage? Or would it be better to learn those topics along the way?

I'd also appreciate any suggestions about how to study reinforcement learning itself (courses, prerequisites, learning path, etc.). So far, the only programming language I’m comfortable with is Python.

reddit.com
u/AddressFancy3675 — 5 days ago
▲ 11 r/reinforcementlearning+4 crossposts

ML with Finance

Hi, I am an MTech student in computer science. I want to work on finance domain with machine learning. So can you suggest me some research topic. On which we can work for last year thesis. During my MTech my major focus on machine learning and deep learning around topic. But I have an interest in the finance domain also I did some project like https://github.com/Zdong104/FNSPID_Financial_News_Dataset with market regime. But now I am finding an solid research topic for the my final year. Is there any suggestion for this ?

u/Gullible_Space_4070 — 6 days ago

Teaching Humans using Expert RL Policies

RL is powerful enough to train superhuman policies, especially in video games. But is there any research on how to leverage RL's policy/value networks to improve human training speed? How can we apply behavioral cloning to humans?

Past research has shown that simply providing a human with optimal moves doesn't improve their pattern recognition or performance, it only increases their reliance on the feedback, making them worse.

Humans use some form of RL to learn motor skills and are more sample-efficient than algorithms. So, using guidance from expert policies, we can teach humans to learn along optimal trajectories, reducing time wasted in exploration.

Surely, with the help of value predictions, one can determine whether an action was suboptimal, helping solve the credit assignment problem. But what are the optimal ways to signal that to a human(e.g., either provide a number on the screen, display red/green colors, or perhaps electrocute them?)

reddit.com
u/MaxedUPtrevor — 5 days ago
▲ 525 r/reinforcementlearning+1 crossposts

Bimo’s walking model now runs natively on a Raspberry Pi Pico at 5ms inference time!

This is Bimo walking completely standalone: no data cable, no external compute, just a battery and an RP2040 (custom board) running the walking policy natively at ~5.2ms inference time.

The main walking model trains on thousands of parallel environments in Isaac Lab. That policy gets distilled down to a tiny student network and compiled directly into the MCU firmware.

Here's the pipeline:

  1. Train a standard 256×128×64 teacher model in Isaac Lab (~5min on an RTX 4080)
  2. Distill it into a 64×32 student network (~30s, yep, I was surprised too)
  3. Export as pure C using onnx2c
  4. Compile into the RP2040 firmware via Arduino IDE
  5. Inference runs at 5.0-5.2ms, comfortably within the 50ms control loop

The full distillation pipeline, the standalone MCU inference code, and the Bimo API ported to ROS2 nodes are all coming in the next update (v1.1). ROS2 was a direct request from the last Reddit post, so that's in.

Has anyone else run RL locomotion policies natively on an MCU? How small have you made the student network before significantly degrading performance?

If you want to follow the development, join the Discord server, all updates go there first. Code update to v1.1 will be available on GitHub soon.

u/mishaurus — 10 days ago