r/reinforcementlearning

What do you think of Yann Lecun option of RL being the cherry on top of all the ML cake?

Title says it all. I'm not expert in pure RL research, I worked mainly in foundation models so far.

Im curious on earing form expert what are their opinion of the role of modern RL, in particular:

- will it be just the very last fine tuning layer of bigger foundation models? If so what kind of RL approach you think are most prominent?

- will there be (or there are alredy) model that use RL more as a core layer in the whole model?

My gut feeling is that RL is very cool, but the hype has gone down in the last years due to diffusion/foundation model performing and scaling much better, and a lot of RL is perceived in practice as mainly "reward engineering".

Please correct me as I might be very wrong :)

u/Amazing-Coat5160 — 7 hours ago

▲ 1 r/reinforcementlearning+1 crossposts

Open-sourced an RL model to give LLM the sales strategies

The main problem faced when using these LLMs for sales usage is that they are perfect, smooth, polite, always accepting, and agree with whatever I say, even with strict prompting; things always go for acceptance in the long run. The same for Fable 5 ( handicapped now), Opus 4.8, Gemini 3.1 pro, and the GPT 5 series. Always thought about augmenting these responses with a trained RL policy that understands sales nuances. We dont need large sales dataset to train these model, we dont need a synthetic dataset or any sentence/word dataset, the trick is model free RL without with self generated interactions!, we just need numbers that represent the sales features or the customer values, like trust, interest, budget fit etc. you give them numbers, then train a PPO model with revenue as reward on mllions of environments with different numbers for each, the idea is to predict action, the actions are like close, pitch, rapport etc. say if trust value increases above a threshold, the interest value should also be increased, if many of these conditions are above a certain threshold, revenue aka reward become larger numbers, else zero. So, without words, we train an RL with just numbers and sales rules. Now this RL has to be bridged with the residual streams of the LLMs, so we can add the hidden features and the action states from the RL to the LLM to augment its final response, so we train a bridge MLP layer using Gemma 4 E4Bs layers, frozen and frozen RL layers, the whole idea is to perfecly bridge the hidden features from the trained LLM to the LLM. During inference, one LLM generates a JSON for the features like trust, interest, and the RL model uses this to create the hidden features, and the action states are injected into the LLM’s residual flow, both use two instances of the same LLM, btw. But we can juse use the json directly from first LLM response and use it in second with second LLM, but it doesnt know the future, it doesnt playes 40 million sales games, the policy makes it more interesting, that is 1024 hidden layers from the RL during inference gives the reason why it made the decision, and 8 action head gives which is the best move to be taken.

TLDR: A trained RL model on 22 customer states like trust, interest, etc., to predict which action to take, like pitch or close, injecting on an open-source LLM residual flow to augment the final response. For the LLM APIs, we don’t need to inject, just a system prompt after the RL output, and augment the final response

Pypi package at: https://pypi.org/project/rl-sales-augment/

GitHub repo at: https://github.com/NandhaKishorM/rl-sales-augment

Build on top of my 1 year back arxiv paper: https://arxiv.org/abs/2510.01237

Now new arxiv submission just submitted. Will share the paper once it's accepted

reddit.com

u/Nandakishor_ml — 9 hours ago

▲ 3 r/reinforcementlearning

quadruped training in reinforcement learning

hi, i’m trying to train a custom quadruped robot in mjlab using RL, I’m stuck after the robot definition phase. can someone help me with resources for doing this? would be really helpful if there’s a github repo or any youtube videos about this
thank you so much :)

reddit.com

u/blueberries_jpeg — 11 hours ago

▲ 18 r/reinforcementlearning

Sutton Barto vs Mathematical Foundations of Reinforcement Learning vs others

I want to improve my RL foundations so I can understand research papers better and eventually do research myself. I’m also looking to buy a physical book since I find it much easier to study that way.

Which would you recommend: Sutton Barto, Mathematical Foundations of Reinforcement Learning, or a different one?

I know Sutton Barto is considered the RL bible, but I started with the Mathematical Foundations YouTube course and really liked how well the professor explains the math.

I’m mainly interested in robotics applications, with games as a secondary interest.

reddit.com

u/LeCholax — 1 day ago

▲ 16 r/reinforcementlearning+1 crossposts

Barto and Sutton book

Is this book still relevant in 2026? Do the concepts in this book help you understand the recent developments in RL like GRPO, DPO, PPO, etc?

reddit.com

u/maryal01 — 2 days ago

▲ 3 r/reinforcementlearning+3 crossposts

Should I do more training for the Number guessing model?

I did a project on making and training a number-guessing reinforcement learning model.

I did 140k episodes, and it started to Show degradation in success rate due to the model being made up of Standard DQN and not Double DQN . Should I train it more to see the max ceiling limit of success rate the model can achieve? What do you think, and how much should I train it until? Number Guessing RL Model

u/Kooky_Golf2367 — 1 day ago

▲ 9 r/reinforcementlearning+4 crossposts

Made a semantic search over accepted AI/ML conference papers (search by meaning, not keywords)

I kept losing papers because I remember what they're about, not what they're called, and keyword search on conference sites needs the exact title words. So I built a search that works by meaning instead: https://aiconfpaper.com

It covers accepted papers from the main AI/ML/CV/NLP/robotics conferences (NeurIPS, ICML, ICLR, CVPR, ACL, CoRL, and more), 2015-2026. You describe the idea in a sentence and it finds matching papers, then "similar papers" lets you walk outward into related work.

It's been genuinely useful for my own related-work scoping, so figured I'd share. There's also an API if you'd rather have an agent search it (docs are on the site). One-person project, so if a search gives you something off, tell me the query and I'll take a look.

u/kyowoon — 2 days ago

▲ 7 r/reinforcementlearning+2 crossposts

Training an Agent to walk using Procedural Animation in my Game Engine

I am creating my own game/simulation engine since the last year. Currently i am working on procedural animation and i am having some trouble with it.

The agent can learn to balance itself easily but when i try to teach it to walk, it just can't do it. It moves only about 0.5 on x-axis and then falls down or the episode ends(due to maximum time limit). I am kind of new to this procedural animation stuff but i've seen some videos of it.

Can anyone tell me what's the problem with my agent here? The max reward won't rise after a few episodes. I am using Box2D for physics and LibTorch to train the network. The renderer is made by me using OpenGL and i am trying to train it to walk from scratch.

I don't think that the problem is in physics or other parts of my engine. Because i've already did pendulum and double pendulum balancing and training the agent to stand without falling down. But i can't get it to walk. I've tried different reward functions but those did not work so i added a very simple reward that can tell the agent to always move forward. Here is my current reward related code : -

// Forward velocity reward
float reward = vel.x * 0.1f;

// Penalize falling - if root body angle is too large
float angle = rootBody-&gt;GetAngle();
if (std::abs(angle) &gt; 1.2f) // ~70 degrees
{
reward -= 1.0f;
brainComponent.done = true; // end episode on fall
}

Btw i am using the PPO algorithm here. If this much info is not enough, feel free to ask me. It would be nice to hear your suggestions if you've worked on this kind of problem before.

https://reddit.com/link/1umh562/video/jsw7otnv31bh1/player

reddit.com

u/ZealousidealDesk3261 — 3 days ago

▲ 242 r/reinforcementlearning+3 crossposts

I built Reinforcement Learning Map

I built a free handbook where the entire field is laid out as an interactive map — ~25 algorithms grouped into branches (value-based, policy-based, model-based, planning), and clicking any node takes you to a full chapter with the intuition, math, and runnable code.

Site: rl-handbook.com
Code: github.com/lubludrova/rl-handbook

Would really appreciate feedback — especially where explanations are unclear or where you'd want more depth. What topics should I prioritize next?

u/Savings-Shoulder-976 — 5 days ago

▲ 3 r/reinforcementlearning+2 crossposts

Agent Behavior Lab — a self-hosted lab for studying how tool-using LLM agents behave (MIT, React/TS/Prisma)

Sharing a project I've been building: Agent Behavior Lab, a self-hosted platform for running reproducible experiments on tool-using LLM agents.

You define an agent's context (model, tools, persona, prior conversation), vary one factor at a time, run repeated trials, and get grouped metrics + heatmaps + effect sizes to see what actually changed the behavior. Works with any OpenAI-compatible provider; ships with seed data so it's populated on first run.

Stack: React 19, Vite, TypeScript, TanStack Query, Express, Prisma, PostgreSQL, Docker Compose
License: MIT
Safety: doesn't execute tools — records whether a model attempted a call

https://github.com/Null-Square/agent-behavior-lab

Contributions welcome (there's a CONTRIBUTING guide). Happy to answer questions about the architecture.

github.com

u/IcyPop8985 — 3 days ago

▲ 8 r/reinforcementlearning+1 crossposts

PPO agent to do load balancing + autoscaling for a Docker cluster (honest writeup + code)

I built a system where a single PPO agent simultaneously handles L7 load balancing and horizontal autoscaling for a Docker based microservice cluster, instead of the usual combo of Round Robin routing plus static CPU thresholds.

Setup

The agent observes per container CPU, RAM, latency, error rate and queue depth, plus a global workload signal, and outputs both continuous routing weights and a scale up/down/hold decision every step.

Training happens in two phases. Phase 1 pretrains on a mathematical M/M/1 queueing simulation (fast, no Docker needed). Phase 2 fine tunes on a real cluster with Docker, HAProxy for routing, and Locust generating traffic.

Evaluation

I benchmarked the trained policy against two baselines across five cluster sizes (N = 5, 10, 15, 20, 25), in both the simulated environment and the real Docker cluster:

A static CPU threshold scaler with Round Robin routing (the common production default)
A PID controller regulating CPU to a 60 percent setpoint, also with Round Robin

Results, the short version

The PID and threshold baselines actually beat PPO on cost efficiency (users served per active container) in most cluster sizes, both simulated and real. PPO does generalize across cluster sizes with no retraining, and it keeps latency well under the SLA ceiling everywhere, but it is not consistently better than classical control here, and its routing precision degrades noticeably at N=25 where the action space becomes 26 dimensional.

I also found that the anti chattering term in the reward is not doing its job well in practice, PPO changes fleet size in over 70 percent of steps versus under 10 percent for the threshold baseline, so it ends up more reactive and "twitchy" than intended.

I wrote this up with the full derivations, per agent metric tables, and a section that's specifically about where the learned policy falls short, rather than only the wins. Repo has the code, the report, and the result plots.

Repo: https://github.com/MartinFarres/LoadBalancerAutoScaler-DRL

Where I'd take this next

A lot of the real cluster numbers should be read with a grain of salt. The real training and eval runs were short (2k steps) mostly because of hardware constraints, I was running everything on a single machine and couldn't afford longer iteration counts there, so the sim to real comparison is probably hiding real differences between agents rather than showing they're actually tied.

The change I'm most interested in for a future version is moving from a homogeneous cluster to a heterogeneous one, containers with different CPU/RAM specs instead of identical replicas. Right now the agent implicitly assumes every node is interchangeable, which is a pretty unrealistic assumption for real infra and probably where a learned policy could actually start to beat static rules, since a PID controller or threshold scaler has a much harder time reasoning about per node capacity differences than a policy that observes them directly.

Happy to get pushback on the reward shaping or the evaluation methodology, this was very much a learning project and I'm sure there are things to improve, especially around the sim to real gap given the real cluster eval window was short.

u/TheGrilla_04 — 4 days ago

▲ 10 r/reinforcementlearning

Research ON RL

Hello all.

Currently i am working in aerial robotics startup.

I want start research on RL world model algorithms.

Is there anyone interested.

Dm me.

We can discuss the problem statement.

reddit.com

u/Chemical_Bonus4471 — 5 days ago

▲ 10 r/reinforcementlearning

High variance returns, are they normal?

Using SAC, and trying to use curriculum learning to advance training slowly. Training advances when the moving average plateaus, however, often when it plateaus it is actually not the optimal solution yet when I look at the variance, there are many instances where the episode returns an optimal solution. How can I converge to this optimal instead? Or should I accept that this is inherent to RL?

u/Markovvy — 5 days ago

▲ 8 r/reinforcementlearning

I am training RL agents in team pursuit (MAPPO) with only capture reward and time penalty...after the first 5000 training iterations the agents have only learnt to travel a little bit and camp...for any other effective strategies to occur do I need a harder training environment ?

reddit.com

u/d13maxx — 5 days ago

▲ 33 r/reinforcementlearning

11 months of building a robotics simulator taught me one thing: talk to users more than your code

Almost 11 months ago, I launched RoboSpace, a browser-based robotics simulator for quickly prototyping robot behaviors.

Looking back at the analytics, one metric stood out more than reaching 600+ users.

try robospace.app

June 29th was the ONLY day since launch with zero sign-ups.

Some of the best features in RoboSpace weren't my ideas—they came directly from researchers, students, and robotics developers who told me what was slowing them down.

Eleven months later, the biggest lesson I've learned is:

Listen. Ship. Repeat.

I'm now starting conversations with universities and robotics labs to understand how people build and iterate on robot simulations today.

If you're doing robotics research or teaching robotics:

What simulator do you use most?
What's the biggest pain point in your workflow?
If you could fix one thing about your simulation tools, what would it be?

I'd genuinely love to hear your experiences.

Sharable posts: Twitter/X (trying to grow my X; I wish I had started sooner lol)

u/keivalya2001 — 5 days ago

▲ 1 r/reinforcementlearning

I really need to install Isaac lab...

Hello. I am working on an assignment exploring Isaac Sim/Lab.

Installing Isaac Sim was a breeze, but Isaac Lab is just not working for me.

My setup is 16 GB of RAM with an RTX 4060 laptop GPU with 8 GB of VRAM, running on Windows. I know this is well below the minimum spec requirements, but this is all I have. Since I am not adding any textures or materials and only using simple grey moving boxes, Isaac Sim itself worked fine on this setup.

The problem is that no matter how I try to install Isaac Lab, I can successfully install Isaac Sim in a separate repository, but whenever I try to run a test RL task, the program crashes.

I tried two installation methods: the Isaac Sim pip package and the Isaac Lab pip packages, but the result was the same.

When I launch:

.\isaaclab.bat -p scripts\reinforcement_learning\rsl_rl\train.py --task=Isaac-Ant-v0 --headless --num_envs 16

this is what I get.

Is the issue really Windows, and should I move to Ubuntu?

|---------------------------------------------------------------------------------------------|

| Driver Version: 572.16 | Graphics API: D3D12

|=============================================================================================|

|---------------------------------------------------------------------------------------------|

| 0 | NVIDIA GeForce RTX 4060 Laptop.. | Yes: 0 | | 7957 MB | 10de | 3b3b0100.. |

| | | | | | 28e0 | 0 |

| | | | | | 1 | |

|---------------------------------------------------------------------------------------------|

| 1 | AMD Radeon 780M Graphics | | | 418 MB | 1002 | 204c0100.. |

| | | | | | 1900 | 0 |

| | | | | | N/A | |

|=============================================================================================|

| OS: Windows 11 Pro, Version: 10.0 (25H2), Build: 26200, Kernel: 10.0.26100.8655

| Processor: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics

| Cores: 8 | Logical Cores: 16

|---------------------------------------------------------------------------------------------|

| Total Memory (MB): 15658 | Free Memory: 4679

| Total Page/Swap (MB): 32042 | Free Page/Swap: 8576

|---------------------------------------------------------------------------------------------|

Additional error log:

2026-06-30T04:40:23Z [1,182ms] [Fatal] [carb.crashreporter-breakpad.plugin] 105: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,183ms] [Fatal] [carb.crashreporter-breakpad.plugin] 106: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,183ms] [Fatal] [carb.crashreporter-breakpad.plugin] 107: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,183ms] [Fatal] [carb.crashreporter-breakpad.plugin] 108: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,184ms] [Fatal] [carb.crashreporter-breakpad.plugin] 109: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,184ms] [Fatal] [carb.crashreporter-breakpad.plugin] 110: python311.dll!PyEval_GetBuiltins+0x1e8

2026-06-30T04:40:23Z [1,185ms] [Fatal] [carb.crashreporter-breakpad.plugin] 111: python311.dll!PyEval_GetBuiltins+0xb8

2026-06-30T04:40:23Z [1,185ms] [Fatal] [carb.crashreporter-breakpad.plugin] 112: python311.dll!PyArg_UnpackTuple+0xe4

2026-06-30T04:40:23Z [1,185ms] [Fatal] [carb.crashreporter-breakpad.plugin] 113: python311.dll!PyObject_Call+0x5b

2026-06-30T04:40:23Z [1,186ms] [Fatal] [carb.crashreporter-breakpad.plugin] 114: python311.dll!PyThread_tss_is_created+0x35e30

2026-06-30T04:40:23Z [1,186ms] [Fatal] [carb.crashreporter-breakpad.plugin] 115: python311.dll!PyEval_EvalFrameDefault+0x535f

2026-06-30T04:40:23Z [1,187ms] [Fatal] [carb.crashreporter-breakpad.plugin] 116: python311.dll!PyNumber_Add+0x13f1

2026-06-30T04:40:23Z [1,187ms] [Fatal] [carb.crashreporter-breakpad.plugin] 117: python311.dll!PyObject_CallMethodObjArgs+0x123

2026-06-30T04:40:23Z [1,187ms] [Fatal] [carb.crashreporter-breakpad.plugin] 118: python311.dll!PyObject_CallMethodObjArgs+0x5e

2026-06-30T04:40:23Z [1,188ms] [Fatal] [carb.crashreporter-breakpad.plugin] 119: python311.dll!PyConfig_FromDict+0xad9

2026-06-30T04:40:23Z [1,188ms] [Fatal] [carb.crashreporter-breakpad.plugin] 120: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,189ms] [Fatal] [carb.crashreporter-breakpad.plugin] 121: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,189ms] [Fatal] [carb.crashreporter-breakpad.plugin] 122: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,189ms] [Fatal] [carb.crashreporter-breakpad.plugin] 123: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,190ms] [Fatal] [carb.crashreporter-breakpad.plugin] 124: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,190ms] [Fatal] [carb.crashreporter-breakpad.plugin] 125: python311.dll!PyEval_GetBuiltins+0x1e8

2026-06-30T04:40:23Z [1,191ms] [Fatal] [carb.crashreporter-breakpad.plugin] 126: python311.dll!PyEval_GetBuiltins+0xb8

2026-06-30T04:40:23Z [1,191ms] [Fatal] [carb.crashreporter-breakpad.plugin] 127: python311.dll!PyArg_UnpackTuple+0xe4

2026-06-30T04:40:23Z [1,192ms] [Fatal] [carb.crashreporter-breakpad.plugin] 128: python311.dll!PyObject_Call+0x5b

2026-06-30T04:40:23Z [1,192ms] [Fatal] [carb.crashreporter-breakpad.plugin] 129: python311.dll!PyThread_tss_is_created+0x35e30

2026-06-30T04:40:23Z [1,192ms] [Fatal] [carb.crashreporter-breakpad.plugin] 130: python311.dll!PyEval_EvalFrameDefault+0x535f

2026-06-30T04:40:23Z [1,193ms] [Fatal] [carb.crashreporter-breakpad.plugin] 131: python311.dll!PyNumber_Add+0x13f1

2026-06-30T04:40:23Z [1,193ms] [Fatal] [carb.crashreporter-breakpad.plugin] 132: python311.dll!PyObject_CallMethodObjArgs+0x123

2026-06-30T04:40:23Z [1,194ms] [Fatal] [carb.crashreporter-breakpad.plugin] 133: python311.dll!PyObject_CallMethodObjArgs+0x5e

2026-06-30T04:40:23Z [1,194ms] [Fatal] [carb.crashreporter-breakpad.plugin] 134: python311.dll!PyConfig_FromDict+0xad9

2026-06-30T04:40:23Z [1,194ms] [Fatal] [carb.crashreporter-breakpad.plugin] 135: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,195ms] [Fatal] [carb.crashreporter-breakpad.plugin] 136: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,195ms] [Fatal] [carb.crashreporter-breakpad.plugin] 137: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,195ms] [Fatal] [carb.crashreporter-breakpad.plugin] 138: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,196ms] [Fatal] [carb.crashreporter-breakpad.plugin] 139: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,196ms] [Fatal] [carb.crashreporter-breakpad.plugin] 140: python311.dll!PyEval_GetBuiltins+0x1e8

2026-06-30T04:40:23Z [1,197ms] [Fatal] [carb.crashreporter-breakpad.plugin] 141: python311.dll!PyEval_GetBuiltins+0xb8

2026-06-30T04:40:23Z [1,197ms] [Fatal] [carb.crashreporter-breakpad.plugin] 142: python311.dll!PyArg_UnpackTuple+0xe4

2026-06-30T04:40:23Z [1,197ms] [Fatal] [carb.crashreporter-breakpad.plugin] 143: python311.dll!PyObject_Call+0x5b

2026-06-30T04:40:23Z [1,198ms] [Fatal] [carb.crashreporter-breakpad.plugin] 144: python311.dll!PyThread_tss_is_created+0x35e30

2026-06-30T04:40:23Z [1,198ms] [Fatal] [carb.crashreporter-breakpad.plugin] 145: python311.dll!PyEval_EvalFrameDefault+0x535f

2026-06-30T04:40:23Z [1,199ms] [Fatal] [carb.crashreporter-breakpad.plugin] 146: python311.dll!PyNumber_Add+0x13f1

2026-06-30T04:40:23Z [1,199ms] [Fatal] [carb.crashreporter-breakpad.plugin] 147: python311.dll!PyObject_CallMethodObjArgs+0x123

2026-06-30T04:40:23Z [1,200ms] [Fatal] [carb.crashreporter-breakpad.plugin] 148: python311.dll!PyObject_CallMethodObjArgs+0x5e

2026-06-30T04:40:23Z [1,200ms] [Fatal] [carb.crashreporter-breakpad.plugin] 149: python311.dll!PyConfig_FromDict+0xad9

2026-06-30T04:40:23Z [1,200ms] [Fatal] [carb.crashreporter-breakpad.plugin] 150: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,201ms] [Fatal] [carb.crashreporter-breakpad.plugin] 151: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,201ms] [Fatal] [carb.crashreporter-breakpad.plugin] 152: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,201ms] [Fatal] [carb.crashreporter-breakpad.plugin] 153: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,202ms] [Fatal] [carb.crashreporter-breakpad.plugin] 154: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,202ms] [Fatal] [carb.crashreporter-breakpad.plugin] 155: python311.dll!PyEval_GetBuiltins+0x1e8

2026-06-30T04:40:23Z [1,203ms] [Fatal] [carb.crashreporter-breakpad.plugin] 156: python311.dll!PyEval_GetBuiltins+0xb8

2026-06-30T04:40:23Z [1,203ms] [Fatal] [carb.crashreporter-breakpad.plugin] 157: python311.dll!PyArg_UnpackTuple+0xe4

2026-06-30T04:40:23Z [1,204ms] [Fatal] [carb.crashreporter-breakpad.plugin] 158: python311.dll!PyObject_Call+0x5b

2026-06-30T04:40:23Z [1,204ms] [Fatal] [carb.crashreporter-breakpad.plugin] 159: python311.dll!PyThread_tss_is_created+0x35e30

2026-06-30T04:40:23Z [1,204ms] [Fatal] [carb.crashreporter-breakpad.plugin] 160: python311.dll!PyEval_EvalFrameDefault+0x535f

2026-06-30T04:40:23Z [1,205ms] [Fatal] [carb.crashreporter-breakpad.plugin] 161: python311.dll!PyNumber_Add+0x13f1

2026-06-30T04:40:23Z [1,205ms] [Fatal] [carb.crashreporter-breakpad.plugin] 162: python311.dll!PyObject_CallMethodObjArgs+0x123

2026-06-30T04:40:23Z [1,206ms] [Fatal] [carb.crashreporter-breakpad.plugin] 163: python311.dll!PyObject_CallMethodObjArgs+0x5e

2026-06-30T04:40:23Z [1,206ms] [Fatal] [carb.crashreporter-breakpad.plugin] 164: python311.dll!PyConfig_FromDict+0xad9

2026-06-30T04:40:23Z [1,206ms] [Fatal] [carb.crashreporter-breakpad.plugin] 165: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,207ms] [Fatal] [carb.crashreporter-breakpad.plugin] 166: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,207ms] [Fatal] [carb.crashreporter-breakpad.plugin] 167: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,208ms] [Fatal] [carb.crashreporter-breakpad.plugin] 168: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,208ms] [Fatal] [carb.crashreporter-breakpad.plugin] 169: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,208ms] [Fatal] [carb.crashreporter-breakpad.plugin] 170: python311.dll!PyEval_GetBuiltins+0x1e8

2026-06-30T04:40:23Z [1,209ms] [Fatal] [carb.crashreporter-breakpad.plugin] 171: python311.dll!PyEval_GetBuiltins+0xb8

2026-06-30T04:40:23Z [1,209ms] [Fatal] [carb.crashreporter-breakpad.plugin] 172: python311.dll!PyArg_UnpackTuple+0xe4

2026-06-30T04:40:23Z [1,209ms] [Fatal] [carb.crashreporter-breakpad.plugin] 173: python311.dll!PyObject_Call+0x5b

2026-06-30T04:40:23Z [1,210ms] [Fatal] [carb.crashreporter-breakpad.plugin] 174: python311.dll!PyThread_tss_is_created+0x35e30

2026-06-30T04:40:23Z [1,210ms] [Fatal] [carb.crashreporter-breakpad.plugin] 175: python311.dll!PyEval_EvalFrameDefault+0x535f

2026-06-30T04:40:23Z [1,211ms] [Fatal] [carb.crashreporter-breakpad.plugin] 176: python311.dll!PyNumber_Add+0x13f1

2026-06-30T04:40:23Z [1,211ms] [Fatal] [carb.crashreporter-breakpad.plugin] 177: python311.dll!PyObject_CallMethodObjArgs+0x123

2026-06-30T04:40:23Z [1,211ms] [Fatal] [carb.crashreporter-breakpad.plugin] 178: python311.dll!PyObject_CallMethodObjArgs+0x5e

2026-06-30T04:40:23Z [1,212ms] [Fatal] [carb.crashreporter-breakpad.plugin] 179: python311.dll!PyConfig_FromDict+0xad9

2026-06-30T04:40:23Z [1,212ms] [Fatal] [carb.crashreporter-breakpad.plugin] 180: python311.dll!PyImport_ImportModuleLevelObject+0x595

2026-06-30T04:40:23Z [1,213ms] [Fatal] [carb.crashreporter-breakpad.plugin] 181: python311.dll!PyException_GetTraceback+0xd3

2026-06-30T04:40:23Z [1,213ms] [Fatal] [carb.crashreporter-breakpad.plugin] 182: python311.dll!PyEval_EvalFrameDefault+0x7339

2026-06-30T04:40:23Z [1,214ms] [Fatal] [carb.crashreporter-breakpad.plugin] 183: python311.dll!PyType_CalculateMetaclass+0xfb

2026-06-30T04:40:23Z [1,214ms] [Fatal] [carb.crashreporter-breakpad.plugin] 184: python311.dll!PyEval_EvalCode+0x97

2026-06-30T04:40:23Z [1,214ms] [Fatal] [carb.crashreporter-breakpad.plugin] 185: python311.dll!PyEval_EvalCode+0x32e

2026-06-30T04:40:23Z [1,215ms] [Fatal] [carb.crashreporter-breakpad.plugin] 186: python311.dll!PyEval_EvalCode+0x2aa

2026-06-30T04:40:23Z [1,215ms] [Fatal] [carb.crashreporter-breakpad.plugin] 187: python311.dll!PyThread_tss_is_created+0x550ae

2026-06-30T04:40:23Z [1,216ms] [Fatal] [carb.crashreporter-breakpad.plugin] 188: python311.dll!PyRun_SimpleFileObject+0x11d

2026-06-30T04:40:23Z [1,216ms] [Fatal] [carb.crashreporter-breakpad.plugin] 189: python311.dll!PyRun_AnyFileObject+0x54

2026-06-30T04:40:23Z [1,216ms] [Fatal] [carb.crashreporter-breakpad.plugin] 190: python311.dll!PyDict_Values+0xcd7

2026-06-30T04:40:23Z [1,217ms] [Fatal] [carb.crashreporter-breakpad.plugin] 191: python311.dll!PyDict_Values+0xb93

2026-06-30T04:40:23Z [1,217ms] [Fatal] [carb.crashreporter-breakpad.plugin] 192: python311.dll!Py_RunMain+0x184

2026-06-30T04:40:23Z [1,218ms] [Fatal] [carb.crashreporter-breakpad.plugin] 193: python311.dll!Py_RunMain+0x15

2026-06-30T04:40:23Z [1,218ms] [Fatal] [carb.crashreporter-breakpad.plugin] 194: python311.dll!Py_Main+0x25

2026-06-30T04:40:23Z [1,220ms] [Fatal] [carb.crashreporter-breakpad.plugin] 195: python.exe!+0x1230

2026-06-30T04:40:23Z [1,220ms] [Fatal] [carb.crashreporter-breakpad.plugin] 196: KERNEL32.DLL!BaseThreadInitThunk+0x17

2026-06-30T04:40:23Z [1,221ms] [Fatal] [carb.crashreporter-breakpad.plugin] 197: ntdll.dll!RtlUserThreadStart+0x2c

reddit.com

u/Complex-Cover5875 — 6 days ago

▲ 186 r/reinforcementlearning

Spot walking procedural terrain: Isaac base policy + transfer learning, all driven from Python on my own Vulkan renderer

Short clip of Spot crossing procedural terrain. The base policy came from Isaac Lab, everything else is mine.

Policy: chain of PPO transfers, not zero-shot. Isaac Lab flat walker → fine-tuned for rough terrain → fine-tuned again for discrete stairs.

Keeping the Isaac gait while learning new terrain: I keep a frozen copy of the original Isaac policy as a teacher and add a scan-gated imitation reward. On flat ground the policy is penalized for drifting from the teacher's actions, so it only deviates where the terrain actually demands it. Plus a per-env adaptive curriculum (1024 parallel envs, each promotes/demotes its own step height).

Obs/action: 94-d obs (48-d Isaac proprio + base height + a 45-cell forward height-scan), 12-d joint-target actions. Training is my own PPO loop over batched PhysX on GPU.

The stack (all mine, driven from Python): C++ engine with Python bindings (threepp). A Vulkan deferred renderer with procedural terrain, PhysX physics, and the live SLAM map. No Isaac Sim / Omniverse at runtime; the base policy is the only imported piece.

The current policy largely ignores the depth sensor data. Working on it, but has very good stability nonetheless.

UPDATE: Due to license issues, I have now generated a new similar gait that is not warm-started from a Isaac policy.

u/laht1 — 8 days ago

▲ 18 r/reinforcementlearning

Planet Fitness 🤝 DeepSeek

u/vafaii — 6 days ago

▲ 5 r/reinforcementlearning+1 crossposts

Number Guessing RL Model

Built a number guessing rl model with the help of claude . Got 40% success 🤯

Here is the project—PantherHale/number-guessing-rl: Number Guessing RL Model

Your support would be much helpful.😀

#rlmodel #numberguessing #MachineLearning #Claude #success

reddit.com

u/Kooky_Golf2367 — 7 days ago

▲ 7 r/reinforcementlearning+1 crossposts

Released X5‑Lite: a lightweight reasoning controller for LLM agents (Lo Shu 3×3 cycle)

Hey everyone,
I’ve been experimenting with structured reasoning loops for local LLM agents and put together a small project called X5‑Lite.

It uses a simple 3×3 Lo Shu cycle as a deterministic controller to stabilize multi‑step reasoning.
The goal is to reduce chaotic drift during long chains of thought and give agents a more predictable evaluation rhythm.

It’s lightweight, backend‑agnostic, and works with any local model.

Code is here:
https://github.com/hkyuyingli-spec/X5lite (github.com in Bing)

If anyone has ideas for improving the cycle logic or integrating it with local inference pipelines, I’d love to hear your thoughts.

u/SuccessfulBand8088 — 6 days ago