u/ConfusionSpiritual19

Wanted to see how close a fully bio-plausible agent could get to PPO on Pong.

Setup

Custom Pong environment (pygame, no gym)
PPO baseline: paper-faithful, from scratch
Hebbian agent: PPO policy replaced with Hebbian value estimation
- engineered features → 61%
BioAgent: Predictive Coding for feature learning + distributional Hebbian plasticity for value (Dabney et al. 2020) → 57% Zero backprop anywhere in the pipeline.

Key observations

The 2% gap is real but small. The bottleneck wasn't the lack of backprop because it was catastrophic forgetting under non-stationary opponent dynamics during self-play.
Distributional value encoding (à la Dabney) helped stability vs. a scalar Hebbian baseline, but not enough to match PPO under self-play.
Self-play exposed the plasticity–stability dilemma hard: Hebbian rules that adapt fast forget fast. This is the real wall for bio-plausible RL in non-stationary settings.

Not claiming novelty in the architecture as this is a from-scratch exploration of whether bio-plausible rules can handle a real RL task. Short answer: yes, mostly, with one clear failure mode.

Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong

Happy to answer questions about the PC implementation, the Hebbian value estimator, or the self-play setup.

Neuroscience question that motivated this: can the kind of learning rules we actually see in the brain; Hebbian plasticity, predictive coding, distributional dopamine signals, be sufficient for a real control task?

I tested this on Pong with a fully backprop-free agent:

Predictive Coding (Rao & Ballard 1999) for visual feature learning
Distributional Hebbian plasticity for value estimation, inspired by Dabney et al. 2020 (the finding that dopamine neurons encode a full distribution over future reward, not just a scalar)

Results: BioAgent reaches 57% vs. PPO's 59%. Close, but self-play training exposed a hard problem: Hebbian rules that adapt fast also forget fast under non-stationary opponent dynamics. The plasticity– stability dilemma shows up immediately.

The dopamine-inspired distributional encoding helped stability compared to a scalar baseline, which I found interesting because it suggests the distributional coding might have a functional role beyond just representing uncertainty.

Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong

Curious what people think about the plasticity–stability angle: Is there a biological mechanism for stabilising Hebbian rules under non-stationarity that I'm missing?

Backprop-free Pong: PC + distributional Hebbian plasticity vs. PPO: 57% vs. 59%, ~1500 lines from scratch [P]

I built a backprop-free RL agent using Hebbian plasticity + Predictive Coding: it nearly matches standard deep RL on Pong (57% vs. 59%)