
Is MountainCar really an exploration or reward function problem?
Hi everyone,
I recently finished my master’s degree, and I’m interested in reinforcement learning. Since I had some free time, I ran MountainCar as a toy project. I was originally interested in the phenomenon of plasticity loss, and I suddenly wondered whether the MountainCar environment might not necessarily be a problem of exploration or reward function design, as it is often described.
Briefly, plasticity loss refers to the phenomenon where a model’s ability to adapt to data decreases when the data distribution changes during training. Dormant neuron ratio and effective rank are often used as indicators of this. In simple terms, dormant neuron ratio measures the proportion of neurons in hidden layers whose activations contribute very little to learning, somewhat similar to dead neurons. Effective rank, on the other hand, can be interpreted as the number of dimensions that the penultimate layer is able to represent.
I used CleanRL’s code and hyperparameters almost as-is, and ran experiments with 5 seeds. I compared the baseline with a method known to be effective against plasticity loss: adding Layer Normalization between the linear layer and ReLU.
Surprisingly, simply adding LayerNorm reduced the model’s dormant neuron ratio, noticeably improved the effective rank, and also made learning much smoother. Those familiar with this environment will know that a return of around -110 can be considered very strong performance.
Based on this experiment, I would like to decide what direction to take next. To summarize, my thoughts and questions are:
- The MountainCar environment may be solvable simply by adding LayerNorm, without changing the reward function or the exploration strategy.
- However, even if LayerNorm solved the problem, I don’t think this necessarily proves that the issue was plasticity loss. What other possible explanations could there be? Why did LayerNorm solve this problem?
- I would appreciate any thoughts or feedback on how I could further develop this result.