u/Hairy-Foundation-963

When would you prefer DMPO over SAC for continuous control if real-world deployment is not the issue?

Hi everyone,

I have been reading about Distributional Maximum a Posteriori Policy Optimization (DMPO), especially in the context of the DeepMind bipedal robot soccer paper, and I am trying to understand when one would practically prefer it over SAC.

My current understanding is:

  • SAC is a strong off-policy continuous-control baseline.
  • It directly optimizes the actor using an entropy-regularized objective.
  • It is widely implemented, easier to find baselines for, and generally very strong in simulation.

On the other hand, DMPO seems to use a more structured actor update.

So my interpretation is that DMPO is more like: conservatively update the actor by matching kl divergence from old policy

whereas SAC is more like: mantain entropy and more aggressive updates of actor

I understand why DMPO might be attractive for real-world robotics, since conservative policy updates can reduce dangerous or unstable behavior. But suppose real-world deployment is not the issue, and all trials are in simulation.

In that case, when would you still prefer DMPO over SAC?

For example, would DMPO be more attractive in tasks where:

  • the policy is very sensitive to sudden changes?
  • the critic is noisy or easy to exploit?
  • the task involves contact-rich dynamics?
  • the return distribution is multi-modal?
  • preserving partially learned behaviors matters?
  • coordination between multiple agents is fragile?

Or would you generally just use SAC unless DMPO clearly performs better in ablations?

I am especially interested in practical opinions from people who have tried MPO/DMPO-style algorithms. In what kinds of environments did they outperform SAC, and where did SAC remain the better choice?

Thanks

reddit.com
u/Hairy-Foundation-963 — 3 days ago