Trust Region Policy Optimization

Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel Source: https://arxiv.org/pdf/1502.05477.pdf

Trust Region Policy Optimization is a key paper in the progression of on-policy algorithms. The promise of monotonic improvement guarantees and better sample efficiency than its predecessors created a powerful and stable training algorithm for reinforcement learning models.

Problems Addressed

When using policy gradients, small changes in the parameter space can sometimes have very large differences in performance. This makes it dangerous to use large step sizes as a single bad step can collapse policy performance.

Key Ideas

The trust region policy optimization algorithm aims to update policies by taking the largest step possible while satisfying a constraint on how close the new and old policies are.

\displaystyle \underset{\theta}{\text{maximize}} \ \mathcal{L}(\theta, \theta_{old})\ subject\ to\ D_{KL}(\theta || \theta_{old}) \leq \delta

The surrogate advantage function \mathcal{L} which measures how a policy \pi performs relative to an old policy \pi_{old}.

\displaystyle \mathcal{L}(\theta, \theta_{old}) = \ \mathbb{E} \left[ \frac{\pi_{\theta}(a | s)}{\pi_{\theta_{old}}(a | s)} A_{\theta_{old}}(s,a) \right] \\\\

The constraint we restrict our update to is called the KL divergence. This measures how different a probability distribution P is from another distribution Q.

\displaystyle D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x)\ log \left ( \frac{P(x)}{Q(x)} \right )

In our case, we use the KL divergence to measure the difference between new and old policies.

\displaystyle D_{KL}(\theta || \theta_{old}) = [ D_{KL}(\pi_{\theta_{old}}(\cdot | s)\ ||\ \pi_{\theta}(\cdot | s))]

The bound for the KL divergence constraint is an arbitrary constant \delta. The bound was chosen to be 0.01 for all experiments in the paper.

Related