Proximal Policy Optimization

Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov Source:

Problems Addressed

How can we take the biggest possible improvement step on a policy using the data we currently have without stepping so far that we accidentally cause performance collapse?

Key Ideas

Proximal Policy Optimization uses a clipped surrogate objective function which forms a lower bound of the performance of the policy.

\displaystyle \mathcal{L}(\theta) = \mathbb{E}_t [min(r_t(\theta)\hat{A}_t,\ clip(r_t(\theta),\ 1 - \epsilon,\ 1 + \epsilon)\hat{A}_t)]

\displaystyle r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}

To optimize policies, alternate between sampling data from the policy and performing several epochs of optimization on sampled data.
