Proximal Policy Optimization

Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov Source: https://arxiv.org/pdf/1707.06347.pdf

Problems Addressed

How can we take the biggest possible improvement step on a policy using the data we currently have without stepping so far that we accidentally cause performance collapse?

Key Ideas

Proximal Policy Optimization uses a clipped surrogate objective function which forms a lower bound of the performance of the policy.

L(θ)=Et[min(rt(θ)A^t, clip(rt(θ), 1ϵ, 1+ϵ)A^t)]\displaystyle \mathcal{L}(\theta) = \mathbb{E}_t [min(r_t(\theta)\hat{A}_t,\ clip(r_t(\theta),\ 1 - \epsilon,\ 1 + \epsilon)\hat{A}_t)]

rt(θ)=πθ(atst)πθold(atst)\displaystyle r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}

To optimize policies, alternate between sampling data from the policy and performing several epochs of optimization on sampled data.

Related