Policy Gradient Methods
Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour Source: https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf
This paper is a classic in the field of reinforcment learning.
Problems Addressed
Prior to this work, the standard approach to reinforcement learning was to approximate a value function and use that to greedily build a policy by selecting the actions that maximize the value function. The value function approach has worked well for many applications but has several limitations.
The value function based approach lends itself to deterministic policies. Sometimes the most optimal policy is a stochastic one.
A small change in the estimated value of an action can cause it to be or not be selected. These discontinuous changes make it hard to assure convergence.
Key Ideas
A function approximator (neural network) is used to represent a stochastic policy directly and the parameters of the policy are updated as follows:
where is the stationary distribution of states, is an approximation of the advantage function, and is a positive-definite step size.
Let be any step size sequence such that and . Then, for any markov decision process with bounded rewards such that is guaranteed to converge to a locally optimal policy such that .