Policy Gradient Methods
Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour Source: https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf
This paper is a classic in the field of reinforcment learning.
Problems Addressed
Prior to this work, the standard approach to reinforcement learning was to approximate a value function and use that to greedily build a policy by selecting the actions that maximize the value function. The value function approach has worked well for many applications but has several limitations.
The value function based approach lends itself to deterministic policies. Sometimes the most optimal policy is a stochastic one.
A small change in the estimated value of an action can cause it to be or not be selected. These discontinuous changes make it hard to assure convergence.
Key Ideas
A function approximator (neural network) is used to represent a stochastic policy directly and the parameters \theta of the policy \pi are updated as follows:
\displaystyle \theta_{k+1} = \theta_k + \alpha_k \sum_s d^{\pi_k}(s) \sum_a \frac{\partial \pi_k(s,a)}{\partial \theta} f_{w_k}(s,a)
where d^{\pi}(s) is the stationary distribution of states, f_{w_k}(s,a) is an approximation of the advantage function, and \alpha is a positive-definite step size.
Let \{\alpha_k\}_{k=0}^{\infty} be any step size sequence such that lim_{k \to \infty} \alpha_k = 0 and \sum_k \alpha_k = \infty. Then, for any markov decision process with bounded rewards such that \{\rho(\pi_k)\}_{k=0}^{\infty} < \infty is guaranteed to converge to a locally optimal policy such that lim_{k \to \infty} \frac{\partial \rho(\pi_k)}{\partial \theta} = 0.