Policy Gradient Methods

Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour Source: https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf

This paper is a classic in the field of reinforcment learning.

Problems Addressed

Prior to this work, the standard approach to reinforcement learning was to approximate a value function and use that to greedily build a policy by selecting the actions that maximize the value function. The value function approach has worked well for many applications but has several limitations.

The value function based approach lends itself to deterministic policies. Sometimes the most optimal policy is a stochastic one.
A small change in the estimated value of an action can cause it to be or not be selected. These discontinuous changes make it hard to assure convergence.

Key Ideas

A function approximator (neural network) is used to represent a stochastic policy directly and the parameters $\theta$ of the policy $\pi$ are updated as follows:

$\displaystyle \theta_{k+1} = \theta_k + \alpha_k \sum_s d^{\pi_k}(s) \sum_a \frac{\partial \pi_k(s,a)}{\partial \theta} f_{w_k}(s,a)$

where $d^{\pi}(s)$ is the stationary distribution of states, $f_{w_k}(s,a)$ is an approximation of the advantage function, and $\alpha$ is a positive-definite step size.

Let $\{\alpha_k\}_{k=0}^{\infty}$ be any step size sequence such that $lim_{k \to \infty} \alpha_k = 0$ and $\sum_k \alpha_k = \infty$ . Then, for any markov decision process with bounded rewards such that $\{\rho(\pi_k)\}_{k=0}^{\infty} < \infty$ is guaranteed to converge to a locally optimal policy such that $lim_{k \to \infty} \frac{\partial \rho(\pi_k)}{\partial \theta} = 0$ .

Reinforcement Learning

Policy Gradient Methods

Problems Addressed

Key Ideas

Related