Mathematical Psychology
About

Policy Gradient Methods

Policy gradient methods directly optimize the policy (mapping from states to action probabilities) through gradient ascent on expected reward, enabling learning in continuous action spaces.

∇J(θ) = E[∇log π_θ(a|s) · R]

Policy gradient methods, unlike value-based approaches like Q-learning, directly parameterize and optimize the policy — the mapping from states to action probabilities. The REINFORCE algorithm (Williams, 1992) provides a simple, elegant way to estimate the gradient of expected reward with respect to policy parameters using only sampled trajectories.

The Policy Gradient Theorem

REINFORCE Algorithm Policy: π_θ(a|s) = P(action a in state s; parameters θ)
Gradient: ∇J(θ) = E_τ[Σₜ ∇log π_θ(aₜ|sₜ) · Rₜ]

Update: θ ← θ + α · ∇log π_θ(a|s) · (R − b)
b = baseline (e.g., average reward, reduces variance)

Actor-Critic Methods

Actor-critic methods combine policy gradient (the actor, which selects actions) with value function estimation (the critic, which evaluates states). The critic provides a learned baseline that reduces the variance of the policy gradient estimate, dramatically improving learning speed. The actor-critic architecture has been proposed as a model of corticostriatal interactions, with the prefrontal cortex implementing the policy (actor) and the basal ganglia computing value estimates (critic) via dopaminergic prediction errors.

Related Topics

References

  1. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256. https://doi.org/10.1007/BF00992696
  2. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. https://doi.org/10.5555/3009657.3009806
  3. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor–critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15(4–6), 535–547. https://doi.org/10.1016/S0893-6080(02)00047-3
  4. O’Doherty, J. P., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304(5669), 452–454. https://doi.org/10.1126/science.1094285

External Links