Mathematical Psychology
About

Q-Learning

Q-learning is a model-free reinforcement learning algorithm that learns the value of state-action pairs through temporal difference updating, converging to optimal behavior without a model of the environment.

Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') − Q(s,a)]

Q-learning, introduced by Christopher Watkins in 1989, is a model-free reinforcement learning algorithm that learns action values — the expected cumulative reward for taking action a in state s and then following the optimal policy. It is "off-policy" because it learns the optimal Q-values regardless of the exploration strategy used to collect data.

The Q-Learning Update

Q-Learning Algorithm Q(s,a) ← Q(s,a) + α · [r + γ · max_a' Q(s', a') − Q(s,a)]

α = learning rate
γ = discount factor
r = immediate reward
max_a' Q(s', a') = estimated optimal future value

The key term is max_a' Q(s', a') — the maximum Q-value over all possible next actions. This "max" operation is what makes Q-learning off-policy: it always uses the estimated optimal action for the update, even if the agent actually took a different (exploratory) action.

Applications in Psychology

Q-learning has been widely used to model habitual and goal-directed behavior in psychology and neuroscience. The Q-values correspond to cached action values learned through experience, paralleling habitual behavior controlled by the dorsolateral striatum. Q-learning models have been fit to data from bandit tasks, probabilistic learning tasks, and social decision-making paradigms, with the learning rate α and discount factor γ serving as individual difference parameters that correlate with clinical and personality measures.

Interactive Calculator

Each row records a state transition: state (integer), reward (numeric), next_state (integer). The calculator applies temporal-difference learning: δ = r + γV(s') − V(s). Parameters: α=0.1 (learning rate), γ=0.9 (discount factor).

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. https://doi.org/10.1007/BF00992698
  2. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press. https://doi.org/10.1109/TNN.1998.712192
  3. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876–879. https://doi.org/10.1038/nature04766

External Links