Q-Learning

Q-learning, introduced by Christopher Watkins in 1989, is a model-free reinforcement learning algorithm that learns action values — the expected cumulative reward for taking action a in state s and then following the optimal policy. It is "off-policy" because it learns the optimal Q-values regardless of the exploration strategy used to collect data.

The Q-Learning Update

Q-Learning Algorithm Q(s,a) ← Q(s,a) + α · [r + γ · max_a' Q(s', a') − Q(s,a)]

α = learning rate
γ = discount factor
r = immediate reward
max_a' Q(s', a') = estimated optimal future value

The key term is max_a' Q(s', a') — the maximum Q-value over all possible next actions. This "max" operation is what makes Q-learning off-policy: it always uses the estimated optimal action for the update, even if the agent actually took a different (exploratory) action.

Applications in Psychology

Q-learning has been widely used to model habitual and goal-directed behavior in psychology and neuroscience. The Q-values correspond to cached action values learned through experience, paralleling habitual behavior controlled by the dorsolateral striatum. Q-learning models have been fit to data from bandit tasks, probabilistic learning tasks, and social decision-making paradigms, with the learning rate α and discount factor γ serving as individual difference parameters that correlate with clinical and personality measures.

Interactive Calculator

Each row records a state transition: state (integer), reward (numeric), next_state (integer). The calculator applies temporal-difference learning: δ = r + γV(s') − V(s). Parameters: α=0.1 (learning rate), γ=0.9 (discount factor).

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

The Q-Learning Update

Applications in Psychology

Interactive Calculator

References

External Links

The Q-Learning Update

Applications in Psychology

Interactive Calculator

Related Topics

References

External Links