Q-learning, introduced by Christopher Watkins in 1989, is a model-free reinforcement learning algorithm that learns action values — the expected cumulative reward for taking action a in state s and then following the optimal policy. It is "off-policy" because it learns the optimal Q-values regardless of the exploration strategy used to collect data.
The Q-Learning Update
α = learning rate
γ = discount factor
r = immediate reward
max_a' Q(s', a') = estimated optimal future value
The key term is max_a' Q(s', a') — the maximum Q-value over all possible next actions. This "max" operation is what makes Q-learning off-policy: it always uses the estimated optimal action for the update, even if the agent actually took a different (exploratory) action.
Applications in Psychology
Q-learning has been widely used to model habitual and goal-directed behavior in psychology and neuroscience. The Q-values correspond to cached action values learned through experience, paralleling habitual behavior controlled by the dorsolateral striatum. Q-learning models have been fit to data from bandit tasks, probabilistic learning tasks, and social decision-making paradigms, with the learning rate α and discount factor γ serving as individual difference parameters that correlate with clinical and personality measures.