Mathematical Psychology
About

Kullback-Leibler Divergence

The Kullback-Leibler divergence measures the information lost when one probability distribution is used to approximate another, providing an asymmetric, non-negative measure fundamental to model comparison and Bayesian inference.

D_KL(P‖Q) = Σ P(x) · log(P(x)/Q(x))

The Kullback-Leibler (KL) divergence, introduced by Solomon Kullback and Richard Leibler in 1951, quantifies the difference between two probability distributions. Originally termed "discrimination information," it measures the expected number of extra bits needed to code samples from distribution P using a code optimized for distribution Q. In mathematical psychology, KL divergence is central to model selection, Bayesian updating, and the analysis of information processing.

Definition and Properties

KL Divergence D_KL(P‖Q) = Σ_x P(x) · log(P(x) / Q(x))

For continuous distributions:
D_KL(P‖Q) = ∫ p(x) · log(p(x) / q(x)) dx

Properties:
D_KL(P‖Q) ≥ 0 (Gibbs' inequality)
D_KL(P‖Q) = 0 iff P = Q
D_KL(P‖Q) ≠ D_KL(Q‖P) in general (asymmetric)

The non-negativity of KL divergence follows from Jensen's inequality applied to the convex function −log(x). The asymmetry is a crucial property: D_KL(P‖Q) measures the cost of using Q to approximate P, which differs from the cost of using P to approximate Q. This asymmetry has practical consequences — using the wrong direction of KL divergence in model fitting leads to qualitatively different approximations, a distinction exploited in variational inference.

Relation to Likelihood Ratio and Bayesian Inference

KL divergence is intimately related to the log-likelihood ratio. If P and Q are the probability models for two hypotheses, then D_KL(P‖Q) equals the expected value of the log-likelihood ratio under P — the expected evidence per observation favoring P over Q. This connection makes KL divergence fundamental to hypothesis testing: Stein's lemma shows that the error exponent of the Neyman-Pearson test equals the KL divergence between the hypotheses.

AIC and Model Selection

Akaike's Information Criterion (AIC) is grounded in KL divergence. Akaike (1973) showed that the expected KL divergence between the true data-generating distribution and a fitted model is asymptotically estimated by −2·log(L) + 2k, where L is the maximum likelihood and k is the number of parameters. Thus AIC selects the model that minimizes expected information loss — a direct application of KL divergence to the everyday practice of model comparison in psychology.

Applications in Psychology

In cognitive modeling, KL divergence provides a principled measure for comparing competing models of human behavior. When fitting models to response distributions, minimizing KL divergence is equivalent to maximum likelihood estimation. In Bayesian models of cognition, the KL divergence between prior and posterior quantifies the information gained from observed data — a formalization of "learning" as the reduction of divergence between belief and reality.

In the free energy principle, variational free energy is defined as a KL divergence between an approximate posterior and the true posterior plus a constant. The brain, under this framework, minimizes KL divergence by updating its internal model to match the statistics of the environment. KL divergence also appears in information-theoretic analyses of neural coding, where it quantifies how well a neural population discriminates between stimuli.

Interactive Calculator

Each row provides an event label and its probability. The calculator computes Shannon entropy H = −Σ p(x)·log₂ p(x), information content per event, and redundancy.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. doi:10.1214/aoms/1177729694
  2. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. doi:10.1109/TAC.1974.1100705
  3. Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference (2nd ed.). Springer. doi:10.1007/b97636
  4. Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. doi:10.1002/047174882X

External Links