Information theory was born in 1948 with the publication of A Mathematical Theory of Communication by Claude Shannon (1916 to 2001). Shannon was inspired in part by earlier work by Boltzmann and Gibbs in thermodynamics and by Hartley and Nyquist at Bell. Most of the theory and applications of information theory (compression, coding schemes, data transfer over noisy channels) are outside the scope of this thesis, but there are certain information theoretic quantities used regularly in machine learning, so it is useful to discuss them now.
The information we talk about is restricted to the information about the probability distribution over the elementary outcomes, not information about the content of the outcomes. The significance of probability is that it tells us how certain we can be when making inference. The most important information in this regard is found in the probability distribution over the possible outcomes.
Information theory is quite useful for deep learning. If we think of neural nets as noisy channels, the need for this theory becomes even more obvious. David Mackay said "brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning". Furthermore, "we might anticipate that the best data compression algorithms will result from the development of artificial intelligence methods".
The most fundamental quantity in information theory is entropy. Before we state the formal definition of entropy, we will motivate it as a measure of uncertainty by walking through its derivation. We will define a function $\eta$ as a measure of uncertainty and we will derive entropy as a function based on the requirements it must satisfy using $\eta$ as a starting point.
Let $(\mathcal{X}, p)$ be a discrete probability space. We define uncertainty to be a real-valued function $\eta(\cdot): \mathcal{X} \mapsto \mathbb{R}^+$ which depends only on the probabilities of the elementary outcomes and satisfies the following:
If an outcome $x$ is guaranteed to occur, then there is no uncertainty about it and $\eta(x)=0$;
For any two outcomes $x$, $x^\prime$, we have $p(x)\leq p(x^\prime)\iff \eta(x)\geq\eta(x^\prime)$;
For any two independent outcomes, $x$, $x^\prime$, the uncertainty of their joint occurrence, is the sum of their uncertainties, i.e. $\eta(x\cdot x^\prime)=\eta(x)+\eta(x^\prime)$.
It should not be a surprise that it is new information we are interested in, since that is what reduces uncertainty. Common outcomes provide less information than rare outcomes, which means $\eta$ should be inversely proportional to the probability of the outcome. $$ \begin{align} \eta(x) \propto {1 \over p(x)} \end{align} $$ Since $\eta$ must satisfy $\eta(x\cdot x^\prime)=\eta(x)+\eta(x^\prime)$, we must define $\eta$ in terms of the logarithm. This is because the probability of two independent outcomes is the product of their probabilities whereas we want information to be additive. Thus, $$ \begin{align} \eta(x) \approx \log{1 \over p(x)}. \end{align} $$ For probability distributions, we need a measure of uncertainty that says, on average, how much uncertainty is contained in $(\mathcal{X}, p)$. We need to weight the calculation by the probability of observing each outcome. This means what we are really seeking is a measure on the probability distribution over $\mathcal{X}$. We adjust the notation, using the capital eta, which resembles the Latin H. Thus, $$ \begin{align} H(p) = \sum_{x \in \mathcal{X}} p(x) \log{1 \over p(x)}. \end{align} $$ This is what we will call entropy, a measure on the average amount of surprise associated to outcomes from $(\mathcal{X}, p)$. Entropy is maximized when we cannot say with any confidence if an outcome will occur. This upper bound occurs when the probabilities over the set of possible outcomes are uniformly distributed. $$ \begin{align} H(p) \leq \log{|\mathcal{X}|} \end{align} $$ We can also think of entropy as how much information, measured in binary information units (bits), is required to describe outcomes drawn from $(\mathcal{X}, p)$. The way to understand this last part is the logarithm tells us how many bits we need to describe this uncertainty, since $$ \begin{align} \log_2{1 \over p(x)} = n \iff 2^n = {1 \over p(x)}. \end{align} $$ However, any logarithm can be used. Base $e$ and base 10 are also commonly used.
Let $(\mathcal{X}, p)$ be any discrete probability space. The entropy of a probability distribution $p$ with mass function $p$, denoted by $H(p)$, is the average amount of uncertainty found in elementary outcomes from $(\mathcal{X}, p)$. We write $$ \begin{align} H(p) = - \mathbb{E}_{x \sim p} \left[ \log{p(x)} \right]. \end{align} $$ The entropy of a probability distribution tells us how much variation we should expect to see in samples drawn from $(\mathcal{X}, p)$. The probability distribution with maximum entropy is the uniform distribution since all outcomes are equally surprising.
The following figure
depicts the entropy of a probability distribution over two states as a function of the symmetry of the distribution. As the probability of heads $p(H)$ approaches 0 or 1, we see the uncertainty vanishes, and uncertainty is maximized when probability is equally distributed over heads and tails.
The entropy of the probability distribution corresponding to a fair coin toss is 1 bit, and the entropy of $m$ tosses is $m$ bits. If there are two states of equal probability, then we need 1 bit and if we have 3 states of equal probability, we need 1.584963 bits.
We include a definition of a metric below in order to make clear the distinction between it and a divergence, which will be defined afterwards.
A metric $d$ on $\mathcal{X}$ is a function $d(\cdot, \cdot): \mathcal{X} \times \mathcal{X} \mapsto \mathbb{R}^+$ such that $\forall x,y,z\in\mathcal{X}$:
$d(x,y))\geq 0$, and $d(x,y)=0\iff x=y$.
$d(x,y)=d(x,y)$
$d(x,z))\leq d(x, y) + d(y, z)$
A divergence is a weaker notion than that of distance. A divergence need not be symmetric nor satisfy the triangle inequality.
Let $\mathcal{P}$ be any space of probability distributions over any finite set $\mathcal{X}$ such that all $p \in \mathcal{P}$ have the same support. A divergence on $\mathcal{P}$ is a function, $\mathcal{D}(\cdot||\cdot):\mathcal{P}\times\mathcal{P}\mapsto \mathbb{R}^+$, such that $\forall p, q, \in \mathcal{P}$ the following conditions are satisfied
$\mathcal{D}(p||q) \geq 0$
$\mathcal{D}(p||q) = 0 \iff p = q$.
The Kullback-Leibler divergence is a measure of how different a probability distribution is from a second, reference probability distribution. It is also known by the following names: relative entropy, directed divergence, information gain and discrimination information. It is defined by $$ \begin{align} KL(p||q)=\sum_{x\in\mathcal{X}} p(x) \log{p(x) \over q(x)}. \end{align} $$ If $p$ and $q$ have the same support, then $KL(p||q) = 0$ if and only if $p = q$.
The Kullback-Leibler divergence is defined only if $p$ is absolutely continuous with respect to $q$, i.e. $\forall x\ q(x) = 0 \implies p(x) = 0$, When $p(x) = 0$, $KL(p||q) = 0$ since $\lim_{x \to 0} x \log{x} = 0$.
For a closed convex set $E \subset \mathcal{P}$, where $\mathcal{P}$ is the space of all probability distributions over a finite set $\mathcal{X}$, and for a distribution $Q \not \in E$ , let $p^* \in E$ be defined by $p^* = \text{argmin}_{P \in E} KL(p||q)$ , then
$$ \begin{align} KL(p||q)\geq KL(p||p^*)+KL(p^*||q). \end{align} $$
The interested reader can consult Theorem 11.6.1 in Cover and Thomas.
The log-likelihood ratio test is used in comparing the goodness-of-fit of one statistical model over another. The Kullback-Leibler divergence of $p$ and $q$ is the average of the log-likelihood ratio test with respect to probabilities defined by $p$. For two models $p(x) = f(x|\theta)$ and $q(x) = f(x | \phi)$ the log-likelihood ratio test is
$$ \begin{align} \lambda(x) & = \log{\prod_{x\in\mathcal{X}}p(x)\over\prod_{x\in\mathcal{X}}q(x)} \\ & = \log \prod_{x\in\mathcal{X}}{p(x)\over q(x)} \\ & = \sum_{x \in \mathcal{X}} \log {p(x) \over q(x)} \end{align} $$
and the average with respect to $p$ is
$$ \begin{align} \mathbb{E}_{p}\left[\lambda(x)\right]=\sum_{x\in\mathcal{X}}p(x)\log{p(x)\over q(x)}. \end{align} $$
One way to think of Generative Adversarial Net (GAN) training is as fitting the Discriminator $D$ and the Generator $G$ to the data via optimizing a goodness-of-fit test since
$$ \begin{align} \min_\phi\max_\theta{1 \over n} \sum_{i=1}^n \log{D_\theta(x_i)} + {1 \over n} \sum_{i=1}^n\log{(1 - D_\theta(G_\phi(z_i)))} \end{align} $$
has the same fixed point as
$$ \begin{align} & \min_\phi\max_\theta{1 \over n} \sum_{i=1}^n \left[\log{D_\theta(x_i)} - \log{(D_\theta(G_\phi(z_i)))}\right] \\ & = \min_\phi\max_\theta{1 \over n} \sum_{i=1}^n \log{\left({D_\theta(x_i) \over D_\theta(G_\phi(z_i))}\right)} \end{align} $$
which is the Kullback-Leibler divergence or the average log-likelihood ratio test. Since $\forall x$ $D_\theta(x) \in [0, 1]$ we can infer that when $D_\theta$ is optimized, it will place a larger amount of mass on $x$ than on $G_\phi(z)$.
The term information gain refers to one interpretation of the Kullback-Leibler divergence. Specifically $KL(p||q)$ is the amount of information gained about the data when $q$ is used to model the data, rather than $p$. Equivalently, the amount of information lost when $q$ is used to approximate $p$.
The reverse Kullback-Leibler divergence is the asymmetrical counterpart.
$$ \begin{align} KL(q||p)=\sum_{x\in\mathcal{X}}q(x)\log{q(x)\over p(x)}. \end{align} $$
The reverse Kullback-Leibler divergence is the average of the log-likelihood ratio test taken with respect to the model $q(x)$, $$ \begin{align} \mathbb{E}_{q} \left[\lambda(x)\right] = \sum_{x \in \mathcal{X}} q(x) \log {q(x) \over p(x)}. \end{align} $$
Minimizing the reverse Kullback-Leibler divergence is not equivalent to maximum likelihood methods.
The Kullback-Leibler divergence is related to another quantity used quite often in machine learning: cross entropy.
The cross entropy of $p$ and $q$ (for a given data set) is the total amount of uncertainty incurred by modelling the data with $q$ rather than $p$.
$$ \begin{align} H(p,q) &=-\sum_{x\in\mathcal{X}}p(x)\log{q(x)}\\ &=-\mathbb{E}_{p}\left[\log{q(x)}\right]. \end{align} $$
The cross entropy of $p$ and $q$ is the sum of the entropy of $p$ and the Kullback-Leibler divergence of $p$ and $q$.
$$ \begin{align} H(p,q) &=-\sum_{x\in\mathcal{X}}p(x)\log q(x)\\ &=-\sum_{x\in\mathcal{X}}p(x)\log p(x)+\sum_{x\in\mathcal{X}}p(x)\log p(x)-\sum_{x\in\mathcal{X}}p(x)\log q(x)\\ &=-\sum_{x\in \mathcal{X}}p(x)\log p(x)+\sum_{x\in\mathcal{X}}p(x)\log{p(x)\over q(x)} \\ &=H(p)+KL(p||q) \end{align} $$
This tells us the lower bound for cross entropy must be the entropy of the probability distribution $p$ over $\mathcal{X}$. Thus, cross entropy is the uncertainty induced by assuming the wrong probability distribution over the data. The additional uncertainty is captured by the Kullback-Leibler divergence. Cross entropy is not symmetric since $H(q,p)=H(q)+KL(q||p)$.
As shown in Goodfellow et al., the generator minimizes an approximation of the Jensen-Shannon divergence.
Let $p$ and $q$ be any two probability distributions over any space $\mathcal{X}$. The Jenson-Shannon divergence of $p$ and $q$ is a symmetrization of the Kullback-Leibler divergence of $p$ and $q$ over $\mathcal{X}$. $$ \begin{align} JSD(p||q) = {1 \over 2} KL\left(p\mid\mid{p+q\over 2}\right) + {1 \over 2} KL\left(q\mid\mid{p+q\over 2}\right) \end{align} $$
The Jensen-Shannon divergence is the average of the Kullback-Leibler divergence and the reverse Kullback-Leibler divergence.
The square root of the Jensen-Shannon divergence is a metric.
Information theory provides us with a measure of dependency, or at least how much information about one probability distribution is contained in another distribution. The following measure are defined in terms of random variables denoted by upper case letters such as $X$ and $Y$.
Let $(\mathcal{X}, p)$ and $(\mathcal{Y}, q)$ and be any two finite probability spaces ($\mathcal{X}$ and $\mathcal{Y}$ need not be distinct) and consider two random variables $X \sim p$ and $Y\sim q$ with joint probability mass function $\gamma$ and marginal probability mass functions $\pi_p \circ \gamma = p$ and $\pi_q \circ \gamma = q$. The mutual information $I(X; Y)$ is the Kullback-Leibler divergence of $\gamma$ and the product of $p$ and $q$ , in other words $$ \begin{align} I(X;Y)=\sum_{x\in\mathcal{X}}\sum_{y\in\mathcal{Y}}\gamma(x,y)\log{{\gamma(x, y) \over p(x)q(y)}} \end{align} $$
If the random variables $X$ and $Y$ are independent, then $\gamma(x, y) = p(x)\cdot q(y)$ and $I(X; Y) = 0$
Mutual information is a measure of the amount of information contained in one probability distribution about another and makes for a useful measure of statistical dependence.
Mutual information can also be defined in terms of conditional entropy, defined in terms of random variables $X$ and $Y$,
$$ \begin{align} I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X) \end{align} $$where $H(X|Y)$ is the conditional entropy of $X$ given that $Y$ has occurred. In this form the mutual information can be interpreted as the information contained in one probability distribution minus the information contained in the distribution when the other distribution is known.
The relationship different information theoretic quantities is depicted in the following Venn diagram in