Category Archives: Information Theory

Shannon Entropy

Let $X$ be a discrete random variable and suppose it takes values $x_1,\cdots,x_n$. Let $p$ be the probability distribution function (pdf in short) of the random variable $X$. For each $i=1,\cdots,n$, denote by $p_i$ $p(x_i)$, i.e.
$$p_i=P\{X=x_i\}$$
Here is a question. Is there a unique way to measure the amount or degree of uncertainty represented by the probability distribution? The answer is affirmative and Claude E. Shannon, in his landmark paper [2], was able to obtain such measurement of uncertainty. It is denoted by $H(p_1,\cdots,p_n)$ and called the entropy. Regardless of the name, there is no reason to expect that this notion of entropy is related to thermodynamic entropy because it is obtained independently from any physical hypotheses. Remarkably however, it coincides with thermodynamic entropy except for the presence of Boltzmann constant. To distinguish it from thermodynamic entropy, it is often called the information entropy or Shannon entropy. Shannon imposed the following three conditions on $H$:

  1. $H$ is a continuous function of the $p_i$.
  2. If all $p_i$ are equal, the quantity $$A(n)=H\left(\frac{1}{n},\cdots,\frac{1}{n}\right)$$ is a monotone increasing function of $n$. An increase in $n$ results in a decrease in the probability, therefore the uncertainty (entropy) increases.
  3. Instead of considering of the events $x_1,\cdots,x_n$ with the respective probabilities $p_1,\cdots,p_n$, we may group the first $k$ of them $x_,\cdots,x_k$ as a single composite event and is given the probability $w_1=p_1+\cdots+p_k$. Then next $m$ events $x_{k+1},\cdots,x_{k+m}$ are gouped together as a single composite event and is given the probability $w_2=p_{k+1}+\cdots+p_{k+m}$, and so on so forth. The entropy regarding these composite events is $H(w_1,\cdots,w_r)$. In addition, since the composite events have occurred, the events $x_1,\cdots,x_k$ should be given the conditional probabilities $\frac{p_1}{w_1},\cdots,\frac{p_k}{w_1}$, the events $x_{k+1},\cdots,x_{k+m}$ should be given the conditional probabilities $\frac{p_{k+1}}{w_2},\cdots,\frac{p_{k+m}}{w_2}$, and so on so forth. In order for our information measurement is to be consistent, the ultimate state of knowledge we obtained this way should be the same as one obtained by considering the event $x_1,\cdots,x_n$ with probabilities $p_1,\cdots,p_n$. The argument is summed up as \begin{equation}\begin{aligned}H(p_1,\cdots,p_n)=&H(w_1,\cdots,w_r)+w_1H\left(\frac{p_1}{w_1},\cdots,\frac{p_k}{w_1}\right)\\&+w_2H\left(\frac{p_{k+1}}{w_2},\cdots,\frac{p_{k+m}}{w_2}\right)+\cdots\end{aligned}\label{eq:composition}\end{equation}The appearance of the weighting factors $w_i$ is of course due to the fact that, for example, we run into the additional uncertainty $H\left(\frac{p_1}{w_1},\cdots,\frac{p_k}{w_1}\right)$ only with the probability $w_1$. \eqref{eq:composition} is called the composition law.

Example. $H\left(\frac{1}{2},\frac{1}{3},\frac{1}{6}\right)=H\left(\frac{1}{2},\frac{1}{2}\right)+\frac{1}{2}H\left(\frac{2}{3},\frac{1}{3}\right)$ by grouping the events with probabilities $\frac{1}{3}$ and $\frac{1}{6}$ as a composite event.

Example. Let us consider three events $x_1,x_2,x_3$ with respective probabilities $p_1=\frac{3}{9}$, $p_2=\frac{4}{9}$, $p_3=\frac{2}{9}$. We may view these three events as three composite events resulted from nine events with equal probability $\frac{1}{9}$ by grouping the first 3 events, then the next 4 events, and finally the remaining 2 events together. Using \eqref{eq:composition}, we obtain
\begin{equation}
\label{eq:entropy}
H\left(\frac{3}{9},\frac{4}{9},\frac{2}{9}\right)+\frac{3}{9}A(3)+\frac{4}{9}A(4)+\frac{2}{9}A(2)=A(9)
\end{equation}

In exactly the same manner, if we write $p_i=\frac{n_i}{\sum_in_i}$, \eqref{eq:entropy} is readily generalized to
\begin{equation}
\label{eq:entropy2}
H(p_1,\cdots,p_n)+\sum_ip_iA(n_i)=A\left(\sum_in_i\right)
\end{equation}
As a special case, if we choose $n_i=m$, then \eqref{eq:entropy2} becomes
\begin{equation}
\label{eq:entropy3}
A(m)+A(n)=A(mn)
\end{equation}
Obviously,
\begin{equation}
\label{eq:entropy4}
A(n)=K\ln n
\end{equation}
(where $K>0$ due to condition 2) is a solution. But is that a unique solution? To see that, let us consider a continuum extension $A(x)$ of $A(n)$ which is at least twice differentiable. Then \eqref{eq:entropy3} can be written as the functional equation
\begin{equation}
\label{eq:entropy5}
A(x)+A(y)=A(xy)
\end{equation}
Differentiating \eqref{eq:entropy5} with respect to $x$, we obtain
\begin{equation}
\label{eq:entropy6}
A'(x)=yA'(xy)
\end{equation}
Differentiating \eqref{eq:entropy6} with respect to $y$, we obtain
\begin{equation}
\label{eq:entropy7}
0=A'(xy)+xyA^{\prime\prime}(xy)
\end{equation}
By introducing a new variable $z=xy$, \eqref{eq:entropy7} can be written as the second-order linear differential equation
$$zA^{\prime\prime}(z)+A'(z)=0$$
whose solution is
$$A(z)=K\ln(Cz)$$
That is, we have $A(n)=K\ln(Cn)$ where $K>0$ and $C>0$ are constants. Hence, $A(n)$ is unique up to constant multiples. Since entropy is a qualitative measurement of uncertainty rather than a quantitative one, we may choose $C=1$. For the same reason, we may choose $K=1$ as well but for now we will keep the constant multiple $K$. Substituting \eqref{eq:entropy4} into \eqref{eq:entropy2}, we finally obtain Shannon entropy
\begin{equation}
\label{eq:shannon}
\begin{aligned}
H(p_1,\cdots,p_n)&=K\ln\left(\sum_in_i\right)-K\sum_ip_i\ln n_i\\
&=-K\sum_ip_i\ln\left(\frac{n_i}{\sum_in_i}\right)\\
&=-K\sum_ip_i\ln p_i
\end{aligned}
\end{equation}

Definition. There is a generalization of Shannon entropy called the Rényi entropy
\begin{equation}
\label{eq:renyi}
H_r(p_1,\cdots,p_n)=\frac{1}{1-r}\ln\left(\sum_ip_i^r\right)
\end{equation}
for $0<r<\infty$ and $r\ne 1$.

Proposition 1. $$\lim_{r\to 1}H_r(p_1,\cdots,p_n)=-\sum_ip_i\ln p_i$$
i.e., Shannon entropy (with $K=1$) is the limit of the Rényi entropy as $r\to 1$.

Proof. Let $f(r)=\ln\left(\sum_ip_i^r\right)$. Then it is readily seen that the limit on the left is simply $f'(1)$.

Definition. Tsallis entropy is another generalization of Shannon entropy. It is defined by
\begin{equation}
\label{eq:tsallis}
S_r(p_1,\cdots,p_n)=\frac{1}{r-1}\left(1-\sum_ip_i^r\right)
\end{equation}

One can easily show that the limit of Tsallis entropy as $r\to 1$ is Shannon entropy (with $K=1$) similarly to the proof of proposition 1.

References:

  1. E. T. Jaynes, Information Theory and Statistical Mechanics, The Physical Review, Vol. 106, No. 4, 620-630, May 15, 1957
  2. Claude E. Shannon, A mathematical theory of communication, Bell Sys. Tech. Journal, 27: 379-423, 623-656, 1948