An example of using information-theoretic quantities for outcomes

In MacKay (2003)

We will derive Stirling’s approximation for binomial coefficients using a neat notation for information-theoretical quantities

A quick revision of the notation in “A Practical & Unified Notation for Information-Theoretic Quantities in ML

All quantities, which we will use below, can be consistently defined as follows:

(For random variables \(R\) and \(B\) with outcomes \(r\) and \(b\), respectively, we shorten “\(R=r\)” and “\(B=b\)” to “\(r\)” and “\(b\)” in expressions.)

- \(H[r] := h(p(r)) := -\log p(r),\)
- \(H[b,r] := h(p(b,r)) = -\log p(b, r),\)
- \(H[b \mid r ] := h(p(b \mid r)) = -\log p(b \mid r),\)
- \(H[B,r] := \mathbb{E}_{p(b|r)} H[b, r] = -\mathbb{E}_{p(b|r)} \log p(b, r),\) and
- \(H[B \mid r] := \mathbb{E}_{p(b|r)} H[b \mid r] = -\mathbb{E}_{p(b|r)} \log p(b \mid r).\)

\(\Hof{r}\) denotes the number of nats to encode event \(R=r\), given probability distribution \(\pof{r}\). Similarly, \(\Hof{B, r}\) denotes the number of nats to encode both \(B\) and \(R\) when \(R=r\) unbeknownst to encoding scheme.

The main idea behind this notation is that it is consistent with taking expectations over the quantities to obtain regular entropies and so on. Most rules for information-theoretic quantities also directly apply to quantities that mix random variables and outcomes. More information and intuitions are found in the paper.We will start by introducing a simple *probabilistic model* that will allow us to formulate the approximation as an upper-bound. We will then show how the upper-bound follows from dropping one term and then pick the success probability to tighten the bound as much as possible. This sounds more complicated than it is but is a fun way to show some tools that are often employed in mathematics.

We will relate the quantities we examine to information theory (as a means of quantifying optimal communication) by asking: *what message does the quantity represent?* This will hopefully provide additional intuitions to the reader.

**Setup.** Let \(B_1, \ldots, B_N\) be \(N\) Bernoulli random variables with success probability \(\rho\), and let \(B\) be the joint of these random variables.

Further, let \(R\) be the random variable that counts the number of successes in \(B\). \(R\) follows a Binomial distribution with success probability \(\rho\) and \(N\) trials.

**Main Idea.** For a given outcome \(r\) of \(R\), we have: \[\begin{equation}
\Hof{B, r} = \Hof{B \given r} + \Hof{r} \ge \Hof{B \given r},
\end{equation}\] as \(\Hof{\cdot}\) is non-negative for discrete random variables. We will examine this inequality to obtain the approximation.

Note that \(\Hof{B \given r}\) is the additional number of bits needed to encode \(B\) when the number of successes is already known. Similarly, \(\Hof{B, r}\) is the number of bits needed to encode both \(B\) and \(R\) under the circumstance that \(R=r\).

**Determining \(\Hof{B, r}\).** \(R\) is fully determined by \(B\), and thus we have \(\Hof{B, R}=\Hof{B}\) and hence also

**Alternative Argument.** We can also look at the terms \(\Hof{B \given r} + \Hof{r}\) separately. We have \[\begin{equation}
\Hof{r} = -\log \pof{r} = - \log \left ( \binom{N}{r} \, \rho^r \,(1-\rho)^{N-r}\right ),
\end{equation}\] and \[\begin{equation}
\Hof{B \given r} = -\simpleE{\pof{b \given r}} \log \pof{b \given r} =
\log \binom{N}{r}.
\end{equation}\] The former follows from \(R\) being binomially distributed.

For the latter, we observe that we need to encode \(B\) while knowing \(r\) already. Given \(r\), \(\pof{b \given r} = \text{const}\) for all valid \(b\). There are \(\binom{N}{r}\) possible \(b\) for fixed \(r\). Hence, we can simply create a table with all possible configurations with \(r\) successes. There are \(\binom{N}{r}\) many. We then encode the index into this table. Each configuration with \(r\) successes has an equal probability, so we have a uniform discrete distribution with entropy \(\log \binom{N}{r}\) and obtain the same result.

**Determining \(\rho\).** With this, we are almost done. We already have \[\begin{align}
\Hof{B \given r} + \Hof{r} &= -r \log \rho - (N - r) \log (1-\rho) \notag \\
&\ge
\log \binom{N}{r} = \Hof{B \given r}.
\end{align}\] How do we make this inequality as tight as possible?

We need to minimize the gap \(\Hof{r}\) which creates the inequality in the first place, and \(\Hof{r}=-\log \pof{r}\) is minimized exactly when \(\pof{r}\) becomes maximal.

Hence, we choose the success probability \(\rho\) to do so: the maximum likelihood solution \(\argmax_p \pof{r \given \rho}\) is \(\rho = \frac{r}{N}\). The Binomial distribution of \(R\) then has its mode, mean, and median at \(r\).

Altogether, after substituting \(\rho = \frac{r}{N}\) and rearranging, we obtain the wanted approximation: \[\begin{align} \log \binom{N}{r} &\le -r \log \rho - (N - r) \log (1-\rho) \\ & = r \log \frac{N}{r} + (N - r) \log \frac{N}{N-r}. \end{align}\]

**Approximation Error \(\Hof{r}\).** The approximation error is just \(\Hof{r}\). We can easily upper-bound it with \(\Hof{r} \le \log N\):

First, \(\Hof{R} \le \log N\) as the uniform distribution with entropy \(\log N\) is the maximum entropy distribution in this case (discrete random variable with finite support).

Second, \(\Hof{R}\) is the expectation over different \(\Hof{R=r'}\). Now given the \(\rho\) we have chosen, \(\pof{r}\) is maximal (and \(\Hof{r}\) is minimal). Hence, \(\Hof{r} \le \log N\) by contradiction: otherwise, we would have \(\log N < \Hof{r} \le \Hof{R}\), yet \(\Hof{R} \le \log N\).

In Figure 2, below we show an empirical evaluation of our approximation in comparison to the true \(\log\) binomial coefficients for \(N\) up to \(10^5\). We chose \(r \approx \tfrac{N}{2}\).

We have deduced an approximation of Stirling’s approximation for binomial coefficients and upper-bounded the approximation error with new intuitions. We have done so by using simple probability theory and information theory.

We would like to thank Yarin Gal, Tim Rudner, Ravid Shwartz-Ziv, Clare Lyle, Joost van Amersfoort, as well as the members of OATML in general for their feedback.

For more blog posts by OATML in Oxford, check out our group’s blog https://oatml.cs.ox.ac.uk/blog.html.