Unifying Approaches in Active Learning and Active Sampling

Our paper “Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities” was recently published in TMLR.

It connects several contemporary “SotA” data subset selection methods (i.e. active learning & active sampling) in deep learning to information(-theoretic) quantities from Bayesian deep learning: BALD (expected information gain), EPIG (expected predictive information gain) for active learning; and the information gain and joint predictive information gain for active sampling. Specifically, we relate these quantities to BADGE, BAIT, PRISM, SIMILAR, and GraNd.

Our paper also provides a gentle tutorial to Fisher information and its approximations using Hessians in deep neural networks and focuses on how to compute various information-theoretic quantities using it. We also show how similarity matrices fit into this. Similarity matrices are often used by recent methods such as SIMILAR, PRISM above, for example.

This paper is the first (theoretical) part of a hopefully two-part work: in a follow-up project, we will look at empirical validation and answering some research questions which we pose throughout the paper.

Outline. We first explain what active learning and active sampling are about. Then, we introduce a unified notation for Fisher information, which we think is quite nice and hopefully useful, before looking at approximations for common information quantities and how they relate to existing recent and literature.

Data Subset Selection: Active Learning and Active Sampling

Active learning and active sampling (under the umbrella of data subset selection) are very important topics in deep learning because they promise to do more with less: active learning reduces the amount of labeled data that we need to annotate by focusing on annotating the most “informative” samples; and active sampling reduces the amount of data we need to train by sub-selecting from within labeled samples. Thus, active learning is done without knowing the labels, while active sampling has access to the labels. The two terms are often aptly referred to as data subset selection.

Over the years, many papers have been published on this, also in deep learning and Bayesian deep learning. Obviously, this is not a new problem, and we can trace the origins of determining the most informative samples back to Lindley (1956) in Bayesian-optimal experimental design and more recently, MacKay (1992).

Our work explicitly connects “non-Bayesian” methods to these original Bayesian works and, we hope, shows that some intuitions and fuzzy definitions of “informativeness” are approximations of the same quantities.

$$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\omega}} \newcommand{\W}{\boldsymbol{\Omega}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\wstar}{\boldsymbol{\omega}^*} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \DeclareMathOperator{\tr}{tr} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\Ypred}{\Y} \newcommand{\ypred}{\y} \newcommand{\Ytrue}{\Y} \newcommand{\ytrue}{\y} \newcommand{\logits}{{\hat z}} \newcommand{\Dtrain}{{\mathcal{D}^\text{train}}} \newcommand{\Dtest}{{\mathcal{D}^\text{test}}} \newcommand{\Dacq}{{\mathcal{D}^\text{acq}}} \newcommand{\Dpool}{{\mathcal{D}^\text{pool}}} \newcommand{\Deval}{{\mathcal{D}^\text{eval}}} \newcommand{\Dany}{{\mathcal{D}}} \newcommand{\xacq}{{\x^\text{acq}}} \newcommand{\Yacq}{{\Y^\text{acq}}} \newcommand{\yacq}{{\y^\text{acq}}} \newcommand{\xacqs}{{\{\x^\text{acq}_i\}}} \newcommand{\yacqs}{{\{\y^\text{acq}_i\}}} \newcommand{\Yacqs}{{\{\Y^\text{acq}_i\}}} \newcommand{\yacqsstar}{{\{y\^\text{acq,*}_i\}}} \newcommand{\xs}{{\{\x_i\}}} \newcommand{\ys}{{\{\y_i\}}} \newcommand{\Ys}{{\{\Y_i\}}} \newcommand{\Xeval}{{\X^\text{eval}}} \newcommand{\xeval}{{\x^\text{eval}}} \newcommand{\Yeval}{{\Y^\text{eval}}} \newcommand{\yeval}{{\y^\text{eval}}} \newcommand{\xevals}{{\{\x^\text{eval}_i\}}} \newcommand{\yevals}{{\{\y^\text{eval}_i\}}} \newcommand{\Yevals}{{\{\Y^\text{eval}_i\}}} \newcommand{\xtrain}{{\x^\text{train}}} \newcommand{\ytrain}{{\y^\text{train}}} \newcommand{\Ytrain}{{\Y^\text{train}}} \newcommand{\pdata}[1]{\hpcof{}{#1}} \newcommand{\encoder}[1]{{\hat f(#1)}} \newcommand{\realencoder}[1]{{f (#1)}} \newcommand{\embedding}{{z}} $$

A Unified Notation for Fisher Information and Information Quantities

Our paper presents a unified notation for information quantities, loss gradients, and Hessians, which directly connects Fisher information to information quantities and allows for expressive symbolic manipulation.

Probabilistic Model. We use a Bayesian perspective for this exposition. What makes it Bayesian is that we also use a distribution for the parameters \(\W\). Concretely, we assume the following: \[\pof{\y, \w \mid \x, \w} = \pof{\y \mid \x, \w} \, \pof{\y \mid \x}.\] Finally, to make a prediction, we marginalize over the parameters (Bayesian model average): \[\pof{\y \mid \x} = \E{\pof{\w}}{\pof{\y \mid \x, \w}}.\]

Notation for Information Quantities. A central quantity in information-theory is the information content: \(-\log p\) of the probability p of an event: \[\xICof{p} := -\log p.\] We commonly use this information content as minimization objective in deep learning (as negative log likelihood or via the cross-entropy).

The entropy is just the expectation over the information content: \[\Hof{X} = \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{\xICof{\pof{x}}} = \E{\pof{x}}{-\log \pof{x}},\] where we use the following convention (introduced in “A Practical & Unified Notation for Information-Theoretic Quantities in ML” (Kirsch et al, 2021)): when we have an expression with (upper-case) random variables and (lower-case) observations, we take an expectation over all the random variables conditional on all the given observations. For example, the joint entropy \(\Hof{X, y \given Z}\) is: \[\Hof{X, y \given Z}= \E{\pof{x,z \given y}}{\Hof{x,y \given z}}.\]

Notation for Loss Gradients and Hessians. We also define the first-order (gradient) and second-order derivatives (Hessian) of \(H\) for a given observation as follows: \[\begin{aligned} \HofJacobian{x} &:= -\nabla_\w \log \pof{x}, \\ \HofHessian{x} &:= -\nabla_\w^2 \log \pof{x}. \end{aligned}\] We use the same convention as above for random variables by taking the expectation. For example: \[\HofHessian{\Y \given \x, \w} = \E{\pof{\y \given \x, \w}}{\HofHessian{\y \given \x, \w}}.\]

If we fix a \(\wstar\), then we have:

Observed Information and Fisher Information. The observed information of \(\y \given \x\) at \(\wstar\) is the Hessian: \[\HofHessian{\y \given \x, \wstar},\] and the Fisher information is the expectation over \(\y \given \x, \wstar\) of the observed information: \[\HofHessian{\Y \given \x, \wstar} = \E{\pof{\y \given \x, \wstar}}{\HofHessian{\y \given \x, \wstar}}.\]

Note: the observed information is sometimes defined with opposite sign in prior literature, but our definition here makes the connection between the two clearer: Fisher information is just observed information without knowledge of the label \(\y\).

Generally, we can take the properties and rules of the entropy and use the linearity of differentation to transfer them to the Fisher information and observed information. Thus, both Fisher information and observed information as additive. For example: \[\begin{aligned} &\Hof{\wstar \given \y, \x} = \Hof{\y \given \x, \wstar} + \Hof{w} - \Hof{\y \given \x}\\ \implies &\HofHessian{\wstar \given \y, \x} = \HofHessian{\y \given \x, \wstar} + \HofHessian{w} - \HofHessian{\y \given \x} \\ \iff &\HofHessian{\wstar \given \y, \x} = \HofHessian{\y \given \x, \wstar} + \HofHessian{w}, \end{aligned}\] where we can drop \(\HofHessian{\y \given \x}\) because it does not depend on \(\w\) and the Hessian (and Jacobian) are both zero. This shows that the Fisher information and observed information only depend on the likelihood and prior.

The benefit of this notation is that we can use intuitions for information theory to operate on the Fisher information and observed information, while it also makes clear how these quantities are connected to the entropy (as second-order derivatives).

We can use the Fisher information to approximate the entropy:

When we use a Gaussian approximation for the posterior distribution \(\pof{\w \given \y, \x}\) around a fixed \(w^*\), we have the following relationship between entropy and Hessian: \[\Hof{\W} \approx -\frac{1}{2} \log \det \HofHessian{\wstar} + C_k,\] where \(C_k\) is a constant which depends only on the dimensionality \(k\) of \(\W\) (\(\frac{k}{2} \log 2 \, \pi , e\)).

Note: when \(w^*\) is a (global) optima, this is also known as the Laplace approximation.

Generalized Gauss-Newton Approximation & Last-Layer Approximations

Computing second-order derivatives of neural networks is usually not feasible because of the memory and computational requirements. Instead, approximations are used.

We will use our notation to explain and introduce the generalized Gauss-Newton approximation and discuss last-layer approximations.

Note that to compute the Fisher information, we need to take an expectation. This can be expensive — and particularly when considering the batch acquisition case where we might need to iterate over many \(\y\)s. For some models, however, we can find simple solutions for the Fisher information (and the observed information).

However, there are models for which the Fisher information and observed information is independent of the specific \(\y\):

Generalized Linear Model. A generalized linear model (GLM) is a model \(\pof{y \given \logits=\encoder{x ; \w}}\) such that \(\log \pof{y \given \logits} = \logits^T T(y) - A(\logits) + \log h(y)\) is a distribution of the exponential family, independent of \(\w\), and \(\encoder{x ; \w} = \w^T \, x\) is linear in the parameters \(\w\).

For such a GLM, we have: \[\HofHessian{\Y \given \x, \wstar} = \HofHessian{\y \given \x, \wstar},\] where \(\y\) can be any target.

If we use this equality for other models where it might not hold, we essentially apply the generalized Gauss-Newton approximation:

Generalized Gauss-Newton Approximation. In the case of an exponential family but not a GLM, the equality above is often used as an approximation for the observed information—we simply use the respective Fisher information as an approximation of the observed information: \[\HofHessian{\y \given \x, \wstar} \approx \HofHessian{\Y \given \x, \wstar}.\] This is known as Generalized Gauss-Newton (GGN) approximation (Kunstner et al., 2019; Immer et al., 2020). This approximation has the advantage that it is always positive semidefinite unlike the true Hessian.

Last-Layer Approaches. GLMs are often used in deep active learning (Ash et al., 2019 ; 2021; Kothawade et al., 2022; 2021). If we split the model into \(\pof{y \given x, \w} = \pof{y \given \embedding = \w^T \, \realencoder{x}}\), where \(\embedding = \realencoder{x}\) are the embeddings and treat the encoder \(\realencoder{x}\) as fixed, we obtain a GLM based on the weights of the last layer, which uses the embeddings as input.

The GGN approximation and last-layer approaches feature heavily in the literature to make computing these quantities more tractable as they reduce computational requirements and memory usage.

Information Quantities in Active Learning and Active Sampling and Approximations

Using this, we can approximate information quantities using the Fisher information and observed information. For the mutual information, we can use the definition of the mutual information as \(\MIof{A; B} = \Hof{A} - \Hof{A \given B}\).

Expected Information Gain (EIG). For example, the expected information gain \(\MIof{\W ; \Y \given x} = \Hof{\W} - \Hof{\W \given \Y, \x}\) can thus be approximated around some \(\wstar\) as: \[ \begin{aligned} \MIof{\W ; \Y \given x} &\approx -\frac{1}{2} \log \det \HofHessian{\wstar} + \frac{1}{2} \log \det \HofHessian{\wstar \given \Y, \x} \\ &= \frac{1}{2} \log \det (\HofHessian{\Y \given \x, \wstar} + \HofHessian{ \wstar} ) -\frac{1}{2} \log \det \HofHessian{\wstar} \\ &= \frac{1}{2} \log \det ((\HofHessian{\Y \given \x, \wstar} + \HofHessian{ \wstar}) \HofHessian{\wstar}^{-1}) \\ &= \frac{1}{2} \log \det (\HofHessian{\Y \given \x, \wstar} \HofHessian{\wstar}^{-1} + Id) \\ &\le \frac{1}{2} \tr (\HofHessian{\Y \given \x, \wstar} \HofHessian{\wstar}^{-1}). \end{aligned} \] The last step follows from the inequality \(\log \det (X + Id) \le \tr X\). Note that the first step is abridged as we need to make some assumptions and approximations (which we look at in the paper), but overall this is roughly what is done in the literature.

Information Quantities. The EIG is just one of the information quantities that are used for active learning and active sampling. The following taxonomy is from the paper, which also explains the details:

Active Learning Active Sampling
Non-Transductive EIG/BALD \(\MIof{\W; \Yacq \given \xacq}\) IG \(\MIof{\Omega; \yacq \given \xacq}\)
Transductive using \(\Deval\) Expectation EPIG \(\simpleE{\pdata{\xeval}} \MIof{\Yeval ; \Yacq \given \xeval, \xacq}\) PIG \(\simpleE{\pdata{\yeval,\xeval}}\MIof{\yeval; \yacq \given \xeval, \xacq}\)
Joint JEPIG \(\MIof{\Yevals ; \Yacq \given \xevals, \xacq}\) JPIG \(\MIof{\yevals; \yacq \given \xevals, \xacq}\)
Table 1: Taxonomy of Information Quantities for Data Subset Selection. In general, information quantities can be split into ones for active sampling or active learning, into non-transductive and transductive ones, and in the transductive case, into taking an expectation or the joint over (additional) evaluation samples. Here, we show the information quantities for individual acquisition. For batch acquisition, \(\left\{Y_i^{\text {acq }}\right\},\left\{y_i^{\text {acq }}\right\},\left\{x_i^{\text {acq }}\right\}\) can be substituted.

Using the steps from the previous section, we develop the following approximations using both log determinants and matrix traces. We provide them here without much further explanation because it mightbe useful to see the “shape” of the approximations to recognize them in the wild (then you can always come back and have a look at the paper):
(a) Log Determinant Approximations
(b) Trace Approximations
Comparison of the Approximations/Proxy Objectives. The difference between active learning and active sampling objectives is in using the Fisher information, which is label independent, or the observed information, which uses label information. JEPIG and EPIG have equivalent proxy objectives when using the matrix trace; see the paper. The notation makes it obvious that in the GLM case, or when the GGN approximation is used, active learning and active sampling approximations match as \(\HofHessian{\y \given \x, \wstar} =\; {(\text{resp.} \approx)} \HofHessian{\Y \given \x, \wstar}.\)

Similarity Matrices

Instead of computing the Fisher information, other literature often uses simlarity matrices. When the similarity matrices are based on the inner product of loss gradients, we find that the respective log determinant terms match the ones above by using the matrix-determinant lemma: \[\det (A B + M) = \det M \det (Id + B M^{-1} A).\] Similarity matrices can be written as a matrix product (similar to how we can compute a covariance matrix using the data matrix) and the Fisher information is just the tranposed product of that (in essence). Again the details are in our paper.

Relationship to Non-Bayesian Literature

So far, we have included a prior \(\pof{\w}\), but a lot of related literature is not Bayesian by itself. It uses the maximum likelihood estimate (MLE) or regularization, which could be interpreted as using a MAP estimate.

We can see that for an uninformative prior, we have \(\HofHessian{\w} \to 0\) (and \(\Hof{\W} \to \infty\)). We analyze this in the paper and draw connections to:

In particular, we see that BADGE approximates the expected information gain while BAIT approximates the expected predictive information gain, while different acquisition functions using similarity matrices in SIMILAR and PRISM approximate different information quantities (depending on the acquisition function).

Some Insights

We want to highlight three insights from our paper in this blog post:

  1. Trace approximations of the EIG are no better than top-k batch acquisition using individual EIG scores. The matrix trace is additive, which immediately implies the above as consequence. Greedy selection using a trace approximation will lead to top-k acquisition (but look fancier).
  2. Last-layer approximations perform data subset selection on embeddings only, despite feature learning being arguably the most important strength of deep neural networks. However, these approaches can still find great use with large pre-trained models, which are only fine-tuned on new data domains anyway (Tran et al, 2022).
  3. BADGE, BAIT, and the LogDet objectives of PRISM and SIMILAR (Ash et al., 2019; 2021; Kothawade et al., 2021; 2022) approximate information quantities in weight space, while (Batch)BALD, (J)EPIG, and (J)PIG (Houlsby et al., 2011; Kirsch et al., 2019; 2021b; Mindermann et al., 2022) approximate the information quantities in prediction space.

Conclusion

Taking a step back, we have seen that a Bayesian perspective using information quantities connects seemingly disparate literature. Although Bayesian methods are often seen as separate from (non-Bayesian) active learning and active sampling, the sometimes fuzzy notion of “informativeness” expressed through various different objectives in non-Bayesian settings collapses to the same couple of information quantities, which were, in principle, already known by Lindley (1956) and MacKay (1992).

This was a whirlwind tour through a parts of the paper. We pose several research questions based on the unified perspective and provide (hopefully) neat insights. Our paper also provides some initial empirical evaluations in the appendix, but we will keep more of that for a follow-up paper. While the proposed notation might seem like a gimmick, notation is important to structure ideas, and we hope it proves useful for others as well.

Acknowledgements

I would like to thank Yarin Gal, as well as the members of OATML in general for their continued feedback. Special thanks also to the anonymous TMLR reviewers, who provided incredibly helpful reviews.

More from OATML

For more blog posts by OATML in Oxford, check out our group's blog https://oatml.cs.ox.ac.uk/blog.html.

Follow me on Twitter @blackhc