Black-Box Batch Active Learning for Regression (B3AL)

Training machine learning models can require massive labeled datasets. Active learning aims to reduce labeling costs by selecting the most informative samples for labeling. But how well do prediction-focused black-box techniques compare to parameter-focused white-box methods?

This post summarizes the main contributions of my paper “Black-Box Batch Active Learning for Regression” (B3AL), which is now published in TMLR, and which introduces a general black-box batch active learning approach competitive with white-box methods for regression using only model predictions.

This approach is applicable to a wide range of models, including both differentiable models like neural networks and non-differentiable models like random forests, with promising results.

The experiments show that the black-box prediction-based active learning in B3AL matches and sometimes even improves on white-box gradient-based active learning for neural networks, which arguably has access to more information about the models.

B3AL builds on top of “A Framework and Benchmark for Deep Batch Active Learning for Regression” by Holzmüller et al (2022), which rephrases many contemporary active learning techniques using a unified kernel-based framework. By extending this to black-box models, we can apply recent active learning techniques to a wide range of models, including neural networks, random forests, and gradient boosted trees.

The post also discusses the benefits and limitations of black-box batch active learning. and the theory behind black-box batch active learning via the empirical predictive covariance kernel, and its connection to gradient-based methods.

$$\require{mathtools} \newcommand{\indicator}[1]{\,{\Large \mathbb{1}}\{#1\}} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \DeclareMathOperator{\opCovariance}{\mathrm{Cov}} \newcommand{\Cov}[2]{\opCovariance{#1} \left [ #2 \right ]} \newcommand{\implicitCov}[1]{\opCovariance \left [ #1 \right ]} \newcommand{\emCov}[2]{\widehat{\opCovariance}_{#1} \left [ #2 \right ]} \newcommand\MidSymbol[1][]{\:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\omega}} \newcommand{\W}{\boldsymbol{\Omega}} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \newcommand{\FisherInfo}[1]{\HofHessian{#1}} \newcommand{\specialFisherInfo}[2]{\specialHofHessian{#1}{#2}} \newcommand{\Yhat}{\hat{Y}} \newcommand{\yhat}{\hat{y}} \newcommand{\Ypred}{Y} \newcommand{\ypred}{y} \newcommand{\Ytrue}{Y} \newcommand{\ytrue}{y} \newcommand{\logits}{{\hat z}} \newcommand{\Dtrain}{{\mathcal{D}^\text{train}}} \newcommand{\Dtest}{{\mathcal{D}^\text{test}}} \newcommand{\Dacq}{{\mathcal{D}^\text{acq}}} \newcommand{\Dpool}{{\mathcal{D}^\text{pool}}} \newcommand{\Deval}{{\mathcal{D}^\text{eval}}} \newcommand{\Dany}{{\mathcal{D}}} \newcommand{\x}{\mathbf{x}} \newcommand{\Y}{{Y}} \newcommand{\y}{{y}} \newcommand{\xs}{{\{\x_i\}}} \newcommand{\ys}{{\{\y_i\}}} \newcommand{\Ys}{{\{\Y_i\}}} \newcommand{\xpools}{{\{\x^\text{pool}_i\}}} \newcommand{\xacq}{{\x^\text{acq}}} \newcommand{\Yacq}{{\Y^\text{acq}}} \newcommand{\yacq}{{\y^\text{acq}}} \newcommand{\xacqs}{{\{\x^\text{acq}_i\}}} \newcommand{\yacqs}{{\{\y^\text{acq}_i\}}} \newcommand{\Yacqs}{{\{\Y^\text{acq}_i\}}} \newcommand{\xyacqs}{\{(\xacq_i,\yacq_i)\}} \newcommand{\yacqsstar}{{\{\y^\text{acq,*}_i\}}} \newcommand{\Xeval}{{\X^\text{eval}}} \newcommand{\xeval}{{\x^\text{eval}}} \newcommand{\Yeval}{{\Y^\text{eval}}} \newcommand{\yeval}{{\y^\text{eval}}} \newcommand{\xevals}{{\{\x^\text{eval}_i\}}} \newcommand{\yevals}{{\{\y^\text{eval}_i\}}} \newcommand{\Yevals}{{\{\Y^\text{eval}_i\}}} \newcommand{\xtrain}{{\x^\text{train}}} \newcommand{\ytrain}{{\y^\text{train}}} \newcommand{\Ytrain}{{\Y^\text{train}}} \newcommand{\xytrains}{{\{(\x^\text{train}_i,\y^\text{train}_i)\}}} \newcommand{\xtrains}{{\{\x^\text{train}_i\}}} \newcommand{\w}{\boldsymbol{\omega}} \newcommand{\W}{\boldsymbol{\Omega}} \newcommand{\wstar}{\w^*} \newcommand{\N}{\mathcal{N}} \newcommand{\normaldist}[2]{\N({#1},\,{#2})} \newcommand{\normaldistpdf}[3]{\N(#1;\,{#2},\,{#3})} \newcommand{\acqf}{\mathcal{A}} \DeclareMathOperator{\Categorical}{Categorical} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Multinomial}{Multinomial} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\softmax}{softmax} \newcommand{\noiseobs}{\sigma_N} \newcommand{\Sigmapred}[2]{\noiseobs(#1; #2)} \newcommand{\mupred}[2]{{\mu}(#1; #2)} \newcommand{\mupredshort}[2]{{\mu}^{#1}_{#2}} \newcommand{\gradmupred}[2]{\nabla_{\w} \mu(#1; #2)} \newcommand{\mushort}[2]{\mu_{#1}^{#2}} \newcommand{\cmushort}[2]{\bar{\mu}_{#1}^{#2}} \newcommand{\gradmushort}[2]{{\nabla_{\w} \mu_{#1}^{#2}}} \newcommand{\predkernel}[1]{{k_\mathrm{pred}(#1)}} \newcommand{\emOmega}{\hat{\W}} \newcommand{\empredkernel}[1]{{k_{\widehat{\mathrm{pred}}}(#1)}} \newcommand{\gradkernel}[1]{k_\mathrm{grad}(#1)} \newcommand{\xpostgradkernel}[2]{{k_{\mathrm{grad}\to \mathrm{post}(#1)}(#2)}} \newcommand{\postgradkernel}[1]{{k_{\mathrm{grad}\to \mathrm{post}(\Dtrain)}(#1)}} \newcommand{\xHessian}[1]{\nabla^2_{\w} #1} \newcommand{\pdata}[1]{\hpcof{}{#1}} \newcommand{\realnum}{\mathbb{R}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \newcommand{\eqdef}{=\vcentcolon} \newcommand{\targetdim}{D} \newcommand{\batchvar}{B} \newcommand{\poolsize}{P} \newcommand{\trainsize}{N} \newcommand{\evalsize}{E} \newcommand{\testsize}{M} \newcommand{\numclasses}{K} \newcommand{\bmcparam}{{\pmb{\alpha}}} \newcommand{\bmch}{\psi} \newcommand{\Bmch}{\Psi} \newcommand{\bmccatprob}{\pmb{q}} \newcommand{\bmcmupred}[2]{{\tilde{\mu}}(#1; #2)} \newcommand{\bmcgradmupred}[2]{{\nabla_{\bmch} \bmcmupred{#1}{#2}}} \newcommand{\bmcpredkernel}[1]{k_{\mathrm{pred}, \bmch}(#1)} \newcommand{\bmcgradkernel}[1]{k_{\mathrm{grad}, \bmccatprob}(#1)} \newcommand{\bmcpostgradkernel}[1]{{k_{\mathrm{grad}, \bmch \to \mathrm{post}(\Dtrain)}(#1)}} $$

Holzmüller et al (2022) present kernel-based versions of contemporary batch active learning methods for neural networks and regression tasks. They unify the following methods using gradient-based kernels and the experiments show strong active learning performance on regression tasks: BALD, BatchBALD, BAIT, BADGE, ACS-FW, and Core-Set/FF-Active. They also propose a new method, LCMD (largest cluster maximum distance).

B3AL examines an empirical predictive covariance kernel \(\predkernel{\x_i; \x_j}\) based on ensemble predictions and connects this prediction-based kernel to gradient methods for neural networks. This opens up using the kernel-based versions of above contemporary batch active learning methods with non-differentiable black-box models like random forests and shows strong experimental performance on regression tasks.

Empirical Evaluation

We follow the evaluation from Holzmüller et al (2022) and use their framework to ease comparison. This allows us to directly compare to several SotA methods in a regression setting (respectively their kernel-based analogues). We compare to the popular deep active learning methods mentioned above and a random selection baseline (‘Uniform’). We use 15 large tabular datasets from the UCI Machine Learning Repository and the OpenML benchmark suite to compare white-box and black-box batch active learning methods.

Black-Box vs White-Box Deep Active Learning

(a) Active Learning Performance
(b) Batch Size Ablation
(c) Ensemble Size Ablation
Figure 1. Deep Neural Networks: Mean logarithmic RMSE over 15 regression datasets. (a) We see that black-box \(\blacksquare\) methods (10 ensemble members) work as well as white-box \(\square\) methods, and in most cases better, with the exception of ACS-FW and BAIT. (b) As expected, performance decreases with acquisition batch size. (c) As expected, performance increases with ensemble size.

In Figure 1 and 2, we see that B3AL is competitive with white-box active learning, when using BALD, BatchBALD, BAIT, BADGE, and Core-Set. On average, B3AL outperforms the white-box methods on the 15 datasets we analyzed (excluding ACS-FW and BAIT). We hypothesize that this is due to the implicit Fisher information approximation in the white-box methods, which is not as accurate in the low data regime as the more explicit approximation in B3AL via ensembling.

Figure 2. Average Logarithmic RMSE by regression datasets for DNNs: \(\blacksquare\) vs \(\square\) (vs Uniform). Across acquisition functions, the performance of black-box methods is highly correlated with the performance of white-box methods, even though black-box methods make fewer assumptions about the model. We plot the improvement of the white-box \(\square\) method over the uniform baseline on the x-axis (so for datasets with markers right of the dashed vertical lines, the white-box method performs better than uniform) and the improvement of the black-box \(\blacksquare\) method over the uniform baseline on the y-axis (so for datasets with markers above the dashed horizontal lines, the black-box method performs better than uniform). For datasets with markers in the \(\blacksquare\) diagonal half, the black-box method performs better than the white-box method. The average over all datasets is marked with a star \(\star\). Surprisingly, on average over all acquisition rounds, the black-box methods perform slightly better than the white-box methods for all but ACS-FW and BAIT.

Why Can Black-Box Methods Outperform White-Box Methods? Both white-box and black-box methods are based on kernels, which can be viewed as different approximations of the predictive covariance kernel (as we will examine below). White-box methods implicitly assume that the predictive covariance kernel is well approximated by the Fisher information kernel and the gradient kernel. However, Long (2022) demonstrated that this assumption does not always hold, particularly in low data regimes, where a Gaussian might not approximate the parameter distribution well. Instead, they suggest using a multimodal distribution. In these situations, methods that employ ensembling, such as B3AL, to approximate the predictive covariance kernel can be more robust. The different ensemble members can reside in different modes of the parameter distribution, allowing black-box methods to outperform their white-box counterparts.

Non-Differentiable Models

Random Forests

(a) Active Learning Performance
(b) Batch Size Ablation
(c) Ensemble Size Ablation

Random Forests (With Bagging)

(a) Active Learning Performance
(b) Batch Size Ablation
(c) Ensemble Size Ablation

Gradient-Boosted Trees

(a) Active Learning Performance
(b) Batch Size Ablation
(c) Ensemble Size Ablation
Figure 3. Mean logarithmic RMSE over 15 regression datasets. For random forests (100 estimators) with the default hyperparameters from scikit-learn, we see that black-box methods perform better than the uniform baseline, with the exception of BALD (using top-k selection). For random forests using bagging with 10 bootstrapped training sets, and for gradient-boosted trees with a virtual ensemble of 20 members, we see that only a few of the black-box methods perform better than the uniform baseline: LCMD, BADGE and CoreSet. We hypothesize that the virtual ensembles and a bagged ensemble of random forests do not express as much predictive disagreement which leads to worse performance for active learning.

In Figure 3, we observe that B3AL is effective for non-differentiable models, including random forests and gradient-boosted decision trees. BALD for non-differentiable models can be considered equivalent to Query-by-Committee (QbC) while BatchBALD for non-differentiable models can be viewed as equivalent to QbC with batch acquisition. For random forests, all methods except BALD (using top-k selection) outperform uniform acquisition. However, for random forests with bagging and gradient-boosted decision trees, B3AL surpasses random acquisition only when employing LMCD and BADGE. This may be attributed to the reduced disagreement within a virtual ensemble for gradient-boosted decision trees and between distinct random forests. In particular, random forests with bagging appear to support this explanation, as a single random forest seems to exhibit more disagreement among its individual trees than an ensemble of random forests with bagging does between different forests. This is evident in the superior overall active learning performance of the single random forest compared to the ensemble of random forests with bagging.

Ablations

Figure 1 and 3 (b) and (c), respectively, show the effect of ablations of the ensemble size and the acquisition batch sizes.

Ensemble Batch Size. Increasing the ensemble size generally improves the performance for all acquisition approaches except LMCD, for which performance first improves and then degrades. The performance increase with uniform sampling is due to better model predictions. This shows that the improvements stemming from more informative sample acquisition by the active learning methods due to increasing ensemble sizes can be significantly larger than the improvements due to better model predictions.

Acquisition Batch Size. Performance generally decreases with increasing acquisition batch size for all acquisition approaches. Comparing white-box and black-box methods, we see that only at the largest batch size of 4096, the white-box methods perform as well as the black-box methods for neural networks.

Best-performing Kernels and Selection Methods. Holzmüller et al (2022) evaluate different kernels and variants of selection methods based on the SotA methods—these variants do not actually match prior art.. In the paper of B3AL, we show the results for an ablation comparing black-box methods with the best-performing white-box kernels. For those variants, B3AL also performs on par or better than the best-performing white-box kernels. See the appendix and the paper for details.

Benefits & Limitations

B3AL shows that a simple extension of kernel-based methods to utilize empirical predictions rather than gradient kernels can be surprisingly effective and enable black-box batch active learning with good performance. Importantly, B3AL also generalizes to non-differentiable models, an area that has received limited attention as of late.

The main limitation of B3AL lies in the acquisition of a sufficient amount of empirical predictions. This could be a challenge, particularly when using deep ensembles with larger models or non-differentiable models that cannot be parallelized efficiently. The experiments using virtual ensembles indicate that the diversity of the ensemble members plays a crucial role in determining the performance. The main limitation of the empirical comparisons is that we only consider regression tasks. Extending the results to classification is an important direction for future work.

In summary, B3AL offers competitive performance with limited assumptions. The empirical kernel is simple to implement. However, generating sufficient predictions for the empirical covariance estimation can be expensive, especially for large neural networks. Carefully tuning the ensemble method and size is key.

Theory

Here, we describe the problem setting, motivate the proposed method, and look at theory behind B3AL.

Problem Setting. The proposed method is inspired by the BALD-family of active learning frameworks and its extension to batch active learning. The derviation makes use of a Bayesian model in the narrow sense that we require some stochastic parameters \(\W\)—the model parameters or bootstrapped training data in the case of non-differentiable models like random forests, for example—with a distribution1 \(\pof{\w}\): \[\begin{align} \pof{\y, \w \given \x} = \pof{\y \given \x, \w} \pof{\w}. \end{align}\] Bayesian model averaging (BMA) is performed by marginalizing over \(\pof{\w}\) to obtain the predictive distribution \(\pof{\y \given \x}\). Importantly, the choice of \(\pof{\w}\) covers ensembles as well as models with additional stochastic inputs or randomized training data by subsampling of the training set, e.g., bagging.

Pool-based Active Learning assumes access to a pool set of unlabeled data \(\Dpool=\xpools\) and a small initially labeled training set \(\Dtrain=\xytrains\), or \(\Dtrain = \emptyset\). In the batch acquisition setting, we want to repeatedly acquire labels for a subset \(\xacqs\) of the pool set of a given acquisition batch size \(\batchvar\) and add them to the training set \(\Dtrain\). Ideally, we want to select samples that are highly ‘informative’ for the model. For example, these could be samples that are likely to be misclassified or have a large prediction uncertainty for models trained on the currently available training set \(\Dtrain\). Once we have chosen such an acquisition batch \(\xacqs\) of unlabeled data, we acquire labels \(\yacqs\) for these samples and train a new model on the combined training set \(\Dtrain \cup \xyacqs\) and repeat the process. Crucial to the success of active learning is the choice of acquisition function \(\acqf(\xacqs; \pof{\w})\) which is a function of the acquisition batch \(\xacqs\) and the distribution \(\pof{\w}\) and which we try to maximize in each acquisition round. It measures the informativeness of an acquisition batch for the current model.

Univariate Regression is a common task in machine learning. We assume that the target \(\y\) is real-valued (\(\in \realnum\)) with homoscedastic Gaussian noise: \[\begin{align} \Y \given \x, \w &\sim \normaldist{\mupred{\x}{\w}}{\noiseobs^2}. \label{eq:gaussian_noise} \end{align}\] Equivalently, \(\Y \given \x, \w \sim \mupred{\x}{\w}+\varepsilon\) with \(\varepsilon \sim \normaldist{0}{\noiseobs^2}\). As usual, we assume that the noise is independent for different inputs \(\x\) and parameters \(\w\). Homoscedastic noise is a special case of the general heteroscedastic setting: the noise variance is simply a constant. Our approach can be extended to heteroscedastic noise by substituting a function \(\Sigmapred{\x}{\w}\) for \(\noiseobs\), but for this work we limit ourselves to the simplest case.

Kernel-based Methods

While a full treatment is naturally beyond the scope of this paper, we briefly review some key ideas of Holzmüller et al (2022) here.

Gaussian Processes are one way to introduce kernel-based methods. A simple way to think about Gaussian Processes is as Bayesian linear regression model with an implicit, potentially infinite-dimensional feature space (depending on the covariance kernel) that uses the kernel trick to abstract away the feature map from input space to feature space.

Multivariate Gaussian Distribution. The distinctive property of a Gaussian Process is that all predictions are jointly Gaussian distributed. We can then write the joint distribution for a univariate regression model as: \[ \begin{align} &\Y_1, \ldots, \Y_n \given \x_1, \ldots, \x_n \sim \normaldist{\mathbf{0}}{\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} + \noiseobs^2 \mathbf{I}}, \label{eq:gp_joint} \end{align} \] where \(\mu(\x){}\) are the observation-noise free predictions as random variables and \(\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)}\) is the covariance matrix of the predictions. The covariance matrix is defined via the kernel function \(k(\x, \x')\): \[ \begin{align} \Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} = \begin{bmatrix} k(\x_i, \x_j) \end{bmatrix}_{i,j=1}^{n,n}. \end{align} \] The kernel function \(k(\x, \x')\) can be chosen almost arbitrarily, e.g. see §4 in Williams & Rasmussen, 2006. The linear kernel \(k(\x, \x') = \langle \x, \x' \rangle\) and the radial basis function kernel \(k(\x, \x') = \exp(-\frac{1}{2} \lvert {\x-\x'} \rvert^2)\) are common examples, as is the gradient kernel, which we examine next.

Fisher Information & Linearization. When using neural networks for regression, the gradient kernel \[ \begin{align} \gradkernel{\x; \x' \given \wstar} &\triangleq \gradmupred{\x}{\wstar} \xHessian{[-\log \pof{\wstar}]}^{-1} \gradmupred{\x'}{\wstar}^\top \\ &= \langle \gradmupred{\x}{\wstar}, \gradmupred{\x'}{\wstar} \rangle_{\xHessian{[-\log \pof{\wstar}]}^{-1}} \end{align} \] is the canonical choice, where \(\wstar\) is a maximum likelihood or maximum a posteriori estimat (MLE, MAP)} and \(\xHessian{[-\log \pof{\wstar}]}\) is the Hessian of the negative log likelihood at \(\wstar\). Note that \(\gradmupred{\x}{\wstar}\) is a row vector. Commonly, the prior is a Gaussian distribution with an identity covariance matrix, and thus \(\xHessian{[-\log \pof{\wstar}]} = \mathbf{I}\).

The significance of this kernel lies in its relationship with the Fisher information matrix at \(\wstar\), or equivalently, with the linearization of the loss function around \(\wstar\). This leads to a Gaussian approximation, which results in a Gaussian predictive posterior distribution when combined with a Gaussian likelihood. The use of the finite-dimensional gradient kernel thus results in an implicit Bayesian linear regression in the context of regression models.

Posterior Gradient Kernel. We can use the well-known properties of multivariate normal distribtions to marginalize or condition the joint distribution in . Following Holzmüller et al (2022), this allows us to explicitly obtain the posterior gradient kernel given additional \(\x_1, \ldots, \x_n\) as: \[\begin{align} \label{eq:gp_postgradkernel} &\xpostgradkernel{\x_1,\ldots,\x_n}{\x; \x' \given \wstar} \\ &\quad \triangleq \nabla_{\w} \mupred{\x}{\wstar} \left ( \noiseobs^{-2} \begin{pmatrix} \gradmupred{\x_1}{\wstar} \\ \vdots \\ \gradmupred{\x_n}{\wstar} \end{pmatrix} \begin{pmatrix} \gradmupred{\x_1}{\wstar} \\ \vdots \\ \gradmupred{\x_n}{\wstar} \end{pmatrix}^\top + \xHessian{[-\log \pof{\wstar}]} \right )^{-1} \nabla_{\w} \mupred{\x'}{\wstar}^\top. \notag \end{align}\] The factor \(\noiseobs^{-2}\) originates from implicitly conditioning on \(\Y_i \given \x_i\), which include observation noise.

Importantly for active learning, the multivariate normal distribution is the maximum entropy distribution for a given covariance matrix, and is thus an upper-bound for the entropy of any distribution with the same covariance matrix. The entropy is given by the log-determinant of the covariance matrix: \[\begin{align} \Hof{\Y_1, \ldots, \Y_n \given \x_1, \ldots, \x_n} &= \frac{1}{2} \log \det (\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} + \sigma^2 \mathbf{I}) + C_n, \end{align}\] where \(C_n \triangleq \frac{n}{2} \log(2 \, \pi \, e)\) is a constant that only depends on the number of samples \(n\). Connecting kernel-based methods to information-theoretic quantities like the expected information gain, we then know that the respective acquisition scores are upper-bounds on the actual expected information gain.

Predictive Covariance Kernel

The predictive covariance kernel \(\predkernel{\x_i, \x_j}\) is the covariance of the predicted means: \[ \begin{align} \predkernel{\x_i; \x_j} \triangleq \Cov{\W}{\mushort{\x_i}{\w}; \mushort{\x_j}{\w}}. \end{align} \]

This is also simply known as the covariance kernel in the literature. We use the prefix predictive to make clear that we look at the covariance of the predictions. The resulting Gram matrix is equal the covariance matrix of the predictions and positive definite (for positive observation noise and otherwise positive semi-definite) and thus a valid kernel.

Empirical Predictive Covariance Kernel

For \(K\) sampled model parameters \(\w_1, \ldots, \w_K \sim \pof{\w}\)—for example, the members of a deep ensemble—the empirical predictive covariance kernel \(\empredkernel{\x_i; \x_j}\) is the empirical estimate: \[ \begin{align} \empredkernel{\x_i; \x_j} &\triangleq \emCov{\W}{\mushort{\x_i}{\w}; \mushort{\x_j}{\w}}\\ &= \frac{1}{K} \sum_{k=1}^K \left (\mushort{\x_i}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x_i}{\w_l} \right )\, \left (\mushort{\x_j}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x_j}{\w_l} \right ) \\ &= \left \langle \frac{1}{\sqrt{K}} (\cmushort{\x_i}{\w_1}, \ldots, \cmushort{\x_i}{\w_K}), \frac{1}{\sqrt{K}} (\cmushort{\x_j}{\w_1}, \ldots, \cmushort{\x_j}{\w_K}) \right \rangle, \label{eq:empirical_pred_kernel} \end{align} \] with centered predictions \(\cmushort{\x}{\w_k} \triangleq \mushort{\x}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x}{\w_l}\). As we can write this kernel as an inner product, it also immediately follows that the empirical predictive covariance kernel is a valid kernel and positive semi-definite.

Differentiable Models

Similar to §C.1 in Holzmüller et al (2022), we show that the posterior gradient kernel is a first-order approximation of the (predictive) covariance kernel. The result is simple but instructive:

Proposition. The posterior gradient kernel \(\postgradkernel{\x_i; \x_j \given \wstar}\) is an approximation of the predictive covariance kernel \(\predkernel{\x_i; \x_j}\).

Proof. We use a first-order Taylor expansion of the mean function \(\mupred{\x}{\w}\) around \(\wstar\): \[ \begin{align} \mupred{\x}{\w} \approx \mupred{\x}{\wstar} + \gradmupred{\x}{\wstar}\, \underbrace{(\w - \wstar)}_{\triangleq \Delta \w}. \end{align} \] Choose \(\wstar = \E{\w \sim \pof{\w \given \Dtrain}}{\w}\) (BMA). Then we have \(\E{\pof{w \given \Dtrain}}{\mupred{\x}{\w}} \approx \mupred{\x}{\wstar}\). Overall this yields: \[ \begin{align} \predkernel{\x_i; \x_j} &= \Cov{\w \sim \pof{\w \given \Dtrain}}{\mupred{\x_i}{\w}; \mupred{\x_j}{\w}} \\ &\approx \E{\wstar + \Delta \w \sim \pof{w \given \Dtrain}}{\langle \gradmushort{\x_i}{\wstar}\, \Delta \w, \gradmushort{\x_j}{\wstar}\, \Delta \w \rangle} \\ &= \gradmushort{\x_i}{\wstar} \, \E{\wstar + \Delta \w \sim \pof{w \given \Dtrain}}{ \Delta \w \Delta \w^\top} \, {\gradmushort{\x_j}{\wstar}}^\top \\ &= \gradmupred{\x_i}{\wstar} \, \Cov{}{\W \given \Dtrain} \, \gradmupred{\x_j}{\wstar}^\top \\ &\approx \postgradkernel{\x_i; \x_j \given \wstar}. \end{align} \] The intermediate expectation is the model covariance \(\Cov{}{\W \given \Dtrain}\) as \(\wstar\) is the BMA.

For the last step, we use the generalized Gauss-Newton (GGN) approximation and approximate the inverse of the covariance using the Hessian of the negative log likelihood at \(\wstar\): \[ \begin{align} &\Cov{}{\W \given \Dtrain}^{-1} \approx \xHessian{[- \log \pof{\wstar \given \Dtrain}]} \\ &\quad = \xHessian{[- \log \pof{\Dtrain \given \wstar} - \log \pof{\wstar}]} \\ &\quad \overset{(GGN)}{\approx} \noiseobs^{-2} \textstyle \sum_i \gradmupred{\xtrain_i}{\wstar}^\top \gradmupred{\xtrain_i}{\wstar} \\ &\quad \quad - \xHessian{\log \pof{\wstar}}, \end{align} \] where we have first used Bayes’ theorem and that \(\pof{\Dtrain}\) vanishes under differentiation—it is constant in \(\w\).

Secondly, the Hessian of the negative log likelihood is just the outer product of the gradients divided by the noise variance in the homoscedastic regression case. \(\xHessian{[-\log \pof{\wstar}]}\) is the prior term. This matches . \[\tag*{\(\blacksquare\)}\]

Non-Differentiable Models

How can we apply the above result to non-differentiable models?

In the following, we use a Bayesian view on the hypothesis space to show that we can connect the empirical predictive covariance kernel to a gradient kernel in this case, too. With \(\emOmega \triangleq (\w_1, \ldots, \w_K)\) fixed—e.g. these could be the different parameters of the members of an ensemble—we introduce a latent \(\Bmch\) to represent the index of the ‘true’ hypothesis \(\w_\bmch \in \emOmega\) from this empirical hypothesis space \(\emOmega\), which we want to identify. This is similar to QbC. In essence, the latent \(\Bmch\) takes on the role of \(\W\) from the previous section, and we are interested in learning the ‘true’ \(\Bmch\) from additional data. We, thus, examine the kernels for \(\Bmch\), as opposed to \(\W\).

Specifically, we model \(\Bmch\) using a one-hot categorical distribution, that is a multinomial distribution from which we draw one sample: \(\Bmch \sim \Multinomial(\bmccatprob, 1)\), with \(\bmccatprob \in S^{K-1}\) parameterizing the distribution, where \(S^{K-1}\) denotes the \(K-1\) simplex in \(\realnum^K\). Then, \(\bmccatprob_k = \pof{\Bmch = e_k}\), where \(e_k\) denotes the \(k\)-th unit vector; and \(\sum_{k=1}^K \bmccatprob_k = 1\). For the corresponding predictive mean \(\bmcmupred{\x}{\Bmch}\), we have: \[ \begin{align} \bmcmupred{\x}{\Bmch} &\triangleq \mupred{\x}{\w_\Bmch} = \langle \mupred{\x}{\cdot}, \Bmch \rangle, \end{align} \] where we use \(\w_\bmch\) to denote the \(\w_k\) when we have \(\bmch=e_k\) in slight abuse of notation, and \(\mupred{\x}{\cdot} \in \realnum^K\) is a column vector of the predictions \(\mupred{\x}{\w_k}\) for \(\x\) for all \(\w_k\). This follows from \(\bmch\) being a one-hot vector.

We now examine this model and its kernels. The BMA of \(\bmcmupred{\x}{\Bmch}\) matches the previous empirical mean if we choose \(\bmccatprob\) to have an uninformative uniform distribution over the hypotheses (\(\bmccatprob_k \triangleq \frac{1}{K}\)):

\[ \begin{align} \bmcmupred{\x}{\bmccatprob} &\triangleq \E{\pof{\bmch}}{\mupred{\x}{\w_\bmch}} \\ &= \langle \mupred{\x}{\cdot}, \bmccatprob \rangle \label{eq:empirical_bma_linearization} \\ &= \sum_{\bmch=1}^K \bmccatprob_\bmch \mupred{\x}{\w_\bmch} \\ &= \sum_{\bmch=1}^K \frac{1}{K} \mupred{\x}{\w_\bmch}. \end{align} \] What is the predictive covariance kernel of this model? And what is the posterior gradient kernel for \(\bmccatprob\)?

Proposition.

  1. The predictive covariance kernel \(\bmcpredkernel{\x_i, \x_j}\) for \(\emOmega\) using uniform \(\bmccatprob\) is equal to the empirical predictive covariance kernel \(\empredkernel{\x_i; \x_j}\).
  2. The `posterior’ gradient kernel \(\bmcpostgradkernel{\x_i ; \x_j}\) for \(\emOmega\) in respect to \(\Bmch\) using uniform \(\bmccatprob\) is equal to the empirical predictive covariance kernel \(\empredkernel{\x_i; \x_j}\).
Proof. Like for the previous differentiable model, the BMA of the model parameters \(\Bmch\) is just \(\bmccatprob\): \(\simpleE{}{\Bmch} = \bmccatprob\). The first statement immediately follows: \[ \begin{align} \bmcpredkernel{\x_i; \x_j} &= \Cov{\bmch}{\bmcmupred{\x_i}{\bmch}; \bmcmupred{\x_j}{\bmch}} \\ &= \E{\pof{\bmch}}{ \cmushort{\x_i}{\w_\bmch} \, \cmushort{\x_j}{\w_\bmch}} \\ &= {\textstyle \frac{1}{K}} \sum_\bmch \cmushort{\x_i}{\w_\bmch} \cmushort{\x_j}{\w_\bmch} \\ &= \empredkernel{\x_i; \x_j}. \end{align} \] For the second statement, we will show that we can express the predictive covariance kernel as a linearization around \(\Bmch\). We can read off a linearization for \(\bmcgradmupred{\x_i}{\bmch}\) from the inner product in : \[ \begin{align} \bmcgradmupred{\x_i}{\bmch} = \mupred{\x}{\cdot}^\top, \end{align} \] This allows us to write the predictive covariance kernel as a linearization around \(\bmccatprob\): \[ \begin{align} \bmcpredkernel{\x_i; \x_j} &= \Cov{\bmch \sim \pof{\bmch}}{\bmcmupred{\x_i}{\bmch}; \bmcmupred{\x_j}{\bmch}} \\ &= \E{\bmccatprob + \Delta \bmch \sim \pof{\bmch}}{\bmcgradmupred{\x_i}{\bmch} \Delta \bmch, \bmcgradmupred{\x_j}{\bmch} \Delta \bmch} \\ &= \bmcgradmupred{\x_i}{\bmccatprob} \, \Cov{}{\Bmch} \, \bmcgradmupred{\x_i}{\bmccatprob}^\top\\ &=\bmcpostgradkernel{\x_i ; \x_j}. \end{align} \] \[\tag*{\(\blacksquare\)}\]

The above gradient kernel is only the posterior gradient kernel in the sense that we have sampled \(\w_\bmch\) from the non-differentiable model after inference on training data. The samples themselves are drawn uniformly.

The covariance of the multinomial \(\Bmch\) is: \[ \begin{align} \Cov{}{\Bmch} = \diag(\bmccatprob) - \bmccatprob \bmccatprob^\top. \end{align} \] Thus, substituting, we can verify that the posterior gradient kernel is indeed equal to the predictive covariance kernel explicitly once more: \[ \begin{align} &\bmcpostgradkernel{\x_i ; \x_j} \\ &\quad= \bmcgradmupred{\x_i}{\bmccatprob} \, (\diag(\bmccatprob) - \bmccatprob \bmccatprob^\top) \, \bmcgradmupred{\x_i}{\bmccatprob}^\top\\ &\quad= \mupred{\x_i}{\cdot}^\top \diag(\bmccatprob) \, \mupred{\x_j}{\cdot} - (\mupred{\x_i}{\cdot}^\top \, \bmccatprob) \, (\bmccatprob^\top \, \mupred{\x_j}{\cdot}) \\ &\begin{split}\quad= &\frac{1}{K} \sum_\bmch \mupred{\x_i}{\w_\bmch} \, \mupred{\x_j}{\w_\bmch}^\top \\ &- \left (\frac{1}{K} \sum_\bmch \mupred{\x_i}{\w_\bmch} \right ) \, \left (\frac{1}{K} \sum_\bmch \mupred{\x_j}{\w_\bmch}\right )\end{split}\\ &\quad= \empredkernel{\x_i; \x_j}. \end{align} \] This demonstrates that a Bayesian model can be constructed on top of a non-differentiable ensemble model. Bayesian inference in this context aims to identify the most suitable member of the ensemble. Given the limited number of samples and likelihood of model misspecification, it is likely that none of the members accurately represents the true model. However, for active learning purposes, the main focus is solely on quantifying the degree of disagreement among the ensemble members.

A similar Bayesian model using Bayesian Model Combination (BMC) could be set up which allows for arbitrary convex mixtures of the ensemble members. This would entail using a Dirichlet distribution \(\Bmch \sim \Dirichlet(\bmcparam)\) instead of the multinomial distribution. Assuming an uninformative prior (\(\bmcparam_k \triangleq \frac{1}{K}\)), this leads to the same results up to a constant factor of \(1+\sum_k \bmcparam_k = 2\). This is pleasing because it does not matter whether we use a multinomial or Dirichlet distribution, that is: whether we take a hypothesis space view with a ‘true’ hypothesis or accept that our model is likely misspecified and we are dealing with a mixture of models, the results are the same up to a constant factor.

Application to DNNs, BNNs, and Other Models. The proposed approach has relevance due to its versatility, as it can be applied to a wide range of models that can be consistently queried for prediction, including deep ensembles, Bayesian neural networks (BNNs), and non-differentiable models. The kernel used in this approach is simple to implement and scales in the number of empirical predictions per sample, rather than in the parameter space, as seen in other methods such as Ash et al (2022).

Conclusion

This blog post has introduced the main points of the TMLR paper. We have discussed the results that show that black-box variants of common active learning approaches work as well and sometimes even better than the white-box variants for regerssion on the datasets we have examined. We have also dived into the details of the predictive covariance kernel and its relation to the posterior gradient kernel. For non-differentiable models, we have considered a Bayesian model that gives another perspective on query-by-committee.

Overall, the results partially answer one of the research questions posed by my previous paper Kirsch & Gal, 2022: how do prediction-based methods compare to parameter-based ones? We have found that, for regression at least, the black-box prediction-based methods are competitive with the white-box parameter-based methods in batch active learning.

Acknowledgements

Thanks to the anonymous TMLR reviewers for their patience and kind feedback during the review process. Their feedback has significantly improved the paper. Likewise, many thanks to David Holzmüller for his constructive and helpful feedback, as well as for making the framework and results from their paper easily available.


Appendix: Novelty

The empirical predictive covariance kernel itself is not novel: prior works have explored prediction-focused uncertainty estimation. However, presenting it as a general technique for black-box active learning and empirically showing strong performance across model types is a novel contribution:

  1. Connecting the empirical kernel to gradient-based methods via approximation for neural networks has been explored before but provides useful insight. The novelty here only exists in clearly framing the empirical and gradient techniques under one unifying lens.

  2. Extending the empirical kernel to non-differentiable models via Bayesian modeling is perhaps the most significant novelty claim here. While query-by-committee is an old idea, explicitly connecting it to modern kernel viewpoints and showing its effectiveness for current model types like random forests is novel to best of my knowledge.

  3. The experimental results are a key piece of the contribution. Demonstrating the empirical kernel matches and exceeds gradient methods empirically across datasets makes a stronger statement than just the theoretical connections. And the results for non-differentiable models are especially compelling.

Appendix: Strongest White-Box \(\square\) Variants

(a) Active Learning Performance
(b) Batch Size Ablation
(c) Ensemble Size Ablation
Figure A.1. Deep Neural Networks: Mean logarithmic RMSE over 15 regression datasets. (a) We see that black-box \(\blacksquare\) methods (10 ensemble members) work as well as strongest variants of the white-box \(\square\) methods, and in most cases better, with the exception of ACS-FW and BAIT. (b) As expected, performance decreases with acquisition batch size. (c) As expected, performance increases with ensemble size.
Figure A.2. Average Logarithmic RMSE by regression datasets for DNNs: \(\blacksquare\) vs \(\square\) (vs Uniform). Across acquisition functions, the performance of black-box methods is highly correlated with the performance of white-box methods, even though black-box methods make fewer assumptions about the model. We plot the improvement of the white-box \(\square\) method over the uniform baseline on the x-axis (so for datasets with markers right of the dashed vertical lines, the white-box method performs better than uniform) and the improvement of the black-box \(\blacksquare\) method over the uniform baseline on the y-axis (so for datasets with markers above the dashed horizontal lines, the black-box method performs better than uniform). For datasets with markers in the \(\blacksquare\) diagonal half, the black-box method performs better than the white-box method. The average over all datasets is marked with a star \(\star\). Surprisingly, on average over all acquisition rounds, the black-box methods perform slightly better than the white-box methods for all but ACS-FW and BAIT.

  1. We do not require a prior distribution as active learning is not concerned with how we arrive at the model we want to acquire labels for. We can define a prior and perform, e.g, variational inference, but we do not need to. Hence, we use \(\pof{\w}\).↩︎

Follow me on Twitter @blackhc