Training machine learning models can require massive labeled datasets. Active learning aims to reduce labeling costs by selecting the most informative samples for labeling. But how well do prediction-focused *black-box* techniques compare to parameter-focused *white-box* methods?

This post summarizes the main contributions of my paper “Black-Box Batch Active Learning for Regression” (**B ^{3}AL**), which is now published in TMLR, and which introduces a general black-box batch active learning approach competitive with white-box methods for regression using only model predictions.

This approach is applicable to a wide range of models, including both differentiable models like neural networks and non-differentiable models like random forests, with promising results.

The experiments show that the black-box prediction-based active learning in **B ^{3}AL** matches and sometimes even improves on white-box gradient-based active learning for neural networks, which arguably has access to more information about the models.

**B ^{3}AL** builds on top of “A Framework and Benchmark for Deep Batch Active Learning for Regression” by Holzmüller et al (2022)

The post also discusses the benefits and limitations of black-box batch active learning. and the theory behind black-box batch active learning via the empirical predictive covariance kernel, and its connection to gradient-based methods.

Holzmüller et al (2022)*kernel-based versions* of contemporary batch active learning methods for neural networks and regression tasks. They unify the following methods using gradient-based kernels and the experiments show strong active learning performance on regression tasks: BALD

We follow the evaluation from Holzmüller et al (2022)

In Figure 1 and 2, we see that **B ^{3}AL** is competitive with white-box active learning, when using BALD, BatchBALD, BAIT, BADGE, and Core-Set. On average,

**Why Can Black-Box Methods Outperform White-Box Methods?** Both white-box and black-box methods are based on kernels, which can be viewed as different *approximations* of the predictive covariance kernel (as we will examine below). White-box methods implicitly assume that the predictive covariance kernel is well approximated by the Fisher information kernel and the gradient kernel**B ^{3}AL**, to approximate the predictive covariance kernel can be more robust. The different ensemble members can reside in different modes of the parameter distribution, allowing black-box methods to outperform their white-box counterparts.

In Figure 3, we observe that **B ^{3}AL** is effective for non-differentiable models, including random forests and gradient-boosted decision trees. BALD for non-differentiable models can be considered equivalent to

Figure 1 and 3 (b) and (c), respectively, show the effect of ablations of the ensemble size and the acquisition batch sizes.

**Ensemble Batch Size.** Increasing the ensemble size generally improves the performance for all acquisition approaches except LMCD, for which performance first improves and then degrades. The performance increase with uniform sampling is due to better model predictions. This shows that the improvements stemming from more informative sample acquisition by the active learning methods due to increasing ensemble sizes can be significantly larger than the improvements due to better model predictions.

**Acquisition Batch Size.** Performance generally decreases with increasing acquisition batch size for all acquisition approaches. Comparing white-box and black-box methods, we see that only at the largest batch size of 4096, the white-box methods perform as well as the black-box methods for neural networks.

**Best-performing Kernels and Selection Methods.** Holzmüller et al (2022)**B ^{3}AL**, we show the results for an ablation comparing black-box methods with the best-performing white-box kernels. For those variants,

**B ^{3}AL** shows that a simple extension of kernel-based methods to utilize empirical predictions rather than gradient kernels can be surprisingly effective and enable black-box batch active learning with good performance. Importantly,

The main limitation of **B ^{3}AL** lies in the acquisition of a sufficient amount of empirical predictions. This could be a challenge, particularly when using deep ensembles with larger models or non-differentiable models that cannot be parallelized efficiently. The experiments using virtual ensembles indicate that the diversity of the ensemble members plays a crucial role in determining the performance. The main limitation of the empirical comparisons is that we only consider regression tasks. Extending the results to classification is an important direction for future work.

In summary, **B ^{3}AL** offers competitive performance with limited assumptions. The empirical kernel is simple to implement. However, generating sufficient predictions for the empirical covariance estimation can be expensive, especially for large neural networks. Carefully tuning the ensemble method and size is key.

Here, we describe the problem setting, motivate the proposed method, and look at theory behind **B ^{3}AL**.

**Problem Setting.** The proposed method is inspired by the BALD-family of active learning frameworks^{1} \(\pof{\w}\)*Bayesian model averaging (BMA)* is performed by marginalizing over \(\pof{\w}\) to obtain the predictive distribution \(\pof{\y \given \x}\). Importantly, the choice of \(\pof{\w}\) covers ensembles

**Pool-based Active Learning** assumes access to a pool set of unlabeled data \(\Dpool=\xpools\) and a small initially labeled training set \(\Dtrain=\xytrains\), or \(\Dtrain = \emptyset\). In the batch acquisition setting, we want to repeatedly acquire labels for a subset \(\xacqs\) of the pool set of a given acquisition batch size \(\batchvar\) and add them to the training set \(\Dtrain\). Ideally, we want to select samples that are highly ‘informative’ for the model. For example, these could be samples that are likely to be misclassified or have a large prediction uncertainty for models trained on the currently available training set \(\Dtrain\). Once we have chosen such an *acquisition batch* \(\xacqs\) of unlabeled data, we acquire labels \(\yacqs\) for these samples and train a new model on the combined training set \(\Dtrain \cup \xyacqs\) and repeat the process. Crucial to the success of active learning is the choice of acquisition function \(\acqf(\xacqs; \pof{\w})\) which is a function of the acquisition batch \(\xacqs\) and the distribution \(\pof{\w}\) and which we try to maximize in each acquisition round. It measures the informativeness of an acquisition batch for the current model.

**Univariate Regression** is a common task in machine learning. We assume that the target \(\y\) is real-valued (\(\in \realnum\)) with homoscedastic Gaussian noise: \[\begin{align}
\Y \given \x, \w &\sim \normaldist{\mupred{\x}{\w}}{\noiseobs^2}. \label{eq:gaussian_noise}
\end{align}\] Equivalently, \(\Y \given \x, \w \sim \mupred{\x}{\w}+\varepsilon\) with \(\varepsilon \sim \normaldist{0}{\noiseobs^2}\). As usual, we assume that the noise is independent for different inputs \(\x\) and parameters \(\w\). Homoscedastic noise is a special case of the general heteroscedastic setting: the noise variance is simply a constant. Our approach can be extended to heteroscedastic noise by substituting a function \(\Sigmapred{\x}{\w}\) for \(\noiseobs\), but for this work we limit ourselves to the simplest case.

While a full treatment is naturally beyond the scope of this paper, we briefly review some key ideas of Holzmüller et al (2022)

**Gaussian Processes** are one way to introduce kernel-based methods. A simple way to think about Gaussian Processes

**Multivariate Gaussian Distribution.** The distinctive property of a Gaussian Process is that all predictions are jointly Gaussian distributed. We can then write the joint distribution for a univariate regression model as: \[
\begin{align}
&\Y_1, \ldots, \Y_n \given \x_1, \ldots, \x_n \sim
\normaldist{\mathbf{0}}{\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} + \noiseobs^2 \mathbf{I}}, \label{eq:gp_joint}
\end{align}
\] where \(\mu(\x){}\) are the observation-noise free predictions as random variables and \(\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)}\) is the covariance matrix of the predictions. The covariance matrix is defined via the kernel function \(k(\x, \x')\): \[
\begin{align}
\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} =
\begin{bmatrix}
k(\x_i, \x_j)
\end{bmatrix}_{i,j=1}^{n,n}.
\end{align}
\] The kernel function \(k(\x, \x')\) can be chosen almost arbitrarily, e.g. see §4 in Williams & Rasmussen, 2006

**Fisher Information & Linearization.** When using neural networks for regression, the gradient kernel \[
\begin{align}
\gradkernel{\x; \x' \given \wstar} &\triangleq
\gradmupred{\x}{\wstar} \xHessian{[-\log \pof{\wstar}]}^{-1} \gradmupred{\x'}{\wstar}^\top \\
&= \langle \gradmupred{\x}{\wstar}, \gradmupred{\x'}{\wstar} \rangle_{\xHessian{[-\log \pof{\wstar}]}^{-1}}
\end{align}
\] is the canonical choice, where \(\wstar\) is a *maximum likelihood* or *maximum a posteriori estimat* (MLE, MAP)} and \(\xHessian{[-\log \pof{\wstar}]}\) is the Hessian of the negative log likelihood at \(\wstar\). Note that \(\gradmupred{\x}{\wstar}\) is a *row* vector. Commonly, the prior is a Gaussian distribution with an identity covariance matrix, and thus \(\xHessian{[-\log \pof{\wstar}]} = \mathbf{I}\).

The significance of this kernel lies in its relationship with the Fisher information matrix at \(\wstar\)

**Posterior Gradient Kernel.** We can use the well-known properties of multivariate normal distribtions to marginalize or condition the joint distribution in . Following Holzmüller et al (2022)

Importantly for active learning, the multivariate normal distribution is the maximum entropy distribution for a given covariance matrix, and is thus an upper-bound for the entropy of any distribution with the same covariance matrix. The entropy is given by the log-determinant of the covariance matrix: \[\begin{align} \Hof{\Y_1, \ldots, \Y_n \given \x_1, \ldots, \x_n} &= \frac{1}{2} \log \det (\Cov{}{\mu(\x_1), \ldots, \mu(\x_n)} + \sigma^2 \mathbf{I}) + C_n, \end{align}\] where \(C_n \triangleq \frac{n}{2} \log(2 \, \pi \, e)\) is a constant that only depends on the number of samples \(n\). Connecting kernel-based methods to information-theoretic quantities like the expected information gain, we then know that the respective acquisition scores are upper-bounds on the actual expected information gain.

The *predictive covariance kernel* \(\predkernel{\x_i, \x_j}\) is the covariance of the predicted means: \[
\begin{align}
\predkernel{\x_i; \x_j} \triangleq \Cov{\W}{\mushort{\x_i}{\w}; \mushort{\x_j}{\w}}.
\end{align}
\]

This is also simply known as the *covariance kernel* in the literature*predictive* to make clear that we look at the covariance of the predictions. The resulting Gram matrix is equal the covariance matrix of the predictions and positive definite (for positive observation noise and otherwise positive semi-definite) and thus a valid kernel.

For \(K\) sampled model parameters \(\w_1, \ldots, \w_K \sim \pof{\w}\)—for example, the members of a deep ensemble—the *empirical predictive covariance kernel* \(\empredkernel{\x_i; \x_j}\) is the empirical estimate: \[
\begin{align}
\empredkernel{\x_i; \x_j} &\triangleq \emCov{\W}{\mushort{\x_i}{\w}; \mushort{\x_j}{\w}}\\
&= \frac{1}{K} \sum_{k=1}^K \left (\mushort{\x_i}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x_i}{\w_l} \right )\, \left (\mushort{\x_j}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x_j}{\w_l} \right )
\\
&= \left \langle \frac{1}{\sqrt{K}} (\cmushort{\x_i}{\w_1}, \ldots, \cmushort{\x_i}{\w_K}),
\frac{1}{\sqrt{K}} (\cmushort{\x_j}{\w_1}, \ldots, \cmushort{\x_j}{\w_K}) \right \rangle, \label{eq:empirical_pred_kernel}
\end{align}
\] with centered predictions \(\cmushort{\x}{\w_k} \triangleq \mushort{\x}{\w_k} - \frac{1}{K} \sum_{l=1}^K \mushort{\x}{\w_l}\). As we can write this kernel as an inner product, it also immediately follows that the empirical predictive covariance kernel is a valid kernel and positive semi-definite.

Similar to §C.1 in Holzmüller et al (2022)

**Proof.** We use a first-order Taylor expansion of the mean function \(\mupred{\x}{\w}\) around \(\wstar\): \[
\begin{align}
\mupred{\x}{\w} \approx \mupred{\x}{\wstar} + \gradmupred{\x}{\wstar}\, \underbrace{(\w - \wstar)}_{\triangleq \Delta \w}.
\end{align}
\] Choose \(\wstar = \E{\w \sim \pof{\w \given \Dtrain}}{\w}\) (BMA). Then we have \(\E{\pof{w \given \Dtrain}}{\mupred{\x}{\w}} \approx \mupred{\x}{\wstar}\). Overall this yields: \[
\begin{align}
\predkernel{\x_i; \x_j} &= \Cov{\w \sim \pof{\w \given \Dtrain}}{\mupred{\x_i}{\w}; \mupred{\x_j}{\w}} \\
&\approx \E{\wstar + \Delta \w \sim \pof{w \given \Dtrain}}{\langle \gradmushort{\x_i}{\wstar}\, \Delta \w, \gradmushort{\x_j}{\wstar}\, \Delta \w \rangle} \\
&= \gradmushort{\x_i}{\wstar} \, \E{\wstar + \Delta \w \sim \pof{w \given \Dtrain}}{ \Delta \w \Delta \w^\top} \, {\gradmushort{\x_j}{\wstar}}^\top \\
&= \gradmupred{\x_i}{\wstar} \, \Cov{}{\W \given \Dtrain} \, \gradmupred{\x_j}{\wstar}^\top \\
&\approx \postgradkernel{\x_i; \x_j \given \wstar}.
\end{align}
\] The intermediate expectation is the model covariance \(\Cov{}{\W \given \Dtrain}\) as \(\wstar\) is the BMA.

For the last step, we use the *generalized Gauss-Newton (GGN) approximation*

How can we apply the above result to non-differentiable models?

In the following, we use a Bayesian view on the hypothesis space to show that we can connect the empirical predictive covariance kernel to a gradient kernel in this case, too. With \(\emOmega \triangleq (\w_1, \ldots, \w_K)\) fixed—e.g. these could be the different parameters of the members of an ensemble—we introduce a latent \(\Bmch\) to represent the index of the ‘true’ hypothesis \(\w_\bmch \in \emOmega\) from this empirical hypothesis space \(\emOmega\), which we want to identify. This is similar to QbC

Specifically, we model \(\Bmch\) using a one-hot categorical distribution, that is a multinomial distribution from which we draw one sample: \(\Bmch \sim \Multinomial(\bmccatprob, 1)\), with \(\bmccatprob \in S^{K-1}\) parameterizing the distribution, where \(S^{K-1}\) denotes the \(K-1\) simplex in \(\realnum^K\). Then, \(\bmccatprob_k = \pof{\Bmch = e_k}\), where \(e_k\) denotes the \(k\)-th unit vector; and \(\sum_{k=1}^K \bmccatprob_k = 1\). For the corresponding predictive mean \(\bmcmupred{\x}{\Bmch}\), we have: \[ \begin{align} \bmcmupred{\x}{\Bmch} &\triangleq \mupred{\x}{\w_\Bmch} = \langle \mupred{\x}{\cdot}, \Bmch \rangle, \end{align} \] where we use \(\w_\bmch\) to denote the \(\w_k\) when we have \(\bmch=e_k\) in slight abuse of notation, and \(\mupred{\x}{\cdot} \in \realnum^K\) is a column vector of the predictions \(\mupred{\x}{\w_k}\) for \(\x\) for all \(\w_k\). This follows from \(\bmch\) being a one-hot vector.

We now examine this model and its kernels. The BMA of \(\bmcmupred{\x}{\Bmch}\) matches the previous empirical mean if we choose \(\bmccatprob\) to have an uninformative uniform distribution over the hypotheses (\(\bmccatprob_k \triangleq \frac{1}{K}\)):

\[ \begin{align} \bmcmupred{\x}{\bmccatprob} &\triangleq \E{\pof{\bmch}}{\mupred{\x}{\w_\bmch}} \\ &= \langle \mupred{\x}{\cdot}, \bmccatprob \rangle \label{eq:empirical_bma_linearization} \\ &= \sum_{\bmch=1}^K \bmccatprob_\bmch \mupred{\x}{\w_\bmch} \\ &= \sum_{\bmch=1}^K \frac{1}{K} \mupred{\x}{\w_\bmch}. \end{align} \] What is the predictive covariance kernel of this model? And what is the posterior gradient kernel for \(\bmccatprob\)?

- The predictive covariance kernel \(\bmcpredkernel{\x_i, \x_j}\) for \(\emOmega\) using uniform \(\bmccatprob\) is equal to the empirical predictive covariance kernel \(\empredkernel{\x_i; \x_j}\).
- The `posterior’ gradient kernel \(\bmcpostgradkernel{\x_i ; \x_j}\) for \(\emOmega\)
*in respect to*\(\Bmch\) using uniform \(\bmccatprob\) is equal to the empirical predictive covariance kernel \(\empredkernel{\x_i; \x_j}\).

The above gradient kernel is only the posterior gradient kernel in the sense that we have sampled \(\w_\bmch\) from the non-differentiable model after inference on training data. The samples themselves are drawn uniformly.

The covariance of the multinomial \(\Bmch\) is: \[ \begin{align} \Cov{}{\Bmch} = \diag(\bmccatprob) - \bmccatprob \bmccatprob^\top. \end{align} \] Thus, substituting, we can verify that the posterior gradient kernel is indeed equal to the predictive covariance kernel explicitly once more: \[ \begin{align} &\bmcpostgradkernel{\x_i ; \x_j} \\ &\quad= \bmcgradmupred{\x_i}{\bmccatprob} \, (\diag(\bmccatprob) - \bmccatprob \bmccatprob^\top) \, \bmcgradmupred{\x_i}{\bmccatprob}^\top\\ &\quad= \mupred{\x_i}{\cdot}^\top \diag(\bmccatprob) \, \mupred{\x_j}{\cdot} - (\mupred{\x_i}{\cdot}^\top \, \bmccatprob) \, (\bmccatprob^\top \, \mupred{\x_j}{\cdot}) \\ &\begin{split}\quad= &\frac{1}{K} \sum_\bmch \mupred{\x_i}{\w_\bmch} \, \mupred{\x_j}{\w_\bmch}^\top \\ &- \left (\frac{1}{K} \sum_\bmch \mupred{\x_i}{\w_\bmch} \right ) \, \left (\frac{1}{K} \sum_\bmch \mupred{\x_j}{\w_\bmch}\right )\end{split}\\ &\quad= \empredkernel{\x_i; \x_j}. \end{align} \] This demonstrates that a Bayesian model can be constructed on top of a non-differentiable ensemble model. Bayesian inference in this context aims to identify the most suitable member of the ensemble. Given the limited number of samples and likelihood of model misspecification, it is likely that none of the members accurately represents the true model. However, for active learning purposes, the main focus is solely on quantifying the degree of disagreement among the ensemble members.

A similar Bayesian model using Bayesian Model Combination (BMC) could be set up which allows for arbitrary convex mixtures of the ensemble members. This would entail using a Dirichlet distribution \(\Bmch \sim \Dirichlet(\bmcparam)\) instead of the multinomial distribution. Assuming an uninformative prior (\(\bmcparam_k \triangleq \frac{1}{K}\)), this leads to the same results up to a constant factor of \(1+\sum_k \bmcparam_k = 2\). This is pleasing because it does not matter whether we use a multinomial or Dirichlet distribution, that is: whether we take a hypothesis space view with a ‘true’ hypothesis or accept that our model is likely misspecified and we are dealing with a mixture of models, the results are the same up to a constant factor.

**Application to DNNs, BNNs, and Other Models.** The proposed approach has relevance due to its versatility, as it can be applied to a wide range of models that can be consistently queried for prediction, including deep ensembles

This blog post has introduced the main points of the TMLR paper. We have discussed the results that show that *black-box* variants of common active learning approaches work as well and sometimes even better than the *white-box* variants for regerssion on the datasets we have examined. We have also dived into the details of the predictive covariance kernel and its relation to the posterior gradient kernel. For non-differentiable models, we have considered a Bayesian model that gives another perspective on query-by-committee.

Overall, the results partially answer one of the research questions posed by my previous paper Kirsch & Gal, 2022*black-box* prediction-based methods are competitive with the *white-box* parameter-based methods in batch active learning.

Thanks to the anonymous TMLR reviewers for their patience and kind feedback during the review process. Their feedback has significantly improved the paper. Likewise, many thanks to David Holzmüller for his constructive and helpful feedback, as well as for making the framework and results from their paper easily available.

The empirical predictive covariance kernel itself is not novel: prior works have explored prediction-focused uncertainty estimation. However, presenting it as a general technique for black-box active learning and empirically showing strong performance across model types is a novel contribution:

Connecting the empirical kernel to gradient-based methods via approximation for neural networks has been explored before but provides useful insight. The novelty here only exists in clearly framing the empirical and gradient techniques under one unifying lens.

Extending the empirical kernel to non-differentiable models via Bayesian modeling is perhaps the most significant novelty claim here. While query-by-committee

is an old idea, explicitly connecting it to modern kernel viewpoints and showing its effectiveness for current model types like random forests is novel to best of my knowledge. The experimental results are a key piece of the contribution. Demonstrating the empirical kernel matches and exceeds gradient methods empirically across datasets makes a stronger statement than just the theoretical connections. And the results for non-differentiable models are especially compelling.

We do not require a prior distribution as active learning is not concerned with how we arrive at the model we want to acquire labels for. We can define a prior and perform, e.g, variational inference, but we do not

*need*to. Hence, we use \(\pof{\w}\).↩︎