All Models are Wrong, Some are Useful: Model Selection with Limited Labels

The machine learning landscape is rich with pretrained models (or LoRas) available through platforms like Hugging Face, PyTorch Hub, and AutoML services. While this abundance of models is amazing, it poses an important challenge: how do we select the best model for a specific downstream task and dataset, especially when creating evaluation datasets by labeling new data is expensive?

Our paper “All models are wrong, some are useful: Model Selection with Limited Labels” accepted at AISTATS 2025 introduces Model Selector, a method designed to tackle this challenge head-on. It addresses the research question:

Given a pool of unlabeled data, how can we identify the most informative examples to label in order to select the best classifier for this data, both in a model-agnostic and label-efficient manner?

This seems similar to active learning and/or active testing, but active model selection is a distinct task with its own unique challenges. Active learning focuses on improving a single model’s performance through strategic data labeling, while active testing aims to efficiently evaluate a model’s performance on a test set.

In contrast, active model selection aims to identify the best model from a pool of pre-existing candidates with minimal labeling effort. Rather than iteratively training or thoroughly testing a model, we’re trying to efficiently determine which already-trained model performs best on our target data distribution through selective and strategic labeling of informative examples.

Model Selector offers a principled approach to minimize the labeling effort required to identify the optimal pretrained classifier from a given collection for a specific target dataset. It operates in a pool-based setting, assuming access to a large set of unlabeled target data but only a small budget for acquiring labels.

Figure 1 (from the paper). An overview of the Model Selector pipeline. Given unlabeled data and multiple pretrained models, Model Selector identifies a small, informative subset of examples to label (\(b \ll n\)). These labels are used to efficiently select the best model for the target data.

Information-Theoretic Selection

The core idea behind Model Selector is to treat the identity of the actually best model (\(H\)) for the target data as an unknown random variable. The goal is to reduce the uncertainty about \(H\) as quickly as possible by strategically selecting which data points to label.

Model Selector frames this as an information-gathering problem. It aims to select the example \(x\) from the unlabeled pool \(\mathcal{U}_t\) at step \(t\) that maximizes the mutual information between the (unknown) label \(Y\) of that example and the identity of the best model \(H\), given the already labeled data \(\mathcal{L}_t\):

\[x_t = \arg \max_{x\in \mathcal{U}_{t}}\ \mathbb{I}[H; Y \mid x, \mathcal{L}_{t}]\]

This is equivalent to selecting the example that minimizes the expected posterior entropy of the models after observing the label:

\[x_t = \arg \min_{x\in \mathcal{U}_{t}} \ \mathbb{E}_{Y} [\mathbb{H}[H \mid \mathcal{L}_{t}\cup \{(x, Y)\}]]\]

Essentially, Model Selector asks: “Which unlabeled data point, if labeled, would be most informative in helping us distinguish between the candidate models and identify the best one?”

A Simplified Model and Parameter Estimation

Now the above model is fairly standard in the Bayesian optimal experiment design literature (and also the Bayesian active learning literature). The crucial insight of our work, in my opinion, is that we can use a very simple probabilistic model to capture the relationship between the candidate models and the true labels, which in turn is incredibly data efficient, despite being obviously a simplification and thus misspecified.

To compute this information gain, Model Selector makes a simplifying assumption. It models the relationship between any candidate model \(h_j\) and the true labels using a single parameter \(\epsilon,\) representing the probability that the model’s prediction is incorrect:

\[\mathbb{P}(h_j(x) \neq y \mid H=h_j) = \epsilon\] \[\mathbb{P}(h_j(x) = y \mid H=h_j) = 1 - \epsilon\]

This parameter captures the expected error rate or disagreement. Importantly, we find that \(\epsilon\) can be estimated without requiring any ground truth labels. Instead, we uses pseudo-labels generated from the collective ensemble predictions of the candidate models themselves. A grid search is performed using these pseudo-labels to find the \(\epsilon\) that yields the best model selection performance under this pseudo-oracle, and this value is then used for the actual selection process with the real labeling oracle. This self-supervised estimation makes Model Selector highly practical as we have to compute these pseudo-labels only once, and they are needed for the actual selection process anyway (to compute the expected entropy).

Key Properties of Model Selector:

Model-Agnostic: It treats models as black boxes, requiring only their predicted labels (hard predictions) for a given input. It doesn’t need access to model internals, architectures, or confidence scores (soft predictions).
Label-Efficient: It aims to minimize the number of labels needed to find the best (or a near-best) model.
Principled: Based on maximizing information gain about the identity of the best model.
Practical: The key parameter \(\epsilon\) can be estimated without ground truth labels.

The Algorithm

The algorithm proceeds iteratively:

Initialize with a pool of unlabeled data \(\mathcal{U}_0\) and an empty labeled set \(\mathcal{L}_0\). Estimate \(\epsilon\) using pseudo-labels from the predictions of the ensemble of candidate models.
For each step \(t\) up to the budget \(b\):
1. For each unlabeled example \(x \in \mathcal{U}_t\), calculate the expected posterior entropy \(\mathbb{E}_{Y} [\mathbb{H}(H| \mathcal{L}_{t}\cup \{(x, Y)\})]\) using the current hypothesis posterior and the estimated \(\epsilon\).
2. Select the example \(x_t\) that minimizes this expected entropy.
3. Query the true label \(y_t\) for \(x_t\) from an oracle.
4. Update the labeled set \(\mathcal{L}_{t+1} = \mathcal{L}_t \cup \{(x_t, y_t)\}\), remove \(x_t\) from \(\mathcal{U}_t\), and update the posterior probability distribution over the best model hypothesis \(\mathbb{P}(H|\mathcal{L}_{t+1})\) using Bayes’ rule and the observed label \(y_t\).
After \(b\) labels are acquired, select the model \(h_j\) with the highest accuracy on the labeled set \(\mathcal{L}_b\).

Experimental Results

The paper presents extensive experiments across 18 model collections (over 1,500 pretrained models in total) on 16 diverse datasets (including ImageNet variants, PACS for domain adaptation, GLUE tasks, CIFAR-10, etc.). Model Selector is compared against random sampling and adapted versions of standard active learning strategies like Uncertainty Sampling, Margin Sampling, Active Model Comparison (AMC), and Variance Minimization Approach (VMA).

The results demonstrate Model Selector’s effectiveness:

Best Model Identification: Model Selector consistently identifies the true best model with significantly fewer labels compared to baselines. The paper reports label cost reductions of up to 94.15% compared to the strongest baseline to achieve 100% identification probability (see Figure 2 in the paper, also reproduced below, for detailed plots). For instance, on ImageNet collections, label costs were reduced by ~70% compared to baselines like Uncertainty or Margin sampling.
Near-Best Model Identification: Even when aiming for a model within a small accuracy margin (e.g., 1%, 0.5%, 0.1%) of the true best, Model Selector remains highly efficient. It achieves label cost reductions of up to 72.41% compared to the best baseline for finding a near-best model (within 1% accuracy). See Table 1 in the paper for details.
Robustness: Model Selector shows robust performance even in worst-case scenarios. The 95th percentile accuracy gap (the gap between the selected model’s accuracy and the best model’s accuracy, exceeded in only 5% of runs) is consistently lower for Model Selector than for baselines across various identification probability thresholds (see Table 2 in the paper). This indicates that even when Model Selector doesn’t pick the absolute best model, it tends to pick one that is very close in performance.
Adaptability: The method performs well even in challenging scenarios like the Drift dataset (significant distribution shift) by automatically adjusting \(\epsilon\) based on model disagreement inferred from noisy labels.

Figure 2 (from the paper). Best model identification probability across various datasets and model collections. Model Selector (blue line) consistently reaches high identification probability with fewer labels (horizontal axis) compared to baselines.

Conclusion

Model Selector provides a practical, theoretically grounded, and highly efficient solution for selecting the best pretrained classifier when labeling costs are a constraint. Its model-agnostic nature and reliance only on hard predictions make it broadly applicable. By intelligently choosing which examples to label based on information gain, it dramatically reduces the effort needed for effective model selection in real-world applications.

The work opens avenues for future research, including extensions to settings with soft predictions or selecting generative models based on limited demonstrations.

Code

The code for Model Selector is available at:

https://github.com/RobustML-Lab/model-selector