Our paper “A Note on ‘Assessing Generalization of SGD via Disagreement’”

**TL;DR.** We simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context (unlike the original hypothesis space view). Empirically, we ask whether the suggested theory might be impractical under distribution shifts because model calibration can deteriorate as prediction disagreement increases. We show this on CIFAR-10/CINIC-10 and ImageNet/PACS. This is precisely when the proposed coupling of test error and disagreement is of the most interest. At the same time, labels are needed to estimate the calibration on new datasets. The authors of “Assessing Generalization (of SGD)^{1} via Disagreement” seem to agree with this: they adjusted the paper to be clearer about being only about in-distribution data for their camera-ready.

Several recent papers connect the generalization error of a model to the model’s prediction disagreement

Prediction agreement is also approximating a model’s predictive entropy (uncertainty). This can be seen via a first-order Taylor expansion of Shannon’s information content around 1: \(H(p(y|x)) = -\log p(y|x) \ge 1 - p(y|x)\) and taking the expectation over \(p(y|x)\). Predictive entropy (and maximum class confidence) are well-known metrics that are used in OOD detection literature

If we were to summarize the major (empirical) claim of these works, it would be that the prediction disagreement is a good proxy for the generalization error and that we can use it to estimate it. On average, we can trust a model’s predictive uncertainty to tell us when the model will be wrong for a given sample. That is, after identifying samples that are likely to be wrong using the prediction disagreement, that fraction will likely be close to the generalization error that we could measure if we had the actual labels.

Please note that this is very simplified, and the various papers differ in their claims and approaches—however, this does capture the gist.

However, it is easy to misunderstand the claims of these recent papers:

It is tempting to extend this claim to data under distribution shift. Then we could use the prediction disagreement to estimate the generalization error on a new dataset, even when we do not have access to the labels of the new dataset and do not know how that dataset differs from the training data.

We have to be careful with this, however. The prediction disagreement can be a good proxy for the generalization error on the training data when the model is well-calibrated, but it is not necessarily a good proxy for the generalization error on new data. The reason is that the gap between prediction disagreement and generalization error is bounded by the model’s calibration

Jiang et al. (2022)

We empirically validate on CIFAR-10/CINIC-10 and ImageNet/PACS that the proposed calibration metrics worsen as prediction disagreement increases, which is also in line with Ovadia et al. (2019)

This takeaway likely also applies to other works: as a model gets more uncertain about its predictions, e.g., higher prediction disagreement, it likely becomes less reliable and calibrated.

Another takeaway of our paper is that a probabilistic notation (and modeling the parameters as a parameter distribution) is easier to work with than a hypothesis space: we look at the proposed calibration metrics and theoretical results, and we simplify them quite a bit.

Jiang et al. (2022) **Generalization Disagreement Equality (GDE)**: a model satisfies *GDE* when its predicted error (prediction disagreement) equals the actual generalization error.

The predicted error is the error if the model’s predictions were true for a sample: e.g., if the model predicts 80% probability for class A and 20% for class B, then the predicted accuracy \(\mathbb{E}_{p(y|x)} \, [p(y|x)]\) is 80% * 80% + 20% * 20% = 68% and the predicted error (prediction disagreement) is 1 - 0.68 = 32%.

We show that GDE immediately follows from the proposed calibration metrics (class-wise and class-aggregated calibration error). We also look at prior art and find that class-wise and class-aggregated calibration error have been implemented previously by Nixon et al. (2019)

For additional context, the recently accepted NeurIPS 2022 paper by Gruber and Büttner (2022)

From a Bayesian perspective, we usually look at epistemic uncertainty to tell us how reliable a model might be for new data. We look at these connections in our paper a bit as well and see that the proposed theory does not disentangle aleatoric and epistemic uncertainty. (Looking at the consequences is future work and an interesting research question.)

Epistemic uncertainty tells us when the model “knows” that there are multiple interpretations, and it is not sure which one is correct. In contrast, aleatoric uncertainty tells us when the model “knows” the data is noisy

In other words, epistemic uncertainty is the uncertainty that we can reduce by training with more data while aleatoric uncertainty is the uncertainty we cannot reduce by training with more data (e.g., data noise).

We cannot trust the model’s predictions for samples with high epistemic uncertainty. If we want to be conservative, we need to assume the model’s prediction will be wrong under high epistemic uncertainty. Likewise, for high aleatoric uncertainty (data noise), the model’s (one-hot) prediction will also likely be wrong. Using predictive uncertainty (prediction disagreement) as the sum of the two can give us a lower bound on the “worst-case generalization error” but not necessarily more. Even when the model is confident, it could still be wrong.

For more details, see our full paper: the OpenReview link is here, and the arXiv link is here.

I would like to thank Yarin Gal, as well as the members of OATML in general for their continued feedback.

For more blog posts by OATML in Oxford, check out our group’s blog https://oatml.cs.ox.ac.uk/blog.html.

While the paper’s official title is “Assessing Generalization of SGD via Disagreement” on arXiv and OpenReview, the paper itself is aptly titled “Assessing Generalization via Disagreement” because the results do not depend on SGD itself.↩︎