Pre-training language models requires massive amounts of data, yet not all data contributes equally to model performance. As models and datasets continue to grow in size, identifying and selecting the most valuable training examples has become increasingly critical.
“CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-Training”, published at NeurIPS 2024, introduces a simple yet effective method for selecting high-quality data to improve language model pre-training, addressing this critical challenge in developing more efficient and task-targeted language models. The research questions driving this work were:
The proposed method, CoLoR-Filter (Conditional Loss Reduction Filtering), is a data selection method inspired by empirical Bayes. By comparing the relative loss values of two auxiliary models, it identifies the most beneficial data points for a specific target distribution. This approach is motivated by two key assumptions:
Training language models on all available data is extremely expensive, and much of the data may be low-quality or irrelevant to target tasks.
A principled approach to data selection could dramatically improve both efficiency and performance for specific downstream tasks.
The rest of this blog post presents the key ideas, methodology, and results of this paper. For all the details, please refer to the paper.
CoLoR-Filter is a data selection method derived from a Bayesian perspective. The key insight is to select data points that show the largest reduction in loss (or increase in likelihood) when moving from a “prior” model to a “conditional” model that has been fine-tuned on target task data.
A crucial aspect of CoLoR-Filter is that it’s computationally efficient. Unlike online methods that require continuous model updates, CoLoR-Filter only needs two fixed models to score all data points. This makes the selection process highly parallelizable, offering significant computational advantages for large-scale applications.
The method works as follows:
Train a “prior” model \(\theta_{\text{prior}}\) on a dataset \(D_{\text{prior}}\) (which can be a subset from the large training corpus).
Train a “conditional” model \(\theta_{\text{prior}+\text{down}}\) by fine-tuning the prior model on a small dataset \(D_{\text{down}}\) from the downstream task(s) of interest.
For each data point \(x\) in the large corpus \(D_{\text{train}}\), compute the CoLoR score:
\[\text{CoLoR}(x) = -\log \Pr(x \mid \theta_{\text{prior}+\text{down}}) - (-\log \Pr(x \mid \theta_{\text{prior}}))\]
Select the points with the smallest CoLoR scores (largest loss reduction).
Opinion: An alternative but equivalent perspective is to consider maximizing the negative of the CoLoR score:
\[-\text{CoLoR}(x) = \log \Pr(x \mid \theta_{\text{prior}+\text{down}}) - \log \Pr(x \mid \theta_{\text{prior}})\]
This is equivalent to the pointwise mutual information (PMI) between the sample \(x\) and the downstream data (conditioned on the prior model):
\[\mathrm{I}[x ; D_{\text{down}} \mid D_{\text{prior}}] = - \log \Pr(x \mid \theta_{\text{prior}}) - (-\log \Pr(x \mid \theta_{\text{prior}+\text{down}})) = -\text{CoLoR}(x)\]
From this view, we are selecting points that have high mutual information with the downstream task distribution, which provides an intuitive information-theoretic interpretation of why the method works. Maximizing this quantity directly selects examples that are most informative about the downstream tasks.Intuitively, CoLoR-Filter selects points that the conditional model finds more likely than the prior model, indicating that these points are more relevant to the downstream tasks. This approach is simple yet theoretically motivated, derived from Bayes’ rule and empirical Bayes principles.
Concretely, we use a hyperparameter \(\tau\) to control how aggressively we subsample data: to select \(n\) points, we first sample \(\tau n\) points randomly from the training dataset to form a candidate pool. We then compute the CoLoR score for each point in this pool and select the \(n\) points with the lowest scores. This allows us to be more selective with higher values of \(\tau\), as we consider more candidates for each selected point, but comes with increased computational cost for scoring the larger candidate pool.
A comparison between global and batch-wise selection strategies reveals an interesting property of CoLoR-Filter. When we select the best points within each random batch of \(\tau n\) candidates rather than globally across all candidates, we achieve essentially the same performance with batch-wise selection outperforming global selection by a small margin. The batch-wise approach might naturally introduce more diversity into the selected dataset compared to greedily selecting the globally best points, since each batch represents a random subset of the data. This finding aligns with results from active learning literature showing that stochastic batch acquisition methods often perform as well as or better than greedy selection approaches.
The paper evaluates CoLoR-Filter on two main tasks:
The results are impressive. Data selected by CoLoR-Filter significantly outperforms randomly selected data, achieving the same performance with far less data:
| Task | Random data needed | CoLoR-Filter data needed | Reduction factor |
|---|---|---|---|
| Books | 25B tokens | 1B tokens | 25× |
| Downstream Tasks | 25B tokens | 2.3B tokens | 11× |
Some key findings include:
One of the most significant advantages of CoLoR-Filter is its computational efficiency. The paper provides a detailed analysis of the computational costs involved:
For example, when matching the performance of a model trained on 25 billion randomly selected tokens with only 1.5 billion filtered tokens (\(\tau=16\)), the total computational cost is reduced by more than 5×.
The paper positions CoLoR-Filter relative to other data selection methods:
CoLoR-Filter distinguishes itself through its simplicity, theoretical grounding, and strong empirical performance across different tasks and model scales.
CoLoR-Filter demonstrates that a theoretically motivated yet simple approach to data selection can dramatically improve the efficiency of language model pre-training. The method is particularly appealing due to its computational efficiency and favorable scaling properties.
The paper opens several promising directions for future research, including: extending CoLoR-Filter to fine-tuning, continual pre-training, and general domain pre-training; applying the method to other domains like code generation or other modalities; and further improving the algorithm’s efficiency and testing its limits of scale generalization.
HZ is supported by an Eric and Susan Dunn Graduate Fellowship. SK acknowledges support from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #IIS 2229881. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence.
The code for CoLoR-Filter is available at:
https://github.com/davidbrandfonbrener/color-filter-olmo.
The filtered data can be accessed at:
https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4.