CoLoR-Filter: Selecting Data for Language Model Pre-training

Pre-training language models requires massive amounts of data, yet not all data contributes equally to model performance. As models and datasets continue to grow in size, identifying and selecting the most valuable training examples has become increasingly critical.

“CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-Training”, published at NeurIPS 2024, introduces a simple yet effective method for selecting high-quality data to improve language model pre-training, addressing this critical challenge in developing more efficient and task-targeted language models. The research questions driving this work were:

How can we efficiently select high-quality subsets of data from large corpora to maximize performance on specific downstream tasks?
Can we develop a computationally efficient method that scales to large datasets and transfers effectively from small auxiliary models to larger target models?

The proposed method, CoLoR-Filter (Conditional Loss Reduction Filtering), is a data selection method inspired by empirical Bayes. By comparing the relative loss values of two auxiliary models, it identifies the most beneficial data points for a specific target distribution. This approach is motivated by two key assumptions:

Training language models on all available data is extremely expensive, and much of the data may be low-quality or irrelevant to target tasks.
A principled approach to data selection could dramatically improve both efficiency and performance for specific downstream tasks.

The rest of this blog post presents the key ideas, methodology, and results of this paper. For all the details, please refer to the paper.

Figure 1. Learning curves for 1.2 billion parameter language models trained on data selected by CoLoR-Filter using smaller 150 million parameter auxiliary models for two different target distributions. (Left) We target and evaluate loss on Books, lower is better. (Right) We target and evaluate accuracy on a suite of 8 downstream tasks, higher is better. In both cases, test data is held out from the data used by CoLoR-Filter to guide selection. $\tau$ is the subset size multiplier denoting the number of examples considered for each selected data point. The CoLoR-Filter line terminates when we run out of data in C4 ($$175b possible tokens). Note that the 11× and 25× improvements are lower bounds since the subselected data outperforms the much larger baseline.

CoLoR-Filter: Conditional Loss Reduction Filtering

CoLoR-Filter is a data selection method derived from a Bayesian perspective. The key insight is to select data points that show the largest reduction in loss (or increase in likelihood) when moving from a “prior” model to a “conditional” model that has been fine-tuned on target task data.

A crucial aspect of CoLoR-Filter is that it’s computationally efficient. Unlike online methods that require continuous model updates, CoLoR-Filter only needs two fixed models to score all data points. This makes the selection process highly parallelizable, offering significant computational advantages for large-scale applications.

The method works as follows:

Train a “prior” model $\theta_{\text{prior}}$ on a dataset $D_{\text{prior}}$ (which can be a subset from the large training corpus).
Train a “conditional” model $\theta_{\text{prior}+\text{down}}$ by fine-tuning the prior model on a small dataset $D_{\text{down}}$ from the downstream task(s) of interest.
For each data point $x$ in the large corpus $D_{\text{train}}$, compute the CoLoR score:

\[\text{CoLoR}(x) = -\log \Pr(x \mid \theta_{\text{prior}+\text{down}}) - (-\log \Pr(x \mid \theta_{\text{prior}}))\]
Select the points with the smallest CoLoR scores (largest loss reduction).

Opinion: An alternative but equivalent perspective is to consider maximizing the negative of the CoLoR score:

\[-\text{CoLoR}(x) = \log \Pr(x \mid \theta_{\text{prior}+\text{down}}) - \log \Pr(x \mid \theta_{\text{prior}})\]

This is equivalent to the pointwise mutual information (PMI) between the sample $x$ and the downstream data (conditioned on the prior model):

\[\mathrm{I}[x ; D_{\text{down}} \mid D_{\text{prior}}] = - \log \Pr(x \mid \theta_{\text{prior}}) - (-\log \Pr(x \mid \theta_{\text{prior}+\text{down}})) = -\text{CoLoR}(x)\]

From this view, we are selecting points that have high mutual information with the downstream task distribution, which provides an intuitive information-theoretic interpretation of why the method works. Maximizing this quantity directly selects examples that are most informative about the downstream tasks.

Intuitively, CoLoR-Filter selects points that the conditional model finds more likely than the prior model, indicating that these points are more relevant to the downstream tasks. This approach is simple yet theoretically motivated, derived from Bayes’ rule and empirical Bayes principles.

Batch-wise vs Global Selection

Concretely, we use a hyperparameter $\tau$ to control how aggressively we subsample data: to select $n$ points, we first sample $\tau n$ points randomly from the training dataset to form a candidate pool. We then compute the CoLoR score for each point in this pool and select the $n$ points with the lowest scores. This allows us to be more selective with higher values of $\tau$, as we consider more candidates for each selected point, but comes with increased computational cost for scoring the larger candidate pool.

A comparison between global and batch-wise selection strategies reveals an interesting property of CoLoR-Filter. When we select the best points within each random batch of $\tau n$ candidates rather than globally across all candidates, we achieve essentially the same performance with batch-wise selection outperforming global selection by a small margin. The batch-wise approach might naturally introduce more diversity into the selected dataset compared to greedily selecting the globally best points, since each batch represents a random subset of the data. This finding aligns with results from active learning literature showing that stochastic batch acquisition methods often perform as well as or better than greedy selection approaches.

Figure 4. Comparison between global and batchwise variants of CoLoR-Filter on Books. When selecting points batch-wise (selecting the best points within each random batch) versus globally (selecting the best points across all candidates), we see nearly identical performance with batch-wise selection outperforming global selection by a small margin.

Experimental Results

The paper evaluates CoLoR-Filter on two main tasks:

Domain Adaptation to Books: Selecting data from C4 to maximize performance on language modeling for Books.
Multiple-Choice Question Answering: Selecting data from C4 to improve performance on a suite of 8 downstream tasks.

The results are impressive. Data selected by CoLoR-Filter significantly outperforms randomly selected data, achieving the same performance with far less data:

Table 1. CoLoR-Filter efficiency. CoLoR-Filter using small auxiliary models (150M parameters) achieves the same performance as random selection with dramatically less data and compute.

Task	Random data needed	CoLoR-Filter data needed	Reduction factor
Books	25B tokens	1B tokens	25×
Downstream Tasks	25B tokens	2.3B tokens	11×

Some key findings include:

CoLoR-Filter shows favorable scaling properties as we select more aggressively (higher $\tau$ values).
Small auxiliary models (150M parameters) can effectively select data for larger target models (1.2B parameters).
The method continues to improve performance even when selecting only 1 in 64 data points considered.

Figure 2. Scaling CoLoR-Filter with $\tau$ when training 1.2B models with data selected by 150M models. Curves end when we exhaust the data in C4.

Figure 3. Performance on downstream tasks for 1.2B models. CoLoR-Filter with 150M auxiliary models matches performance of random selection with significantly less data.

Figure 4. Performance on Books validation as we vary $\tau$ for 150M models compared to baselines. Higher $\tau$ means more selective filtering. Performance improves with more aggressive selection and outperforms random selection and other baselines.

Figure 5. Performance on downstream tasks as we vary $\tau$ for 150M models compared to baselines. CoLoR-Filter continues to improve with more aggressive selection across all tasks while maintaining advantages over random selection and other baseline methods.

Computational Benefits

One of the most significant advantages of CoLoR-Filter is its computational efficiency. The paper provides a detailed analysis of the computational costs involved:

CoLoR-Filter: Requires 2τn forward passes for selection, which can be entirely parallelized.
Training on Selected Data: Takes advantage of the higher quality data to achieve better performance with fewer training steps.

For example, when matching the performance of a model trained on 25 billion randomly selected tokens with only 1.5 billion filtered tokens ($\tau=16$), the total computational cost is reduced by more than 5×.

The paper positions CoLoR-Filter relative to other data selection methods:

RHOLoss : CoLoR-Filter is inspired by RHOLoss but differs in that it targets a different distribution and uses fixed models for selection rather than online updating.
DSIR : Like CoLoR-Filter, DSIR also performs importance sampling but uses different features and sampling procedures.
DSDM : Uses a different approach based on TRAK Datamodels to estimate data influence.

CoLoR-Filter distinguishes itself through its simplicity, theoretical grounding, and strong empirical performance across different tasks and model scales.

Conclusion

CoLoR-Filter demonstrates that a theoretically motivated yet simple approach to data selection can dramatically improve the efficiency of language model pre-training. The method is particularly appealing due to its computational efficiency and favorable scaling properties.

The paper opens several promising directions for future research, including: extending CoLoR-Filter to fine-tuning, continual pre-training, and general domain pre-training; applying the method to other domains like code generation or other modalities; and further improving the algorithm’s efficiency and testing its limits of scale generalization.

Acknowledgements

HZ is supported by an Eric and Susan Dunn Graduate Fellowship. SK acknowledges support from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #IIS 2229881. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence.

Code and Data

The code for CoLoR-Filter is available at:

https://github.com/davidbrandfonbrener/color-filter-olmo.

The filtered data can be accessed at:

https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4.