Adaptive Data Optimization

This repository contains the code for the paper "Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws".

The main algorithmic changes of the paper is contained in src/data_selectors/ado.py. The rest is standard language model training codebase and can be replaced by your favorite training framework as long as it supports sampling from different domains with different weights and has access to the loss of each domain.

Launching experiments

This codebase is designed to run on GCP TPUs, with data stored in GCS buckets.

Processing data (The Pile)

Note: we recommend you run the commands in this section on a GCP VM with at least 2TB of disk space.

By default, The Pile comes as .jsonl.zst files which contain examples from all domains mixed together. We use procdata/process_dataset.py to separate examples out by domain:

python procdata/process_dataset.py --path /path/to/the/pile --output_path /outputs/go/here

This will produce a folder with 22 subdirectories, one per domain in The Pile. Each subdirectory should contain .jsonl.zst files with examples for the corresponding domain.

Once done, we will materialize the data as a tensorflow dataset in a GCS bucket. In procdata/build_tfds_configs.sh, set local_base_path to point to the outputs from your previous command and remote_base_path to point to the desired location in your bucket, e.g., gs://your-bucket/tfds/the_pile_grouped. Run:

./procdata/build_tfds_configs.sh

This should materialize the TF dataset at gs://your-bucket/tfds/the_pile_grouped.

Running experiments on TPUs

We assume you already have a TPU named node1 running in the same region as your GCS bucket--see below for instructions on launching TPU VMs.

First, modify scripts/tpu_commands.sh to update ado_project here with your GCP project information: overwrite the existing definitions of tpu_project, tpu_zone, and tpu_gen. Next, you need to specify the path to the dataset in configs/experiments/*.yaml. Set data_dir=gs://your-bucket/tfds/the_pile_grouped.

Then, from your local machine, you can launch a run with:

echo '[your wandb key]' >> .wandbkey  # store your API key, which you can find at [wandb.ai/authorize](wandb.ai/authorize)
source scripts/tpu_commands.sh
tpu ado_project reset_key node1  # maybe necessary if you get weird SSH key issues
tpu ado_project setup node1  # copy code over, install deps, etc.

tpu ado_project cl "python ado/launch.py [OPTIONS]"  # copy and launch

Under hydra, here is how we configure the run:

python ado/launch.py +experiment=124_ado debug=true

See possible experiment configurations in configs/experiment. Set debug=false to enable checkpointing and larger sample size during evaluation.

Creating/deleting TPUs

To create a new node called node1:

# Make sure you're in the right project
gcloud projects list
gcloud config set project [PROJECT]

ZONE=europe-west4-a  # change if your TPUs are in a different zone.
# Create
gcloud compute tpus tpu-vm create node1 \
--zone=$ZONE \
--accelerator-type=v3-128 \
--version=tpu-ubuntu2204-base \
--preemptible  # only if you want it to be preemptible

To delete the instance:

# Delete
gcloud compute tpus tpu-vm delete node1 --zone=$ZONE

Adding your own dataset

Implement a new dataloader. See src/dataloader/pile_loader.py for an example of new dataloader.
Add a new task configuration to configs/loader/tasks_sampler_cfg (see configs/loader/tasks_sampler_cfg/pile_tasks.yaml for an example).
Add a loader configuration to configs/loader (see configs/loader/pile_natural.yaml for an example).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
procdata		procdata
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
launch.py		launch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Data Optimization

Launching experiments

Processing data (The Pile)

Running experiments on TPUs

Creating/deleting TPUs

Adding your own dataset

About

Releases

Packages

Contributors 2

Languages

yidingjiang/ado

Folders and files

Latest commit

History

Repository files navigation

Adaptive Data Optimization

Launching experiments

Processing data (The Pile)

Running experiments on TPUs

Creating/deleting TPUs

Adding your own dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages