This is the shared codebase for the Stanford BabyLM team. It is basically a dressed-up version of Huggingface's standard run_clm.py
script, which does causal (decoder-only) language modeling on a set of training files and evaluates perplexity on a set of validation files.
It is "dressed up" in the sense that
- we have switched to Hydra for experiment (see below),
- we have refactored some of the code into separate files for cleanliness,
- we have slightly improved Weights and Biases logging.
Please follow these innstructions for setup so we can have a consistent development environment and package versions.
I recommend creating a new conda environment. For consistency let's all use python3.8
.
conda create --name babylm python=3.8
conda activate babylm
Now, first install a version of PyTorch that is compatible with your system from
pytorch.org. Try to use at least 1.12.0
.
Then install extra requirements from requirements.txt
. Torch is not included
since installation differs depending on the machine.
pip install -r requirements.txt
Next, make a hidden .cache
dir in the root directory.
mkdir .cache
Make sure this is somewhere that has a lot of space. If needed, symlink the directory.
Run ./download_data.sh
in home dir to download BabyLM data to
./babylm_data
.
Follow instructions here: https://docs.wandb.ai/quickstart
We're going to use black
and isort
, python formatters, to automatically
format code whenever we push to maintain consistent style, as well as flake8
,
a Python linter. pre-commit
should have been installed from
requirements.txt
. Now do:
pre-commit install
Whenever you make a commit, if your code is not formatted properly or has issues, pre-commit should automatically format your code. Then you need to add the new files and repeat the commit message.
If flake8 returns errors, please review them and determine whether they're
significant. Note that if flake8 is complaining and you truly don't think it's
an error, feel free to ask flake8 to ignore the errors by using the # noqa
comment (https://flake8.pycqa.org/en/3.1.1/user/ignoring-errors.html).
If you use VSCode for development I also highly recommend using the black and isort plugins, auto-formatting upon save. You can read a tutorial here: https://cereblanco.medium.com/setup-black-and-isort-in-vscode-514804590bf9
Example command:
python -m src.run_clm +model=gpt2-small +dataset=babylm_10M
This runs GPT2-Small architecture on all of the 10M training files, with eval on all of the dev files. It will log a run to
wandb.ai/<YOUR_USERNAME>/stanford-babylm/groups/debug-gpt2-small-babylm_10M
.
If you use VSCode, there is a launch config in .vscode/launch.json
which runs
this in the VSCode debugger.
This codebase uses Hydra to manage
and configure experiments. Hydra is a little more complicated than argparse
,
which you may be familiar with, but offers much more flexibility for running and
managing complex configurations.
The key idea of hydra is that, using argparse
, we rarely ever need to change a
single argparse
flag. For example imagine if your command was
python -m src.run_clm --model=gpt2
And you wanted to use GPT2-Medium instead. But GPT-2 medium is bigger, so you might need to change the batch size to fit it on your GPU. Maybe you want to add gradient accumulation steps; maybe you want to save less often since the amount of data per step has changed. So now your command looks like this:
python -m src.run_clm --model=gpt2-medium \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--eval_steps=200 \
--gradient_accumulation_steps=2
This gets super annoying to type out and hard to remember. The idea of hydra
is not only have the option of changing individual arguments via the CLI like
the above; you can also save groups of changes into a YAML configuration
file, and apply groups of changes at once. So for example you might make a
file, src/conf/model/gpt2-medium.yaml
, which looks like:
# @package _global_
model: gpt2-medium
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
eval_steps: 200
gradient_accumulation_steps: 2
And now you can run your GPT2-Medium experiment by just writing
python -m src.run_clm +model=gpt2-medium
You'll see that the example run above uses two config groups: a dataset
group
and a model
group. You can access the corresponding YAML files in
src/conf/dataset
and src/conf/model
, respectively. There is a default config
as well: src/conf/config.yaml
. The values in the default config get read
first; then the additional configs (applied a la +model=
) override the default
values. If you change the default config, make sure it is a change that you
think will should sensibly be set for all further experiment runs (e.g. a
default setting in case we add some new architecture). Otherwise, add new
changes in a separate config group.
Another benefit is that unlike argparse, Hydra supports rich hierarchical/nested config structure. So you can have a set of training arguments in your YAML file like
model:
model_name_or_path: gpt2-small
training:
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
and these are kept in separate namespaces, and you can use args.model.model_name_or_path
or args.training.per_device_train_batch_Size
to refer to the relevant args in your script.
Feel free to add more YAML files or even create new config groups. This will come in handy as we do a lot of experimentation!
Lastly, Hydra also supports running sweeps over arguments automatically, via command line. I'll talk about this later if we get to it.
We use Weights and Biases for (super convenient!) experiment logging and management. Follow the setup instructions above to get WandB setup.
Weights and Biases config is slightly different from Huggingface default. Take a
look at the WandBArguments
class in arguments.py
, and correspondingly, the
wandb:
set of arguments in src/conf/config.yaml
, which look like:
wandb:
# Set this to false here or via command line (e.g. `wandb.log=false`) to disable wandb logging.
log: true
# The name of the wandb project under your account.
project: stanford-babylm
# The name of the group that a run will fall under. This intimidating syntax
# just interpolates values from other config settings set during your run. For
# example ${hydra:runtime.choices.model} is the +model= command you specified;
# likewise via dataset. `wandb.tag` is a key that is not set here but by
# default is set to "debug". You might tag experiments with the current date,
# or a name, to remind yourself of which experiments are which. This lets you
# change the name of the group while still having the auto-generated group
# name parts (model/dataset).
group: ${wandb.tag}-${hydra:runtime.choices.model}-${hydra:runtime.choices.dataset}
# Groups carry groups of runs; this is the name of the individual run, which is the group + a seed.
name: ${wandb.group}-run-${training.seed}
The comments are self explanatory, but basically, if you want to tag an experiment so that it shows up as a separate group, you might do something like
python -m src.run_clm +model=gpt2-small +dataset=babylm_10M wandb.tag=test-new-architecture
which generates a run at wandb.ai/<YOUR_USERNAME>/stanford-babylm/groups/test-new-architecture-gpt2-small-babylm_10M
.
Note the lack of +
when setting individual keys, versus the +
needed when
applying an entire YAML file to override configs.
To add new arguments, you will want to (1) add the corresponding argument, with its
type, to the corresponding dataclass object (e.g. DataTrainingArguments
or
TrainingArguments
) in arguments.py
; then (2) you can specify it via command
line or in a config file. This uses Hydra's Structured
Configs settings to
enable static typechecking.