Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update checkpointing to use fsspec #39

Merged
merged 1 commit into from
Feb 6, 2025
Merged

Update checkpointing to use fsspec #39

merged 1 commit into from
Feb 6, 2025

Conversation

EntilZha
Copy link
Contributor

@EntilZha EntilZha commented Feb 4, 2025

Summary:

  • Make the data/checkpoint code fsspec compatible
  • Still will not work with s3 saves, due to torch.distributed.checkpoint.save not being out of the box workable with fsspec. Will implement in followup PR

Test Plan:

Run unit tests and the commands below

python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2025
@EntilZha EntilZha force-pushed the pr39 branch 2 times, most recently from ab399e9 to bc39591 Compare February 4, 2025 18:19
@EntilZha EntilZha changed the title Several changes to enable entropy model training/eval Update checkpointing to use fsspec Feb 5, 2025
@EntilZha EntilZha force-pushed the pr39 branch 5 times, most recently from b6e53f1 to 1450464 Compare February 5, 2025 22:10
@EntilZha EntilZha changed the title Update checkpointing to use fsspec Add bpb and n_bytes to metric logging Feb 5, 2025
@EntilZha EntilZha changed the title Add bpb and n_bytes to metric logging Update checkpointing to use fsspec Feb 5, 2025
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
Copy link
Contributor

@sriniiyer sriniiyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@EntilZha EntilZha merged commit afedb16 into main Feb 6, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants