Add bpb and n_bytes to metric logging #41

EntilZha · 2025-02-05T22:36:29Z

Summary:

Test Plan:

sriniiyer · 2025-02-05T22:42:36Z

bytelatent/metrics.py

+            if self.fs is None:
+                self.jsonl_writer = open(self.outdir, "a")
+            else:
+                self.jsonl_writer = self.fs.open(self.outdir, "a")


What functionality does this add?

Eventually (once we fix torch.distributed.checkpointing.save to work with fsspec), all of dump_dir should be compatible with writing to nfs or s3/blob store.

sriniiyer · 2025-02-05T22:50:44Z

bytelatent/train.py

@@ -403,6 +419,24 @@ def train(args: TrainArgs):
                batch_patch_lengths = torch.from_numpy(batch.patch_lengths).cuda()
            mask = None if batch.mask is None else torch.from_numpy(batch.mask).cuda()

+            if args.data.tokenizer_args.name in ["bytes", "blt"]:
+                if mask is None:
+                    n_bytes += batch_y.numel()


n_bytes += mask.sum() if mask else batch_y.numel()

sriniiyer · 2025-02-05T23:18:51Z

bytelatent/train.py

                    f"  grad: {grad_norm:.2e}"
                    f"  flops: {FLOPS:.2e}"
                    f"  wps: {wps:.2e}"
                    f"  iter: {curr_iter_time:>7}"
                    f"  data: {data_load_time:>5}"
                    f"  lr: {curr_lr:.2e}"
+                    f"  n_bytes={total_n_bytes}"


this is confusing, we need a way to indicate per_gpu vs. across all gpus

sriniiyer · 2025-02-05T23:19:41Z

bytelatent/train.py

                logger.info(
                    f"step: {train_state.step}"
                    f"  acc: {train_state.acc_step}"
-                    f"  loss: {round(loss.item(),4):>7}"
+                    f"  loss: step={round(loss.item(),4):>7} avg={avg_loss}"
+                    f"  bpb: {avg_bpb:3f}"


This is inconsistent, we normally report bpb per gpu

sriniiyer · 2025-02-07T00:57:02Z

bytelatent/train.py

                logger.info(
                    f"step: {train_state.step}"
                    f"  acc: {train_state.acc_step}"
-                    f"  loss: {round(loss.item(),4):>7}"
+                    f"  loss: [step_local={round(step_loss_per_gpu, 4):>7} interval_local={round(interval_loss_per_gpu, 4):>7} step_global={round(step_loss_across_gpus, 4):>7} interval_global={round(interval_loss_across_gpus, 4):>7}]"


this is too many things - I think interval local and interval global is good enough

Summary: Test Plan:

EntilZha mentioned this pull request Feb 5, 2025

Update checkpointing to use fsspec #39

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 5, 2025

sriniiyer reviewed Feb 5, 2025

View reviewed changes

This was referenced Feb 6, 2025

Allow ArrowIterator to read from json #45

Merged

Broken train reproducing bf16 error #47

Draft

Minimal working eval #46

Draft

EntilZha force-pushed the pr41 branch 2 times, most recently from 4e2ed0a to b6396eb Compare February 7, 2025 00:26

sriniiyer reviewed Feb 7, 2025

View reviewed changes

Add bpb and n_bytes to metric logging

8d73383

Summary: Test Plan:

EntilZha force-pushed the pr41 branch from b6396eb to 8d73383 Compare February 7, 2025 21:13

EntilZha merged commit fe45f69 into main Feb 7, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bpb and n_bytes to metric logging #41

Add bpb and n_bytes to metric logging #41

EntilZha commented Feb 5, 2025 •

edited

Loading

sriniiyer Feb 5, 2025

EntilZha Feb 6, 2025

sriniiyer Feb 5, 2025

sriniiyer Feb 5, 2025

sriniiyer Feb 5, 2025

sriniiyer Feb 7, 2025

Add bpb and n_bytes to metric logging #41

Add bpb and n_bytes to metric logging #41

Conversation

EntilZha commented Feb 5, 2025 • edited Loading

sriniiyer Feb 5, 2025

Choose a reason for hiding this comment

EntilZha Feb 6, 2025

Choose a reason for hiding this comment

sriniiyer Feb 5, 2025

Choose a reason for hiding this comment

sriniiyer Feb 5, 2025

Choose a reason for hiding this comment

sriniiyer Feb 5, 2025

Choose a reason for hiding this comment

sriniiyer Feb 7, 2025

Choose a reason for hiding this comment

EntilZha commented Feb 5, 2025 •

edited

Loading