Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current batch indexing does not interact nicely with varying mixtures #876

Open
kothasuhas opened this issue Feb 3, 2025 · 3 comments
Open
Assignees

Comments

@kothasuhas
Copy link
Contributor

kothasuhas commented Feb 3, 2025

#868 implements varying the mixture over training duration for where the transition should happen. For this, it relies on user-provided step counts. When the dataset is asked for a batch, it compared the requested indices to this threshold.

Unfortunately, I have now realized that the indices supplied to the dataset might not exactly correspond to the specific training run. In a two stage curriculum, this results in data mixture 2 starting slightly before the requested transition point. This can be observed when running two experiments with the exact same mixture for two stages but different transition points: the runs are completely identical except for right before the transition point. Picture attached for two runs where the mixtures are same but the transition point is either 80% through training or 90% through training

Image

Two solutions come to mind

  1. Change the requested indices to be in the order of training
  2. Look into which indices are being requested and manually account for this within the varying mixture dataset
@kothasuhas kothasuhas self-assigned this Feb 3, 2025
@dlwh
Copy link
Member

dlwh commented Feb 3, 2025

hrm that isn't good but I don't really understand how this happens unless the mixture block sizes aren't lining up?

@kothasuhas
Copy link
Contributor Author

The mixture block sizes are lining up (I assert this inside the code:

assert start_seq_index % block_size == 0, (

In some initial debugging, I think I found that during a training run, the indices being requested by the trainer are not perfectly aligned with the actual step count during training? I should make a minimal script to share.

Is it possible that some intermediate steps are being thrown out, or used for eval, when the trainer iterates? If batch i, element j, always requests index i + batch_size + j, then I can't understand why this bug would occur.

@dlwh
Copy link
Member

dlwh commented Feb 3, 2025

so each machine only loads the data it needs, but it should just be batch_size * step + index_within_batch:

for bn in batch_numbers:
indices_this_batch = range(bn * self.dl.batch_size, (bn + 1) * self.dl.batch_size, 1)
indices_this_batch_this_process = [indices_this_batch[i] for i in self.dl._local_indices]
indices_for_this_batch_of_batches.extend(indices_this_batch_this_process)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants