Current batch indexing does not interact nicely with varying mixtures #876

kothasuhas · 2025-02-03T04:47:11Z

#868 implements varying the mixture over training duration for where the transition should happen. For this, it relies on user-provided step counts. When the dataset is asked for a batch, it compared the requested indices to this threshold.

Unfortunately, I have now realized that the indices supplied to the dataset might not exactly correspond to the specific training run. In a two stage curriculum, this results in data mixture 2 starting slightly before the requested transition point. This can be observed when running two experiments with the exact same mixture for two stages but different transition points: the runs are completely identical except for right before the transition point. Picture attached for two runs where the mixtures are same but the transition point is either 80% through training or 90% through training

Two solutions come to mind

Change the requested indices to be in the order of training
Look into which indices are being requested and manually account for this within the varying mixture dataset

dlwh · 2025-02-03T06:03:08Z

hrm that isn't good but I don't really understand how this happens unless the mixture block sizes aren't lining up?

kothasuhas · 2025-02-03T06:12:08Z

The mixture block sizes are lining up (I assert this inside the code:

levanter/src/levanter/data/mixture.py

Line 67 in 1d216d1

assert start_seq_index % block_size == 0, (

In some initial debugging, I think I found that during a training run, the indices being requested by the trainer are not perfectly aligned with the actual step count during training? I should make a minimal script to share.

Is it possible that some intermediate steps are being thrown out, or used for eval, when the trainer iterates? If batch i, element j, always requests index i + batch_size + j, then I can't understand why this bug would occur.

dlwh · 2025-02-03T20:39:01Z

so each machine only loads the data it needs, but it should just be batch_size * step + index_within_batch:

levanter/src/levanter/data/loader.py

Lines 234 to 237 in 93a8aa9

    
           for bn in batch_numbers: 
        
               indices_this_batch = range(bn * self.dl.batch_size, (bn + 1) * self.dl.batch_size, 1) 
        
               indices_this_batch_this_process = [indices_this_batch[i] for i in self.dl._local_indices] 
        
               indices_for_this_batch_of_batches.extend(indices_this_batch_this_process)

kothasuhas self-assigned this Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current batch indexing does not interact nicely with varying mixtures #876

Current batch indexing does not interact nicely with varying mixtures #876

kothasuhas commented Feb 3, 2025 •

edited

Loading

dlwh commented Feb 3, 2025

kothasuhas commented Feb 3, 2025

dlwh commented Feb 3, 2025

Current batch indexing does not interact nicely with varying mixtures #876

Current batch indexing does not interact nicely with varying mixtures #876

Comments

kothasuhas commented Feb 3, 2025 • edited Loading

dlwh commented Feb 3, 2025

kothasuhas commented Feb 3, 2025

dlwh commented Feb 3, 2025

kothasuhas commented Feb 3, 2025 •

edited

Loading