-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MoE for GPTModelPipe #373
Conversation
Signed-off-by: Moshe Island <[email protected]>
Propagate aux loss along GPTModelPipe layers by forwarding the aggregated loss from each transformer layer to the next transformer layer. In addition, add a layer to GPTModelPipe, after the last transformer layer, to catch the final aggregated aux loss and cache it for use in the loss function. Signed-off-by: Moshe Island <[email protected]>
Signed-off-by: Moshe Island <[email protected]>
Currently PipelineEngine supports only a single tensor partitioning with grad. MoE model requires to forward with grad both the activations and the aux_loss. Therefore, until PilelineEngine limitation is removed, verify no partitioning when using MoE. Signed-off-by: Moshe Island <[email protected]>
Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well.
|
Moshe: For non-MoE runs, the losses are LM loss
Moshe: sure, below: Total loss (LM + Aux): Color legend: Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new.
Moshe: I am mainly interested in running with BF16_Optimizer. |
@mosheisland Thank you for sharing the results! They all look good to me. |
@tohtana, deepspeedai/DeepSpeed#5338 is merged. So I think now we can progress with this one. |
Thank you @mosheisland, merged now. |
Main changes:
NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338
Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions.
Testing was done with following configurations:
Training runs with and without this PR:
Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network).

Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2).

Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network).

Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe
Comparing without vs with this PR
Comparing using with this PR:
At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss.
