Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MoE for GPTModelPipe #373

Merged
merged 4 commits into from
Apr 9, 2024
Merged

Conversation

mosheisland
Copy link

Main changes:

  • Support MoE layers creation.
  • Support MoE aux loss for GPTModelPipe by propagating the aux loss along the layers.
  • Support display of MoE aux loss.

NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338

Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions.
Testing was done with following configurations:

  • Dense model: LLaMA like model with 8 layers.
  • MoE modles: Dense with --num-experts 4 --topk 2 --disable-moe-token-dropping --expert-interval 1

Training runs with and without this PR:

  1. GPTModel Dense (No MoE) using DP, TP with fp16
  2. GPTModelPipe Dense (No MoE) using DP, TP, PP with BF16_Optimizer
  3. GPTModel MoE using DP, TP, EP with fp16
  4. GPTModelPipe MoE using DP, TP, PP, EP with BF16_Optimizer (only with this PR)

Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network).
Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
GPTModel_2D_Dense_fp16_with_vs_without_PR

Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2).
Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
GPTModel_2D_MoE_fp16_with_vs_without_PR

Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network).
Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe
Comparing without vs with this PR
GPTModelPipe_3D_Dense_bf16_with_vs_without_PR

Comparing using with this PR:

  • GPTModel DP=4 TP=2 PP=1 fp16
  • GPTModelPipe DP=2 TP=2 PP=2 BF16_Optimizer

At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss.
GPTModel_fp16_vs_GPTModelPipePipe_bf16_MOE

Propagate aux loss along GPTModelPipe layers by forwarding the aggregated loss
from each transformer layer to the next transformer layer.

In addition, add a layer to GPTModelPipe, after the last transformer layer, to
catch the final aggregated aux loss and cache it for use in the loss function.

Signed-off-by: Moshe Island <[email protected]>
Currently PipelineEngine supports only a single tensor partitioning with grad.
MoE model requires to forward with grad both the activations and the aux_loss.
Therefore, until PilelineEngine limitation is removed, verify no partitioning
when using MoE.

Signed-off-by: Moshe Island <[email protected]>
@tohtana
Copy link

tohtana commented Apr 3, 2024

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well.
Can you clarify a few things for reference in the future?

  • Are the losses shown here only LM losses or do they include aux losses?
  • Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.
  • You verified with Z0. Do you have any reason that you didn't use Z1?

@mosheisland
Copy link
Author

mosheisland commented Apr 3, 2024

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

  • Are the losses shown here only LM losses or do they include aux losses?

Moshe: For non-MoE runs, the losses are LM loss
For MoE runs: "lm loss" is LM-loss only and "loss" is LM-loss + Aux-loss
However, I see that it is hard to read the titles.

  • Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.

Moshe: sure, below:

Aux loss
image

LM loss
image

Total loss (LM + Aux):
Displayed only for GPTModelPipe
image

Color legend:
image
with_pr = With both DeepSpeed required PR and this PR
before = Without

Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new.
Both are the same (the "new" one is after the last rebase).

  • You verified with Z0. Do you have any reason that you didn't use Z1?

Moshe: I am mainly interested in running with BF16_Optimizer.
BF16_Optimizer internally uses ZeRO=1.

@tohtana
Copy link

tohtana commented Apr 3, 2024

@mosheisland Thank you for sharing the results! They all look good to me.
Let me take a bit more time to review the PR on DS side.

@mosheisland
Copy link
Author

@tohtana, deepspeedai/DeepSpeed#5338 is merged. So I think now we can progress with this one.

@tohtana tohtana merged commit bcedecd into deepspeedai:main Apr 9, 2024
1 check passed
@tohtana
Copy link

tohtana commented Apr 9, 2024

Thank you @mosheisland, merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants