Support MoE for GPTModelPipe #373

mosheisland · 2024-03-30T17:27:17Z

Main changes:

Support MoE layers creation.
Support MoE aux loss for GPTModelPipe by propagating the aux loss along the layers.
Support display of MoE aux loss.

NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338

Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions.
Testing was done with following configurations:

Dense model: LLaMA like model with 8 layers.
MoE modles: Dense with --num-experts 4 --topk 2 --disable-moe-token-dropping --expert-interval 1

Training runs with and without this PR:

GPTModel Dense (No MoE) using DP, TP with fp16
GPTModelPipe Dense (No MoE) using DP, TP, PP with BF16_Optimizer
GPTModel MoE using DP, TP, EP with fp16
GPTModelPipe MoE using DP, TP, PP, EP with BF16_Optimizer (only with this PR)

Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network).
Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.

Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2).
Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.

Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network).
Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe
Comparing without vs with this PR

Comparing using with this PR:

GPTModel DP=4 TP=2 PP=1 fp16
GPTModelPipe DP=2 TP=2 PP=2 BF16_Optimizer

At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss.

Signed-off-by: Moshe Island <[email protected]>

Propagate aux loss along GPTModelPipe layers by forwarding the aggregated loss from each transformer layer to the next transformer layer. In addition, add a layer to GPTModelPipe, after the last transformer layer, to catch the final aggregated aux loss and cache it for use in the loss function. Signed-off-by: Moshe Island <[email protected]>

Signed-off-by: Moshe Island <[email protected]>

Currently PipelineEngine supports only a single tensor partitioning with grad. MoE model requires to forward with grad both the activations and the aux_loss. Therefore, until PilelineEngine limitation is removed, verify no partitioning when using MoE. Signed-off-by: Moshe Island <[email protected]>

tohtana · 2024-04-03T16:37:19Z

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well.
Can you clarify a few things for reference in the future?

Are the losses shown here only LM losses or do they include aux losses?
Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.
You verified with Z0. Do you have any reason that you didn't use Z1?

mosheisland · 2024-04-03T17:03:25Z

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

Are the losses shown here only LM losses or do they include aux losses?

Moshe: For non-MoE runs, the losses are LM loss
For MoE runs: "lm loss" is LM-loss only and "loss" is LM-loss + Aux-loss
However, I see that it is hard to read the titles.

Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.

Moshe: sure, below:

Aux loss

LM loss

Total loss (LM + Aux):
Displayed only for GPTModelPipe

Color legend:

with_pr = With both DeepSpeed required PR and this PR
before = Without

Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new.
Both are the same (the "new" one is after the last rebase).

You verified with Z0. Do you have any reason that you didn't use Z1?

Moshe: I am mainly interested in running with BF16_Optimizer.
BF16_Optimizer internally uses ZeRO=1.

tohtana · 2024-04-03T17:17:14Z

@mosheisland Thank you for sharing the results! They all look good to me.
Let me take a bit more time to review the PR on DS side.

mosheisland · 2024-04-09T08:20:18Z

@tohtana, deepspeedai/DeepSpeed#5338 is merged. So I think now we can progress with this one.

tohtana · 2024-04-09T08:23:16Z

Thank you @mosheisland, merged now.

mosheisland added 4 commits March 28, 2024 09:12

MOE: Support MoE layers creation for GPTModelPipe

179b3c0

Signed-off-by: Moshe Island <[email protected]>

MOE: Support display of MoE loss for GPTModelPipe

bd16d94

Signed-off-by: Moshe Island <[email protected]>

mosheisland requested review from tjruwase, conglongli, awan-10, eltonzheng, duli2012, mrwyattii, arashb, xiaoxiawu-microsoft and GuanhuaWang as code owners March 30, 2024 17:27

tohtana self-requested a review April 3, 2024 16:28

tohtana approved these changes Apr 9, 2024

View reviewed changes

tohtana merged commit bcedecd into deepspeedai:main Apr 9, 2024
1 check passed

hatanp mentioned this pull request Jul 11, 2024

Feature: Mixture of experts with token dropping argonne-lcf/Megatron-DeepSpeed#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MoE for GPTModelPipe #373

Support MoE for GPTModelPipe #373

mosheisland commented Mar 30, 2024

tohtana commented Apr 3, 2024

mosheisland commented Apr 3, 2024 •

edited

Loading

tohtana commented Apr 3, 2024

mosheisland commented Apr 9, 2024

tohtana commented Apr 9, 2024

Support MoE for GPTModelPipe #373

Support MoE for GPTModelPipe #373

Conversation

mosheisland commented Mar 30, 2024

tohtana commented Apr 3, 2024

mosheisland commented Apr 3, 2024 • edited Loading

tohtana commented Apr 3, 2024

mosheisland commented Apr 9, 2024

tohtana commented Apr 9, 2024

mosheisland commented Apr 3, 2024 •

edited

Loading