Is current megablocks compatible with distributed optimizer in Megatron-LM? #160

Spico197 · 2024-11-11T14:19:50Z

Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:

https://github.com/stanford-futuredata/Megatron-LM/blob/85f95aef3b648075fe6f291c86714fdcbd9cd1f5/megatron/arguments.py#L352-L356

But there's no such validation in the open PR to Megatron-LM: NVIDIA/Megatron-LM#288

Does that mean the assertion is redundant and the current version of megablocks is compatible with the distributed optimizer under expert parallelism?

Thanks very much.

Spico197 · 2024-11-12T08:50:59Z

I setup an experiment with 64 experts split across 2 devices with expert parallel. Both MegaBlocks and distributed optimizer are enabled. However, I found the saved experts across devices are the same (32 experts in rank0 are in the same weights as the other 32 experts in rank1).

But when the distributed optimizer is disabled, there seems to be no problem. So I'm wondering if there is still a potential incompatibility with the latest Megatron-LM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is current megablocks compatible with distributed optimizer in Megatron-LM? #160

Is current megablocks compatible with distributed optimizer in Megatron-LM? #160

Spico197 commented Nov 11, 2024

Spico197 commented Nov 12, 2024

Is current megablocks compatible with distributed optimizer in Megatron-LM? #160

Is current megablocks compatible with distributed optimizer in Megatron-LM? #160

Comments

Spico197 commented Nov 11, 2024

Spico197 commented Nov 12, 2024