You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:
I setup an experiment with 64 experts split across 2 devices with expert parallel. Both MegaBlocks and distributed optimizer are enabled. However, I found the saved experts across devices are the same (32 experts in rank0 are in the same weights as the other 32 experts in rank1).
But when the distributed optimizer is disabled, there seems to be no problem. So I'm wondering if there is still a potential incompatibility with the latest Megatron-LM.
Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:
https://github.com/stanford-futuredata/Megatron-LM/blob/85f95aef3b648075fe6f291c86714fdcbd9cd1f5/megatron/arguments.py#L352-L356
But there's no such validation in the open PR to Megatron-LM: NVIDIA/Megatron-LM#288
Does that mean the assertion is redundant and the current version of megablocks is compatible with the distributed optimizer under expert parallelism?
Thanks very much.
The text was updated successfully, but these errors were encountered: