Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all the softmax extensions and cherry-pick transformer-related commits #101

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

hubertlu-tw
Copy link

@hubertlu-tw hubertlu-tw commented Dec 30, 2022

  • Cherry-picked transformer related commits from the upstream
  • Support generic_scaled_masked_softmax_cuda
  • Support scaled_softmax_cuda
  • Support fused_weight_gradient_mlp_cuda for ROCm
  • Add optimizers, clip_grad in run_rocm_extensions.py
  • Fix the bug in run_rocm_extensions.py
    To run the extension unit tests, please follow the following commands:
cd apex/contrib/test
APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python3 run_rocm_extensions.py

To run the transformer unit tests, please follow the following commands:

python tests/L0/run_test.py --include run_transformer

Ran 120 tests in 506.928s
FAILED (errors=7, skipped=55)

TODO: We will need to work on an IFU PR and start to look into the failed tests in order to skip them on ROCm.

yidong72 and others added 25 commits December 28, 2022 00:10
* new kernel

Signed-off-by: Yi Dong <[email protected]>

* added the unit tests

Signed-off-by: Yi Dong <[email protected]>

* clean up unittest

Signed-off-by: Yi Dong <[email protected]>

* use float

Signed-off-by: Yi Dong <[email protected]>

* more clean up

Signed-off-by: Yi Dong <[email protected]>

* remove the long seq test case
…DIA#1448)

* less mem consumption by fused generic softmax tests

ran with RTX 3070 Ti

Signed-off-by: Masaki Kozuki <[email protected]>

* Deduplicate qlen of 1234
…VIDIA#1451)

* Use xmlrunner.XMLTestRunner accordingly

TODO:
- [x] Remove `subTest` because it's not compatible with the current way
of running L0 tests

Signed-off-by: Masaki Kozuki <[email protected]>

* use `torch.testing` more to enable xmlrunner

Signed-off-by: Masaki Kozuki <[email protected]>

* Remove `subTest` for xmlrunner

Signed-off-by: Masaki Kozuki <[email protected]>

* removing subTest

Signed-off-by: Masaki Kozuki <[email protected]>

* not depend on an env var

Signed-off-by: Masaki Kozuki <[email protected]>

* fix syntax errors

* open with `"wb"`

* xml file per dir

Signed-off-by: Masaki Kozuki <[email protected]>

* remove comment-out

Signed-off-by: Masaki Kozuki <[email protected]>

* Refactor `TestTransformer`: define member methods (#5)

* setUpClass to define `test_` methods

Signed-off-by: Masaki Kozuki <[email protected]>

* manually define

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

* add a missing test

Signed-off-by: Masaki Kozuki <[email protected]>

* remove print

Signed-off-by: Masaki Kozuki <[email protected]>

* remove ext

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>
to use `torch.testing.assert_close` instead of
`numpy.testing.assert_allclose`. The former uses a bit looser threshold
values.

Signed-off-by: Masaki Kozuki <[email protected]>
…VIDIA#1451)

* Use xmlrunner.XMLTestRunner accordingly

TODO:
- [x] Remove `subTest` because it's not compatible with the current way
of running L0 tests

Signed-off-by: Masaki Kozuki <[email protected]>

* use `torch.testing` more to enable xmlrunner

Signed-off-by: Masaki Kozuki <[email protected]>

* Remove `subTest` for xmlrunner

Signed-off-by: Masaki Kozuki <[email protected]>

* removing subTest

Signed-off-by: Masaki Kozuki <[email protected]>

* not depend on an env var

Signed-off-by: Masaki Kozuki <[email protected]>

* fix syntax errors

* open with `"wb"`

* xml file per dir

Signed-off-by: Masaki Kozuki <[email protected]>

* remove comment-out

Signed-off-by: Masaki Kozuki <[email protected]>

* Refactor `TestTransformer`: define member methods (#5)

* setUpClass to define `test_` methods

Signed-off-by: Masaki Kozuki <[email protected]>

* manually define

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

* add a missing test

Signed-off-by: Masaki Kozuki <[email protected]>

* remove print

Signed-off-by: Masaki Kozuki <[email protected]>

* remove ext

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>
* apex.amp migration to torch.cuda.amp

Signed-off-by: Masaki Kozuki <[email protected]>

* add autocast tests

Signed-off-by: Masaki Kozuki <[email protected]>

* split with and without autocast

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>
* Label smoothing in vocab parallel cross entropy

Signed-off-by: MaximumEntropy <[email protected]>

* Fix context saving

Signed-off-by: MaximumEntropy <[email protected]>

* Remove .item() calls

Signed-off-by: MaximumEntropy <[email protected]>

* Update tests

Signed-off-by: MaximumEntropy <[email protected]>

Signed-off-by: MaximumEntropy <[email protected]>
…atron pipeline parallelism (NVIDIA#1475)

* Refactor how dist Adam handles overlapped grad sync

Each grad bucket independently keeps track of grads that have been generated. Add helper function to create callback functions. Change default param arg in grad norm functions to None. Perform communication for checkpointing in main stream to avoid memory pool overheads.

* Support Megatron pipeline parallelism with async grad reduction

Enables async grad reduction in first pipeline stage during last backward pass, and disables async grad reduction in all other pipeline stages.

* Review suggestions from crcrpar

Add unit test for pipeline parallelism with custom sync context. Style tweaks.

* Use unittest assert functions in pipeline parallelism test

Review suggestion from crcrpar
* Optionally disable stream synchronization after batched p2p communication

* Add test cases with `sync_batch_comm=False`

only when pytorch/pytorch#82450 is included in
pytorch.

Signed-off-by: Masaki Kozuki <[email protected]>

* utilize existing test methods

Signed-off-by: Masaki Kozuki <[email protected]>

* consistent naming

Signed-off-by: Masaki Kozuki <[email protected]>
Co-authored-by: Aidyn-A <[email protected]>

* silly boy, to skip the sync, set False

Signed-off-by: Masaki Kozuki <[email protected]>

* cosmetic

Signed-off-by: Masaki Kozuki <[email protected]>

* Test with async_pipelinign w/o sync after batch_isend_irecv

Signed-off-by: Masaki Kozuki <[email protected]>

* again, set sync_batch_comm to False

Signed-off-by: Masaki Kozuki <[email protected]>
Co-authored-by: Aidyn-A <[email protected]>

* Remove `torch.testing._internal.common_cuda`

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Aidyn-A <[email protected]>
…lelism (NVIDIA#1514)

* Add option to use no_sync context with interleaved pipeline parallelism

* Add unit test for no_sync context with interleaved pipeline parallelism

* Debug no_sync context support in interleaved pipeline parallelism
…nstead of torch_ucc (NVIDIA#1495)

* update HAS_TORCH_UCC to TORCH_UCC

* add comments for failing tests

* move HAS_UCC to _ucc_utils.py

* whitespace

* small changes

* newline

* updated list of failing tests

* update failing tests list
Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>
* Update megatron fused softmax follow megatron-lm

Signed-off-by: Yu Yao <[email protected]>

* Add mask=None support in scaled_masked_softmax

Signed-off-by: Yu Yao <[email protected]>

* Update setup.py for scaled_softmax_cuda

Signed-off-by: Yu Yao <[email protected]>

* Add tests for fused_scale_softmax (mask=None)

Signed-off-by: Yu Yao <[email protected]>

* Assert grad equal in fused softmax test

Signed-off-by: Yu Yao <[email protected]>

* Revert "Assert grad equal in fused softmax test"

Signed-off-by: Yu Yao <[email protected]>

Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Yu Yao <[email protected]>
)

* working test_bert_minimal.py

* remove some debugging statements

* working test_gpt_minimal.py

* test_dynamic_batchsize.py having issues with torch.backends.cudnn.allow_tf32

* working test_dynamic_batchsize.py

* refactor test_bert_minimal.py, need to investigate rng of MANUAL_SEED for nccl only pipeline with virtual_pipeline_model_parallel_size = 2

* add test_bert_minimal_alt.py for visibility

* update test_gpt_minimal.py

* lint

* update loss cutoff for bert test

* split with / without interleaving tests for bert

* use skipTest

* remove ONCE

* add ignore_unknown_args=True

* remove old testing files

* add num_devices logic to override_args
Signed-off-by: Masaki Kozuki <[email protected]>
@hubertlu-tw hubertlu-tw self-assigned this Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants