forked from NVIDIA/apex
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable --fast_layer_norm for ROCm #94
Open
hubertlu-tw
wants to merge
23
commits into
master
Choose a base branch
from
fastlayernorm
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reopening |
…acy tests to fail in fwd and bwd pass. Making CTAS_PER_ROW=1 for all cases doesnt seem to be affecting the performance and is allowing all fwd pass tests to pass. In addition bwd pass tests fail for hidden_size >= 8192. With CTAS_PER_ROW=1 for all cases in bwd, tests fail when hidden_size >= 16K. This commit allows fastlayernorm capabale to be used for inference cases. Additional debugging required for fastlayernorm bwd.
…rm/apex into fastlayernorm
What is this blocked by ? |
…eed to revert this change later if needed.
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
…eported performances.
…ed_layer_norm. Updated the framework for fast_layer_norm to be same as fused_layer_norm. Updated variables gamma and beta to unique variables gamma_ and beta_ to address some runtime errors. Added patch to call fused_layer_norm bwd when hidden_size > 12K since the tests fail for these hidden_sizes. This is currently only a patch and needs to be debugged. Potential for even better performance for fast_layer_norm. NOTE that fast_layer_norm fwd is still called for hidden_sizes > 12K.
…n_sizes > 12K since fast_layer_norm patch calls fused_layer_norm for bwd when hideen_Sizes > 12K
Hi all, I've updated all files with appropriate changes and ALL fast_layer_norm tests pass for fwd and bwd now. Please let me know if there are any concerns. Thanks! Performance uplift can be found on the "fwd + bwd all" sheet [here] Let me know if you need access to the above link. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To run the --fast_layer_norm unit tests,
As a reference, the following results were obtained from CUDA systems (with nvcr.io/nvidia/pytorch:22.08-py3 on a A100 node):
On ROCm,
it failed the following two checks with NaN outputs:
in test_fast_layer_norm.py when wtype or ctype is bf16.
However, when we skip the tests with bf16 wtype or ctype, it failed the tests starting from hidden_size=8192 due to relerr > tol. Please find the results of the unit tests here: apex_fastlayernorm_unittest.txt