Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NaN check] Add NaN check to support bfloat16. #5879

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ys950902
Copy link
Contributor

@ys950902 ys950902 commented Aug 8, 2024

No description provided.

@tjruwase
Copy link
Contributor

tjruwase commented Aug 9, 2024

@ys950902, thanks for helping with this. The problem is a bit more involved and there was a previous attempt that was abandoned. Can you please take a look at #5252

Do you think you can incorporate the learnings into your PR?

@ys950902
Copy link
Contributor Author

@ys950902, thanks for helping with this. The problem is a bit more involved and there was a previous attempt that was abandoned. Can you please take a look at #5252

Do you think you can incorporate the learnings into your PR?

Sorry for later response, I am doing some other work these days, my understanding is nan check is not for float16 but also bfloat16, and won't add some extra log info on deepspeed side, the uses can only use API was_step_applied() to check whether update successfully.

@QingtaoLi1
Copy link

@ys950902 This check only makes the training continues without error but the grad_norm and loss are not descending any more. Would it be better to turn this into an error or provide a way to solve the overflow problem by code/by warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants