Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special remat for Neuron #898

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

apoorvtintin
Copy link
Contributor

@apoorvtintin apoorvtintin commented Dec 17, 2024

This PR adds special remat configuration for TRN2 and Fuji-70B. This is done by adding a new remat policy that uses regex to match against multiple regex patterns for remat names. This allows more flexibility in remat configurations for different backends and device types.

Misc: Enable remat for StackedTransformer

@apoorvtintin apoorvtintin requested review from ruomingp, markblee and a team as code owners December 17, 2024 21:08
@apoorvtintin apoorvtintin changed the title Special Remat for Neuron Special remat for Neuron Dec 17, 2024
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from a08fb3c to 9715aef Compare December 17, 2024 21:52
axlearn/common/utils.py Outdated Show resolved Hide resolved
axlearn/common/attention.py Outdated Show resolved Hide resolved
axlearn/common/attention.py Outdated Show resolved Hide resolved
axlearn/common/utils.py Outdated Show resolved Hide resolved
axlearn/common/utils.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/common.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from 9715aef to b8c8fa5 Compare December 18, 2024 02:26
@apoorvtintin
Copy link
Contributor Author

Thanks for the reviews, most comments are resolved and PR looks clean, let me know if more changes are needed.

axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from b8c8fa5 to 9b9bba5 Compare December 18, 2024 08:13
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion to address the concern from @apghml. WDYT?

axlearn/experiments/text/gpt/common.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from 9b9bba5 to bf4a016 Compare December 18, 2024 19:50
@apoorvtintin
Copy link
Contributor Author

Thanks for the guidance, I addressed all comments. Let me know if more changes are needed

axlearn/audio/evaler_asr_test.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch 2 times, most recently from 45342e7 to 4d4e32e Compare December 19, 2024 01:50
Copy link
Contributor

@kelvin-zou kelvin-zou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comment, probably not to add more activation checkpoints in attention transformer layer class unless we must do so, otherwise the change looks good to me. Will approve once revert that part.

axlearn/common/attention.py Outdated Show resolved Hide resolved
axlearn/common/attention.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address @kelvin-zou 's comments. Otherwise LGTM.

@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from 4d4e32e to c720878 Compare January 6, 2025 22:07
@apoorvtintin
Copy link
Contributor Author

I removed the extra remat save points as discussed. Please let me know if more changes are needed. Thank you!

@apoorvtintin apoorvtintin requested a review from ruomingp January 6, 2025 22:09
axlearn/common/attention_test.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from c720878 to cad26b0 Compare January 6, 2025 22:48
@apoorvtintin
Copy link
Contributor Author

Hello @ruomingp, looks like merge is blocked on your requested changes, can you please mark them as resolved so this PR can be merged. Thank you!

@apoorvtintin
Copy link
Contributor Author

Updated PR to avoid merge conflicts. Let's merge this soon, thanks everyone!

axlearn/common/attention_test.py Outdated Show resolved Hide resolved
axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from ed58dd2 to 345e4a9 Compare January 13, 2025 20:02
@apoorvtintin
Copy link
Contributor Author

@kelvin-zou thanks for the review, I resolved your comments.

axlearn/common/attention_test.py Outdated Show resolved Hide resolved
axlearn/common/attention_test.py Outdated Show resolved Hide resolved
@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch 3 times, most recently from cfc5b88 to 4ba8a00 Compare January 14, 2025 00:56
@apoorvtintin
Copy link
Contributor Author

apoorvtintin commented Jan 14, 2025

Addressed all comments. Thanks for the review @kelvin-zou @hanzhi713. Let's try to merge this soon.

@hanzhi713
Copy link
Member

Thanks. In the meantime, could you address comments in #884?

@hanzhi713
Copy link
Member

@apoorvtintin Can you fix the pre-commit errors?

@apoorvtintin apoorvtintin force-pushed the mainline-upstream-remat branch from 4ba8a00 to 21f362a Compare January 14, 2025 15:26
@apoorvtintin
Copy link
Contributor Author

apoorvtintin commented Jan 14, 2025

@hanzhi713 Fixed the pre-commit errors, thank you!

@hanzhi713 hanzhi713 enabled auto-merge January 14, 2025 17:43
@apoorvtintin
Copy link
Contributor Author

I see test TfIteratorTest.test_no_save_input_iterator failed, This test passes in my development environment. Any idea why it fails in the workflow?

@hanzhi713
Copy link
Member

hanzhi713 commented Jan 14, 2025

The test itself is poorly written @markblee. It didn't wait for async future to finish. I will send a PR to fix it.

@hanzhi713
Copy link
Member

hanzhi713 commented Jan 14, 2025

See #924. I enabled retry and see if that works. If not you can rebase onto #924 if it's merged.

@hanzhi713 hanzhi713 added this pull request to the merge queue Jan 14, 2025
Merged via the queue into apple:main with commit f93be34 Jan 14, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants