-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PyTorch to 2.4.1 #268
Update PyTorch to 2.4.1 #268
Conversation
45ea711
to
9e21726
Compare
a916401
to
f9f0531
Compare
f9f0531
to
00a11aa
Compare
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow failed on this PR: View run, please investigate. |
This pull request has merge conflicts that must be resolved before it can be |
@@ -4,7 +4,7 @@ pyyaml | |||
py-cpuinfo | |||
# we set this to be above 0a0 so that it doesn't | |||
# replace custom pytorch images with the 2.3.0 | |||
torch>=2.3.0a0 | |||
torch>=2.3.0,<2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this min version here be bumped to 2.4.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When Gaudi 1.18 is supported, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaudi software 1.18.0 has 2.4.0a0. The 1.18.x series will not get a newer version.
Why are you changing the version range for Torch? I don't see any changes to Python code or other dependencies. There is no apparent reason why instructlab-training
should no longer work with Torch 2.3.x or 2.5.x.
I would prefer to keep the version ranges for dependencies of instructlab-training as open as possible and only restrict upper versions in instructlab
package.
@fabiendupont can you rebase this pr? |
This change the default PyTorch 2.4.1. Intel Gaudi (HPU) remains on Intel fork of PyTorch 2.3.1. Fixes instructlab/instructlab#2437 Signed-off-by: Fabien Dupont <[email protected]>
00a11aa
to
6e0395d
Compare
Holding for E2E test results |
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
@@ -0,0 +1,11 @@ | |||
# # Extra dependencies for Intel Gaudi / Habana Labs HPU devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Christian's PR merged and changed these things as well, I don't want to overwrite whatever he updated. Would it be possible to check which is the correct source of truth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep requirements/hpu.txt
empty. There is no need to restrict torch
here or to pull any of the habana extensions. instructlab.training
does not import them.
@tiran, commented on a different channel that capping the torch version might not be such a good idea. And the Intel Gaudi requirements don't exist at the moment, so they are not really related to this PR either. If we don't do these 2 things, then this PR is basically just moving requirements file into a @tiran, @nathan-weinberg, @JamesKunstle, do we still want to cap torch version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a very, VERY strong aversion against constraining upper bound versions. It's a major pain in the back. In most cases, it is causing more trouble than it's worth.
To give you some numbers: the upper bound limit of Torch in instructlab
has delayed Gaudi software 1.18.0 update by almost three weeks. Intel releases 1.18.0 in Oct 10. I had implemented the downstream changes by Oct 11. Last night (Oct 29) we had most instructlab fixes in place and were able to build all packages with Gaudi 1.18.0. #309 has to be solved before we have a full run. The blog post https://iscinumpy.dev/post/bound-version-constraints/ explains in great detail why upper bound versions are bad.
IMHO instructlab-training
should not have any upper or lower version constraint unless they are really necessary. Instead it should rely on the instructlab
core package to restrict versions. Any lower version bound should be as relaxed as possible and documented in the requirements file. Any upper version bound should have an explanation plus a ticket.
For internal testing, you could use a constraint file with known good versions and consume it with pip install -c constraints.txt ...
in tests.
@@ -4,7 +4,7 @@ pyyaml | |||
py-cpuinfo | |||
# we set this to be above 0a0 so that it doesn't | |||
# replace custom pytorch images with the 2.3.0 | |||
torch>=2.3.0a0 | |||
torch>=2.3.0,<2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaudi software 1.18.0 has 2.4.0a0. The 1.18.x series will not get a newer version.
Why are you changing the version range for Torch? I don't see any changes to Python code or other dependencies. There is no apparent reason why instructlab-training
should no longer work with Torch 2.3.x or 2.5.x.
I would prefer to keep the version ranges for dependencies of instructlab-training as open as possible and only restrict upper versions in instructlab
package.
@@ -0,0 +1,11 @@ | |||
# # Extra dependencies for Intel Gaudi / Habana Labs HPU devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep requirements/hpu.txt
empty. There is no need to restrict torch
here or to pull any of the habana extensions. instructlab.training
does not import them.
@@ -0,0 +1,2 @@ | |||
# Extra dependencies for AMD ROCm | |||
flash-attn>=2.6.2,<2.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to limit the upper bound of flash-attn
?
This pull request has merge conflicts that must be resolved before it can be |
We've bumped to 2.4.1 and are in the process of bumping to >=2.5.1 |
This change the default PyTorch 2.4.1.
Intel Gaudi (HPU) remains on Intel fork of PyTorch 2.3.1.
Fixes #293