Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PyTorch to 2.4.1 #268

Closed

Conversation

fabiendupont
Copy link
Contributor

@fabiendupont fabiendupont commented Oct 14, 2024

This change the default PyTorch 2.4.1.
Intel Gaudi (HPU) remains on Intel fork of PyTorch 2.3.1.

Fixes #293

@mergify mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Oct 14, 2024
@fabiendupont fabiendupont force-pushed the 2437-update-torch-2.4.1 branch 2 times, most recently from 45ea711 to 9e21726 Compare October 14, 2024 12:48
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 14, 2024
@fabiendupont fabiendupont force-pushed the 2437-update-torch-2.4.1 branch 2 times, most recently from a916401 to f9f0531 Compare October 22, 2024 08:16
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 22, 2024
@fabiendupont fabiendupont force-pushed the 2437-update-torch-2.4.1 branch from f9f0531 to 00a11aa Compare October 22, 2024 08:43
@mergify mergify bot removed the ci-failure label Oct 22, 2024
@ktam3 ktam3 linked an issue Oct 22, 2024 that may be closed by this pull request
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

e2e workflow failed on this PR: View run, please investigate.

Copy link
Contributor

mergify bot commented Oct 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 25, 2024
@@ -4,7 +4,7 @@ pyyaml
py-cpuinfo
# we set this to be above 0a0 so that it doesn't
# replace custom pytorch images with the 2.3.0
torch>=2.3.0a0
torch>=2.3.0,<2.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this min version here be bumped to 2.4.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Gaudi 1.18 is supported, yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaudi software 1.18.0 has 2.4.0a0. The 1.18.x series will not get a newer version.

Why are you changing the version range for Torch? I don't see any changes to Python code or other dependencies. There is no apparent reason why instructlab-training should no longer work with Torch 2.3.x or 2.5.x.

I would prefer to keep the version ranges for dependencies of instructlab-training as open as possible and only restrict upper versions in instructlab package.

@Maxusmusti
Copy link
Contributor

@fabiendupont can you rebase this pr?

This change the default PyTorch 2.4.1.
Intel Gaudi (HPU) remains on Intel fork of PyTorch 2.3.1.

Fixes instructlab/instructlab#2437

Signed-off-by: Fabien Dupont <[email protected]>
@fabiendupont fabiendupont force-pushed the 2437-update-torch-2.4.1 branch from 00a11aa to 6e0395d Compare October 29, 2024 15:13
@mergify mergify bot removed the needs-rebase label Oct 29, 2024
@nathan-weinberg
Copy link
Member

Holding for E2E test results

Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

@@ -0,0 +1,11 @@
# # Extra dependencies for Intel Gaudi / Habana Labs HPU devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Christian's PR merged and changed these things as well, I don't want to overwrite whatever he updated. Would it be possible to check which is the correct source of truth?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep requirements/hpu.txt empty. There is no need to restrict torch here or to pull any of the habana extensions. instructlab.training does not import them.

@fabiendupont
Copy link
Contributor Author

@tiran, commented on a different channel that capping the torch version might not be such a good idea. And the Intel Gaudi requirements don't exist at the moment, so they are not really related to this PR either. If we don't do these 2 things, then this PR is basically just moving requirements file into a requirements directory, which is fine but also not related to PyTorch 2.4.

@tiran, @nathan-weinberg, @JamesKunstle, do we still want to cap torch version?

Copy link
Contributor

@tiran tiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a very, VERY strong aversion against constraining upper bound versions. It's a major pain in the back. In most cases, it is causing more trouble than it's worth.

To give you some numbers: the upper bound limit of Torch in instructlab has delayed Gaudi software 1.18.0 update by almost three weeks. Intel releases 1.18.0 in Oct 10. I had implemented the downstream changes by Oct 11. Last night (Oct 29) we had most instructlab fixes in place and were able to build all packages with Gaudi 1.18.0. #309 has to be solved before we have a full run. The blog post https://iscinumpy.dev/post/bound-version-constraints/ explains in great detail why upper bound versions are bad.

IMHO instructlab-training should not have any upper or lower version constraint unless they are really necessary. Instead it should rely on the instructlab core package to restrict versions. Any lower version bound should be as relaxed as possible and documented in the requirements file. Any upper version bound should have an explanation plus a ticket.

For internal testing, you could use a constraint file with known good versions and consume it with pip install -c constraints.txt ... in tests.

@@ -4,7 +4,7 @@ pyyaml
py-cpuinfo
# we set this to be above 0a0 so that it doesn't
# replace custom pytorch images with the 2.3.0
torch>=2.3.0a0
torch>=2.3.0,<2.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaudi software 1.18.0 has 2.4.0a0. The 1.18.x series will not get a newer version.

Why are you changing the version range for Torch? I don't see any changes to Python code or other dependencies. There is no apparent reason why instructlab-training should no longer work with Torch 2.3.x or 2.5.x.

I would prefer to keep the version ranges for dependencies of instructlab-training as open as possible and only restrict upper versions in instructlab package.

@@ -0,0 +1,11 @@
# # Extra dependencies for Intel Gaudi / Habana Labs HPU devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep requirements/hpu.txt empty. There is no need to restrict torch here or to pull any of the habana extensions. instructlab.training does not import them.

@@ -0,0 +1,2 @@
# Extra dependencies for AMD ROCm
flash-attn>=2.6.2,<2.7.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to limit the upper bound of flash-attn?

Copy link
Contributor

mergify bot commented Nov 13, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 13, 2024
@JamesKunstle
Copy link
Contributor

We've bumped to 2.4.1 and are in the process of bumping to >=2.5.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file needs-rebase one-approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Torch 2.4 (required for Intel Gaudi 1.18.0)
5 participants