-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shampoo conformer workload hangs #389
Comments
For completeness the conformer run that did not get stuck printed these warnings:
|
Strongly believe this is a memory issue. The workload runs fine w smaller model or smaller batch size. |
Marking as obsolete |
Reopening to explore feasibility of submission. |
Won't fix. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The conformer workload hangs when run with shampoo training algorithm.
Description
Traceback
Steps to Reproduce
Pull the docker image:
Run the container and entrypoint script which will launch a submission runner:
To see output of
submission_runner.py
monitor the logs of the container:$ docker logs -f <container_id printed by previous command>
Source or Possible Fix
I think this may be an XLA memory issue. On a different VM the runs got a little further along and errored out with a seemingly memory related issue. I restarted all the VMs and they don't get any further along then the above message. I may have changed some environment flags on the VM that got further along. I tried setting
XLA_PYTHON_CLIENT_PREALLOCATE=false
which didn't do anything and settingXLA_PYTHON_CLIENT_MEM_FRACTION=.80
which made it error out sooner.For reference the output of the run that got further:
To debug in container:
Run the container without starting the submission runner (not passing in a value for the -s flag):
Start an interactive bash session in the running container:
Run submission_runner.py in the container:
You can also pull the code to the host VM and mount the local repo so that you can make changes to the code without losing them.
Run the container w the mounted dir:
The text was updated successfully, but these errors were encountered: