-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI timeout (test-llava-runner-linux) since #7922 #8180
Comments
The first log gap is also strange -- the "internal" timestamp of the log mesage after the gap is from before the gap!
definitely wondering if we are swapping. |
Found another 43-minute gap in the logs before the first one I identified previously:
|
There are also timestamp gaps in the "good" logs!
|
These jobs are timing out. I found big gaps in the timestamps of the logs even before the blame PR, which to me suggests swapping. Let's try bigger runner and see what happens on this PR. Fixes #8180 (if it lands) ghstack-source-id: 1648a66a3ed17701b2049481dffeb342d5c6f6c6 ghstack-comment-id: 2634545665 Pull Request resolved: #8181
c5.24xlarge instances, which these jobs run in, have 192 GB of RAM. If we are swapping, something is wrong. |
after we're done here we should revert #8173 , which raised the timeout to 180 minutes so that the job can at least succeed despite now taking 150 minutes |
Llava export was taking longer than advertised even before the job started timing out. In the last good log, from https://github.com/pytorch/executorch/blame/main/.ci/scripts/test_llava.sh#L105 ("Starting to export Llava. This will take about 6 mins") to the export finishing (prepare_image_tensor) is actually around 15 minutes. The bad log is not starting the second export. |
Just watching |
my theory is we're swapping during export both before and after the change to --use-pt-pinned-commit, and something about the change made our memory usage worse. So, I've spent the last hour or so trying out a couple Python heap profiling tools (guppy and tracemalloc), but they only show Python-interpreter allocations, which is not helpful for PyTorch where I would very much also like to know how much memory is allocated in Tensors
and, to be clear, if true that would be a big engineering problem with export/lowering/quantization, since the model safetensors size is only 15 GB and the CI machine in question has 192 GB of RAM. we have some anecdotal evidence that this phase uses "too much" memory (machines can't quantize models they should be able to run) but I believe swapping would require "way way too much" memory to be in use... |
@atalman from PyTorch dev infra agreed that the --use-pt-pinned-commit build should not be worse offhand. |
@huydhn , I think you worked on putting --use-pt-pinned-commit in place, any reason you're aware of that that build should perform worse or use more memory than a nightly wheel? |
#8193 finished test-llava-runner-linux in 67 minutes, so we can be reasonably confident that --use-pt-pinned-commit is causing the excessive running times |
Nothing obvious comes to mind. Let me compare the two wheels to see if I can spot anything there. |
#8192 seems to indicate that we're not swapping. I'm hesitant to continue to speculate-and-test about possible sources of the gaps in timestamps, since that could consume a lot of time, but I'm not sure how else to make progress here... |
assigning to @huydhn for now because he found the likely source of regression |
#8173 raised these timeouts. Now that #8248 has landed to fix #8180, we should be able to lower them again. (I'm sending this early so I don't forget; double-check llava-runner running time) ghstack-source-id: cb4c1691907b8bb46a504a2d8cbc00d12b1ef4a4 ghstack-comment-id: 2648474106 Pull Request resolved: #8339
🐛 Describe the bug
Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs
In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:
@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.
I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.
Versions
N/A
The text was updated successfully, but these errors were encountered: