Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI timeout (test-llava-runner-linux) since #7922 #8180

Closed
swolchok opened this issue Feb 4, 2025 · 14 comments · Fixed by #8248 or #8339
Closed

CI timeout (test-llava-runner-linux) since #7922 #8180

swolchok opened this issue Feb 4, 2025 · 14 comments · Fixed by #8248 or #8339
Assignees
Labels
module: ci Issues related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@swolchok
Copy link
Contributor

swolchok commented Feb 4, 2025

🐛 Describe the bug

Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs

In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:

  • 14 minute "gap" in logs, jump from 2025-01-31T23:11:50.6702881Z to 2025-01-31T23:25:21.0383351Z, during export.
  • 25 minute "gap" in logs from 2025-01-31T23:25:21.2243143Z to 2025-01-31T23:42:40.6914293Z , and the second message is just a job timeout. Seems to also be during export; not sure why we are exporting multiple times offhand, but that's a separate problem regardless.

@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.

I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.

Versions

N/A

@swolchok swolchok self-assigned this Feb 4, 2025
@digantdesai digantdesai added module: ci Issues related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 4, 2025
@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

The first log gap is also strange -- the "internal" timestamp of the log mesage after the gap is from before the gap!

2025-01-31T23:11:50.6701775Z [INFO 2025-01-31 23:11:50,659 builder.py:211] Exporting with:
2025-01-31T23:11:50.6702881Z [INFO 2025-01-31 23:11:50,660 builder.py:212] inputs: (tensor([0]), tensor([[[ 0.0045, -0.0038,  0.0017,  ..., -0.0088,  0.0025, -0.0025],
2025-01-31T23:25:21.0383351Z          [-0.0112, -0.0129, -0.0121,  ...,  0.0090,  0.0118, -0.0081],
2025-01-31T23:25:21.0384234Z          [ 0.0195, -0.0058,  0.0061,  ...,  0.0171, -0.0052, -0.0212],
2025-01-31T23:25:21.0384805Z          ...,
2025-01-31T23:25:21.0386617Z          [-0.0187, -0.0017,  0.0177,  ...,  0.0238,  0.0052,  0.0101],
2025-01-31T23:25:21.0387267Z          [ 0.0066, -0.0161,  0.0117,  ..., -0.0103,  0.0148,  0.0073],
2025-01-31T23:25:21.0388147Z          [ 0.0039,  0.0015,  0.0055,  ..., -0.0042,  0.0151,  0.0024]]],
2025-01-31T23:25:21.0388845Z        grad_fn=<CatBackward0>))
2025-01-31T23:25:21.0389468Z [INFO 2025-01-31 23:11:50,660 builder.py:213] kwargs: None
2025-01-31T23:25:21.0390782Z [INFO 2025-01-31 23:11:50,660 builder.py:214] dynamic shapes: ({0: 1}, {1: <class 'executorch.examples.models.llava.model.token_dim'>})
2025-01-31T23:25:21.0392301Z [INFO 2025-01-31 23:12:18,280 builder.py:321] Using pt2e [] to quantizing the model...
2025-01-31T23:25:21.0393555Z [INFO 2025-01-31 23:12:18,280 builder.py:366] No quantizer provided, passing...

definitely wondering if we are swapping.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

Found another 43-minute gap in the logs before the first one I identified previously:

2025-01-31T22:25:17.0525537Z [WARNING 2025-01-31 22:25:17,042 quantized_kv_cache.py:273] Replacing KVCache with CustomKVCache. This modifies the model in place.
2025-01-31T23:08:22.3657932Z [WARNING 2025-01-31 22:25:17,042 quantized_kv_cache.py:273] Replacing KVCache with CustomKVCache. This modifies the model in place.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

There are also timestamp gaps in the "good" logs!

2025-01-31T22:29:01.0922046Z [INFO 2025-01-31 22:29:00,968 GPTQ.py:1139] linear: text_model.layers.29.feed_forward.w2, in=11008, out=4096
2025-01-31T22:44:55.4846123Z [INFO 2025-01-31 22:29:01,082 GPTQ.py:1139] linear: text_model.layers.29.feed_forward.w3, in=4096, out=11008

swolchok added a commit that referenced this issue Feb 4, 2025
These jobs are timing out. I found big gaps in the timestamps of the logs even before the blame PR, which to me suggests swapping. Let's try bigger runner and see what happens on this PR.

Fixes #8180 (if it lands)

ghstack-source-id: 1648a66a3ed17701b2049481dffeb342d5c6f6c6
ghstack-comment-id: 2634545665
Pull Request resolved: #8181
@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

swapping

c5.24xlarge instances, which these jobs run in, have 192 GB of RAM. If we are swapping, something is wrong.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

after we're done here we should revert #8173 , which raised the timeout to 180 minutes so that the job can at least succeed despite now taking 150 minutes

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

Llava export was taking longer than advertised even before the job started timing out. In the last good log, from https://github.com/pytorch/executorch/blame/main/.ci/scripts/test_llava.sh#L105 ("Starting to export Llava. This will take about 6 mins") to the export finishing (prepare_image_tensor) is actually around 15 minutes. The bad log is not starting the second export.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

Just watching free while running test_llava, it appears to take at least 80 GB of RAM, which seems like a lot (but not enough to exhaust the 192 GB of RAM in a 24xlarge runner...)

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

my theory is we're swapping during export both before and after the change to --use-pt-pinned-commit, and something about the change made our memory usage worse. So, I've spent the last hour or so trying out a couple Python heap profiling tools (guppy and tracemalloc), but they only show Python-interpreter allocations, which is not helpful for PyTorch where I would very much also like to know how much memory is allocated in Tensors

we're swapping

and, to be clear, if true that would be a big engineering problem with export/lowering/quantization, since the model safetensors size is only 15 GB and the CI machine in question has 192 GB of RAM. we have some anecdotal evidence that this phase uses "too much" memory (machines can't quantize models they should be able to run) but I believe swapping would require "way way too much" memory to be in use...

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

@atalman from PyTorch dev infra agreed that the --use-pt-pinned-commit build should not be worse offhand.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

@huydhn , I think you worked on putting --use-pt-pinned-commit in place, any reason you're aware of that that build should perform worse or use more memory than a nightly wheel?

swolchok added a commit that referenced this issue Feb 4, 2025
Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping.

ghstack-source-id: e2aa5b03d999423756923b0619e52b29a0dfbe74
ghstack-comment-id: 2634969596
Pull Request resolved: #8192
@swolchok
Copy link
Contributor Author

swolchok commented Feb 4, 2025

#8193 finished test-llava-runner-linux in 67 minutes, so we can be reasonably confident that --use-pt-pinned-commit is causing the excessive running times

swolchok added a commit that referenced this issue Feb 4, 2025
Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping.

ghstack-source-id: 45e7e64de3b231a87a9a9e97212d0c66456aec31
ghstack-comment-id: 2634969596
Pull Request resolved: #8192
@huydhn
Copy link
Contributor

huydhn commented Feb 5, 2025

@huydhn , I think you worked on putting --use-pt-pinned-commit in place, any reason you're aware of that that build should perform worse or use more memory than a nightly wheel?

Nothing obvious comes to mind. Let me compare the two wheels to see if I can spot anything there.

@swolchok
Copy link
Contributor Author

swolchok commented Feb 5, 2025

#8192 seems to indicate that we're not swapping. I'm hesitant to continue to speculate-and-test about possible sources of the gaps in timestamps, since that could consume a lot of time, but I'm not sure how else to make progress here...

swolchok added a commit that referenced this issue Feb 6, 2025
Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping.

ghstack-source-id: 0bc03bc7bcefdad887e7ebf5e894fad6362a398c
ghstack-comment-id: 2634969596
Pull Request resolved: #8192
swolchok added a commit that referenced this issue Feb 6, 2025
Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping.

ghstack-source-id: 7582788e769145f75800de5eee0d024c0966a98a
ghstack-comment-id: 2634969596
Pull Request resolved: #8192
@swolchok
Copy link
Contributor Author

swolchok commented Feb 7, 2025

assigning to @huydhn for now because he found the likely source of regression

@swolchok swolchok removed their assignment Feb 8, 2025
swolchok added a commit that referenced this issue Feb 10, 2025
#8173 raised these timeouts. Now that #8248 has landed to fix #8180, we should be able to lower them again. (I'm sending this early so I don't forget; double-check llava-runner running time)

ghstack-source-id: cb4c1691907b8bb46a504a2d8cbc00d12b1ef4a4
ghstack-comment-id: 2648474106
Pull Request resolved: #8339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Issues related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
3 participants