CI timeout (test-llava-runner-linux) since #7922 #8180

swolchok · 2025-02-04T16:36:18Z

🐛 Describe the bug

Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs

In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:

14 minute "gap" in logs, jump from 2025-01-31T23:11:50.6702881Z to 2025-01-31T23:25:21.0383351Z, during export.
25 minute "gap" in logs from 2025-01-31T23:25:21.2243143Z to 2025-01-31T23:42:40.6914293Z , and the second message is just a job timeout. Seems to also be during export; not sure why we are exporting multiple times offhand, but that's a separate problem regardless.

@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.

I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.

Versions

N/A

swolchok · 2025-02-04T16:49:12Z

The first log gap is also strange -- the "internal" timestamp of the log mesage after the gap is from before the gap!

2025-01-31T23:11:50.6701775Z [INFO 2025-01-31 23:11:50,659 builder.py:211] Exporting with:
2025-01-31T23:11:50.6702881Z [INFO 2025-01-31 23:11:50,660 builder.py:212] inputs: (tensor([0]), tensor([[[ 0.0045, -0.0038,  0.0017,  ..., -0.0088,  0.0025, -0.0025],
2025-01-31T23:25:21.0383351Z          [-0.0112, -0.0129, -0.0121,  ...,  0.0090,  0.0118, -0.0081],
2025-01-31T23:25:21.0384234Z          [ 0.0195, -0.0058,  0.0061,  ...,  0.0171, -0.0052, -0.0212],
2025-01-31T23:25:21.0384805Z          ...,
2025-01-31T23:25:21.0386617Z          [-0.0187, -0.0017,  0.0177,  ...,  0.0238,  0.0052,  0.0101],
2025-01-31T23:25:21.0387267Z          [ 0.0066, -0.0161,  0.0117,  ..., -0.0103,  0.0148,  0.0073],
2025-01-31T23:25:21.0388147Z          [ 0.0039,  0.0015,  0.0055,  ..., -0.0042,  0.0151,  0.0024]]],
2025-01-31T23:25:21.0388845Z        grad_fn=<CatBackward0>))
2025-01-31T23:25:21.0389468Z [INFO 2025-01-31 23:11:50,660 builder.py:213] kwargs: None
2025-01-31T23:25:21.0390782Z [INFO 2025-01-31 23:11:50,660 builder.py:214] dynamic shapes: ({0: 1}, {1: <class 'executorch.examples.models.llava.model.token_dim'>})
2025-01-31T23:25:21.0392301Z [INFO 2025-01-31 23:12:18,280 builder.py:321] Using pt2e [] to quantizing the model...
2025-01-31T23:25:21.0393555Z [INFO 2025-01-31 23:12:18,280 builder.py:366] No quantizer provided, passing...

definitely wondering if we are swapping.

swolchok · 2025-02-04T16:51:43Z

Found another 43-minute gap in the logs before the first one I identified previously:

2025-01-31T22:25:17.0525537Z [WARNING 2025-01-31 22:25:17,042 quantized_kv_cache.py:273] Replacing KVCache with CustomKVCache. This modifies the model in place.
2025-01-31T23:08:22.3657932Z [WARNING 2025-01-31 22:25:17,042 quantized_kv_cache.py:273] Replacing KVCache with CustomKVCache. This modifies the model in place.

swolchok · 2025-02-04T16:52:12Z

There are also timestamp gaps in the "good" logs!

2025-01-31T22:29:01.0922046Z [INFO 2025-01-31 22:29:00,968 GPTQ.py:1139] linear: text_model.layers.29.feed_forward.w2, in=11008, out=4096
2025-01-31T22:44:55.4846123Z [INFO 2025-01-31 22:29:01,082 GPTQ.py:1139] linear: text_model.layers.29.feed_forward.w3, in=4096, out=11008

These jobs are timing out. I found big gaps in the timestamps of the logs even before the blame PR, which to me suggests swapping. Let's try bigger runner and see what happens on this PR. Fixes #8180 (if it lands) ghstack-source-id: 1648a66a3ed17701b2049481dffeb342d5c6f6c6 ghstack-comment-id: 2634545665 Pull Request resolved: #8181

swolchok · 2025-02-04T17:07:57Z

swapping

c5.24xlarge instances, which these jobs run in, have 192 GB of RAM. If we are swapping, something is wrong.

swolchok · 2025-02-04T17:49:43Z

after we're done here we should revert #8173 , which raised the timeout to 180 minutes so that the job can at least succeed despite now taking 150 minutes

swolchok · 2025-02-04T17:57:41Z

Llava export was taking longer than advertised even before the job started timing out. In the last good log, from https://github.com/pytorch/executorch/blame/main/.ci/scripts/test_llava.sh#L105 ("Starting to export Llava. This will take about 6 mins") to the export finishing (prepare_image_tensor) is actually around 15 minutes. The bad log is not starting the second export.

swolchok · 2025-02-04T18:09:47Z

Just watching free while running test_llava, it appears to take at least 80 GB of RAM, which seems like a lot (but not enough to exhaust the 192 GB of RAM in a 24xlarge runner...)

swolchok · 2025-02-04T19:47:10Z

my theory is we're swapping during export both before and after the change to --use-pt-pinned-commit, and something about the change made our memory usage worse. So, I've spent the last hour or so trying out a couple Python heap profiling tools (guppy and tracemalloc), but they only show Python-interpreter allocations, which is not helpful for PyTorch where I would very much also like to know how much memory is allocated in Tensors

we're swapping

and, to be clear, if true that would be a big engineering problem with export/lowering/quantization, since the model safetensors size is only 15 GB and the CI machine in question has 192 GB of RAM. we have some anecdotal evidence that this phase uses "too much" memory (machines can't quantize models they should be able to run) but I believe swapping would require "way way too much" memory to be in use...

swolchok · 2025-02-04T19:49:28Z

@atalman from PyTorch dev infra agreed that the --use-pt-pinned-commit build should not be worse offhand.

swolchok · 2025-02-04T19:53:23Z

@huydhn , I think you worked on putting --use-pt-pinned-commit in place, any reason you're aware of that that build should perform worse or use more memory than a nightly wheel?

Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping. ghstack-source-id: e2aa5b03d999423756923b0619e52b29a0dfbe74 ghstack-comment-id: 2634969596 Pull Request resolved: #8192

swolchok · 2025-02-04T21:49:48Z

#8193 finished test-llava-runner-linux in 67 minutes, so we can be reasonably confident that --use-pt-pinned-commit is causing the excessive running times

Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping. ghstack-source-id: 45e7e64de3b231a87a9a9e97212d0c66456aec31 ghstack-comment-id: 2634969596 Pull Request resolved: #8192

huydhn · 2025-02-05T00:25:15Z

@huydhn , I think you worked on putting --use-pt-pinned-commit in place, any reason you're aware of that that build should perform worse or use more memory than a nightly wheel?

Nothing obvious comes to mind. Let me compare the two wheels to see if I can spot anything there.

swolchok · 2025-02-05T02:44:54Z

#8192 seems to indicate that we're not swapping. I'm hesitant to continue to speculate-and-test about possible sources of the gaps in timestamps, since that could consume a lot of time, but I'm not sure how else to make progress here...

Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping. ghstack-source-id: 0bc03bc7bcefdad887e7ebf5e894fad6362a398c ghstack-comment-id: 2634969596 Pull Request resolved: #8192

Attempting to test my theory that the timestamp gaps in #8180 are caused by swapping. ghstack-source-id: 7582788e769145f75800de5eee0d024c0966a98a ghstack-comment-id: 2634969596 Pull Request resolved: #8192

swolchok · 2025-02-07T23:14:12Z

assigning to @huydhn for now because he found the likely source of regression

#8173 raised these timeouts. Now that #8248 has landed to fix #8180, we should be able to lower them again. (I'm sending this early so I don't forget; double-check llava-runner running time) ghstack-source-id: cb4c1691907b8bb46a504a2d8cbc00d12b1ef4a4 ghstack-comment-id: 2648474106 Pull Request resolved: #8339

swolchok self-assigned this Feb 4, 2025

swolchok mentioned this issue Feb 4, 2025

Increase CI timeout #8173

Merged

digantdesai added module: ci Issues related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 4, 2025

swolchok mentioned this issue Feb 4, 2025

try to use 48xlarge runners for llava-runner and eval_llama-mmlu #8181

Closed

swolchok mentioned this issue Feb 4, 2025

NOCOMMIT: log peak mem usage and swapouts to debug test-llava-runner #8192

Closed

swolchok mentioned this issue Feb 4, 2025

Trial revert of --use-pt-pinned-commit for llava #8193

Closed

huydhn mentioned this issue Feb 7, 2025

Install mkl for PyTorch and libjpeg/libpng for TorchVision #8248

Merged

swolchok assigned huydhn Feb 7, 2025

swolchok removed their assignment Feb 8, 2025

huydhn closed this as completed in #8248 Feb 10, 2025

swolchok mentioned this issue Feb 10, 2025

Lower timeout for llava-runner and eval_llama-mmlu jobs again #8339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI timeout (test-llava-runner-linux) since #7922 #8180

CI timeout (test-llava-runner-linux) since #7922 #8180

swolchok commented Feb 4, 2025 •

edited

Loading

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025 •

edited

Loading

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025 •

edited

Loading

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

huydhn commented Feb 5, 2025

swolchok commented Feb 5, 2025

swolchok commented Feb 7, 2025

CI timeout (test-llava-runner-linux) since #7922 #8180

CI timeout (test-llava-runner-linux) since #7922 #8180

Comments

swolchok commented Feb 4, 2025 • edited Loading

🐛 Describe the bug

Versions

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025 • edited Loading

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025 • edited Loading

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

swolchok commented Feb 4, 2025

huydhn commented Feb 5, 2025

swolchok commented Feb 5, 2025

swolchok commented Feb 7, 2025

swolchok commented Feb 4, 2025 •

edited

Loading

swolchok commented Feb 4, 2025 •

edited

Loading

swolchok commented Feb 4, 2025 •

edited

Loading