code will get stuck when using ddp #1922

cunangjiang · 2025-01-16T11:47:59Z

cunangjiang
Jan 16, 2025

When using ddp for diff_model_train.py, the code will get stuck at a certain epoch. When I reduce the dataset, the model runs in more epochs. How can i solve this problem? For example, as shown in the figure, the code will remain stuck here without any errors.

cunangjiang · 2025-01-16T11:54:34Z

cunangjiang
Jan 16, 2025
Author

0 replies

cunangjiang · 2025-01-16T23:56:25Z

cunangjiang
Jan 16, 2025
Author

The GPU memory has remaining space, but the GPU utilization is 0.

0 replies

KumoLiu · 2025-01-17T10:52:44Z

KumoLiu
Jan 17, 2025
Maintainer

Hi @cunangjiang, did you check the RAM?

0 replies

cunangjiang · 2025-01-17T13:14:45Z

cunangjiang
Jan 17, 2025
Author

i use free -h to check the RAM, here is the result

0 replies

cunangjiang · 2025-01-17T13:15:15Z

cunangjiang
Jan 17, 2025
Author

@KumoLiu

0 replies

KumoLiu · 2025-01-17T14:25:02Z

KumoLiu
Jan 17, 2025
Maintainer

Could you please share the command you are using to run the training? And did you using the latest version? It's not easy to reproduce and figure out the issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code will get stuck when using ddp #1922

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

code will get stuck when using ddp #1922

cunangjiang Jan 16, 2025

Replies: 6 comments

cunangjiang Jan 16, 2025 Author

cunangjiang Jan 16, 2025 Author

KumoLiu Jan 17, 2025 Maintainer

cunangjiang Jan 17, 2025 Author

cunangjiang Jan 17, 2025 Author

KumoLiu Jan 17, 2025 Maintainer

cunangjiang
Jan 16, 2025

cunangjiang
Jan 16, 2025
Author

cunangjiang
Jan 16, 2025
Author

KumoLiu
Jan 17, 2025
Maintainer

cunangjiang
Jan 17, 2025
Author

cunangjiang
Jan 17, 2025
Author

KumoLiu
Jan 17, 2025
Maintainer