distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

yiershanxll · 2024-05-09T01:27:56Z

Problem:
We tested 5 times, and each time the problem occurred at 25 minutes.
the error message "distributed.comm.core.CommClosedError: in <TLS (closed) Scheduler Broadcast local=tls://182.10.4.6:58090 remote=tls://182.10.2.6:18715>: Stream is closed" is displayed.
This problem does not exist in earlier versions:dask==2022.03.0.
Although the task is error, the background worker executes the task properly until the calculation is complete.

Environment information:

Dask version: 2024.2.0
distributed version: 2024.2.0
pandas:2.0.3
pyarrow:14.0.1
Python version: 3.9.11
Operating System: suse12.5

Number of nodes: Four containers with 16 vCPUs and 32 GB memory are deployed.
Number of workers: Four workers are started using the dask command. Each worker has ten processes and one thread. The memory usage is limited to 90%. A total of 10 processes are processed in the background.
Distributed computing: We use the client.run method to submit tasks to each worker for processing. The input processed by each worker is a file. Pandas is used for processing, and dask.dataframe is not used. The output is also a file.

We have been using earlier version 2022.03. 0 for distributed computing. However, in this version, the scheduling process fails to allocate tasks due to full CPU usage when the task is processing big data. As a result, the scheduling process stops responding. Therefore, we have not used the native scheduling of dask. Instead, we encapsulate the task submission mode. If there are four workers and each worker has five processes, all tasks are divided into 20 large task lists on average, and each worker is specified to run a large task. In addition, the run method is used for submission to avoid the scheduling problem. This problem occurs occasionally because the production needs to be upgraded to 2024.2. 0 and the data volume increases again.

yiershanxll · 2024-05-09T06:01:25Z

distributed.yaml worker-ttl param need to set null

fjetter · 2024-05-21T11:06:24Z

Just driving by: Client.run is not necessarily meant for users to run their computations. This is mostly used for diagnostics purposes, debugging and occasionally for more exotic things. As the docs for Client.run already suggests, this function is running outside of the task scheduling system.

Users should instead use Client.submit to schedule individual functions.

You will also noticed that with Client.run, the dashboard is not actually working just like many other features will not work

yiershanxll · 2024-09-23T03:35:05Z

When processing 100 million daily data, a total of 180 days, and a total of 4,000 files, it will appear occasionally. Near the end of the task, the dask scheduling is closed for unknown reasons, but the task is still running and can be completed.

hendrikmakait · 2024-09-23T07:07:12Z

@yiershanxll: Have you switched from Client.run to Client.submit as recommended earlier?

yiershanxll · 2024-09-23T07:40:27Z

I'm trying today, but there is no result yet. It takes 10 hours to run before something can go wrong.

@yiershanxll: Have you switched from Client.run to Client.submit as recommended earlier?

hendrikmakait · 2024-09-23T09:42:29Z

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

The full traceback of the exception and relevant logs.
A self-contained copy-pasteable example that generates the issue. You can find some guidance on how to generate a reproducer at http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

This helps us reproduce the issue you're having and resolve the issue more quickly.

yiershanxll · 2024-09-23T09:48:42Z

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

The full traceback of the exception and relevant logs.

A self-contained copy-pasteable example that generates the issue. You can find some guidance on how to generate a reproducer at http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

This helps us reproduce the issue you're having and resolve the issue more quickly.

Thank you so much, I will try it!

yiershanxll · 2024-09-26T02:57:58Z

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

The full traceback of the exception and relevant logs.

A self-contained copy-pasteable example that generates the issue. You can find some guidance on how to generate a reproducer at http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

This helps us reproduce the issue you're having and resolve the issue more quickly.

We still use the run mode for submission, but set client.run(on_error='return') and try to run the command for one more time to avoid this problem successfully.

github-actions bot added the needs triage label May 9, 2024

yiershanxll changed the title ~~dask upgrade from 2022.03.0 to 2024.2.0 has performance issues.~~ distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. May 9, 2024

yiershanxll closed this as completed May 9, 2024

yiershanxll reopened this Sep 23, 2024

yiershanxll closed this as completed Sep 26, 2024

jacobtomlinson removed the needs triage label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

yiershanxll commented May 9, 2024 •

edited

Loading

yiershanxll commented May 9, 2024

fjetter commented May 21, 2024

yiershanxll commented Sep 23, 2024

hendrikmakait commented Sep 23, 2024

yiershanxll commented Sep 23, 2024 •

edited

Loading

hendrikmakait commented Sep 23, 2024

yiershanxll commented Sep 23, 2024

yiershanxll commented Sep 26, 2024

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

Comments

yiershanxll commented May 9, 2024 • edited Loading

yiershanxll commented May 9, 2024

fjetter commented May 21, 2024

yiershanxll commented Sep 23, 2024

hendrikmakait commented Sep 23, 2024

yiershanxll commented Sep 23, 2024 • edited Loading

hendrikmakait commented Sep 23, 2024

yiershanxll commented Sep 23, 2024

yiershanxll commented Sep 26, 2024

yiershanxll commented May 9, 2024 •

edited

Loading

yiershanxll commented Sep 23, 2024 •

edited

Loading