Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

Closed
yiershanxll opened this issue May 9, 2024 · 8 comments
Closed

Comments

@yiershanxll
Copy link

yiershanxll commented May 9, 2024

Problem:
We tested 5 times, and each time the problem occurred at 25 minutes.
the error message "distributed.comm.core.CommClosedError: in <TLS (closed) Scheduler Broadcast local=tls://182.10.4.6:58090 remote=tls://182.10.2.6:18715>: Stream is closed" is displayed.
This problem does not exist in earlier versions:dask==2022.03.0.
Although the task is error, the background worker executes the task properly until the calculation is complete.

Environment information:

  • Dask version: 2024.2.0
  • distributed version: 2024.2.0
  • pandas:2.0.3
  • pyarrow:14.0.1
  • Python version: 3.9.11
  • Operating System: suse12.5

Number of nodes: Four containers with 16 vCPUs and 32 GB memory are deployed.
Number of workers: Four workers are started using the dask command. Each worker has ten processes and one thread. The memory usage is limited to 90%. A total of 10 processes are processed in the background.
Distributed computing: We use the client.run method to submit tasks to each worker for processing. The input processed by each worker is a file. Pandas is used for processing, and dask.dataframe is not used. The output is also a file.

We have been using earlier version 2022.03. 0 for distributed computing. However, in this version, the scheduling process fails to allocate tasks due to full CPU usage when the task is processing big data. As a result, the scheduling process stops responding. Therefore, we have not used the native scheduling of dask. Instead, we encapsulate the task submission mode. If there are four workers and each worker has five processes, all tasks are divided into 20 large task lists on average, and each worker is specified to run a large task. In addition, the run method is used for submission to avoid the scheduling problem. This problem occurs occasionally because the production needs to be upgraded to 2024.2. 0 and the data volume increases again.

@yiershanxll yiershanxll changed the title dask upgrade from 2022.03.0 to 2024.2.0 has performance issues. distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. May 9, 2024
@yiershanxll
Copy link
Author

distributed.yaml worker-ttl param need to set null

@fjetter
Copy link
Member

fjetter commented May 21, 2024

Just driving by: Client.run is not necessarily meant for users to run their computations. This is mostly used for diagnostics purposes, debugging and occasionally for more exotic things. As the docs for Client.run already suggests, this function is running outside of the task scheduling system.

Users should instead use Client.submit to schedule individual functions.

You will also noticed that with Client.run, the dashboard is not actually working just like many other features will not work

@yiershanxll yiershanxll reopened this Sep 23, 2024
@yiershanxll
Copy link
Author

When processing 100 million daily data, a total of 180 days, and a total of 4,000 files, it will appear occasionally. Near the end of the task, the dask scheduling is closed for unknown reasons, but the task is still running and can be completed.

@hendrikmakait
Copy link
Member

@yiershanxll: Have you switched from Client.run to Client.submit as recommended earlier?

@yiershanxll
Copy link
Author

yiershanxll commented Sep 23, 2024

I'm trying today, but there is no result yet. It takes 10 hours to run before something can go wrong.

@yiershanxll: Have you switched from Client.run to Client.submit as recommended earlier?

@hendrikmakait
Copy link
Member

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

This helps us reproduce the issue you're having and resolve the issue more quickly.

@yiershanxll
Copy link
Author

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

This helps us reproduce the issue you're having and resolve the issue more quickly.

Thank you so much, I will try it!

@yiershanxll
Copy link
Author

I recommend updating to the latest version. Dask development moves fast and chances are that your problem may already be fixed. If that doesn't work, please provide more information about the exact problem you're seeing:

This helps us reproduce the issue you're having and resolve the issue more quickly.

We still use the run mode for submission, but set client.run(on_error='return') and try to run the command for one more time to avoid this problem successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants