Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionRefusedError #614

Open
mens-artis opened this issue Sep 14, 2023 · 2 comments
Open

ConnectionRefusedError #614

mens-artis opened this issue Sep 14, 2023 · 2 comments

Comments

@mens-artis
Copy link

mens-artis commented Sep 14, 2023

2023-09-14 14:24:39,905 - distributed.core - INFO - Starting established connection to tcp://...166.214:46011
slurmstepd-dlcgpu16: error: *** JOB 9277005 ON dlcgpu16 CANCELLED AT 2023-09-14T14:24:41 ***
2023-09-14 14:24:41,045 - distributed.worker - INFO - Stopping worker at tcp://...166.176:40901. Reason: scheduler-close
2023-09-14 14:24:41,046 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...5.166.214:46011>
Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 316, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 327, in write
    convert_stream_closed_error(self, e)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...166.214:46011>: Stream is closed
2023-09-14 14:24:41,051 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://...166.176:33175'. Reason: scheduler-close
2023-09-14 14:24:41,053 - distributed.core - INFO - Received 'close-stream' from tcp://...166.214:46011; closing.
2023-09-14 14:24:41,053 - distributed.nanny - INFO - Worker closed

I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:

import asyncio

and

try:
    self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
    logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
                  "Try increasing `worker_timeout`.")
    raise error
@mens-artis
Copy link
Author

mens-artis commented Sep 16, 2023

It seems that the program just ends with an error, like, one end of the TCP connection continuing after the job is finished. This does cause a problem, notably that my GPUinfo cannot be run but it seems to be a side effect of the disconnection. Or, even a different problem. So I guess I will delete it here. I don't have time to study it further and taking a different approach.

@guillaumeeb
Copy link
Member

Hi @mens-artis,

I'm not sure about the context here, you are talking about SMAC3 or GPUinfo which I don't know about.

Anyway, yes, there are sometimes Error messages when shutting down a Dask cluster, but as you've noticed, this does not cause a problem for your computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants