You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2023-09-14 14:24:39,905 - distributed.core - INFO - Starting established connection to tcp://...166.214:46011
slurmstepd-dlcgpu16: error: *** JOB 9277005 ON dlcgpu16 CANCELLED AT 2023-09-14T14:24:41 ***
2023-09-14 14:24:41,045 - distributed.worker - INFO - Stopping worker at tcp://...166.176:40901. Reason: scheduler-close
2023-09-14 14:24:41,046 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...5.166.214:46011>
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 316, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 327, in write
convert_stream_closed_error(self, e)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...166.214:46011>: Stream is closed
2023-09-14 14:24:41,051 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://...166.176:33175'. Reason: scheduler-close
2023-09-14 14:24:41,053 - distributed.core - INFO - Received 'close-stream' from tcp://...166.214:46011; closing.
2023-09-14 14:24:41,053 - distributed.nanny - INFO - Worker closed
I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:
import asyncio
and
try:
self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
"Try increasing `worker_timeout`.")
raise error
The text was updated successfully, but these errors were encountered:
It seems that the program just ends with an error, like, one end of the TCP connection continuing after the job is finished. This does cause a problem, notably that my GPUinfo cannot be run but it seems to be a side effect of the disconnection. Or, even a different problem. So I guess I will delete it here. I don't have time to study it further and taking a different approach.
I'm not sure about the context here, you are talking about SMAC3 or GPUinfo which I don't know about.
Anyway, yes, there are sometimes Error messages when shutting down a Dask cluster, but as you've noticed, this does not cause a problem for your computation.
I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:
and
The text was updated successfully, but these errors were encountered: