recent dask changes #509

pavlis · 2024-02-12T19:19:54Z

pavlis
Feb 12, 2024
Collaborator

It is pretty clear to me that there have been some fairly significant changes in dask distributed. All of us should peruse the new documentation I found here. (I think you will get that as the top hit if you just google "dask distributed")

There are two things that make me post this.

I am pretty certain our containerized environment has an impedance mismatch with way dask distributed has evolved. They have standard setups for different abstractions of clusters described on this page. The "LocalCluster" is one I'll get to next, but the others, as @wangyinz pointed out to me, do not mesh with our containerized model. I'm not sure, however, that we are doing things right in to properly use the functionality of this newer version. There are some pretty heavy technical details outside my understanding these pages.
I successfully used the "LocalCluster" version to run a parallel job on my laptop. There was, however, a problem that I think is a case in point about collision with our current container setup. I created a distributed client with this code fragment:

from dask.distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()

The problem is that this throws the following warning message:

/opt/conda/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42057 instead
  warnings.warn(

The job runs, BUT I can't connect to the diagnostic port that is normally on 8787 even when I used the "-p 8787:8787" option with docker run. The message says why as the LocalCluster instance changed the port. I cannot, of course, connect to 42047 in the above example because docker won't let me. I presume this is happening because our container prelaunches dask in the background and i apparently grabs 8787. My LocalCluster instance is colliding with the version already running. Somewhat surprising it works at all.

I ddn't put this in issues and I think it belongs more here as a design discussion on how we need to make potentially bigger revisions to deal with this problem.

pavlis · 2024-02-13T11:46:46Z

pavlis
Feb 13, 2024
Collaborator Author

Additional reading of the new documentation for dask (link above) showed three thing that are important:

The way we set up dask is explicitly defined as "not recommended".
The documents say the old way to set up dask distributed on a desktop is now little more than shorthand for running LocalCluster with the line I gave above.
When LocalCluster is initiated it creates a dask scheduler and a workers set up by the definition when the object is created. The default is a thread pool as before, but changing that is now easy. The big thing though is that explains why I get the port conflict with our current configuration. Our container always launches the scheduler in its startup script. When an instance of LocalCluster is instantiated the scheduler will find that 8787 is in use be the already running instance of dask scheduler. It is kind of surprising my job actually worked at all.

All this says we need to change our container's startup script when dask is set as the scheduler. I was going to suggest how we could do that, but I'm a bit mystified by how that works in spite of the fact I wrote a user manual page on the topic several months ago. A conclusion I take from that is what we have is way too complicated as a standard setup. Some ideas to consider:

I wonder if instead of running everything through one master startup script for the container we create something like 5 different versions of the container: a) all in one, b) role of db, c) role of scheduler, d) role of worker, and e) role of shard. Maybe do a sixth that is essentially the current one but runs spark and not dask at all.
From last week's meeting in seems we are close to having a conda package for MsPASS in working order. If so, I would think the way to do dask on a local machine would be to have run dask with LocalCluster running using the conda package installed on the local machine with the container used only to handle MongoDB. Maybe the later wouldn't even be necessary if the conda package works. I'm not sure.
With a conda package we might be able to figure out a way to use some of the other new dask distributed clients. The only complexity I see there if a user had mspass installed as a conda package is how that would work with MongoDB. I think you have to still have some set in the job script to launch MongoDB before starting the python jobs script.

A side issue about the potential design changes here is this one. We REALLY need a clean mechanism on HPC to just run a python file as the job script. We really should not encourage interactive use of a system made for batch submissions. The model should be to use a small local system to set up your workflow on a test data set so you can just release the same job on a bigger data set on a large cluster or cloud system. We should make jupyter notebook files only an option for such submissions not the required form.

0 replies

pavlis · 2024-02-16T20:11:29Z

pavlis
Feb 16, 2024
Collaborator Author

During our meeting today @wangyinz pointed out that the dask diagnostics dashboard should always be available if one connects to port 8787. I checked and he is indeed right. My memory and/or understanding of how dask operated earlier was faulty. I had thought that dashboard was only visible if you launched dask distributed. I haven't checked, but my error may be in running a local copy of dask during development outside the mspass container. I think if you just invoke dask that way without instantiating an instance of dask distributed maybe you don't get the diagnostics dashboard.

So, absolutely the port conflict problem that initiated this discussion was an error on my part in launching dask. Apparently one does NOT want to do what I did which is launch LocalCluster within a notebook running in the container.

The topic of this discussion, on the other hand does remain open. I am very very sure that when we get a conda package in working order the best way to advise people to get started with MsPASS is to do a conda install MsPASS in a special environment and then use the LocalCluster approach to launch dask locally.

0 replies

pavlis · 2024-02-26T12:54:55Z

pavlis
Feb 26, 2024
Collaborator Author

I decided to do a controlled test with LocalCluster running a notebook I've been developing for the MsPASS class we are now planning on teaching in July 2024. The notebook compares the same tasks done serial and in parallel. I runs fine provided I don't try to explicitly start the new LocalCluster class that is now in dask distributed (see above) and run the notebook our now standard way with the docker container.

I tried running this on my Ubuntu development machine. I first updated dask with conda. I launched the docker container in "daemon" mode as defined on a run like in the wiki. The point is it port maps only 27017. I was trying to use our container as only a database server. I then tried to run the notebook as a local process. I launched it from the shell using:

jupyter lab

In that mode, of course, jupyter automatically opens in a browser window.

I knew there would be trouble when I got this message using the usual "get_database" method call for the MsPASS client:

/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/dask/dataframe/_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 11.0.0. Please consider upgrading.
  warnings.warn(
/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37089 instead
  warnings.warn(

I added this incantation from the dask documentation to a box at the top of the notebook and it gave a similar warning:

/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37959 instead
  warnings.warn(

When I ran the parallel workflow box it crashed almost immediately. I won't give the whole error stack as it is long but it is clear the following is the key issue:

RuntimeError: Error during deserialization of the task graph. This frequently
occurs if the Scheduler and Client have different environments.
For more information, see
https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

I am not at all convinced the error message is pointing to the real problem (may be "frequently" but not necessarily the problem). I am guessing the jupyter server is confused about which instance of the scheduler it is should be talking to. I think the way I launched the container it is actually running in "all-in-one" mode so there is an instance of dask and a jupyter notebook running in the container. One or the other or both are likely confusing dask.

That link the error message gives is also informative and something all development team members should read. It definitely points out the dark side of using dask in this mode and why the container remains a wise choice.

Any suggestions? Maybe I should launch the container running in db mode. @wangyinz could you remind me how I might do that? Maybe what you should do is edit the wiki page on running mspass with docker and have the line that shows how to run in daemon mode for using docker changing the incantation to run the container as strickly a db server.

0 replies

pavlis · 2024-02-27T11:57:41Z

pavlis
Feb 27, 2024
Collaborator Author

In an email @wangyinz suggested I try running the container with MSPASS_ROLE set to db. This doesn't work and I do not know why.

First, let me note that the following is the line I cut and paste from the wiki for mspass all the time to run mongodb with a local instance of mspass:

docker run --name MsPASS -p 27017:27017 -d  --mount src=`pwd`,target=/home,type=bind  mspass/mspass

Note that with this configuration I can run dask locally IF I don't instantiate an instance of LocalCluster. It is when I try to create that new feature I get the errors noted above.

Based on the email suggestion of @wangyinz I tried the following relatively small variant from above:

docker run --name MsPASS -p 27017:27017 -d  --mount src=`pwd`,target=/home,type=bind --env MSPASS_ROLE=db mspass/mspass

Note the only difference is adding the --env arg to set MSPASS_ROLE.

The problem is the above line appears to exit almost immediately with no trace of any output anywhere that I can find. i.e. the command line echoes nothing. There is nothing in the logs directory and I don't see any other unknown file in "pwd". The only hint I did anything is that if I try the same command again I get this common error when you use --name with docker:

docker: Error response from daemon: Conflict. The container name "/MsPASS" is already in use by container "d16a28c7e424dd921639ab69dc8015167ba275525faa1d8cdd1ac07d8651a484". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

Why that is important is it shows the container run script was executed but just appears to have silently exited immediately. I tried to see how this might happen with the startup script but it has me stumped.

Any ideas?

0 replies

wangyinz · 2024-02-27T14:21:22Z

wangyinz
Feb 27, 2024
Maintainer

hmmm... I am a bit confused on what exactly is going on here. So, for the first command, since it only binds the 27017 port it means that only the database is accessible from both inside and outside of the container. Because how the startup script is written, it will also start a dask cluster inside the container, but that cluster won't be accessible from the outside. This means, if you run a LocalCluster on the outside, it should not interfere because the ports are not being mapped, and there will be no port conflict at all. The dask cluster inside the container is not accessible from the outside. Of course, if you run a LocalCluster from inside the container, it will cause error because there is already a cluster running and the ports is taken. Since you said that you get errors when instantiating an instance of LocalCluster. It means either you are running that inside the container, or there is some other cluster running on the outside that you are not aware of.

For the second command setting MSPASS_ROLE, I tried it too and I can see that this setup is indeed broken, or we should just say the original script never intended to support such a use. It will exit immediately because we didn't wait for the background mongo daemon in the script and it just exits right away. I just got that fixed by adding a wait at the end of the startup script.

mspass/scripts/start-mspass.sh

Line 320 in 9740939

wait

This should got that fixed.

0 replies

pavlis · 2024-02-27T21:16:47Z

pavlis
Feb 27, 2024
Collaborator Author

First, when I tried this earlier today before I wrote that last comment I definitely was instantiating LocalCluster outside the container. You stated exactly why I did not understand what I was getting. The way I understand docker works the fact that there is something inside the container listening on 8787 should detectable outside. That security feature is one of the main selling points for containerization. I wondered if maybe dask has some configuration file it is reading because I did a bind. Indeed I now realize dask is creating a "work" directory. I thought that might be the issue, BUT I just tried docker run -p 27017:27107 without the bind argument. I still get the same warning about 8787 being in use. I'm going to try one more thing but it will require ending this comment and returning. That is, I want to test a hypothesis that some instance of dask I created earlier has done something to the networking on this machine. I'll use the all purpose equivalent of ctrl-alt-del and try this again without any hacking in between.

BTW - the search mentioned above led me this this site that has some information very relevant to the overall topic of this discussion page.

0 replies

pavlis · 2024-02-27T21:29:47Z

pavlis
Feb 27, 2024
Collaborator Author

Well rebooting did not solve this problem. I did, however, poke at the code base the error message indicated was throwing the exception that was handled and generated that error message about port conflict. It seems most likely that there is some hidden state issue related to the browser history. It references line 182 of this file distributed/node.py. Here is a section of code around line 182:

       # If more than one address is configured we just use the first here
        self.http_server.address, self.http_server.port = bound_addresses[0]
        self.services["dashboard"] = self.http_server

        # Warn on port changes
        for expected, actual in zip(
            [a["port"] for a in http_addresses], [b[1] for b in bound_addresses]
        ):
            if expected != actual and expected > 0:
                warnings.warn(
                    f"Port {expected} is already in use.\n"
                    "Perhaps you already have a cluster running?\n"
                    f"Hosting the HTTP server on port {actual} instead"
                )
END OF FILE

I don't know what that section of code is doing, but it definitely is interrogating some system service that is saying 8787 is in use.

Did you push that configuration change? Will I get it if I pull the latest container?

0 replies

wangyinz · 2024-02-27T21:32:30Z

wangyinz
Feb 27, 2024
Maintainer

OK, I just tried on my ubuntu system. First, I open a new terminal and ran:

docker run -p 27017:27017 -d  mspass/mspass

Then, in another terminal, I run the following:

$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.distributed import Client
>>> from dask.distributed import LocalCluster
>>> cluster = LocalCluster()
>>> client = Client(cluster)
>>> print(client)
<Client: 'tcp://127.0.0.1:36719' processes=4 threads=16, memory=50.09 GiB>

So, there is no conflict in the ports, although the LocalCluster is using a random port instead of 8787.

0 replies

pavlis · 2024-02-28T11:53:30Z

pavlis
Feb 28, 2024
Collaborator Author

@wangyinz I pulled the revised container after your change and it seems using --env MSPASS_ROLE=db now works and we can run the container to only be a db service. The dask conflict, however, remains a bit mysterious. I still get the warning and dask swiitches the port for 8787 for diagnostics. Whatever is doing that in ubuntu is outside docker.

0 replies

wangyinz · 2024-02-29T02:51:19Z

wangyinz
Feb 29, 2024
Maintainer

So, I guess that means some other service has been running and using that port then. Maybe you can try ss -tuln or sudo lsof -i to show all the ports being used and figure out what has been using it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recent dask changes #509

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

recent dask changes #509

pavlis Feb 12, 2024 Collaborator

Replies: 10 comments

pavlis Feb 13, 2024 Collaborator Author

pavlis Feb 16, 2024 Collaborator Author

pavlis Feb 26, 2024 Collaborator Author

pavlis Feb 27, 2024 Collaborator Author

wangyinz Feb 27, 2024 Maintainer

pavlis Feb 27, 2024 Collaborator Author

pavlis Feb 27, 2024 Collaborator Author

wangyinz Feb 27, 2024 Maintainer

pavlis Feb 28, 2024 Collaborator Author

wangyinz Feb 29, 2024 Maintainer

pavlis
Feb 12, 2024
Collaborator

pavlis
Feb 13, 2024
Collaborator Author

pavlis
Feb 16, 2024
Collaborator Author

pavlis
Feb 26, 2024
Collaborator Author

pavlis
Feb 27, 2024
Collaborator Author

wangyinz
Feb 27, 2024
Maintainer

pavlis
Feb 27, 2024
Collaborator Author

pavlis
Feb 27, 2024
Collaborator Author

wangyinz
Feb 27, 2024
Maintainer

pavlis
Feb 28, 2024
Collaborator Author

wangyinz
Feb 29, 2024
Maintainer