recent dask changes #509
Replies: 10 comments
-
Additional reading of the new documentation for dask (link above) showed three thing that are important:
All this says we need to change our container's startup script when dask is set as the scheduler. I was going to suggest how we could do that, but I'm a bit mystified by how that works in spite of the fact I wrote a user manual page on the topic several months ago. A conclusion I take from that is what we have is way too complicated as a standard setup. Some ideas to consider:
A side issue about the potential design changes here is this one. We REALLY need a clean mechanism on HPC to just run a python file as the job script. We really should not encourage interactive use of a system made for batch submissions. The model should be to use a small local system to set up your workflow on a test data set so you can just release the same job on a bigger data set on a large cluster or cloud system. We should make jupyter notebook files only an option for such submissions not the required form. |
Beta Was this translation helpful? Give feedback.
-
During our meeting today @wangyinz pointed out that the dask diagnostics dashboard should always be available if one connects to port 8787. I checked and he is indeed right. My memory and/or understanding of how dask operated earlier was faulty. I had thought that dashboard was only visible if you launched dask distributed. I haven't checked, but my error may be in running a local copy of dask during development outside the mspass container. I think if you just invoke dask that way without instantiating an instance of dask distributed maybe you don't get the diagnostics dashboard. So, absolutely the port conflict problem that initiated this discussion was an error on my part in launching dask. Apparently one does NOT want to do what I did which is launch LocalCluster within a notebook running in the container. The topic of this discussion, on the other hand does remain open. I am very very sure that when we get a conda package in working order the best way to advise people to get started with MsPASS is to do a |
Beta Was this translation helpful? Give feedback.
-
I decided to do a controlled test with LocalCluster running a notebook I've been developing for the MsPASS class we are now planning on teaching in July 2024. The notebook compares the same tasks done serial and in parallel. I runs fine provided I don't try to explicitly start the new I tried running this on my Ubuntu development machine. I first updated dask with conda. I launched the docker container in "daemon" mode as defined on a run like in the wiki. The point is it port maps only 27017. I was trying to use our container as only a database server. I then tried to run the notebook as a local process. I launched it from the shell using:
In that mode, of course, jupyter automatically opens in a browser window. I knew there would be trouble when I got this message using the usual "get_database" method call for the MsPASS client:
I added this incantation from the dask documentation to a box at the top of the notebook and it gave a similar warning:
When I ran the parallel workflow box it crashed almost immediately. I won't give the whole error stack as it is long but it is clear the following is the key issue:
I am not at all convinced the error message is pointing to the real problem (may be "frequently" but not necessarily the problem). I am guessing the jupyter server is confused about which instance of the scheduler it is should be talking to. I think the way I launched the container it is actually running in "all-in-one" mode so there is an instance of dask and a jupyter notebook running in the container. One or the other or both are likely confusing dask. That link the error message gives is also informative and something all development team members should read. It definitely points out the dark side of using dask in this mode and why the container remains a wise choice. Any suggestions? Maybe I should launch the container running in db mode. @wangyinz could you remind me how I might do that? Maybe what you should do is edit the wiki page on running mspass with docker and have the line that shows how to run in daemon mode for using docker changing the incantation to run the container as strickly a db server. |
Beta Was this translation helpful? Give feedback.
-
In an email @wangyinz suggested I try running the container with MSPASS_ROLE set to db. This doesn't work and I do not know why. First, let me note that the following is the line I cut and paste from the wiki for mspass all the time to run mongodb with a local instance of mspass:
Note that with this configuration I can run dask locally IF I don't instantiate an instance of Based on the email suggestion of @wangyinz I tried the following relatively small variant from above:
Note the only difference is adding the --env arg to set MSPASS_ROLE. The problem is the above line appears to exit almost immediately with no trace of any output anywhere that I can find. i.e. the command line echoes nothing. There is nothing in the logs directory and I don't see any other unknown file in "pwd". The only hint I did anything is that if I try the same command again I get this common error when you use --name with docker:
Why that is important is it shows the container run script was executed but just appears to have silently exited immediately. I tried to see how this might happen with the startup script but it has me stumped. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
hmmm... I am a bit confused on what exactly is going on here. So, for the first command, since it only binds the 27017 port it means that only the database is accessible from both inside and outside of the container. Because how the startup script is written, it will also start a dask cluster inside the container, but that cluster won't be accessible from the outside. This means, if you run a For the second command setting mspass/scripts/start-mspass.sh Line 320 in 9740939 |
Beta Was this translation helpful? Give feedback.
-
First, when I tried this earlier today before I wrote that last comment I definitely was instantiating LocalCluster outside the container. You stated exactly why I did not understand what I was getting. The way I understand docker works the fact that there is something inside the container listening on 8787 should detectable outside. That security feature is one of the main selling points for containerization. I wondered if maybe dask has some configuration file it is reading because I did a bind. Indeed I now realize dask is creating a "work" directory. I thought that might be the issue, BUT I just tried docker run -p 27017:27107 without the bind argument. I still get the same warning about 8787 being in use. I'm going to try one more thing but it will require ending this comment and returning. That is, I want to test a hypothesis that some instance of dask I created earlier has done something to the networking on this machine. I'll use the all purpose equivalent of ctrl-alt-del and try this again without any hacking in between. BTW - the search mentioned above led me this this site that has some information very relevant to the overall topic of this discussion page. |
Beta Was this translation helpful? Give feedback.
-
Well rebooting did not solve this problem. I did, however, poke at the code base the error message indicated was throwing the exception that was handled and generated that error message about port conflict. It seems most likely that there is some hidden state issue related to the browser history. It references line 182 of this file distributed/node.py. Here is a section of code around line 182:
I don't know what that section of code is doing, but it definitely is interrogating some system service that is saying 8787 is in use. Did you push that configuration change? Will I get it if I pull the latest container? |
Beta Was this translation helpful? Give feedback.
-
OK, I just tried on my ubuntu system. First, I open a new terminal and ran:
Then, in another terminal, I run the following:
So, there is no conflict in the ports, although the LocalCluster is using a random port instead of 8787. |
Beta Was this translation helpful? Give feedback.
-
@wangyinz I pulled the revised container after your change and it seems using |
Beta Was this translation helpful? Give feedback.
-
So, I guess that means some other service has been running and using that port then. Maybe you can try |
Beta Was this translation helpful? Give feedback.
-
It is pretty clear to me that there have been some fairly significant changes in dask distributed. All of us should peruse the new documentation I found here. (I think you will get that as the top hit if you just google "dask distributed")
There are two things that make me post this.
The problem is that this throws the following warning message:
The job runs, BUT I can't connect to the diagnostic port that is normally on 8787 even when I used the "-p 8787:8787" option with docker run. The message says why as the LocalCluster instance changed the port. I cannot, of course, connect to 42047 in the above example because docker won't let me. I presume this is happening because our container prelaunches dask in the background and i apparently grabs 8787. My LocalCluster instance is colliding with the version already running. Somewhat surprising it works at all.
I ddn't put this in issues and I think it belongs more here as a design discussion on how we need to make potentially bigger revisions to deal with this problem.
Beta Was this translation helpful? Give feedback.
All reactions