Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mongodb version skew error #477

Open
pavlis opened this issue Dec 9, 2023 · 8 comments
Open

mongodb version skew error #477

pavlis opened this issue Dec 9, 2023 · 8 comments

Comments

@pavlis
Copy link
Collaborator

pavlis commented Dec 9, 2023

I was trying to get mspass running on a new cluster at Indiana. As usual I had to do some customization to launch the mspass container. That actually only involved two things:

  1. I used the newer version of singularity now called "apptainer" which required name changes
  2. I wanted to have the container access more than my home directory.

If we need to dig into my changes later we can do so, but I don't think that is the issue I'm facing here. For the record I found any attempt to do anything with mongodb in the container would fail with a timeout error. Running jupyter-lab terminal I quickly learned that mongod was not running in the container. I found the mongo log file (in the usual logs directory). There is the usual pile of stuff, but I cut out and did a pretty formatting of the line that shows the error that caused mongod to exit:

{
  "t": {
    "$date": "2023-12-09T05:46:03.889-05:00"
  },
  "s": "F",
  "c": "CONTROL",
  "id": 20573,
  "ctx": "initandlisten",
  "msg": "Wrong mongod version",
  "attr": {
    "error": "UPGRADE PROBLEM: Found an invalid featureCompatibilityVersion document (ERROR: Location4926900: Invalid featureCompatibilityVersion document in admin.system.version: { _id: \"featureCompatibilityVersion\", version: \"4.4\" }. See https://docs.mongodb.com/master/release-notes/5.0-compatibility/#feature-compatibility. :: caused by :: Invalid feature compatibility version value, expected '5.0' or '5.3' or '6.0. See https://docs.mongodb.com/master/release-notes/5.0-compatibility/#feature-compatibility.). If the current featureCompatibilityVersion is below 5.0, see the documentation on upgrading at https://docs.mongodb.com/master/release-notes/5.0/#upgrade-procedures."
  }
}

Is there a problem with the current container? I just pulled this version yesterday.

@wangyinz
Copy link
Member

wangyinz commented Dec 9, 2023

There shouldn't be as @Aristoeu has been using the container on HPC system recently. @Aristoeu could you please confirm that the latest container works? Also, maybe provide some details on how you've been running it.

btw, I wonder if this error is due to using the newer mongoDB with an old database setup by an old version.

@pavlis
Copy link
Collaborator Author

pavlis commented Dec 9, 2023

Having just done this there is a strong possibility @Aristoeu is using an older version of the container. With apptainer (aka singularity) you normally build a file that apptainer uses for a run rather than name as is done with docker. He could easily be using a version he had built earlier. That is a warning to anyone who is using the latest version of any container.

@wangyinz
Copy link
Member

wangyinz commented Dec 9, 2023

I actually just tested pulling the latest mspass container on Frontera with apptainer pull docker://mspass/mspass, and run it using the single node script. I was able to get everything started and connect to the database with mongosh.

@Aristoeu
Copy link
Collaborator

Aristoeu commented Dec 9, 2023

Yes, I was using an older version. I just pulled the latest container and ran it in distributed nodes using script
distributed_node.txt, it also started and connected mongosh

@pavlis
Copy link
Collaborator Author

pavlis commented Dec 10, 2023

In an email on this topic @wangyinz noted there were tricky issues with running containers on HPC that could be an issue here. I am suspicious there is some file system issue. If found mongod dies when I run the container this way:

apptainer run --home $WORK_DIR $MSPASS_CONTAINER

where the two shell variables point to that their names imply. However, if I run this without the --home argument like this:

apptainer run --home $WORK_DIR $MSPASS_CONTAINER

and then connect to the jupyter server I see that mongod is running and will accept a connection with mongosh. That doubly confirms the problem is not with the container, but suggests strongly it something about the system configuration on this new cluster. Note the same script used to run on the ancestor to this cluster that this one is replacing.

It is not something simple as a write permission error as with the terminal in jupyter I can create and delete files in $WORK_DIR.

Seems I need to get the cluster sysadmins involved. @wangyinz any ideas I could give them to help sort this out?

@wangyinz
Copy link
Member

The first thing to look at is what directories are by default mounted into the container. At TACC, the default behavior is to mount $HOME, $WORK, and $SCRATCH, so that inside the container user will have access to all the files just as on the host. However, IU might have that set differently. A simple test you can do is to run a bash shell within a container and see what directories are mounted. Then, you may use the bind option (see here) to explicitly add those that aren't there by default.

@pavlis
Copy link
Collaborator Author

pavlis commented Dec 11, 2023

Seems pretty clear you are right that this issue is that I need to get the right combination of the bind option to mount the right file systems along with the --home option. The only combination I've found that works is to not define --home at all, in which case home is literally my home directory on the cluster (~ in shell lingo), and use -B to mount the file system where I have data. Notebooks and a terminal running in container can access files in the file system defined by -B, but if I try anything with --home mongod doesn't run. Note I've also even tried reversing the order of -B and --home and that makes no difference.

It may be telling that I have an older singularity container I built a few months ago that runs fine on this same cluster with apptainer even though it was created with singularity on a different machine. Makes me think we need to heed the error message I always find in mongod log file about a version problem that seems to be always posted when it crashes.

@pavlis
Copy link
Collaborator Author

pavlis commented Dec 16, 2023

I'm not yet sure why, but I have a solid workaround for the problem I encountered on this new IU cluster. It turned out that for some reason running the container using the --home options caused problems. If I omitted the --home options and used the -B option to mount the lustre file systems (here called /N/slate) AND make sure the container was launched following a cd to director passed as APPTAINER_MSPASS_DBPATH things worked. I'm guessing the behavior is that --home defaults to the current directory. Why using --home aborts monogd remains a mystery to me, but this is a simple fix that actually simplified the run line from the script I used before anyway.

For the record, here is a record of the job script that ran (without SBATCH commands that are a side issue for this issue and comments).

WORK2=/N/slate/pavlis
WORK_DIR=$WORK2/usarray
MSPASS_CONTAINER=~/containers/mspass.sif
DB_PATH=/N/slate/pavlis/usarray
SING_COM="apptainer  run -B /N/slate/pavlis/usarray $MSPASS_CONTAINER"
module load apptainer
cd $WORK_DIR
APPTAINERENV_MSPASS_DB_PATH=$DB_PATH \
    APPTAINERENV_MSPASS_WORK_DIR=$WORK_DIR $SING_COM

I think you can mark this issue closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants