Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of nextsim.dg module in NEDAS workflow #7

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from

Conversation

myying
Copy link
Collaborator

@myying myying commented Dec 4, 2024

  • added a JobSubmitter class to handle different ways to run jobs; replacing the job_submit_cmd approach (@yumengch could you help test this?)
  • moved nextsimdg restart and forcing file logic from top-level config file to its own default.yml config file. Adding a namelist() function.
  • moved perturb.py under models/nextsim/dg, keeping it as a nextsimdg specific perturbation scheme, along with the standalone scheme in scripts/perturb.py
  • added the missing bits in nextsim/dg/model.py methods, to integrate with core assimilate algorithm

aydogduali and others added 21 commits July 3, 2024 16:34
…ing yml file. The perturbation follows the original HYCOM approach.
… time units instead of giving it explicitly.
Adding random perturbation to forcing - an initial implementation
…ll perturbation variance for cycled testing;updating model configuration files for each cycle; put generated restart file to next cycle;
saved the old copy of config.py

moved perturb/randoim_perturb.py to ./perturb.py temporarily, will try to merge with utils/random_perturb.py later

output_dir is no longer needed, makedir function using os.makedirs but
catching FileExistsError from race conditions, adding nens to the
job_opts for batch mode

generate_init_ensemble.py is removed and the function is merged into
preprocess.py as time==time_start the restart_dir is where the initial
ensemble will be obtained

run_exp.py renamed into run_expt.py and made more clean
replacing the job_submit_cmd approach in the config file

 # Changes not staged for commit:
@myying myying requested review from yumengch and aydogduali December 4, 2024 11:16
implemented OAR/SLURM submitters

updated nextsim.dg.model.run to use the submitter
previously I have the perturb settings under the model_def section,
defined for each model; but now the perturb section is made stand alone.
The reason doing so is that I can now run perturbation scheme involving
multiple model components. I'm keeping the model-specific perturb
settings so that the code is still compatible. Note that
model.preprocess method will run the model-specific perturb settings
while the scripts/perturb.py will run the standalone perturb settings.

Moving the perturb.py inside nextsim.dg module as well

prepare_forcing and prepare_restart are included in the preprocess step
combined prepare_restart and prepare_forcing into preprocess

since nextsim.dg config_file is holding the files and perturb options,
they are initialized directly in model class, instead of passing through
kwargs everytime.
-initialize grid inside __init__, can we define model grid parameters in
default.yml?

-include restart forcing modules read_var and write_var in model class

-in preprocess, there is a bit issue with copying previous cycle restart
files to the current cycle. In config file files:restart:format, there
seems to be no ensemble member index?
since only ensemble forecast gets to use this job array functionality,
moving it out of job_submit section, so that other scripts like
assimilate.py and perturb.py will not get accidently assigned
use_job_array=True. Note; if not specified, use_job_array is defaulted
to False
like the ensemble_forecast script, these utility scripts are all run in a separate job, so need to enter python environment before running.
@myying myying marked this pull request as draft December 17, 2024 08:26
@myying myying marked this pull request as ready for review December 17, 2024 08:26
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aydogduali had a question about the "ssh " part and about the temporary file used in submit_job_and_monitor.

I remembered Yumeng mentioning that ssh is necessary when the job is submitted from a compute node, so that they are accessing the login node to submit additional jobs. If it is unnecessary, I suggest making additional logic:
if self.job_submit_node is None: then don't add the "ssh self.job_submit_node" part in the commands

For the temporary job script files. I used NamedTemporaryFile to create a file with guaranteed unique name. I imagine that if the self.run_dir is the same when submitting multiple jobs with the same job_name their job_script can collide (all members run in the same directory, can happen sometimes). The default name from NamedTemporaryFile is "/tmp/tmpXXXXXX.sh" if this is not a good location. I suggest using NamedTemporaryFile(mode='w+', delete=False, dir=self.run_dir, prefix=self.job_name, suffix='.sh') so that it is created in the run_dir. We can also keep the job_script (don't os.remove it) so that later we can check it.

@tdcwilliams
Copy link

tdcwilliams commented Dec 17, 2024 via email

-temporary files generated inside run_dir
-make ssh node command optional
@myying
Copy link
Collaborator Author

myying commented Dec 17, 2024

Hi @tdcwilliams, Thanks for chiming in.

I have a separate job_submitters/slurm.py to handle the sigma2 computers. As you see in slurm.py I didn't implement any ssh before sbatch command, since we don't need this additional step. The ssh stuff is only for oar.py for a different machine.

The temporary file is indeed not safe to rely on /tmp partition. So in the new commit I changed it to be written in self.run_dir which is specified when making the JobSubmitter instance.

@aydogduali aydogduali mentioned this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants