Integration of nextsim.dg module in NEDAS workflow #7

myying · 2024-12-04T11:16:19Z

added a JobSubmitter class to handle different ways to run jobs; replacing the job_submit_cmd approach (@yumengch could you help test this?)
moved nextsimdg restart and forcing file logic from top-level config file to its own default.yml config file. Adding a namelist() function.
moved perturb.py under models/nextsim/dg, keeping it as a nextsimdg specific perturbation scheme, along with the standalone scheme in scripts/perturb.py
added the missing bits in nextsim/dg/model.py methods, to integrate with core assimilate algorithm

…ing yml file. The perturbation follows the original HYCOM approach.

…perturb function;

…ction needs attention

… time units instead of giving it explicitly.

Adding random perturbation to forcing - an initial implementation

…g job arrays for the forecast phase

…ll perturbation variance for cycled testing;updating model configuration files for each cycle; put generated restart file to next cycle;

…n zero

saved the old copy of config.py moved perturb/randoim_perturb.py to ./perturb.py temporarily, will try to merge with utils/random_perturb.py later output_dir is no longer needed, makedir function using os.makedirs but catching FileExistsError from race conditions, adding nens to the job_opts for batch mode generate_init_ensemble.py is removed and the function is merged into preprocess.py as time==time_start the restart_dir is where the initial ensemble will be obtained run_exp.py renamed into run_expt.py and made more clean

replacing the job_submit_cmd approach in the config file # Changes not staged for commit:

implemented OAR/SLURM submitters updated nextsim.dg.model.run to use the submitter

previously I have the perturb settings under the model_def section, defined for each model; but now the perturb section is made stand alone. The reason doing so is that I can now run perturbation scheme involving multiple model components. I'm keeping the model-specific perturb settings so that the code is still compatible. Note that model.preprocess method will run the model-specific perturb settings while the scripts/perturb.py will run the standalone perturb settings. Moving the perturb.py inside nextsim.dg module as well prepare_forcing and prepare_restart are included in the preprocess step

combined prepare_restart and prepare_forcing into preprocess since nextsim.dg config_file is holding the files and perturb options, they are initialized directly in model class, instead of passing through kwargs everytime.

-initialize grid inside __init__, can we define model grid parameters in default.yml? -include restart forcing modules read_var and write_var in model class -in preprocess, there is a bit issue with copying previous cycle restart files to the current cycle. In config file files:restart:format, there seems to be no ensemble member index?

since only ensemble forecast gets to use this job array functionality, moving it out of job_submit section, so that other scripts like assimilate.py and perturb.py will not get accidently assigned use_job_array=True. Note; if not specified, use_job_array is defaulted to False

like the ensemble_forecast script, these utility scripts are all run in a separate job, so need to enter python environment before running.

myying · 2024-12-17T08:38:58Z

utils/job_submitters/oar.py

@aydogduali had a question about the "ssh " part and about the temporary file used in submit_job_and_monitor.

I remembered Yumeng mentioning that ssh is necessary when the job is submitted from a compute node, so that they are accessing the login node to submit additional jobs. If it is unnecessary, I suggest making additional logic:
if self.job_submit_node is None: then don't add the "ssh self.job_submit_node" part in the commands

For the temporary job script files. I used NamedTemporaryFile to create a file with guaranteed unique name. I imagine that if the self.run_dir is the same when submitting multiple jobs with the same job_name their job_script can collide (all members run in the same directory, can happen sometimes). The default name from NamedTemporaryFile is "/tmp/tmpXXXXXX.sh" if this is not a good location. I suggest using NamedTemporaryFile(mode='w+', delete=False, dir=self.run_dir, prefix=self.job_name, suffix='.sh') so that it is created in the run_dir. We can also keep the job_script (don't os.remove it) so that later we can check it.

tdcwilliams · 2024-12-17T09:57:39Z

Hi @myying, - I use sbatch command from the compute node all the time without needing ssh (on fram, but maybe other servers are different?). Perhaps it is better to simplify the logic and don't have the ssh option? - space in /tmp is limited since many people can be saving there (I've had "no disk space" crashes from this) so I would agree the default location for temp files should be in the run directory.

…

On Tue, 17 Dec 2024 at 09:39, Yue (Michael) Ying ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ On utils/job_submitters/oar.py <#7 (comment)>: @aydogduali <https://github.com/aydogduali> had a question about the "ssh " part and about the temporary file used in submit_job_and_monitor. I remembered Yumeng mentioning that ssh is necessary when the job is submitted from a compute node, so that they are accessing the login node to submit additional jobs. If it is unnecessary, I suggest making additional logic: if self.job_submit_node is None: then don't add the "ssh self.job_submit_node" part in the commands For the temporary job script files. I used NamedTemporaryFile to create a file with guaranteed unique name. I imagine that if the self.run_dir is the same when submitting multiple jobs with the same job_name their job_script can collide (all members run in the same directory, can happen sometimes). The default name from NamedTemporaryFile is "/tmp/tmpXXXXXX.sh" if this is not a good location. I suggest using NamedTemporaryFile(mode='w+', delete=False, dir=self.run_dir, prefix=self.job_name, suffix='.sh') so that it is created in the run_dir. We can also keep the job_script (don't os.remove it) so that later we can check it. — Reply to this email directly, view it on GitHub <#7 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATVYQLXT4HE22UV3XLMZ3T2F7PLTAVCNFSM6AAAAABS75MPGSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKMBYGE4DMMBQGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-temporary files generated inside run_dir -make ssh node command optional

myying · 2024-12-17T11:17:21Z

Hi @tdcwilliams, Thanks for chiming in.

I have a separate job_submitters/slurm.py to handle the sigma2 computers. As you see in slurm.py I didn't implement any ssh before sbatch command, since we don't need this additional step. The ssh stuff is only for oar.py for a different machine.

The temporary file is indeed not safe to rely on /tmp partition. So in the new commit I changed it to be written in self.run_dir which is specified when making the JobSubmitter instance.

aydogduali and others added 21 commits July 3, 2024 16:34

set the ground for nextsimdg interface development

73dd89e

set the ground for nextsimdg interface development

5f99a78

adding random field generator and geostrophic wind adjustment function

62813fa

applying forcing perturbation module to NEDAS with assumed correspond…

06ae18e

…ing yml file. The perturbation follows the original HYCOM approach.

adding grid for perturbation generation; restore the original random_…

d3c15cc

…perturb function;

bug fixes; add support for multiple forcing files; todo: grid constru…

950af7c

…ction needs attention

adding function for restart file perturbationsl allows for reading of…

3b10b98

… time units instead of giving it explicitly.

adding ensemble setup and perturbations to nextsim dg

783a2cb

Merge branch 'nextsimdg' into main

1ea01cf

Merge pull request #1 from yumengch/main

8720f98

Adding random perturbation to forcing - an initial implementation

add and modify config files for nextsim-dg

5372d67

update run scripts for nextsim-dg

d220a0a

allowing ensemble perturbation of restart and forcing file; submittin…

64ac837

…g job arrays for the forecast phase

restore the perturbation section in yaml file

0d8032b

adding time information for sliced forcing files; using extremely sma…

353246b

…ll perturbation variance for cycled testing;updating model configuration files for each cycle; put generated restart file to next cycle;

fixes that issue due to different oarstat output in f-dahu and dahu-oar3

9e91d4e

adding amplitude to the perturbation; ensuring perturbation centers o…

468eac4

…n zero

move nextsim.dg config entries to its own default.yml

00ab8a9

start building a run_job function for different modes of job submission

4a7819e

JobSubmitter to handle job submission

4efe5f3

replacing the job_submit_cmd approach in the config file # Changes not staged for commit:

myying requested review from yumengch and aydogduali December 4, 2024 11:16

myying added 7 commits December 4, 2024 14:54

OAR and SLURM JobSubmitter classes

a45e24b

implemented OAR/SLURM submitters updated nextsim.dg.model.run to use the submitter

add conda environment.yml

a1ea13e

run and run_batch for nextsim.dg

1e4d739

SlurmJobSubmitter added job array support

c85693d

nextsim.dg.model.preprocess fixes

a8c9757

combined prepare_restart and prepare_forcing into preprocess since nextsim.dg config_file is holding the files and perturb options, they are initialized directly in model class, instead of passing through kwargs everytime.

myying added 2 commits December 8, 2024 00:03

fix issue for utiliy scripts when run_separate_job=true

785f0bf

like the ensemble_forecast script, these utility scripts are all run in a separate job, so need to enter python environment before running.

myying marked this pull request as draft December 17, 2024 08:26

myying marked this pull request as ready for review December 17, 2024 08:26

myying commented Dec 17, 2024

View reviewed changes

fixing issues in job_submitter

e5e28a2

-temporary files generated inside run_dir -make ssh node command optional

aydogduali mentioned this pull request Jan 22, 2025

Nextsimdg #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of nextsim.dg module in NEDAS workflow #7

Integration of nextsim.dg module in NEDAS workflow #7

myying commented Dec 4, 2024 •

edited

Loading

myying Dec 17, 2024

tdcwilliams commented Dec 17, 2024 via email

myying commented Dec 17, 2024

Integration of nextsim.dg module in NEDAS workflow #7

Are you sure you want to change the base?

Integration of nextsim.dg module in NEDAS workflow #7

Conversation

myying commented Dec 4, 2024 • edited Loading

myying Dec 17, 2024

Choose a reason for hiding this comment

tdcwilliams commented Dec 17, 2024 via email

myying commented Dec 17, 2024

myying commented Dec 4, 2024 •

edited

Loading