-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of nextsim.dg module in NEDAS workflow #7
base: develop
Are you sure you want to change the base?
Conversation
myying
commented
Dec 4, 2024
•
edited
Loading
edited
- added a JobSubmitter class to handle different ways to run jobs; replacing the job_submit_cmd approach (@yumengch could you help test this?)
- moved nextsimdg restart and forcing file logic from top-level config file to its own default.yml config file. Adding a namelist() function.
- moved perturb.py under models/nextsim/dg, keeping it as a nextsimdg specific perturbation scheme, along with the standalone scheme in scripts/perturb.py
- added the missing bits in nextsim/dg/model.py methods, to integrate with core assimilate algorithm
…ing yml file. The perturbation follows the original HYCOM approach.
…perturb function;
…ction needs attention
… time units instead of giving it explicitly.
Adding random perturbation to forcing - an initial implementation
…g job arrays for the forecast phase
…ll perturbation variance for cycled testing;updating model configuration files for each cycle; put generated restart file to next cycle;
saved the old copy of config.py moved perturb/randoim_perturb.py to ./perturb.py temporarily, will try to merge with utils/random_perturb.py later output_dir is no longer needed, makedir function using os.makedirs but catching FileExistsError from race conditions, adding nens to the job_opts for batch mode generate_init_ensemble.py is removed and the function is merged into preprocess.py as time==time_start the restart_dir is where the initial ensemble will be obtained run_exp.py renamed into run_expt.py and made more clean
replacing the job_submit_cmd approach in the config file # Changes not staged for commit:
implemented OAR/SLURM submitters updated nextsim.dg.model.run to use the submitter
previously I have the perturb settings under the model_def section, defined for each model; but now the perturb section is made stand alone. The reason doing so is that I can now run perturbation scheme involving multiple model components. I'm keeping the model-specific perturb settings so that the code is still compatible. Note that model.preprocess method will run the model-specific perturb settings while the scripts/perturb.py will run the standalone perturb settings. Moving the perturb.py inside nextsim.dg module as well prepare_forcing and prepare_restart are included in the preprocess step
combined prepare_restart and prepare_forcing into preprocess since nextsim.dg config_file is holding the files and perturb options, they are initialized directly in model class, instead of passing through kwargs everytime.
-initialize grid inside __init__, can we define model grid parameters in default.yml? -include restart forcing modules read_var and write_var in model class -in preprocess, there is a bit issue with copying previous cycle restart files to the current cycle. In config file files:restart:format, there seems to be no ensemble member index?
since only ensemble forecast gets to use this job array functionality, moving it out of job_submit section, so that other scripts like assimilate.py and perturb.py will not get accidently assigned use_job_array=True. Note; if not specified, use_job_array is defaulted to False
like the ensemble_forecast script, these utility scripts are all run in a separate job, so need to enter python environment before running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aydogduali had a question about the "ssh " part and about the temporary file used in submit_job_and_monitor.
I remembered Yumeng mentioning that ssh is necessary when the job is submitted from a compute node, so that they are accessing the login node to submit additional jobs. If it is unnecessary, I suggest making additional logic:
if self.job_submit_node is None: then don't add the "ssh self.job_submit_node" part in the commands
For the temporary job script files. I used NamedTemporaryFile to create a file with guaranteed unique name. I imagine that if the self.run_dir is the same when submitting multiple jobs with the same job_name their job_script can collide (all members run in the same directory, can happen sometimes). The default name from NamedTemporaryFile is "/tmp/tmpXXXXXX.sh" if this is not a good location. I suggest using NamedTemporaryFile(mode='w+', delete=False, dir=self.run_dir, prefix=self.job_name, suffix='.sh') so that it is created in the run_dir. We can also keep the job_script (don't os.remove it) so that later we can check it.
Hi @myying,
- I use sbatch command from the compute node all the time without needing
ssh (on fram, but maybe other servers are different?). Perhaps it is better
to simplify the logic and don't have the ssh option?
- space in /tmp is limited since many people can be saving there (I've had
"no disk space" crashes from this) so I would agree the default location
for temp files should be in the run directory.
…On Tue, 17 Dec 2024 at 09:39, Yue (Michael) Ying ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
On utils/job_submitters/oar.py
<#7 (comment)>:
@aydogduali <https://github.com/aydogduali> had a question about the "ssh
" part and about the temporary file used in submit_job_and_monitor.
I remembered Yumeng mentioning that ssh is necessary when the job is
submitted from a compute node, so that they are accessing the login node to
submit additional jobs. If it is unnecessary, I suggest making additional
logic:
if self.job_submit_node is None: then don't add the "ssh
self.job_submit_node" part in the commands
For the temporary job script files. I used NamedTemporaryFile to create a
file with guaranteed unique name. I imagine that if the self.run_dir is the
same when submitting multiple jobs with the same job_name their job_script
can collide (all members run in the same directory, can happen sometimes).
The default name from NamedTemporaryFile is "/tmp/tmpXXXXXX.sh" if this is
not a good location. I suggest using NamedTemporaryFile(mode='w+',
delete=False, dir=self.run_dir, prefix=self.job_name, suffix='.sh') so that
it is created in the run_dir. We can also keep the job_script (don't
os.remove it) so that later we can check it.
—
Reply to this email directly, view it on GitHub
<#7 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATVYQLXT4HE22UV3XLMZ3T2F7PLTAVCNFSM6AAAAABS75MPGSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKMBYGE4DMMBQGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
-temporary files generated inside run_dir -make ssh node command optional
Hi @tdcwilliams, Thanks for chiming in. I have a separate job_submitters/slurm.py to handle the sigma2 computers. As you see in slurm.py I didn't implement any ssh before sbatch command, since we don't need this additional step. The ssh stuff is only for oar.py for a different machine. The temporary file is indeed not safe to rely on /tmp partition. So in the new commit I changed it to be written in self.run_dir which is specified when making the JobSubmitter instance. |