Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fix" for the increment handler application #738

Merged
merged 2 commits into from
Nov 16, 2023

Conversation

guillaumevernieres
Copy link
Contributor

Not really a fix, I just removed the linear variable change from the increment handler application. The option is not used when cycling and probably shouldn't anyways.

@CoryMartin-NOAA CoryMartin-NOAA added hera-GW-RT Queue for automated testing with global-workflow on Hera orion-GW-RT Queue for automated testing with global-workflow on Orion labels Nov 16, 2023
@emcbot emcbot added orion-GW-RT-Running Automated testing with global-workflow running on Orion hera-GW-RT-Running Automated testing with global-workflow running on Hera and removed orion-GW-RT Queue for automated testing with global-workflow on Orion hera-GW-RT Queue for automated testing with global-workflow on Hera labels Nov 16, 2023
@emcbot
Copy link

emcbot commented Nov 16, 2023

Automated Global-Workflow GDASApp Testing Results:
Machine: orion

Start: Thu Nov 16 10:01:46 CST 2023 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Thu Nov 16 10:58:44 CST 2023
---------------------------------------------------
Tests:                                 *SUCCESS*
Tests: Completed at Thu Nov 16 11:28:13 CST 2023
Tests: 100% tests passed, 0 tests failed out of 53

@emcbot emcbot added orion-GW-RT-Passed Automated testing with global-workflow successful on Orion and removed orion-GW-RT-Running Automated testing with global-workflow running on Orion labels Nov 16, 2023
@emcbot
Copy link

emcbot commented Nov 16, 2023

Automated Global-Workflow GDASApp Testing Results:
Machine: hera

Start: Thu Nov 16 16:17:31 UTC 2023 on hfe07
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Thu Nov 16 17:02:18 UTC 2023
---------------------------------------------------
Tests:                                  *Failed*
Tests: Failed at Thu Nov 16 18:08:38 UTC 2023
Tests: 98% tests passed, 1 tests failed out of 53
Tests: see output at /scratch1/NCEPDEV/da/Cory.R.Martin/CI/GDASApp/workflow/PR/738/global-workflow/sorc/gdas.cd/build/log.ctest

@emcbot emcbot added hera-GW-RT-Failed Automated testing with global-workflow failed on Hera and removed hera-GW-RT-Running Automated testing with global-workflow running on Hera labels Nov 16, 2023
@guillaumevernieres
Copy link
Contributor Author

Looks like one of the ctest timed out:

51/53 Test #1686: test_gdasapp_atm_jjob_ens_run .........................***Timeout 1500.37 sec

Can we ignore?

@guillaumevernieres guillaumevernieres merged commit edbe844 into develop Nov 16, 2023
17 checks passed
@RussTreadon-NOAA
Copy link
Contributor

According to /scratch1/NCEPDEV/da/Cory.R.Martin/CI/GDASApp/workflow/PR/738/global-workflow/sorc/gdas.cd/build/log.ctest ctest test_gdasapp_atm_jjob_ens_run was submitted as batch job 51900427

+ '[' HERA = HERA ']'
+ sbatch --nodes=1 --ntasks=36 --account=da-cpu --qos=batch --time=00:30:00 --export=ALL --wait /scratch1/NCEPDEV/da/Cory.R.Martin/CI/GDAS\
App/workflow/PR/738/global-workflow/sorc/gdas.cd/../..//jobs/JGLOBAL_ATMENS_ANALYSIS_RUN
Submitted batch job 51900427

If we look at /scratch1/NCEPDEV/da/Cory.R.Martin/CI/GDASApp/workflow/PR/738/global-workflow/sorc/gdas.cd/build/test/atm/global-workflow/testrun/slurm-51900427.out we see that fv3jedi_letkf.x finished in just over 4 minutes (249.72 seconds).

0: OOPS_STATS ---------------------------------- Parallel Timing Statistics (   6 MPI tasks) -----------------------------------
0:
0: OOPS_STATS Run end                                  - Runtime:    249.72 sec,  Memory: total:    10.08 Gb, per task: min =     1.28 Gb, max =     1.97 Gb
0: Run: Finishing oops::LocalEnsembleDA<FV3JEDI, UFO and IODA observations> with status = 0
0: OOPS Ending   2023-11-16 18:07:37 (UTC+0000)
^[[38;21m2023-11-16 18:07:38,955 - INFO     - atmens_analysis:   END: pygfs.task.atmens_analysis.execute^[[0m
^[[38;5;39m2023-11-16 18:07:38,956 - DEBUG    - atmens_analysis:  returning: None^[[0m
+ slurm_script[21]: status=0

The total execution time for the job was under 5 minutes

End slurm_script at 18:07:39 with error code 0 (time elapsed: 00:04:40)
_______________________________________________________________
Start Epilog on node h5c24 for job 51900427 :: Thu Nov 16 18:07:40 UTC 2023
Job 51900427 finished for user Cory.R.Martin in partition hera with exit code 0:0
_______________________________________________________________
End Epilogue Thu Nov 16 18:07:40 UTC 2023

Why did ctest flag this job as exceeding 30 minutes run time?

51/53 Test #1686: test_gdasapp_atm_jjob_ens_run .........................***Timeout 1500.37 sec
...
The following tests FAILED:
        1686 - test_gdasapp_atm_jjob_ens_run (Timeout)

ctest printout is wrong. test_gdasapp_atm_jjob_ens_run Passed in 280 seconds.

@CoryMartin-NOAA
Copy link
Contributor

I think the issue @RussTreadon-NOAA is the ctest counts the time in the queue. So if the queue is slow, the test will timeout. I wonder if there is a fix for this?

@RussTreadon-NOAA
Copy link
Contributor

Hmm, this is problem. We don't want false negatives!

When I run ctests interactively, I source a setup script we used during the Dec 2020 JEDI training. In this script we set

export SLURM_ACCOUNT=da-cpu
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SLURM_QOS=debug

Job 51900427 was submitted to the batch queue. Can we change the queue to debug to reduce queue wait time?

@CoryMartin-NOAA
Copy link
Contributor

We did that at one point, @RussTreadon-NOAA , but there are a limit on the number of debug jobs that can be submitted by a user. It runs under my account, so I was unable to do any independent work on the debug queue, so we changed it back. Perhaps we should work to move the testing to role.jedipara on Hera?

@RussTreadon-NOAA
Copy link
Contributor

Good point. There's a limit of 2 concurrent debug jobs per user. Moving automated CI to a role account seems reasonable. Automated CI is for the group. We should use a group account to run it.

@CoryMartin-NOAA CoryMartin-NOAA deleted the bugfix/incr_handler branch November 17, 2023 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hera-GW-RT-Failed Automated testing with global-workflow failed on Hera orion-GW-RT-Passed Automated testing with global-workflow successful on Orion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants