Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post processing errors resulting from GFS HR4 test run #3019

Open
ChristopherHill-NOAA opened this issue Oct 18, 2024 · 20 comments · May be fixed by #3187
Open

Post processing errors resulting from GFS HR4 test run #3019

ChristopherHill-NOAA opened this issue Oct 18, 2024 · 20 comments · May be fixed by #3187
Assignees
Labels
bug Something isn't working

Comments

@ChristopherHill-NOAA
Copy link
Contributor

What is wrong?

From a test run of GFS HR4 performed by @RuiyuSun, execution of the post processing package is included, and the following log files indicated the following errors:

gfsarch.log:  FATAL ERROR: Required file, directory, or glob gfs.20201030/00/products/atmos/wmo/gfs_collective1.postsnd_00 not found!
gfsawips_20km_1p0deg_f###-f###.log:  End exgfs_atmos_awips_20km_1p0deg.sh ... with error code 30
gfswaveawipsbulls.log:  FATAL ERROR: Job waveawipsbulls.95602 failed RETURN CODE 4
gfswaveawipsgridded.log: End exgfs_wave_prdgen_gridded.sh ... with error code 1

What should have happened?

The post processing scripts should all have run to completion without error or interruption.

What machines are impacted?

WCOSS2

What global-workflow hash are you using?

4ad9695

Steps to reproduce

Clone and build the workflow code from the indicated hash on WCOSS, then execute a single cycle (2020103000) that points to datasets available from the HR4 test case. Please consult @RuiyuSun for more specifications.

Additional information

Errors from gfswaveawipsbulls and gfswaveawipsgridded appear to result from a misread of Alaska buoy station information, and these scripts within the workflow may require reference to an updated table that links stations to expected data bulletins. @AminIlia-NOAA is added here for reference to the issue and the attached log files.
gfswaveawipsbulls.log
gfswaveawipsgridded.log

Errors resulting from gfsfbwind and gfsgempakncdcupagif are being assessed for potential inclusion with this issue.

Do you have a proposed solution?

The GETGB2P errors generated from gfsawips_20km_1p0deg - as seen in OUTPUT70005.txt - result from the absence of GRIB variable 5WAVH from the GFS control file, and appear related to those resolved through #2652; the relevant parameter tables will be modified in a similar manner. Resolution of errors from gfsawips_20km_1p0deg may subsequently resolve the absence of files causing the gfsarch error.

Once developed, the modifications resolving all errors described here will be bundled into one or two pull requests.

@ChristopherHill-NOAA ChristopherHill-NOAA added bug Something isn't working triage Issues that are triage labels Oct 18, 2024
@ChristopherHill-NOAA
Copy link
Contributor Author

Additional errors from the post-processing scripts:

  1. gfsfbwind
    gfsfbwind.log: ERROR config.resources must be sourced before sourcing WCOSS2.env

This error is triggered by the following logic statement within WCOSS.env:

if [[ -n "${ntasks:-}" && -n "${max_tasks_per_node:-}" && -n "${tasks_per_node:-}" ]]; then
    max_threads_per_task=$((max_tasks_per_node / tasks_per_node))
    NTHREADSmax=${threads_per_task:-${max_threads_per_task}}
    NTHREADS1=${threads_per_task:-1}
    [[ ${NTHREADSmax} -gt ${max_threads_per_task} ]] && NTHREADSmax=${max_threads_per_task}
    [[ ${NTHREADS1} -gt ${max_threads_per_task} ]] && NTHREADS1=${max_threads_per_task}
    APRUN_default="${launcher} -n ${ntasks}"
else
    echo "ERROR config.resources must be sourced before sourcing WCOSS2.env"
    exit 2
fi

The variable $tasks_per_node is currently absent from the 'fbwind' case module within config.resources, and will be added.

  1. gfsgempakncdcupapgif
    gfsgempakncdcupapgif.log: End exgfs_atmos_gempak_gif_ncdc_skew_t.sh ... with error code 1

Preceding this error:

  • the ImageMagick command that is invoked during the execution of make_tif.sh (convert) is not recognized
  • the location of file make_NTC_file.pl is not recognized during the execution of make_tif.sh
  • the subdirectory defined by $COM_OBS was found to be absent from the $ROTDIRS filespace

The file make_tif.sh is found to rely on the static reference to an outdated version of ImageMagick, and will need to be modified to reference a currently available system module. Further, make_tif.sh relies on the $UTILgfs variable to construct the path to make_NTC_file.pl. $UTILgfs is defined only within the job file $HOMEgfs/sorc/upp.fd/jobs/J_NCEPPOST, which references an antiquated 'util' directory that is no longer present within the UPP directory tree. Either or each of the files J_NCEPPOST or make_tif.sh will be modified to reflect the current path to make_NTC_file.pl.

Specific to the HR4 test of the global workflow, there may be a need to ensure proper staging of the observational data directory, defined by $COM_OBS. The workflow code will be reviewed for any need of revision.

@WalterKolczynski-NOAA
Copy link
Contributor

These run to completion in CI tests (C96), albeit we are not validating the results.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Oct 31, 2024
@WalterKolczynski-NOAA
Copy link
Contributor

@DavidHuber-NOAA Can you look at the env issues here?

@DavidHuber-NOAA DavidHuber-NOAA self-assigned this Oct 31, 2024
@DavidHuber-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA The variable that @ChristopherHill-NOAA mentions not being in config.resources for the fbwind jobs is correct. That job is not run by us in the C96_atm3DVar_extended CI test. We should probably add it so it is tested.

For the gfsgempakncdcupapgif job, I see the same make_tif.sh and make_NTC_file.pl errors. However, the COM_OBS directory was found for our C96 test case. I suspect that the COM_OBS directory is not generated for a forecast-only experiment (assuming that is how this is run). In that case, the gfsgempakncdcupapgif should probably not be part of the mesh in forecast-only cases.

Going back to the original post, the C96_atm3DVar_extended test does not attempt to create the gfs_downstream.tar tarball, in which the WMO collective files would be stored. It's not immediately clear to me why not, as it should be triggered when DO_BUFRSND == YES, which it is. I will look into this.

We do not run the gfs_awips* or gfs_waveawips* jobs, so I don't have anything to compare against there. I wonder if the gfs_awips jobs should be run in the C96_atm3DVar_extended test. I know it adds a significant number of jobs.

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Oct 31, 2024

AWIPS jobs should be running in the extended test. That's the main point of it. I believe fbwind is gated behind the AWIPS switch as well.

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Oct 31, 2024

Traced back why it is was off. When I added the test in #2567, there was an issue with tocgrib2 and the convective precip fields, so AWIPS was turned off. That has since been fixed (#2652), so the test should be turned on now (it should've been turned on then).

@DavidHuber-NOAA
Copy link
Contributor

Looking into the gfs_downstream.tar failure, I see that WCOSS2 does not enable HPSS archiving by default. Turning this feature on and attempting to run gfs_arch in the C96_atm3DVar_extended test causes an identical failure that Chris reported. HPSS archiving should probably be enabled during CI testing on WCOSS2.

@DavidHuber-NOAA
Copy link
Contributor

I will address the gfs_downstream.tar issue in an upcoming PR.

@DavidHuber-NOAA
Copy link
Contributor

The source of the missing postsnd files comes from #1929. This PR reworked the way data was sent to COM_ATMOS_WMO. Now, data is only sent to that directory if SENDDBN == "YES". However, the postsnd collective files are also (always) sent to COM_ATMOS_BUFR, though they are named differently. I will update the archive yaml to point to this location instead.

@ChristopherHill-NOAA
Copy link
Contributor Author

Pertaining to errors found within the gfsawips_20km_1p0deg_f###-f###.log files:

According to a relevant SCN from 2021, the 500-hPa 5-wave height (5WAVH) atmospheric field was to have been removed from all GFS GRIB2 products. As the script exgfs_atmos_awips_20km_1p0deg.sh invokes TOCGRIB2 to reference parameters from the parm/wmo/grib2_awpgfs* files, the absence of the expected field 5WAVH from the GRIB files results in "error code 30".

Given the above information, the 5WAVH entries will need to be removed from grib2_awpgfs_20km_[ak,conus,pac,prico]f000 and grib2_awpgfs[000-240].003 parameter files -- which should provide remedy to the error.

Additionally, 5WAVH is erroneously associated with the operation of a water wave model through a block of code (featured twice) within exgfs_atmos_awips_20km_1p0deg.sh:

      if [[ ${DO_WAVE} != "YES" ]]; then
         # Remove wave field it not running wave model
         grep -vw "5WAVH" "parm_list" > "parm_list_temp"
         mv "parm_list_temp" "parm_list"
      fi

The test run of GFS HR4 featured DO_WAVE=YES, and did not remove 5WAVH from parm_list. Nonetheless, the above code block was developed in error, and will need to be removed from exgfs_atmos_awips_20km_1p0deg.sh at lines 177-181 and 211-215.

ChristopherHill-NOAA pushed a commit to ChristopherHill-NOAA/global-workflow that referenced this issue Dec 6, 2024
@DavidHuber-NOAA
Copy link
Contributor

The source of the missing postsnd files comes from #1929. This PR reworked the way data was sent to COM_ATMOS_WMO. Now, data is only sent to that directory if SENDDBN == "YES". However, the postsnd collective files are also (always) sent to COM_ATMOS_BUFR, though they are named differently. I will update the archive yaml to point to this location instead.

These issues were fixed in #3053.

@DavidHuber-NOAA DavidHuber-NOAA linked a pull request Dec 20, 2024 that will close this issue
9 tasks
@DavidHuber-NOAA
Copy link
Contributor

@BoCui-NOAA @WalterKolczynski-NOAA @aerorahul
Should the fbwind executable be able to run on Hera? It compiled, but I am getting errors about being unable to read the index file. I will switch to WCOSS2 if not.

@BoCui-NOAA
Copy link
Contributor

@BoCui-NOAA @WalterKolczynski-NOAA @aerorahul Should the fbwind executable be able to run on Hera? It compiled, but I am getting errors about being unable to read the index file. I will switch to WCOSS2 if not.

@DavidHuber-NOAA Are you referring to the executable fbwind from code sorc/gfs_utils.fd/src/fbwndgfs.fd?

@DavidHuber-NOAA
Copy link
Contributor

@BoCui-NOAA Yes, that's correct.

@DavidHuber-NOAA
Copy link
Contributor

If you would like to see a run log, it can be found here: /scratch1/NCEPDEV/stmp2/David.Huber/RUNDIRS/C96_atm3DVar_extended_extended/gfs.2021122100/fbwind.684074/OUTPUT.684369.

@BoCui-NOAA
Copy link
Contributor

@DavidHuber-NOAA Unfortunately, I don't have experience with this code and compiling it on Hera.

@DavidHuber-NOAA
Copy link
Contributor

Ah, OK, thank you.

@aerorahul Do you know who I should ping on this?

@aerorahul
Copy link
Contributor

I think these are best debugged on wcoss2. These programs run into issues on RDHPCS platforms.

@AminIlia-NOAA
Copy link

The buoy issue seems to be related to missing gfswave.46001.cbull file that should have been existed or created in upstream.

@JessicaMeixner-NOAA
Copy link
Contributor

The buoy issue seems to be related to missing gfswave.46001.cbull file that should have been existed or created in upstream.

This is a known issue that we're working on a fix for from the model side: ufs-community/ufs-weather-model#2546 - I'll run a test w/this fix in the model so downstream can be re-tested.

DavidHuber-NOAA added a commit that referenced this issue Jan 7, 2025
# Description
As referred within #3019, the variable 5WAVH is being removed from each
of the files `parm/wmo/grib2_awpgfs[000-240].003` and
`parm/wmo/grib2_awpgfs_20km_[ak,conus,pac,prico]f000` for the purpose of
remedying "error code 30" that was generated through the execution of
`exgfs_atmos_awips_20km_1p0deg.sh` during the GFSv17 HR4 test run.
Obsolete code is also being removed from the script
`exgfs_atmos_awips_20km_1p0deg.sh`.

No other errors mentioned in #3019 are addressed in this PR.

# Type of change
- [x] Bug fix (fixes something broken)
- [ ] New feature (adds functionality)
- [ ] Maintenance (code refactor, clean-up, new CI test, etc.)

# Change characteristics
- Is this a breaking change (a change in existing functionality)? NO
- Does this change require a documentation update? NO
- Does this change require an update to any of the following submodules?
NO
  (If YES, please add a link to any PRs that are pending.)
  - [ ] EMC verif-global <!-- NOAA-EMC/EMC_verif-global#1234 -->
  - [ ] GDAS <!-- NOAA-EMC/GDASApp#1234 -->
  - [ ] GFS-utils <!-- NOAA-EMC/gfs-utils#1234 -->
  - [ ] GSI <!-- NOAA-EMC/GSI#1234 -->
  - [ ] GSI-monitor <!-- NOAA-EMC/GSI-Monitor#1234 -->
  - [ ] GSI-utils <!-- NOAA-EMC/GSI-Utils#1234 -->
  - [ ] UFS-utils <!-- ufs-community/UFS_UTILS#1234 -->
  - [ ] UFS-weather-model <!-- ufs-community/ufs-weather-model#1234 -->
  - [ ] wxflow <!-- NOAA-EMC/wxflow#1234 -->

# How has this been tested?
Removal of variable 5WAVH from the GRIB2 files should allow completion
of TOCGRIB2 processing (within `exgfs_atmos_awips_20km_1p0deg.sh`) of
the GRIB2 files. @RuiyuSun, or the GW team, may wish to include the
requested modifications for future GFSv17 tests that include
post-processing jobs.

# Checklist
- [ ] Any dependent changes have been merged and published
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have documented my code, including function, input, and output
descriptions
- [ ] My changes generate no new warnings
- [ ] New and existing tests pass with my changes
- [ ] This change is covered by an existing CI test or a new one has
been added
- [ ] Any new scripts have been added to the .github/CODEOWNERS file
with owners
- [ ] I have made corresponding changes to the system documentation if
necessary

Co-authored-by: christopher hill <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: David Huber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants