-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define test suite for remap_restarts.py #265
Comments
I should add, that wherever possible this suite should use coarse resolutions. For input this is difficult, but hopefully for output we can even do things like C24. (Speaking from ignorance.) Of course some direct comparison with regrid.pl results is necessary, but ... hopefully those can also be at coarse resolution. We also need to distinguish between crashing and getting wrong results. Testing and fixing crashes is a relatively fast process and should take priority. Then when things are more stable we can invest the large effort and resources and checking/fixing numerical results. |
@yvikhlya, @zhaobin74, we should consider creating a test restart set that originates from MERRA-2 ocean/S2S-v3. Ignoring the ocean restarts, we should think of all else, incl. sea ice restarts. |
@tclune: Thanks for initiating the discussion. Having a workable set of tests for regrid (both the pl and py scripts) has been a concern from the beginning of PR #238 (and related PR GEOS-ESM/GEOSgcm_GridComp#571). In a nutshell, regrid is used by many different users in many different configurations. It is very difficult to assert that things are ok, and the impact on users of breaking regrid is considerable. Our plan has been to first make sure the Catchment restarts are ok (testing by the land group) and then broaden the test users to include @sdrabenh, @sanAkel, and maybe also someone from the ADAS group, perhaps Joe Stassi. We're in the middle of this now. It's impossible and not desirable to be exhaustive in our list of tests, but we do need to make sure we capture the most frequent use cases. I'm not sure that coarse-resolution output restarts are good tests. The land is very heterogeneous, and when land restarts are regridded to, say, c24, nothing will make scientific sense. I don't think regrid is prohibitively expensive, and I would advocate to not use anything coarser than 0.5 deg. We also need to test CatchmentCN regridding (@gmao-jkolassa can help with testing). Ultimately, before the two PRs can be merged, @sdrabenh will have to agree that we did not break the (old) perl regrid script. It now looks like this may be trivially 0-diff because the PR no longer touches anything that has to do with regrid.pl (to be confirmed by someone other than me). It has taken some time to agree that we should simply add the py capability and not touch the perl script and and the existing F90 programs in the GEOSgcm_GridComp.
I'm not sure it's a good idea to merge a branch that runs but produces wrong results. If power users such as @sanAkel are eager to help with testing, they can use the branch. Once we merge into develop, any old user might employ the py script without realizing that it has not been fully tested. Again, regrid is a special case because it's widely used and any error might have a big negative impact on general users. |
@gmao-rreichle I generally agree with what you wrote above, but a few responses. I would prioritize the easiest use cases, which might well not include catchments? (I'm outside my domain knowledge here.) I.e., what can we get working first with some confidence to use as a backbone, while remaining aware of the other complexities that must be allowed for?
Yes - certainly all common cases must be tested at some point before mainstream use. They may or may not all be part of this basic test suite I'm proposing. Just like we can't test all GEOS use cases with every PR. We'll need to prioritize based on frequency and/or orthogonality of the tests for coverage.
Software verification tests are not about doing science. Simply the ability to generate c24 results without crashing is an amazingly useful test. It runs fast and does not require a domain expert to evaluate. Such tests are of course limited, but why waste the time of a domain expert (or computing resources) unless such basic things are functioning. Of course, fast is a relative term. Can someone give me a sense of how long a typical run takes and on how many nodes? And then can we estimate how much faster a coarse test might run? Ultimately the verification tests can aid in validation by checking 0-diff from established results. These can have some overhead to manage though as intentional diffs enter the implementation over time.
To be fair, essentially every PR on GEOS produces wrong results. Do not discount the ability to manage multiple tickets against this tool instead one long linear thread that we have in that other PR. I agree that some minimal functionality should exist. From what @sdrabenh told me yesterday, I don't think we are ready for validation type tests. Ordinary uses apparently fail completely (crash), and we need to focus on those first. |
@sdrabenh wrote: @tclune @weiyuan-jiang defining a minimal test suite is a bit nebulous with such an all-encompassing tool like this. From my point of view, the new python script should be able to demonstrate comparable results to the pearl script for current (Icarus/Jason) datasets and resolutions. Regarding the functionality of the tool, it should be able to:
From an AGCM point of view, the most important atmospheric resolutions are C90, C180, C360, and C720. I would think that regridding between higher horizontal resolutions would better demonstrate whether the script is working correctly since they inherently contain more information. We frequently regrid between 72, 91, 137, and 181 levels. In fact, Bill's feature branch is aimed toward non 72-level resolutions, thus all those experiments. Lastly, it is common to regrid between the 4 data ocean resolutions based on whether we are running an OPS-like experiment, EMIPs, AMIPs, etc. Again, just seeing a few cases here demonstrating 0-diffness/equivalency with the pearl script would go a long way in establishing confidence this script is working correctly. My wish list for this script (after meeting minimal requirements) would be to:
[removed ">" characters that made the comment look like it was only a quote from an earlier comment; @gmao-rreichle, 21 July 2022] |
Many of the things that you list above can be captured in a very small number of tests. Some are feature requests that should be handled after the initial PR. Some specific comments:
|
OK - let me be more concrete. Can someone suggest/specify a single set of parameters that exercises regrid.py in a minimal fashion. Least features used, lowest resolution allowed. Verification that the script completes without crashing on these inputs becomes the 1st test in the suite. Variations on the above then become additional tests:
And so on. Once we have this list and script completes, we can then turn back to validation issues. As everyone likes to say, we cannot validate everything. But surely there are just a handful of canonical cases that could be run to give confidence in the answers. Please name them. These should not have variants for different resolutions and such. Just concrete use cases. At that point we should merge the PR. And then start opening issues for additional features and bug reports on additional cases. |
👋 all! @yvikhlya, @zhaobin74 and myself need to huddle up reg
|
@tclune I am actively working on testing this with @weiyuan-jiang, I believe he is working on adding some of the options ( more like adding options of what user can run) and fix few bugs we found (default wemin for example). It has been a process because some options during testing work and are zero diff. It helps that cases Scott was running are different than mine, so that reveled few more issues. |
also I still didn't run/test any of the DAS tags... |
I should have re-phrased this statement. You are correct. Technically nothing is broken at the moment since we have not updated the gcm scripts to use this yet. However, this is an important feature to retain and will need to be resolved moving ahead. I tend to lean towards releasing a script the resembles its final form with minor bugs rather than adding features like this after the fact.
That is your call, but it would certainly help streamline the testing of multiple configurations without waiting in the queue. |
This should be handled via |
What I meant is that regrid.py should be smart enough to modify the job directives according to the current interactive session and execute the job scripts immediately. No need for a queue or waiting. That should be relatively straight forward. |
Would it be difficult to add an option to specify a path to input/output tile files and clsm directory explicitly, instead of using code names? This would be handy, because this would allow us to work with testing grids which are not in the default locations. I have this possibility in my little script, just using symbolic links: |
Please start with a list from @sdrabenh. If anything breaks for uncoupled, it's a no go! |
In my opinion, the best way to test if your script is working correctly is to to run
and regrid the atmosphere to any other resolution and levels, and regrid the ocean to something different. It really doesn't matter what you choose as long as the results from the pearl script and the python script are consistent. If things are non-zero diff you should use higher resolutions to interrogate how different the result is - and most importantly whether that is expected or not. That would be my approach to testing the AGCM/dataocean. |
@mathomp4, @weiyuan-jiang,
The bcs are likely going to be there for a few more years. I suggest putting the restarts into a dir controlled by the SI Team. (We will likely remove Vv6032 restarts from the dir once we get to Version 8 of L4_SM sometime in 2023.) |
This is in reference to PR #238.
@weiyuan-jiang @gmao-rreichle @sanAkel @sdrabenh @biljanaorescanin
We should define a small suite of test cases that the regrid.py must pass each time the tool is to be updated. Nothing automated for the moment. But if possible it could be a short script (shell or python) that drives this tool for a few resolutions and such. The point is not to be exhaustive, but to perhaps aim to toggle each major option to get some coverage. True corner cases should be left for subsequent work.
Preference is for discussion to be here. But reach out on Teams if you have concerns.
The text was updated successfully, but these errors were encountered: