-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with 2 moment cloud microphysics, code shows grid or fails layout regression (but does not show grid pattern) #847
Comments
This is scary. It does seem to eliminate the suggestion by @wmputman that this is possibly due to an errant formula applying to the whole array. I see 2 possibilities from this:
(2) is more plausible in my estimation. |
I see some odd code in the interface: Lines 1886 to 1914 in b60ee1f
From my reading of that,
but then later on this:
occurs for all pressures which would just replace the sub 500mb value. But since Still, I can't see how this could cause the MPI layout to appear in the output. And I have no idea of It does look like |
@bena-nasa Tested setting |
Another data point. I tried building this model version (v11.1.1 with donifan's branch) with gfortran. A c90 stock experiment (the 1 movement moist) runs fine. However, if I turn on the 2 moment physics it just crashes in the first run step with either the debug or optimized gfortran build, so apparently something in 2 moment is so screwy that even gfortran with no optimization can't handle it. |
relvarr8 and qcvarr8 are not diagnostic, however the line qcvarr8 = min(max(qcvarr8, 0.5), 50.0) is a limiter. It might be that without setting a value to SINST, qcvarr8 becomes nan bypassing the line. However I haven't seen any weirdness in the QCVAR_EXP export. Initializing SINST to zero should fix it. |
Summary of facts to the best of my knowledge as of November 2st, 2023: Ways to fix this: The debug build appears to pass regression/not show these weird patterns. What is confusing is that I took he debug build and selectively compiled just the moist library with the release flags and that did not show the bad behavior. So it is like memory is just being moved around it is not JUST that one file being built with optimization that can trigger this. gfortran (using gcc 12.1 and openmpi 4.1.3): For some reason if you use this 1800 second tilmestep, the model just crashes in the dynamics right away so to do the gfortran tests I had to set the tilmestep back to 450 seconds; note that intel has no trouble with this timestep. I have made a separate issue for that: |
@wmputman @dbarahon I've been trying it outside of the 11.1.1 + donifan's 2 moment branch, to see if that also shows this but at least in 11.3.1 if you choose 2 moment, seems not to run. |
my earlier observation was not correct that there was an out of bounds issue with c24 (I had done something to disable the callback when I was futzing with the component testing frame that was causing that erroneous error). I was able to setup a c24 experiment following Donifan's c90 using v11.1.1 + donifan 2 moment branch and that also does not pass layout regress. So at least the "2 moment problem" seems to be reproducible at lower resolution with specific model release/branch version I have been testing with. The 2 moment in v11.3.1 is still broken though it seems... |
A fear more data points. I redid the only build most with optimization, and build the rest of geos with the debug flags. Unfortunately I must not have done the test right before. The rational with this test would be that if this is a compiler optimization problem this executable would also not pass layout regression. This is with v11.1.1 and the donifan branch. Unfortunately his crashes with this error even before reaching the 2 moment.
which implies there's even more issues here. 2nd data point. Apparently all these changes to the 2 moment have NOT been merged into any stable tag to the best of my knowledge and that the 2 moment option in any v11.x.y tag should be consider non-functional. One must either use the donifan branch referenced in the first post OR one must use the I am concerned that no one can give me stable tag, only branches to reproduce this which of course can change. Hence why I have been referencing the SHA codes in this issue. |
and yet another disturbing thing. As I reported in the previous post, with @wmputman branch I can reproduce the problem we are after using the release build. BUT, if I use the DEBUG build of the model configuration as described in the previous post the model crashes like so with a floating point exception rather than running and passing layout regression like Donifan's branch does and crashing in ConvPar_GF2020.F90 |
I tried cloning v11.3.2. Running MG1 out of the box without updates, setting CLDMICR_OPTION: MGB2_2M, MGVERSION: 1, CONVPAR_OPTION: GF, USE_GF2020: 1 and USE_FCT: 0., does not show the weird behavior. This seems to conflict with @bena-nasa results. |
@dbarahon @wputman so I thought that Bill's branch could reproduce the problem, well, I was using layout reproducibility as a proxy. Donifan's branch when showing these bad patterns was not layout reproducible. I assumed they were one and the same. So it seems the experiment run with bill's branch, I'm not seeing the "grid" pattern when I actually looked. So yay, but even after a step 2 runs at different layouts diverged, but they were diverging after GF runs, not the 2 moment code. Likewise I also cloned stock v11.3.2 and used the same AGCM.rc settings as @dbarahon used in the experiment in the previous pos in my own c90 experiment. That also did not show any "grid" pattern in the moist fields but also did not pass layout regression so it is absolutely showing weird behavior. So clearly there is another issue with this code since it absolutely should pass layout regression. The question is, is this related to the "grid pattern" problem, another manifestation of a deeper issue? |
@wmputman this 2 moment code is not right that I'm testing with your branch (and that I guess Scott is making a release of). Besides the fact it doesn't layout reproduce, the debug build of the mode just crashes in GF and worse, if I take release build of the model, but say build just the moist library with debug build, it fails too, but fails differently also in GF! I'll pursue these, but in my mind I would consider the 2 moment option to be non-functional... |
Update of situation as of November 22nd, 2023 as I am to see: Scott has a released a new version of the GEOSgcm fixture v11.3.3, that has all of Bill's and Donifan's changes (Bill had broken layout regression in GF2020 too in his development branch which confused me for a while, but he fixed that in what was passed to Scott and GF2020 passes layout regression for me in this release) they wanted to see get into a release so that we can have a stable release to serve as the baseline to track the issue. In the discussion below I will refer to the file GEOS_MGB2_2M_InterfaceMod.F90 as the "MG2 interface" for brevity. My test to find where the the code was diverging in different layouts was to add multiple calls to "MAPL_CheckpointState" and dump out the internal state of moist at various strategic points in the MG2 interface the first time the run method is executed. I simply ran the code at 2 layouts and found the first dumped checkpoint in the sequence that differed. Using this release the following can be stated when MG2 is enabled, in a C90 experiment. All these tests were performed on either skylake or cascade lake nodes. I have not done this experiment on SCU17.
With gfortran (openmpi 4.1.3, gcc 12.1.0, same baselibs version as in stock tag)
The summary, the new release does not pass layout regression when MG2 is turned, either because of the call to hystpdf or mmicro_pcond depending on the optimization/compiler. I cannot reproduce this grid imprint Donifan was seeing, but in my mind this is irrelevant. The code is not passing layout regression with 2 different compilers. As far as where to investigate, mmicro_pcond seems to be the place (although the gfortran results suggest that perhaps the memory corruption may move). I am concerned simply that the mmicro_pcond and micro_mg_tend_interface (an alternative to mmicro_pcond) have a ridiculous number of arguments. We have seen an intel bug that was related to the number of arguments and the shear number and the fact it is called in per-column I-j loop makes this incredibly hard to debug/investigate. |
Update with v11.5.2 of the GEOSgcm fixture |
I guess the question is does @dbarahon have a newer MG code? Perhaps the v12 candidate has it? Or is there a new branch? |
I think we should confirm this with @dbarahon before doing anything further. |
Hi,
Just merged my latest cleaned up version of MG. It must work with v11.5.2 and it is very close to what is implemented in Bill’s v12_rc1 tag.
Donifan
From: Ben Auer ***@***.***>
Reply-To: GEOS-ESM/GEOSgcm_GridComp ***@***.***>
Date: Wednesday, June 12, 2024 at 10:58 AM
To: GEOS-ESM/GEOSgcm_GridComp ***@***.***>
Cc: "Barahona, Donifan (GSFC-6101)" ***@***.***>, Mention ***@***.***>
Subject: [EXTERNAL] [BULK] Re: [GEOS-ESM/GEOSgcm_GridComp] Issue with 2 moment cloud microphysics, code shows grid or fails layout regression (but does not show grid pattern) (Issue #847)
CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
Update with v11.5.2
The stock tag at c24 still fails layout regression when MGB2_2M is selected as the microphysics using gcm_setup.
Compiling GEOS_MGB2_2M_InterfaceMod.F90 at O0 still fixes the regression failure.
—
Reply to this email directly, view it on GitHub<#847 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AM3UCEUUZLF6YMTRQRCUIPTZHBOZFAVCNFSM6AAAAABJGRQVIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGI3DANBZHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Just merged my latest updates. The stock code is very outdated. Instead checkout feature/donifan/mg3clean |
@dbarahon my initial test seemed to say that your branch is now passing layout reproducibility but failing start/stop |
Yes, that's what I am finding as well. Setting NCPL and NCPI to constant values makes it pass start-stop. Those are variables we want to be able to predict however. |
@dbarahon that's weird, those are in the moist_internal_checkpoint. Also, they are internal, so I would have thought something you want to "predict" means you would want to write it in History which means it should be an export. |
I just meant that they can't be constant . They are part of the internal. |
@dbarahon what did you do to make NCPL and NCPI constant? I'm not sure what you mean by that. I tried setting them to 0 in the run method of the mg2 code but that didn't fix the start/stop regression |
For some reason zero does not seem to be a good value. Uncomment Lines 1162 and 1163 in the run method. |
Well, I'm not sure it means much, but And it looks like in the Thompson microphysics they have defaults of 50e6 and 1e3 in there. Do we know if the Thompson microphysics regresses? |
Ug, so my first idea just makes things more confusing. Starting at 21z, I'll do a 6 hour run and 3+3 hour runs I'll write the import and internal state of moist at 0z in both at the start of run and see if they are different since start/stop is usually something is missing in the restart. I then tried also dumping these states at the END of the run method at 0. Now they have gone 0 diff. So despite seeing the see the same import and internal state coming in, they end up at a different place. |
@dbarahon
Since NCPL and NCPI are in the internal restart, the only logical conclusion is that DNDCNV is different which I see is computed a few lines above. I need to leave for today but that's where I will investigate on Monday |
I think I found it. DNDCNV is computed with these lines:
But |
@dbarahon says CNV_MFD is filled in by GF. So either either GF didn't fill it in or CNV_NDROP/NCV_NICE has gone non-zero diff. As an aside. The fact you are using the export state here, that someone else had to fill, that's just incredibly confusing... |
Dang. I was hoping it was found. I'm still trying to follow the code around, but no luck. But at least there's a point to focus on! So far, one interesting thing is that |
It is not supposed to be. I was testing whether the issue was in the aerosol_activate code. Must have forgotten to remove it. |
You were right. All that was needed was to verify that CNV_MFD and CNV_FICE were associated. I just updated my branch. It is passing both start-stop and layout regression for me. |
Wooo! Congrats to both @bena-nasa and @dbarahon for figuring this out. |
Hmm. I just tried a merge of
Looks like I might need @dbarahon help to do this right. I must have screwed up the merge. |
I concur, I pulled Donifan's |
Just needs to add that new structure to aer_cloud.F90 |
Might need you to try that. MG code is complex to me! :) |
Just making an issue for this problem since it was not done (sorry I the title is not right, but I'm not an atmospheric scientist, all these difference "moist" schemes are just gobblygook to me). This is not a new issue, user @atrayano has been looking at it with fine tooth comb I'm told.
Two users @dbarahon and @wmputman have been see "odd" patters in some of the moist exports when running the 2 moment moist physics (I assume it is two moment because they have this
CLDMICR_OPTION: MGB2_2M
in the AGCM.rc file.Both of them are seeing things like this it seems when examining output, this is "IWP" for example (this picture was made with ncview, using the "low" option for the scale, @dbarahon said this is the easiest way to spot this):
Where the domain decomposition is clearly appearing in the field and just not physical (I don't know how else to describe this).
In order to reproduce here is what you have to do.
First checkout
v11.1.1
of the model. Then update the GEOSgcm_GridComp repo to thefeature/donifan/KrokMG3
branch. When I did the test the SHA code for GEOSgcm_GridComp I used was41d6f34f035193794b454c807e8a6a61bc7f9610
.Once built clone experiment on discover
/gpfsm/dnb78s1/bmauer/donifan_weirdness
. The AGCM.rc has this for the most options which I assume is critical:Note I'm also told that @wmputman has feature branch that works off of develop branches off the GEOSgmc fixture:
feature/wmputman/hwt_spring_exp
but I don't want to record anything here for reproducibility that depends on a branch since branches change...I thought user @dbarahon claimed that downgrading the optimization to O1 in the moist component fixes (maybe I misunderstood) this but that's not what my testing showed so if there was a workaround that was not it. Whether I compiled those files in moist O3 or O1 I got these weird patterns clearly showing the domain.
The text was updated successfully, but these errors were encountered: