Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained performance degradation on first job of case compared to subsequent jobs with pm-cpu/muller-cpu (using v3 HR F case) #6733

Open
ndkeen opened this issue Nov 4, 2024 · 4 comments
Labels
pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Nov 4, 2024

When running some longer cases, I noticed that the first job (say 3 months), had a much lower SYPD than all of the subsequent jobs in the same case, where the only difference is that the job was reading from local restart. The timing differences are in each daily time stamp, so not confined to init (or writing restarts, etc). Further, it looks like the additional time is in CAM_run3 which is just fairly odd.

This is with a E3SM v3 HR F cases on CPU. In this case it's muller-cpu, but other than it being slower in IO due to smaller scratch resources, it's always the same as pm-cpu. We might expect some differences in timing for init: maybe first job initializes differently than subsequent jobs all reading from local restarts -- but that diff would show in init timers. And here, I'm seeing the timing diff in every day timestamp.

Below are 2 examples of straight copies of a cases i ran on muller-cpu.

/global/cfs/cdirs/e3sm/ndk/tmpdata/20241009.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.test3.n256t120x1.coilr4.i2.dvsro

muller-login01% zgrep CAM_run3 case_scripts/timing/e3sm_timing_stats.868178.241029-164244.gz case_scripts/timing/e3sm_timing_stats.868249.241029-175657.gz 
case_scripts/timing/e3sm_timing_stats.868178.241029-164244.gz:"a:CAM_run3"                                     -      30720    30720 2.654208e+08   5.398307e+07  1831.666 (  6239      0)  1717.178 ( 22048      0)
case_scripts/timing/e3sm_timing_stats.868249.241029-175657.gz:"a:CAM_run3"                                        -      30720    30720 2.683699e+08   4.721440e+07  1619.878 (  6239      0)  1487.644 (  7904      0)

And i have another longer-running case that shows similar behavior.

muller-login01% pwd
/global/cfs/cdirs/e3sm/ndk/tmpdata/20241025.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.NoZMenh.n256t120x1.coilr4.i2.dvsro.hy
muller-login01% zgrep CAM_run3 case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz
case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz:"a:CAM_run3"                                     -      30720    30720 2.654208e+08   5.840219e+07  2020.120 (  1698      0)  1782.316 ( 13686      0)
case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz:"a:CAM_run3"                                        -      30720    30720 2.683699e+08   4.781989e+07  1633.105 (  6239      0)  1508.410 (  7904      0)

muller-login01% zgrep a:SLMM_csl case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz
case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz:"a:SLMM_csl"                                     -      30720    30720 5.308416e+08   1.004908e+07   698.403 (     8      0)   116.957 ( 29517      0)
case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz:"a:SLMM_csl"                                        -      30720    30720 5.367398e+08   1.015562e+07   494.818 (    17      0)   117.130 ( 27117      0)

I can also reproduce with even 5 day cases on only 90 nodes (above is on 256 nodes), but that's still with HR F case.

Sure, if the first job is first 3 months, second job is a diff 3 months -- but I can repeat this behavior over the same set of days. And, of course, we would not expect one time span to have such consistently significant performance diffs.

Note that for this launch script, it is a branch run with REF_CASE:

# Run options                                                                                                                                                                                                                                                                                                                                      
readonly MODEL_START_TYPE="branch"  # 'initial', 'continue', 'branch', 'hybrid'                                                                                                                                                                                                                                                                    
readonly START_DATE="0001-01-01"

# Additional options for 'branch' and 'hybrid'                                                                                                                                                                                                                                                                                                     
readonly GET_REFCASE=TRUE
#readonly RUN_REFDIR="/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/archive/rest/0038-01-01-00000"                                                                                                                                                                                   
readonly RUN_REFDIR="/dvs_ro/cfs/cdirs/e3sm/terai/E3SM_restarts/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/0038-01-01-00000"
readonly RUN_REFCASE="20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu"
readonly RUN_REFDATE="0038-01-01"
@ambrad
Copy link
Member

ambrad commented Nov 4, 2024

How often do you see this? Below are some timers of interest for one pair of your runs. I see two main things, both related to the various boundary exchanges:

  1. The slower run is slower almost entirely because of comm taking longer (CEDR_local is a strange exception, but in any case its timer disparity is only a small part of the gap).
  2. The max vs min times for the boundary exchanges, regardless of run, seems wildly bad, or else I'm misinterpreting or misremembering something about this setup.
e3sm_timing_stats.867570.241027-121511
"CPL:RUN_LOOP"              30720 2.654208e+08   1.303592e+08  4244.343 (     0      0)  4243.153 ( 22407      0)
"CPL:ICE_RUN"                7680 6.635520e+07   3.033537e+05    43.880 (  8036      0)    38.151 ( 28920      0)
"CPL:LND_RUN"                7680 6.635520e+07   2.027057e+06   305.028 ( 19200      0)   239.313 ( 18768      0)
"CPL:OCNT_RUN"               7680 6.635520e+07   1.461513e+04     2.794 (     0      0)     1.783 (  1000      0)
"CPL:ATM_RUN"               30720 2.654208e+08   1.074550e+08  3583.485 (  8038      0)  3427.263 ( 23695      0)
"a:CAM_run2"                30720 2.654208e+08   2.038929e+07   799.999 ( 13687      0)   526.398 ( 28858      0)
"a:CAM_run3"                30720 2.654208e+08   5.840219e+07  2020.120 (  1698      0)  1782.316 ( 13686      0)
"a:prim_run_subcycle"       30720 5.308416e+08   5.839981e+07  2020.068 (  1698      0)  1782.253 ( 13686      0)
"a:compute_andor_apply_rhs" 30720 1.592525e+10   9.402567e+06   554.237 ( 27953      0)   191.621 (     4      0)
"a:caar_bexchV"             30720 1.592525e+10   5.719391e+06   477.866 ( 27953      0)    62.222 ( 24014      0)
"a:advance_hypervis"        30720 3.185050e+09   3.514140e+06   332.999 ( 29531      0)    64.734 (   149      0)
"a:vertical_remap"          30720 1.592525e+09   7.493347e+05    34.611 ( 11355      0)    16.241 ( 26824      0)
"a:PAT_remap"               30720 5.308416e+08   3.997493e+07  1398.849 (    15      0)  1136.026 ( 28318      0)
"a:SLMM_reconstruct"        30720 5.308416e+08   1.290741e+05     5.676 ( 21596      0)     2.754 ( 27208      0)
"a:SLMM_v2x"                30720 5.308416e+08   1.199458e+05     8.162 ( 18360      0)     2.668 ( 25962      0)
"a:SLMM_csl"                30720 5.308416e+08   1.004908e+07   698.403 (     8      0)   116.957 ( 29517      0)
"a:advance_hypervis_scalar" 30720 5.308416e+08   3.254983e+06   509.355 (  1438      0)    22.302 ( 19034      0)
"a:ah_scalar_bexchV"        30720 3.185050e+09   1.143516e+06   405.546 (   116      0)     6.085 (  9356      0)
"a:CEDR"                    30720 5.308416e+08   1.876471e+07   804.927 ( 27125      0)   279.160 ( 16476      0)
"a:CEDR_local"              30720 5.308416e+08   2.346945e+06   196.466 (     4      0)    15.948 ( 28016      0)
"a:SLMM_bexchV"             30720 5.308416e+08   2.095142e+06   189.647 ( 27659      0)    14.297 ( 26513      0)
"a:SLMM vertical remap"     30720 5.308416e+08   3.655281e+06   146.044 (   838      0)    76.343 ( 27948      0)
"a:CAM_run4"                30720 2.654208e+08   4.065067e+06   157.024 ( 26513      0)   113.838 (    15      0)
"a:CAM_run1"                30720 2.654208e+08   2.454508e+07   924.043 ( 28679      0)   732.546 ( 21353      0)
e3sm_timing_stats.867574.241027-133107
"CPL:RUN_LOOP"              30720 2.683699e+08   1.209881e+08  3939.252 ( 16384      0)  3938.133 (  3825      0)
"CPL:ICE_RUN"                7680 6.709248e+07   2.961514e+05    43.217 ( 15488      0)    37.141 ( 30360      0)
"CPL:LND_RUN"                7680 6.709248e+07   2.267965e+06   335.182 ( 19200      0)   268.936 ( 18768      0)
"CPL:OCNT_RUN"               7680 6.709248e+07   1.499465e+04     2.905 (     0      0)     1.830 (  7612      0)
"CPL:ATM_RUN"               30720 2.683699e+08   9.698622e+07  3246.981 (  8039      0)  3089.476 ( 15589      0)
"a:CAM_run2"                30720 2.683699e+08   2.045468e+07   723.949 ( 20687      0)   533.660 ( 28318      0)
"a:CAM_run3"                30720 2.683699e+08   4.781989e+07  1633.105 (  6239      0)  1508.410 (  7904      0)
"a:prim_run_subcycle"       30720 5.367398e+08   4.781749e+07  1633.059 (  6239      0)  1508.322 (  7904      0)
"a:compute_andor_apply_rhs" 30720 1.610220e+10   8.445844e+06   507.389 ( 29934      0)   202.657 ( 15675      0)
"a:caar_bexchV"             30720 1.610220e+10   4.772122e+06   429.949 ( 29934      0)    60.410 (  1575      0)
"a:advance_hypervis"        30720 3.220439e+09   3.025493e+06   278.846 ( 25436      0)    69.987 ( 10018      0)
"a:vertical_remap"          30720 1.610220e+09   7.612946e+05    33.287 ( 12838      0)    16.446 ( 29450      0)
"a:PAT_remap"               30720 5.367398e+08   3.081427e+07  1074.296 (    31      0)   846.176 ( 25438      0)
"a:SLMM_reconstruct"        30720 5.367398e+08   1.310209e+05     5.900 ( 12839      0)     2.788 ( 29449      0)
"a:SLMM_v2x"                30720 5.367398e+08   1.227579e+05     9.158 ( 20400      0)     2.698 ( 26865      0)
"a:SLMM_csl"                30720 5.367398e+08   1.015562e+07   494.818 (    17      0)   117.130 ( 27117      0)
"a:advance_hypervis_scalar" 30720 5.367398e+08   2.922591e+06   323.855 ( 28557      0)    17.472 ( 17726      0)
"a:ah_scalar_bexchV"        30720 3.220439e+09   9.652018e+05   240.438 ( 30353      0)     5.553 ( 24823      0)
"a:CEDR"                    30720 5.367398e+08   9.763367e+06   503.563 ( 28964      0)   214.416 (   117      0)
"a:CEDR_local"              30720 5.367398e+08   2.373188e+06   123.821 (    31      0)    16.209 ( 30356      0)
"a:SLMM_bexchV"             30720 5.367398e+08   2.117387e+06   188.598 ( 27957      0)    13.811 ( 25553      0)
"a:SLMM vertical remap"     30720 5.367398e+08   3.722375e+06   152.013 ( 16919      0)    77.841 ( 30412      0)
"a:CAM_run4"                30720 2.683699e+08   3.938087e+06   152.999 ( 28433      0)   117.487 (    31      0)
"a:CAM_run1"                30720 2.683699e+08   2.472444e+07   928.463 ( 28679      0)   741.893 ( 12785      0)

@ambrad
Copy link
Member

ambrad commented Nov 4, 2024

Separately, have you looked at the variation across multiple initial runs? Is it possible the initial-vs-restart discrepancy is an artifact of undersampling the initial case?

It's of course always possible something really subtle related to restart is responsible, but the above timers really shouldn't be affected by the type of run.

@rljacob
Copy link
Member

rljacob commented Nov 4, 2024

To clarify, the first run is an initial run and the rest are all restart runs?

@ambrad
Copy link
Member

ambrad commented Nov 5, 2024

More thoughts:

It looks like these jobs run on 256 nodes (120 ranks/node -- leaving 8 free for system reasons, I'm guessing -- times 256 = 30720). That leaves a subset of nodes with 2 elements/rank instead of 3.

For studying performance, I would choose the number of nodes to be such that every rank gets the same number of elements, if possible. For 120 ranks per node and ne 120, that would be something like this sequence of number of nodes: 60 (12 elem/rank), 90 (8), 180 (4), 240 (3), 360 (2).

When studying a possible anomaly, I retreat to fewer nodes. It might be worth looking at this issue again on just 60 nodes.

@rljacob rljacob added the pm-cpu Perlmutter at NERSC (CPU-only nodes) label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

No branches or pull requests

3 participants