Failures in ensemble GETKF linear observer: obs types, ens members, memory #455

metdyn · 2024-11-12T22:37:45Z

Finding 1:

When testing observations for GETKF with linear observer, I find the following observations will result in code freeze (fv3jedi_letkf.x) for 2 member C12 case (Lev=72) with 1 node (120cores) or 2 node. [jedi version: 08/31/24]

Obs list that fails LGETKF code
   - iasi_metop-b
   - iasi_metop-c
   - cris-fsr_n20
   - cris-fsr_npp
   - gmi_gpm
   - gps

Finding 2:

Excluding the above list, I am having difficulties to run GETKF with linear observers for 32 members for C90 case (LM=72).

------------------------------------------------------------------------------------------
C90 + L72 + GETKF + Linear_obs + SCU17 + letkf.x [08/31/24] + 120core/node
------------------------------------------------------------------------------------------
#ob_tot #memb #node  timing(s)  per_task_mem(Gb)  final_tot_mem (Gb)
------------------------------------------------------------------------------------------
ob_1_3     3     6      698.       1.35 / 3.49      1322.34
ob_1_6     3     6      775.96     1.51 / 1.89      1799.64
ob_1_12    3     6      1202.10    3.55 / 3.84      2828.68
ob_1_18    3     6      Fail
ob_1_18    3     12     Fail       4.21  stop at starting ensemble member 1/OOPS_STATS GETKFSolver calculate hofx
ob_1_24                 NA

ob_1_3     32     12     11887.78    1.57 / 2.79    3538.89
ob_1_6     32     12     12146.14    2.52 / 3.25   4755.65
ob_1_12    32     12     Fail
ob_1_18

Obs list that can work with LGETKF code
   1       - sondes
   2       - amsua_aqua
   3       - amsua_metop-b
   4       - amsua_metop-c
   5       - amsua_n15
   6       - amsua_n18
   7       - amsua_n19
   8       - amsr2_gcom-w1
   9       - atms_n20
  10       - atms_npp
  11       - avhrr3_metop-b
  12       - avhrr3_n18
  13       - avhrr3_n19
  14       - scatwind
  15       - sfcship
  16       - sfc
  17       - mhs_metop-b
  18       - mhs_metop-c
  19       - mhs_n19
  20       - mls55_aura
  21       - omi_aura
  22       - ompsnm_npp
  23       - pibal
  24       - ssmis_f17
  25       - airs_aqua

Reasoning:

For contrast, I show the HX calculation below. The logic seems to be

HX calculation is memory consuming. For SCU17 node, which has about 480 GB memory, I should use half the node to reserve more memory.
Ens LGEKF, memory per task does not scale with members which is right.

------------------------------------------------------------------------------------------
C90 + L72 + HofX + SCU17 + fv3-jedi.x [08/31/24] + 24 cors
------------------------------------------------------------------------------------------
#ob_tot  #node        timing(s) per_task_mem(Gb)  final_tot_mem (Gb)
------------------------------------------------------------------------------------------
ob_1_33   1node [24 core]  356.45    5.91 / 7.36       154.75

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in ensemble GETKF linear observer: obs types, ens members, memory #455

Failures in ensemble GETKF linear observer: obs types, ens members, memory #455

metdyn commented Nov 12, 2024

Failures in ensemble GETKF linear observer: obs types, ens members, memory #455

Failures in ensemble GETKF linear observer: obs types, ens members, memory #455

Comments

metdyn commented Nov 12, 2024

Finding 1:

Finding 2:

Reasoning: