Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the automated max_blocks calculation #954

Merged
merged 7 commits into from
May 17, 2024

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented May 13, 2024

PR checklist

  • Short (1 sentence) summary of your PR:
    Update the automated max_blocks calculation

  • Developer(s):
    apcraig, anton

  • Suggest PR reviewers from list in the column to the right.

  • Please copy the PR test results link or provide a summary of testing completed below.
    Testing still underway. Expect full test suite on derecho with intel, gnu, cray to be bit-for-bit.

  • How much do the PR code changes differ from the unmodified code?

    • bit for bit
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?

    • Yes
    • No
  • Does this PR update the Icepack submodule? If so, the Icepack submodule must point to a hash on Icepack's main branch.

    • Yes
    • No
  • Does this PR add any new test cases?

    • Yes, some test cases changed to leverage and test max_blocks=-1 implementation
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)

    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please document the changes in detail, including why the changes are made. This will become part of the PR commit log.

    Update support for max_blocks=-1. This update computes the blocks required on
    each MPI task and then sets that as max_blocks if max_blocks=-1 in namelist.
    This is done in ice_distribution and is a function of the decomposition among
    other things. Refactor the decomposition computation to defer usage of max_blocks
    and eliminate the blockIndex array. Update some indentation formatting in
    ice_distribution.F90.

    Modify cice.setup and cice_decomp.csh to set max_blocks=-1 unless it's explicitly
    defined by the cice.setup -p setting.

    Fix a bug in ice_gather_scatter related to zero-ing out of the halo with the
    field_loc_noupdate setting. This was zero-ing out the blocks extra times and
    there were no problems as long as max_blocks was the same value on all MPI tasks.
    With the new implementation of max_blocks=-1, max_blocks can be different values
    on different MPI tasks. An error was generated and then the implementation
    was fixed so each block on each task is now zeroed out exactly once.

    Update diagnostics related to max_block information. Write out the min and max
    max_blocks values across MPI tasks.

    Add extra allocation/deallocation checks in ice_distribution.F90 and add
    a function, ice_memusage_allocErr, to ice_memusage.F90 that checks the
    alloc/dealloc return code, writes an error message, and aborts. This
    function could be used in other parts of the code as well.

    Fix a bug in the io_binary restart output where each task was writing some
    output when it should have just been the master task.

    Update test cases

    Update documentation

apcraig added 3 commits May 13, 2024 11:38
…uired on

each MPI task and then sets that as max_blocks is max_blocks=-1 in namelist.
This is done in ice_distribution and is a function of the decomposition among
other things.  Refactor the decomposition computation to defer usage of max_blocks
and eliminate the blockIndex array.  Update some indentation formatting in
ice_distribution.F90.

Modify cice.setup and cice_decomp.csh to set max_blocks=-1 unless it's explicitly
defined by the cice.setup -p setting.

Fix a bug in ice_gather_scatter related to zero-ing out of the halo with the
field_loc_noupdate setting.  This was zero-ing out the blocks extra times and
there were no problems as long as max_blocks was the same value on all MPI tasks.
With the new implementation of max_blocks=-1, max_blocks can be different values
on different MPI tasks.  An error was generated and then the implementation
was fixed so each block on each task is now zeroed out exactly once.

Update diagnostics related to block numbers.  Write out the min and max
max_blocks values across MPI tasks.

Add extra allocation/deallocation checks in ice_distribution.F90 and add
a function, ice_memusage_allocErr, to ice_memusage.F90 that checks the
alloc/dealloc return code, writes an error message, and aborts.  This
function could be used in other parts of the code as well.

Fix a bug in the io_binary restart output where each task was writing some
output when it should have just been the master task.

Add sectcart test case.
@apcraig
Copy link
Contributor Author

apcraig commented May 13, 2024

I still testing, refining the test suite, and updating documentation. But this should represent the code changes I'm proposing. Things are running well. The max_blocks=-1 setting now computes the maximum required blocks on each task and sets the internal max_blocks variable to that value. That means that it uses exactly the amount of memory required and the max_blocks can vary per task. Users can still manually set max_blocks in namelist as before.

@apcraig
Copy link
Contributor Author

apcraig commented May 14, 2024

Testing results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#7402dc7f04f98d840890f29f8f02a59f956a8fc2.

This is ready for review and merge.

@apcraig apcraig marked this pull request as ready for review May 14, 2024 15:38
@apcraig
Copy link
Contributor Author

apcraig commented May 15, 2024

Could someone do a review on this PR? Would love to get this merged. Then I can start comprehensively testing in preparation for a release. Thanks!

@dabail10
Copy link
Contributor

There is a lot here, so I might have missed something. I'm not going to get a chance to test this out until later (after the workshop). I will approve, but just know I might find stuff later once I have tested.

@eclare108213
Copy link
Contributor

@anton-seaice do you have time to look at this? It's probably after hours there...

Copy link
Contributor

@anton-seaice anton-seaice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good Tony.

It looks like there is coverage in the tests for both setting max_blocks automatically and from the namelist, and the tests are passing.

It looks like the test cases might be using slightly less memory when max_blocks is set automatically, but the resolution is probably too low to be significant.

@@ -227,8 +228,7 @@ but the user can overwrite the defaults by manually changing the values in
information to the log file, and if the block size or max blocks is
inconsistent with the task and thread size, the model will abort. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inconsistent with the task and thread size, the model will abort. The
inconsistent with the task and thread size, the model will abort. If ``max_blocks``=-1, the model will calculate the number of blocks needed for each task. ``max_blocks`` can also be set by the user, although this may use extra memory. The

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the documentation here.

and chooses a block size, ``block_size_x`` :math:`\times`\ ``block_size_y``,
and decomposition information ``distribution_type``, ``processor_shape``,
and ``distribution_type`` in **ice_in**. ``max_blocks`` is computed
automatically if set to a value of -1, but it can also be set by the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
automatically if set to a value of -1, but it can also be set by the user.

I think this sentence should be at the end of the paragraph. If I understand correctly, max_blocks does not impact how the blocks are distributed, the how depends on all the other parameters.

Copy link
Contributor

@anton-seaice anton-seaice May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this paragraph is out of order:

The user sets the NTASKS and NTHRDS settings in cice.settings and chooses a block size, block_size_x block_size_y, and decomposition information distribution_type, processor_shape, and distribution_wgt in ice_in. If max_blocks=-1, the model will calculate the number of blocks needed for each task. max_blocks can also be set by the user, although this may use extra memory and the model will abort if max_blocks is set too small for the decomposition. This information is used to determine how the blocks are distributed across the processors, and how the processors are distributed across the grid domain.

"This information" should refer to the information in the first sentence but at the moment it reads like "This information" refers to "max_blocks"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for checking again. I have refactored that paragraph a bit. I think it's better now. Let me know if you have any concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thankyou

character(*),parameter :: subname = '(ice_memusage_allocErr)'

ice_memusage_allocErr = .false.
if (istat > 0) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the value of istat have a meaning? Should we return istat or interpret the error in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at some documentation. I am changing this to istat /= 0 to be more correct. Couldn't find any info about return codes other than 0 is success.

!===============================================================================
! check memory alloc/dealloc return code

logical function ice_memusage_allocErr(istat, errstr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this ... unhandled errors spook me :)

! set/check max_blocks
if (lmax_blocks_calc) then
if (max_blocks < 0) then
max_blocks = newDistrb%numLocalBlocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling this variable max_blocks confusing? Isn't it just the number of blocks for this task ? Where max_blocks means the maximum number for all the tasks ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_blocks is used throughout the code as the number of blocks to allocate on each task. It used to be the same value on all tasks, set by the user. Now it's computed internally and set uniquely for each task. Setting max_blocks in line 683 is doing exactly that. Because the rake decomp calls the cartesian decomp, I needed to add an option where I could turn off the setting of max_blocks in cart because I calculate it in rake. But if cart is called directly, then max_blocks is set. Every decomp option sets the max_blocks variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused here. I thought nblocks was meant to be the number of blocks on each task and this could be different. The parameter max_blocks should be the maximum number of blocks across all tasks, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dabail10, you are correct. In general, nblocks is the number of active blocks on each task. max_blocks is the number of blocks that are allocated on each task. Historically, max_blocks was used to allocate memory, was the same value on all tasks, and was set by a CPP at build time. While nblocks was used to loop over only active blocks at run time. The code still uses those two variables as they've always been defined.

But in the last few years, we moved to dynamic allocation of memory (moving max_blocks to namelist) and with the current PR, we are able to compute max_blocks internally on each task BEFORE we need to use it to allocate memory. So in that case, nblocks and max_blocks can overlap in function.

However, we still support user defined max_blocks (although we probably don't need to) which means we still want to differentiate max_blocks and nblocks.

Maybe the next step is to ignore the max_blocks namelist setting and always set it internally. Once we do that, we could, in theory, unify nblocks and max_blocks in the code to a single variable. But we're not quite there yet, and I think we could debate whether all that refactoring would be worth the effort. max_blocks and nblocks still play different roles (memory allocation vs active block count), and we're all pretty comfortable with that scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a "max_blocks" in the namelist and a "max_blocks" in the code with different meanings is confusing. I see that it may not be worth it to unify max_blocks and nblocks (there are ~1000 uses of max_blocks), so I guess we just run with it as is.

This comment in ice_domain_size is now wrong:

    max_blocks  , & ! max number of blocks per processor

and maybe could be number of blocks memory is allocated for or similar?

call abort_ice(subname//'ERROR: processors left with no blocks')
newDistrb%numLocalBlocks = newDistrb%blockCnt(my_task+1)
if (newDistrb%numLocalBlocks < 0) then
call abort_ice(subname//'ERROR: processors left with no blocks', &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we abort here, but not say if there are no processors left with no blocks for a cartesian distribution ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the implementation as it was. Looks like rake is the only one that checks that there has to be a block on each task. I think I'll remove that constraint now. There is already a test that verifies the model runs fine with zero blocks on a task, so we've got that covered in the test suite. Good catch.

call abort_ice(subname//'ERROR: processors left with no blocks')
newDistrb%numLocalBlocks = newDistrb%blockCnt(my_task+1)
if (newDistrb%numLocalBlocks < 0) then
call abort_ice(subname//'ERROR: processors left with no blocks', &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
call abort_ice(subname//'ERROR: processors left with no blocks', &
call abort_ice(subname//'ERROR: tasks left with no blocks', &

I am not totally on top of threading, but should this be tasks ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been removed.

enddo
! set/check max_blocks
if (max_blocks < 0) then
max_blocks = newDistrb%numLocalBlocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its kind of silly for roundobin, but should we check the processor has work (i.e. numLocalBlocks /= 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having no work on a task is allowed. In my opinion, it's up to the user to properly tune the number of processors and decomposition.

@apcraig
Copy link
Contributor Author

apcraig commented May 16, 2024

I have update the PR based on feedback from @anton-seaice and am running a set of tests just to make sure nothing is broken. Will report results when the testing is done. Thanks @anton-seaice for the comments.

@apcraig
Copy link
Contributor Author

apcraig commented May 16, 2024

I reran a portion of the test suite with the latest code changes and I think everything is OK. I'll merge once github actions passes and @anton-seaice is happy with the current implementation. Please let me know if anything else needs to be fixed. Thanks!

@anton-seaice
Copy link
Contributor

There just a couple if lines in ice_domain_size that are not totally consistent now:

    max_blocks  , & ! max number of blocks per processor

Could be updated

   !*** The model will inform the user of the correct
   !*** values for the parameter below.  A value higher than
   !*** necessary will not cause the code to fail, but will
   !*** allocate more memory than is necessary.  A value that
   !*** is too low will cause the code to exit.
   !*** A good initial guess is found using
   !*** max_blocks = (nx_global/block_size_x)*(ny_global/block_size_y)/
   !***               num_procs

Can probably be removed because its covered in the docs ?

@apcraig
Copy link
Contributor Author

apcraig commented May 16, 2024

There just a couple if lines in ice_domain_size that are not totally consistent now:

    max_blocks  , & ! max number of blocks per processor

Could be updated

   !*** The model will inform the user of the correct
   !*** values for the parameter below.  A value higher than
   !*** necessary will not cause the code to fail, but will
   !*** allocate more memory than is necessary.  A value that
   !*** is too low will cause the code to exit.
   !*** A good initial guess is found using
   !*** max_blocks = (nx_global/block_size_x)*(ny_global/block_size_y)/
   !***               num_procs

Can probably be removed because its covered in the docs ?

good catch, fixed these.

@anton-seaice
Copy link
Contributor

good catch, fixed these.

I think you still need to push the commit

Copy link
Contributor

@anton-seaice anton-seaice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tony!

@apcraig apcraig merged commit 969a76d into CICE-Consortium:main May 17, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants