Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: SegFault when a MPI rank has no data of a sub-region #3529

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

MelReyCG
Copy link
Contributor

@MelReyCG MelReyCG commented Jan 29, 2025

Closing #3528

My goal here is to solve the crash, and to propose a way to reduce (Min/Max) pairs lexicographically over MPI ranks. That can typically be useful to get the min/max value (pressure, temperature...) along with its globalIndex in the mesh (while ensuring that the globalIndex will be stable).
It could be generalized to tuples, let me know if that could be useful.

@MelReyCG MelReyCG added type: bug Something isn't working ci: run integrated tests Allows to run the integrated tests in GEOS CI labels Jan 29, 2025
@MelReyCG MelReyCG self-assigned this Jan 29, 2025
@MelReyCG MelReyCG linked an issue Jan 29, 2025 that may be closed by this pull request
@MelReyCG MelReyCG requested a review from arng40 January 29, 2025 16:36
@corbett5
Copy link
Contributor

I'm sure there's a reason I haven't thought about, but why not use std::pair? And those mpiPairType::get functions can be more succinctly written as templated variables.

Example: https://godbolt.org/z/3sxTeGrzs

@MelReyCG
Copy link
Contributor Author

Hi @corbett5!

  1. Yes, I indeed started with std::pair, but I did not continue with it because is not trivially copiable. As it seem for the least that it could be used in performance critical places, it seems like a reasonable requirement.
    I wonder if this is why the original code here did not go for std::pair.
    https://godbolt.org/z/KE7TEMx7d
    https://stackoverflow.com/questions/64978278/what-is-the-performance-problem-with-stdpair-assignment-operator

  2. IIUC, you suggest the following?

/* no default get() implementation, please add a template specialization and add it in the "testMpiWrapper" unit test. */
template< typename FIRST, typename SECOND >
MPI_Datatype const mpiPairType;

template<> MPI_Datatype const mpiPairType< float, int > = MPI_FLOAT_INT;
template<> MPI_Datatype const mpiPairType< double, int > = MPI_DOUBLE_INT;
template<> MPI_Datatype const mpiPairType< int, int > = MPI_2INT;
template<> MPI_Datatype const mpiPairType< long int, int > = MPI_LONG_INT;
template<> MPI_Datatype const mpiPairType< long int, long int > = getMpiCustomPairType< long int, long int >();
template<> MPI_Datatype const mpiPairType< long long int, long long int > = getMpiCustomPairType< long long int, long long int >();
template<> MPI_Datatype const mpiPairType< double, long int > = getMpiCustomPairType< double, long int >();
template<> MPI_Datatype const mpiPairType< double, long long int > = getMpiCustomPairType< double, long long int >();
template<> MPI_Datatype const mpiPairType< double, double > = getMpiCustomPairType< double, double >();

@corbett5
Copy link
Contributor

@MelReyCG

  1. Gotcha, yeah I think using std::pair would just work, but I think using the custom struct is not a bad idea.
  2. I forgot about the getMpiCustomPairType function, on second thought would it be cleaner if we eliminated the mpiPairType variable completely and just put all the logic in getMpiCustomPairType? You can't template specialize functions, but it could be done with if constexpr checks of the types I believe.

@MelReyCG
Copy link
Contributor Author

MelReyCG commented Jan 31, 2025

@corbett5 I added a new proposal in the "proposal #1" commit.
For now, I did not go for the if constexpr solution, but it would easily be feasible if you guys think it would be cleaner than this proposal.

@CusiniM
Copy link
Collaborator

CusiniM commented Jan 31, 2025

@MelReyCG

1. Gotcha, yeah I think using `std::pair` would just work, but I think using the custom struct is not a bad idea.

I remember an old discussion about the std::pair layout not being guaranteed to be contiguous in memory which I think is the reason why it was not used in the first place. I don't recall the details of how it was being used though.

2. I forgot about the `getMpiCustomPairType` function, on second thought would it be cleaner if we eliminated the `mpiPairType` variable completely and just put all the logic in `getMpiCustomPairType`? You can't template specialize functions, but it could be done with `if constexpr` checks of the types I believe.

Personally I don't have a strong preference between using if constexpr in a single function or using template specializations as @MelReyCG has done. They way it's written now seems pretty tidy and easy to read but something equally good could probably be obtained using if constexpr. Maybe this second option is a bit more modern?

@corbett5
Copy link
Contributor

Personally I don't have a strong preference between using if constexpr in a single function or using template specializations as @MelReyCG has done. They way it's written now seems pretty tidy and easy to read but something equally good could probably be obtained using if constexpr. Maybe this second option is a bit more modern?

I think the single function is cleaner, but this is in my opinion a small improvement to a minor part of this PR and I think the current implementation is fine.

@MelReyCG
Copy link
Contributor Author

MelReyCG commented Feb 3, 2025

@CusiniM

I remember an old discussion about the std::pair layout not being guaranteed to be contiguous in memory which I think is the reason why it was not used in the first place. I don't recall the details of how it was being used though.

This is exactly why I chose the struct, it seems that there is no guarantee that the compilator will be allowed to optimize std::pair<> copies as much as structures. Another source: https://www.reddit.com/r/cpp/comments/ar4ghs/stdpair_disappointing_performance/?rdt=48069

About the if constexpr, you can see that I used it in getMpiPairReductionOp(), I tend to use it more when it concerns implementation details, or when it can simplify overcomplicated std::enable_if.
For such "switch on type T" usage, I prefer the good old template. I think that's a matter of style / tastes. If the code readability is not affected, I think we can keep template specializations.

@MelReyCG
Copy link
Contributor Author

MelReyCG commented Feb 3, 2025

I finished to write the last comments I think, we may be good to go?

Copy link
Collaborator

@CusiniM CusiniM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @MelReyCG

GEOS_ERROR_IF_NE( MPI_Op_create( customOpFunc, 1, &mpiOp ), MPI_SUCCESS );
return mpiOp;
};
static MPI_Op mpiOp{ createOpHolder() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be static? mpiOp is returned by value, so it shouldn't matter if it gets destroyed after the function returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static storage here does matter as MPI_Op (or MPI_Datatype for custom types) represents a persistent MPI resource that must:

  • be initialized only once to avoid mem-leaks (MPI_X_create() functions allocates internal resources),
  • remain valid for the entire MPI lifetime (MPI_Init->MPI_Finalize),
  • be shared across all calls to this function.

The static construction I used here is a variation of the "Meyer's Singleton", which lazily constructs a unique instance of an object when the resource is first requested (by calling the lambda in this case).

At first I did not mind calling MPI_Op_free() since we already call MPI_Finalize() but I found the following in the MPI documentation:

The call to MPI_FINALIZE does not free objects created by MPI calls; these objects are freed using MPI_ XXX_FREE, MPI_COMM_DISCONNECT, or MPI_FILE_CLOSE calls.

To completely manage these resources lifetime I will work a bit more on this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh god, yeah I get it. If performance isn't a concern (and I doubt it is) I would simply createa new op every time and free it every time as well. You could do this pretty easily with a RAII object that had a move constructor and deleted copy constructor.

Copy link
Contributor Author

@MelReyCG MelReyCG Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corbett5 Indeed, a RAII object would do the job.
Yesterday I finished with another approach in which I kept using the MPI operation / data-type caching and stored their trace to release them on MPI_Finalize(). Let me know if my approach is right (commit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer my approach (in which these objects are create and destroyed on the fly) since it would eliminate the need for static variables, but this is still quite the improvement and I'm not sure it's worth your time to rewrite it.

src/coreComponents/common/MpiWrapper.hpp Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: ready for review type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SegFault when a MPI rank has no data of a sub-region
3 participants