fix: SegFault when a MPI rank has no data of a sub-region #3529

MelReyCG · 2025-01-29T15:08:51Z

Closing #3528

My goal here is to solve the crash, and to propose a way to reduce (Min/Max) pairs lexicographically over MPI ranks. That can typically be useful to get the min/max value (pressure, temperature...) along with its globalIndex in the mesh (while ensuring that the globalIndex will be stable).
It could be generalized to tuples, let me know if that could be useful.

…n sets of pairs over ranks. If a rank has no element in its set, it no longer results in a crash.

…ey/system-solution-scaling-crash-fix

corbett5 · 2025-01-29T17:00:45Z

I'm sure there's a reason I haven't thought about, but why not use std::pair? And those mpiPairType::get functions can be more succinctly written as templated variables.

Example: https://godbolt.org/z/3sxTeGrzs

src/coreComponents/physicsSolvers/fluidFlow/CompositionalMultiphaseFVM.cpp

MelReyCG · 2025-01-30T16:28:46Z

Hi @corbett5!

Yes, I indeed started with std::pair, but I did not continue with it because is not trivially copiable. As it seem for the least that it could be used in performance critical places, it seems like a reasonable requirement.
I wonder if this is why the original code here did not go for std::pair.
https://godbolt.org/z/KE7TEMx7d
https://stackoverflow.com/questions/64978278/what-is-the-performance-problem-with-stdpair-assignment-operator
IIUC, you suggest the following?

/* no default get() implementation, please add a template specialization and add it in the "testMpiWrapper" unit test. */
template< typename FIRST, typename SECOND >
MPI_Datatype const mpiPairType;

template<> MPI_Datatype const mpiPairType< float, int > = MPI_FLOAT_INT;
template<> MPI_Datatype const mpiPairType< double, int > = MPI_DOUBLE_INT;
template<> MPI_Datatype const mpiPairType< int, int > = MPI_2INT;
template<> MPI_Datatype const mpiPairType< long int, int > = MPI_LONG_INT;
template<> MPI_Datatype const mpiPairType< long int, long int > = getMpiCustomPairType< long int, long int >();
template<> MPI_Datatype const mpiPairType< long long int, long long int > = getMpiCustomPairType< long long int, long long int >();
template<> MPI_Datatype const mpiPairType< double, long int > = getMpiCustomPairType< double, long int >();
template<> MPI_Datatype const mpiPairType< double, long long int > = getMpiCustomPairType< double, long long int >();
template<> MPI_Datatype const mpiPairType< double, double > = getMpiCustomPairType< double, double >();

corbett5 · 2025-01-30T18:40:04Z

@MelReyCG

Gotcha, yeah I think using std::pair would just work, but I think using the custom struct is not a bad idea.
I forgot about the getMpiCustomPairType function, on second thought would it be cleaner if we eliminated the mpiPairType variable completely and just put all the logic in getMpiCustomPairType? You can't template specialize functions, but it could be done with if constexpr checks of the types I believe.

MelReyCG · 2025-01-31T11:07:09Z

@corbett5 I added a new proposal in the "proposal #1" commit.
For now, I did not go for the if constexpr solution, but it would easily be feasible if you guys think it would be cleaner than this proposal.

CusiniM · 2025-01-31T15:54:12Z

@MelReyCG

1. Gotcha, yeah I think using `std::pair` would just work, but I think using the custom struct is not a bad idea.

I remember an old discussion about the std::pair layout not being guaranteed to be contiguous in memory which I think is the reason why it was not used in the first place. I don't recall the details of how it was being used though.

2. I forgot about the `getMpiCustomPairType` function, on second thought would it be cleaner if we eliminated the `mpiPairType` variable completely and just put all the logic in `getMpiCustomPairType`? You can't template specialize functions, but it could be done with `if constexpr` checks of the types I believe.

Personally I don't have a strong preference between using if constexpr in a single function or using template specializations as @MelReyCG has done. They way it's written now seems pretty tidy and easy to read but something equally good could probably be obtained using if constexpr. Maybe this second option is a bit more modern?

corbett5 · 2025-01-31T17:02:31Z

Personally I don't have a strong preference between using if constexpr in a single function or using template specializations as @MelReyCG has done. They way it's written now seems pretty tidy and easy to read but something equally good could probably be obtained using if constexpr. Maybe this second option is a bit more modern?

I think the single function is cleaner, but this is in my opinion a small improvement to a minor part of this PR and I think the current implementation is fine.

MelReyCG · 2025-02-03T16:36:45Z

@CusiniM

I remember an old discussion about the std::pair layout not being guaranteed to be contiguous in memory which I think is the reason why it was not used in the first place. I don't recall the details of how it was being used though.

This is exactly why I chose the struct, it seems that there is no guarantee that the compilator will be allowed to optimize std::pair<> copies as much as structures. Another source: https://www.reddit.com/r/cpp/comments/ar4ghs/stdpair_disappointing_performance/?rdt=48069

About the if constexpr, you can see that I used it in getMpiPairReductionOp(), I tend to use it more when it concerns implementation details, or when it can simplify overcomplicated std::enable_if.
For such "switch on type T" usage, I prefer the good old template. I think that's a matter of style / tastes. If the code readability is not affected, I think we can keep template specializations.

MelReyCG · 2025-02-03T16:43:22Z

I finished to write the last comments I think, we may be good to go?

CusiniM

LGTM. Thanks @MelReyCG

corbett5 · 2025-02-04T19:19:47Z

src/coreComponents/common/MpiWrapper.hpp

+    GEOS_ERROR_IF_NE( MPI_Op_create( customOpFunc, 1, &mpiOp ), MPI_SUCCESS );
+    return mpiOp;
+  };
+  static MPI_Op mpiOp{ createOpHolder() };


Why does this need to be static? mpiOp is returned by value, so it shouldn't matter if it gets destroyed after the function returns.

The static storage here does matter as MPI_Op (or MPI_Datatype for custom types) represents a persistent MPI resource that must:

be initialized only once to avoid mem-leaks (MPI_X_create() functions allocates internal resources),

remain valid for the entire MPI lifetime (MPI_Init->MPI_Finalize),

be shared across all calls to this function.

The static construction I used here is a variation of the "Meyer's Singleton", which lazily constructs a unique instance of an object when the resource is first requested (by calling the lambda in this case).

At first I did not mind calling MPI_Op_free() since we already call MPI_Finalize() but I found the following in the MPI documentation:

The call to MPI_FINALIZE does not free objects created by MPI calls; these objects are freed using MPI_ XXX_FREE, MPI_COMM_DISCONNECT, or MPI_FILE_CLOSE calls.

To completely manage these resources lifetime I will work a bit more on this PR.

Oh god, yeah I get it. If performance isn't a concern (and I doubt it is) I would simply createa new op every time and free it every time as well. You could do this pretty easily with a RAII object that had a move constructor and deleted copy constructor.

@corbett5 Indeed, a RAII object would do the job.
Yesterday I finished with another approach in which I kept using the MPI operation / data-type caching and stored their trace to release them on MPI_Finalize(). Let me know if my approach is right (commit).

I prefer my approach (in which these objects are create and destroyed on the fly) since it would eliminate the need for static variables, but this is still quite the improvement and I'm not sure it's worth your time to rewrite it.

src/coreComponents/common/MpiWrapper.hpp

…peration

MelReyCG added 7 commits January 29, 2025 11:07

🎨 formatting details

d89f059

🚑 Adding MpiWrapper::minPair()/maxPair() to safely find the min/max i…

77e48a6

…n sets of pairs over ranks. If a rank has no element in its set, it no longer results in a crash.

✅ Adding a unit test for MPI pair reduction. Could be extended later.

e89330d

📦 schema

175b09c

Merge commit 'ac45ee2a112dca04a2ecddacfc8252d2a486c29e' into bugfix/r…

a466974

…ey/system-solution-scaling-crash-fix

🔇 remoging debug logs

a51fbae

🐛 changing primitive types

367836f

MelReyCG added type: bug Something isn't working ci: run integrated tests Allows to run the integrated tests in GEOS CI labels Jan 29, 2025

MelReyCG self-assigned this Jan 29, 2025

MelReyCG requested review from CusiniM, paveltomin, ryar9534, rrsettgast, corbett5 and wrtobin as code owners January 29, 2025 15:08

MelReyCG linked an issue Jan 29, 2025 that may be closed by this pull request

SegFault when a MPI rank has no data of a sub-region #3528

Open

MelReyCG added 2 commits January 29, 2025 16:32

♻️ minPair()/maxPair()->min()/max()

f763763

✅ unit test update

fbb84fe

MelReyCG requested a review from arng40 January 29, 2025 16:36

♻️ minPair()/maxPair()->min()/max()

a49208e

CusiniM reviewed Jan 29, 2025

View reviewed changes

src/coreComponents/physicsSolvers/fluidFlow/CompositionalMultiphaseFVM.cpp Outdated Show resolved Hide resolved

⚰️ removing dead code (valueAndLocation struct & debug variable)

03977cc

🎨 code style

2ccefa7

MelReyCG added 2 commits January 31, 2025 12:09

🎨 proposal #1

cfb41e3

🎨 uncrustify

a3d29ba

Merge branch 'develop' into bugfix/rey/system-solution-scaling-crash-fix

1dd7fd7

MelReyCG added the DO NOT MERGE ! label Feb 3, 2025

finished last comments

bf9ef33

MelReyCG added flag: ready for review and removed DO NOT MERGE ! labels Feb 3, 2025

MelReyCG requested a review from CusiniM February 4, 2025 08:49

CusiniM approved these changes Feb 4, 2025

View reviewed changes

corbett5 reviewed Feb 4, 2025

View reviewed changes

MelReyCG added the DO NOT MERGE ! label Feb 5, 2025

MelReyCG and others added 2 commits February 5, 2025 17:07

🐛 Mem-leak fix: Added MPI resource management for mpi-types and mpi-o…

d49b850

…peration

Merge branch 'develop' into bugfix/rey/system-solution-scaling-crash-fix

79acdb8

MelReyCG removed the DO NOT MERGE ! label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: SegFault when a MPI rank has no data of a sub-region #3529

fix: SegFault when a MPI rank has no data of a sub-region #3529

MelReyCG commented Jan 29, 2025 •

edited

Loading

corbett5 commented Jan 29, 2025

MelReyCG commented Jan 30, 2025

corbett5 commented Jan 30, 2025

MelReyCG commented Jan 31, 2025 •

edited

Loading

CusiniM commented Jan 31, 2025

corbett5 commented Jan 31, 2025

MelReyCG commented Feb 3, 2025 •

edited

Loading

MelReyCG commented Feb 3, 2025

CusiniM left a comment

corbett5 Feb 4, 2025

MelReyCG Feb 5, 2025

corbett5 Feb 5, 2025

MelReyCG Feb 6, 2025 •

edited

Loading

corbett5 Feb 6, 2025

fix: SegFault when a MPI rank has no data of a sub-region #3529

Are you sure you want to change the base?

fix: SegFault when a MPI rank has no data of a sub-region #3529

Conversation

MelReyCG commented Jan 29, 2025 • edited Loading

corbett5 commented Jan 29, 2025

MelReyCG commented Jan 30, 2025

corbett5 commented Jan 30, 2025

MelReyCG commented Jan 31, 2025 • edited Loading

CusiniM commented Jan 31, 2025

corbett5 commented Jan 31, 2025

MelReyCG commented Feb 3, 2025 • edited Loading

MelReyCG commented Feb 3, 2025

CusiniM left a comment

Choose a reason for hiding this comment

corbett5 Feb 4, 2025

Choose a reason for hiding this comment

MelReyCG Feb 5, 2025

Choose a reason for hiding this comment

corbett5 Feb 5, 2025

Choose a reason for hiding this comment

MelReyCG Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

corbett5 Feb 6, 2025

Choose a reason for hiding this comment

MelReyCG commented Jan 29, 2025 •

edited

Loading

MelReyCG commented Jan 31, 2025 •

edited

Loading

MelReyCG commented Feb 3, 2025 •

edited

Loading

MelReyCG Feb 6, 2025 •

edited

Loading