Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle crashing of micro simulations in a proper way #74

Closed
IshaanDesai opened this issue Jan 29, 2024 · 4 comments
Closed

Handle crashing of micro simulations in a proper way #74

IshaanDesai opened this issue Jan 29, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@IshaanDesai
Copy link
Member

Currently when the Micro Manager is controlling and running micro simulations, if one simulation crashes or has an improper exit, the Micro Manager run just hangs. It is nearly impossible to know which micro simulation crashed, and why the overall execution has hanged.

The Micro Manager should be able to handle a simulation crash by ...

  • ... continuing to run the rest of the micro simulations.
  • ... logging the simulation crash in the log file.
  • ... providing the macro location at which the simulation crashed.
  • ... creating a new log file and parsing the error output from the crashed simulation to it.

It is not clear if all these things could be done, so initially some investigations are necessary.

@IshaanDesai IshaanDesai added enhancement bug Something isn't working and removed enhancement labels Jan 29, 2024
@uekerman
Copy link
Member

uekerman commented Feb 7, 2024

Ideally, the complete simulation could continue to run, by replacing the result of the failing micro simulation by a similar simulation.
Technically, this should be handled by catching exceptions.

@tjwsch
Copy link
Collaborator

tjwsch commented Mar 5, 2024

How could it be decided what simulation is similar enough? Especially in the case of adaptivity, as I understand it, simulations that are similar to each other rely on one to run, and if it hangs or crashes, these inactive simulations would not help resolve the crash/hang.
Should this crashed simulation rely on data from another simulation from the point of failure until the end of the simulation, or should it attempt to restart somehow?
As the macro simulation can rely on data from all micro simulations, it only makes sense to continue a run if all micro simulations provide data or the necessary data is provided by other means, right? Continuing an incomplete simulation would run into more problems. Thus, the simulation could also run into cases when it might be best to abort the entire solving process if continuing the complete simulation is unreasonable due to a lack of a similar simulation to replace the crashed one with. In such a case, is there a way to let the other participant know that the simulation has been stopped early and finalize the incomplete simulation properly?

@uekerman
Copy link
Member

uekerman commented Mar 7, 2024

Especially in the case of adaptivity, as I understand it, simulations that are similar to each other rely on one to run, and if it hangs or crashes, these inactive simulations would not help resolve the crash/hang.

It could be the next similar one.

Should this crashed simulation rely on data from another simulation from the point of failure until the end of the simulation, or should it attempt to restart somehow?

No restart. What we had in mind were cases failing in a deterministic way. So, restart should fail again.

Thus, the simulation could also run into cases when it might be best to abort the entire solving process if continuing the complete simulation is unreasonable due to a lack of a similar simulation to replace the crashed one with.

Agree

is there a way to let the other participant know that the simulation has been stopped early and finalize the incomplete simulation properly?

Currently not, but under discussion in precice/precice#1118

@IshaanDesai
Copy link
Member Author

Resolved via #85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants