-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGTERM Handling #21
SIGTERM Handling #21
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
Test Results 6 files 6 suites 1h 3m 56s ⏱️ Results for commit 1659d57. |
@@ -56,7 +90,7 @@ def _upload_logs(self) -> None: | |||
|
|||
def _trainer_fit(self, *args, **kwargs): | |||
try: | |||
self.pl_trainer.fit(*args, **kwargs) | |||
self.pl_trainer.fit(*args, ckpt_path=self.resume, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unimportant for this PR, but when we resume from checkpoint, should we consider also resuming the state of some hyperparameters? I'm not sure if that is already abstracted and/or a feature of pytorch lightning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should probably store all states (optimizer, scheduler, ...) so we can truly continue with the training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, pytorch lightning does this automatically.
* handling SIGTERM signal * resume argument takes path
Added handling of the SIGTERM signal. The current state of the training is saved and can be later resumed using the
--resume
flag.