-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement checkpointing for bmm models #130
base: develop
Are you sure you want to change the base?
Conversation
I suggest we bump to 0.4.0 when merging this, and then I can make a new release with all the recent additions and take care of setting up an R-universe repo |
Maybe we can discuss the naming of the arguments. checkpoints currently takes a number of iterations after which to save. E.g., checkpoints = 100, means it will create a checkpoint every 100 iterations. But the stop_after argument takes "number of checkpoints". So stop_after = 2, means stop after making two checkpoints (e.g. 200 iterations). This now seems inconsistent. Alternatives 1
2
What do you think? versions with codefit <- fit_model(formula, data, model,
parallel = T,
backend = 'cmdstanr',
sort_data = T,
iter = 100,
checkpoints_at = 25, # always N iter
stop_at = 50, # always N iter
checkpoints_folder = 'local/checkpoints2') fit <- fit_model(formula, data, model,
parallel = T,
backend = 'cmdstanr',
sort_data = T,
iter = 100,
checkpoints_at = 25, # always N iter
stop_at_iter = 50, # iter
checkpoints_folder = 'local/checkpoints2') fit <- fit_model(formula, data, model,
parallel = T,
backend = 'cmdstanr',
sort_data = T,
iter = 100,
checkpoints_at = 25, # always N iter
stop_at_checkpoint = 2, # N checkpoints
checkpoints_folder = 'local/checkpoints2') fit <- fit_model(formula, data, model,
parallel = T,
backend = 'cmdstanr',
sort_data = T,
iter = 100,
checkpoints = 4, # always N checkpoints (therefore saving at 25, 50, 75 and 100 iter
stop_at_checkpoint = 2,
checkpoints_folder = 'local/checkpoints2') |
So, I have thought a bit about the naming of the arguments. For me it depends on what should be specified. I will just share my thoughts:
for the stop argument:
So maybe for the stop argument: I hope my thoughts make sense and help you in making a decision about the naming of the arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. So, once you decided on a naming scheme you are to merge with develop.
@GidonFrischkorn just to update you on this. I tested it with our reference models and it worked great on my machine, but when I tested it on the server, it failed and I discovered a few more bugs when not running on windows. I've fixed most of them and after I fix the final one (hopefully today), I will update and merge. |
I can also get to testing this in depth tomorrow on my Mac to see if there are additional problems on Mac vs. Ubuntu & Windows. |
Ok, everything works on all platforms, but I am shelving this for now, because I discovered that the adaptation stage is not implemented properly . Will come back to this at some point when the checkpointing adaptation is fixed. For now we can proceed without this feature. |
Summary
This PR adds an option to use checkpointing for bmm models. This is available via the new "checkpoints" argument of `bmm.
This uses the
chkptstanr
package as a backend to save the sampling results every "checkpoints" iterations. This is useful for long sampling runs, as it allows you to resume sampling from the last checkpoint in case of a crash or other interruption (#129). This option should be considered Experimental. It works only with thecmdstanr
backend, and it requires you to install a forked version ofchkptstanr
from GitHub, which implements a number of bugfixes. To install the forked version, runremotes::install_github("venpopov/chkptstanr")
. See '?fit_model' for more information on how to use thecheckpoints
argument, and see thechkptstanr
package documentation for the motivation and benefits of using checkpoints. Closes #129.Example usage
This will save the samples every 25 iterations. If you stop it manually and rerun, it will pickup where it left off.
Alternatively you can specify a stopping point:
Tests
tests/internal/test_checkpointing.R
[X] Confirm that all tests passed
[X] Confirm that devtools::check() produces no errors
Release notes