Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checkpointing for bmm models #130

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from
Draft

Conversation

venpopov
Copy link
Owner

@venpopov venpopov commented Feb 24, 2024

Summary

This PR adds an option to use checkpointing for bmm models. This is available via the new "checkpoints" argument of `bmm.

This uses the chkptstanr package as a backend to save the sampling results every "checkpoints" iterations. This is useful for long sampling runs, as it allows you to resume sampling from the last checkpoint in case of a crash or other interruption (#129). This option should be considered Experimental. It works only with the cmdstanr backend, and it requires you to install a forked version of chkptstanr from GitHub, which implements a number of bugfixes. To install the forked version, run remotes::install_github("venpopov/chkptstanr"). See '?fit_model' for more information on how to use the checkpoints argument, and see the chkptstanr package documentation for the motivation and benefits of using checkpoints. Closes #129.

Example usage

data <- OberauerLin_2017
formula <- bmf(c ~ set_size, kappa ~ 1)
model <- sdmSimple('dev_rad')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 25,
                 checkpoints_folder = 'local/checkpoints2')

This will save the samples every 25 iterations. If you stop it manually and rerun, it will pickup where it left off.

Alternatively you can specify a stopping point:

data <- OberauerLin_2017
formula <- bmf(c ~ set_size, kappa ~ 1)
model <- sdmSimple('dev_rad')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 25,
                 stop_after = 2,  # stops after 2 checkpoints
                 checkpoints_folder = 'local/checkpoints2')

Tests

  • new local tests for this in tests/internal/test_checkpointing.R

[X] Confirm that all tests passed
[X] Confirm that devtools::check() produces no errors

Release notes

@venpopov venpopov added PR - minor Pull-request should update minor version enhancement - new feature New user or developer feature labels Feb 24, 2024
@venpopov venpopov added this to the 1.0.0 milestone Feb 24, 2024
@venpopov
Copy link
Owner Author

I suggest we bump to 0.4.0 when merging this, and then I can make a new release with all the recent additions and take care of setting up an R-universe repo

@venpopov
Copy link
Owner Author

venpopov commented Feb 24, 2024

Maybe we can discuss the naming of the arguments.

checkpoints currently takes a number of iterations after which to save. E.g., checkpoints = 100, means it will create a checkpoint every 100 iterations.

But the stop_after argument takes "number of checkpoints". So stop_after = 2, means stop after making two checkpoints (e.g. 200 iterations).

This now seems inconsistent.

Alternatives

1

  • rename checkpoints to checkpoints_at. Still takes number of iterations, but now the name is more intuitive
  • rename stop_after to stop_at_checkpoint or stop_at_iter, where it either takes number of checkpoints or number of iters before interupting sampling

2

  • keep name checkpoints, but give the number of checkpoints. E.g., checkpoints=4 means that it will make a checkpoint at every iter/4. If we have 2000 iterations total, then this is checkpoints at 500, 1000, 1500 and 2000
  • keep stop_after or rename it to stop_at or stop_at_checkpoint

What do you think?

versions with code

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at = 50,                  # always N iter
                 checkpoints_folder = 'local/checkpoints2')
fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at_iter = 50,           # iter
                 checkpoints_folder = 'local/checkpoints2')
fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at_checkpoint = 2,           # N checkpoints
                 checkpoints_folder = 'local/checkpoints2')
fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 4,      # always N checkpoints (therefore saving at 25, 50, 75 and 100 iter
                 stop_at_checkpoint = 2,      
                 checkpoints_folder = 'local/checkpoints2')

@GidonFrischkorn
Copy link
Collaborator

GidonFrischkorn commented Feb 24, 2024

So, I have thought a bit about the naming of the arguments. For me it depends on what should be specified. I will just share my thoughts:

  • if a number of iterations is provided, meaning after each N number of iterations samples should be saved, then I would use checkpoint_each clarifying that the checkpoint will be done as soon as N number of iterations finished
  • if an absolute number of iterations, maybe even a vector would be passed, then checkpoint_at makes more sense to me, as you are giving the absolute value at which the checkpoints should be done
  • if a number of checkpoints is passed, then I would name the argument checkpoints or maybe even more explicitly n_checkpoints

for the stop argument:

  • if the number of iterations after which the sampling should stop is passed then stop_after or stop_after_iter makes sense to me. (but here you could also argue for stopping at which number of iterations, speaking for stop_at_iter
  • if the checkpoint at which the sampling should stop is passed then stop_at or stop_at_checkpoint makes sense to me.

So maybe for the stop argument: stop_at_iter and stop_at_checkpoint is most consistent and explicit.
For the checkpoints argument, I find it more challenging to find a coherent naming scheme. Generally, I like having the option to specify the number of checkpoints or having a checkpoint each time a certain number of iterations has been reached. So, if it is not too complicated maybe we could have checkpoints and checkpoints_each (if you do not like the each than I am also fine with `checkpoints_at)

I hope my thoughts make sense and help you in making a decision about the naming of the arguments.

Copy link
Collaborator

@GidonFrischkorn GidonFrischkorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. So, once you decided on a naming scheme you are to merge with develop.

@venpopov
Copy link
Owner Author

@GidonFrischkorn just to update you on this. I tested it with our reference models and it worked great on my machine, but when I tested it on the server, it failed and I discovered a few more bugs when not running on windows. I've fixed most of them and after I fix the final one (hopefully today), I will update and merge.

@GidonFrischkorn
Copy link
Collaborator

I can also get to testing this in depth tomorrow on my Mac to see if there are additional problems on Mac vs. Ubuntu & Windows.

@venpopov
Copy link
Owner Author

Ok, everything works on all platforms, but I am shelving this for now, because I discovered that the adaptation stage is not implemented properly . Will come back to this at some point when the checkpointing adaptation is fixed. For now we can proceed without this feature.

@venpopov venpopov marked this pull request as draft February 28, 2024 04:03
@venpopov venpopov linked an issue Feb 28, 2024 that may be closed by this pull request
@venpopov venpopov removed this from the 1.0.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement - new feature New user or developer feature PR - minor Pull-request should update minor version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement checkpointing functionality Extend checkpoint functionality
2 participants