Implement checkpointing for bmm models #130

venpopov · 2024-02-24T00:06:52Z

Summary

This PR adds an option to use checkpointing for bmm models. This is available via the new "checkpoints" argument of `bmm.

This uses the chkptstanr package as a backend to save the sampling results every "checkpoints" iterations. This is useful for long sampling runs, as it allows you to resume sampling from the last checkpoint in case of a crash or other interruption (#129). This option should be considered Experimental. It works only with the cmdstanr backend, and it requires you to install a forked version of chkptstanr from GitHub, which implements a number of bugfixes. To install the forked version, run remotes::install_github("venpopov/chkptstanr"). See '?fit_model' for more information on how to use the checkpoints argument, and see the chkptstanr package documentation for the motivation and benefits of using checkpoints. Closes #129.

Example usage

data <- OberauerLin_2017
formula <- bmf(c ~ set_size, kappa ~ 1)
model <- sdmSimple('dev_rad')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 25,
                 checkpoints_folder = 'local/checkpoints2')

This will save the samples every 25 iterations. If you stop it manually and rerun, it will pickup where it left off.

Alternatively you can specify a stopping point:

data <- OberauerLin_2017
formula <- bmf(c ~ set_size, kappa ~ 1)
model <- sdmSimple('dev_rad')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 25,
                 stop_after = 2,  # stops after 2 checkpoints
                 checkpoints_folder = 'local/checkpoints2')

Tests

new local tests for this in tests/internal/test_checkpointing.R

[X] Confirm that all tests passed
[X] Confirm that devtools::check() produces no errors

Release notes

venpopov · 2024-02-24T00:29:11Z

I suggest we bump to 0.4.0 when merging this, and then I can make a new release with all the recent additions and take care of setting up an R-universe repo

venpopov · 2024-02-24T06:24:36Z

Maybe we can discuss the naming of the arguments.

checkpoints currently takes a number of iterations after which to save. E.g., checkpoints = 100, means it will create a checkpoint every 100 iterations.

But the stop_after argument takes "number of checkpoints". So stop_after = 2, means stop after making two checkpoints (e.g. 200 iterations).

This now seems inconsistent.

Alternatives

1

rename checkpoints to checkpoints_at. Still takes number of iterations, but now the name is more intuitive
rename stop_after to stop_at_checkpoint or stop_at_iter, where it either takes number of checkpoints or number of iters before interupting sampling

2

keep name checkpoints, but give the number of checkpoints. E.g., checkpoints=4 means that it will make a checkpoint at every iter/4. If we have 2000 iterations total, then this is checkpoints at 500, 1000, 1500 and 2000
keep stop_after or rename it to stop_at or stop_at_checkpoint

What do you think?

versions with code

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at = 50,                  # always N iter
                 checkpoints_folder = 'local/checkpoints2')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at_iter = 50,           # iter
                 checkpoints_folder = 'local/checkpoints2')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints_at = 25,      # always N iter
                 stop_at_checkpoint = 2,           # N checkpoints
                 checkpoints_folder = 'local/checkpoints2')

fit <- fit_model(formula, data, model,
                 parallel = T,
                 backend = 'cmdstanr',
                 sort_data = T,
                 iter = 100,
                 checkpoints = 4,      # always N checkpoints (therefore saving at 25, 50, 75 and 100 iter
                 stop_at_checkpoint = 2,      
                 checkpoints_folder = 'local/checkpoints2')

GidonFrischkorn · 2024-02-24T12:18:16Z

So, I have thought a bit about the naming of the arguments. For me it depends on what should be specified. I will just share my thoughts:

if a number of iterations is provided, meaning after each N number of iterations samples should be saved, then I would use checkpoint_each clarifying that the checkpoint will be done as soon as N number of iterations finished
if an absolute number of iterations, maybe even a vector would be passed, then checkpoint_at makes more sense to me, as you are giving the absolute value at which the checkpoints should be done
if a number of checkpoints is passed, then I would name the argument checkpoints or maybe even more explicitly n_checkpoints

for the stop argument:

if the number of iterations after which the sampling should stop is passed then stop_after or stop_after_iter makes sense to me. (but here you could also argue for stopping at which number of iterations, speaking for stop_at_iter
if the checkpoint at which the sampling should stop is passed then stop_at or stop_at_checkpoint makes sense to me.

So maybe for the stop argument: stop_at_iter and stop_at_checkpoint is most consistent and explicit.
For the checkpoints argument, I find it more challenging to find a coherent naming scheme. Generally, I like having the option to specify the number of checkpoints or having a checkpoint each time a certain number of iterations has been reached. So, if it is not too complicated maybe we could have checkpoints and checkpoints_each (if you do not like the each than I am also fine with `checkpoints_at)

I hope my thoughts make sense and help you in making a decision about the naming of the arguments.

GidonFrischkorn

Looks good to me. So, once you decided on a naming scheme you are to merge with develop.

venpopov · 2024-02-27T09:12:32Z

@GidonFrischkorn just to update you on this. I tested it with our reference models and it worked great on my machine, but when I tested it on the server, it failed and I discovered a few more bugs when not running on windows. I've fixed most of them and after I fix the final one (hopefully today), I will update and merge.

GidonFrischkorn · 2024-02-27T13:45:56Z

I can also get to testing this in depth tomorrow on my Mac to see if there are additional problems on Mac vs. Ubuntu & Windows.

venpopov · 2024-02-27T22:41:10Z

Ok, everything works on all platforms, but I am shelving this for now, because I discovered that the adaptation stage is not implemented properly . Will come back to this at some point when the checkpointing adaptation is fixed. For now we can proceed without this feature.

venpopov added 2 commits February 23, 2024 18:05

Merge branch 'develop' into feature/chkptstanr

f5e09b0

Implement checkpointing option

a98baa0

venpopov added PR - minor Pull-request should update minor version enhancement - new feature New user or developer feature labels Feb 24, 2024

venpopov added this to the 1.0.0 milestone Feb 24, 2024

venpopov requested a review from GidonFrischkorn February 24, 2024 00:06

venpopov added 2 commits February 24, 2024 01:14

remove cmdstanr dependency

1281de0

remove check for cmdstanr

3e199db

venpopov mentioned this pull request Feb 24, 2024

Extend checkpoint functionality #131

Closed

GidonFrischkorn approved these changes Feb 24, 2024

View reviewed changes

venpopov added 2 commits February 27, 2024 23:36

updates to arguments and testing

012e16f

Merge branch 'develop' into feature/chkptstanr

f43ff68

venpopov marked this pull request as draft February 28, 2024 04:03

venpopov linked an issue Feb 28, 2024 that may be closed by this pull request

Extend checkpoint functionality #131

Closed

venpopov removed this from the 1.0.0 milestone May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement checkpointing for bmm models #130

Implement checkpointing for bmm models #130

venpopov commented Feb 24, 2024 •

edited

Loading

venpopov commented Feb 24, 2024

venpopov commented Feb 24, 2024 •

edited

Loading

GidonFrischkorn commented Feb 24, 2024 •

edited

Loading

GidonFrischkorn left a comment

venpopov commented Feb 27, 2024

GidonFrischkorn commented Feb 27, 2024

venpopov commented Feb 27, 2024

Implement checkpointing for bmm models #130

Are you sure you want to change the base?

Implement checkpointing for bmm models #130

Conversation

venpopov commented Feb 24, 2024 • edited Loading

Summary

Example usage

Tests

Release notes

venpopov commented Feb 24, 2024

venpopov commented Feb 24, 2024 • edited Loading

1

2

versions with code

GidonFrischkorn commented Feb 24, 2024 • edited Loading

GidonFrischkorn left a comment

Choose a reason for hiding this comment

venpopov commented Feb 27, 2024

GidonFrischkorn commented Feb 27, 2024

venpopov commented Feb 27, 2024

venpopov commented Feb 24, 2024 •

edited

Loading

venpopov commented Feb 24, 2024 •

edited

Loading

GidonFrischkorn commented Feb 24, 2024 •

edited

Loading