Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add .recipes_estimate_sparsity() #1410

Merged
merged 5 commits into from
Jan 15, 2025
Merged

add .recipes_estimate_sparsity() #1410

merged 5 commits into from
Jan 15, 2025

Conversation

EmilHvitfeldt
Copy link
Member

@EmilHvitfeldt EmilHvitfeldt commented Jan 15, 2025

To close #1397

I think i found a fairly elegant solution. it adds another S3 method, one with methods for the whole recipe, and for individual steps. This way steps that produce sparsity can return how much sparsity they produce.

It uses the newly added sparsity() function from {sparsevctrs} to estimate an initial sparsity of the input-data. Downsampling to 1000 rows for performance reasons.

Then each step that will produce sparse predictors will via their S3 method return how many columns and how sparse, they expect to return. This is used to update the estimate and repeated for each input/step.

the it obviously over estimates for recipe() |> step_other(col) |> step_dummy(col), but it is the best we can do IMO with unprepped recipes. An user still have the option to overwrite anyways.

It is quite a bit faster than prepping the recipe. And from profiling the most significant time comes from sparsevctrs::sparsity(), which has room for improvement if needed.

library(recipes)
library(modeldata)

rec_sparse <- recipe(~., ames) |>
  step_dummy(all_nominal())

rec_sparse_no <- recipe(~., ames) |>
  step_dummy(all_nominal(), sparse = "no")

rec_dense_0 <- recipe(~., ames)

rec_dense_10 <- recipe(~., ames) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric())

bench::mark(
  check = FALSE,
  prep_sparse = prep(rec_sparse),
  est_sparse = .recipes_estimate_sparsity(rec_sparse),
  est_sparse_no = .recipes_estimate_sparsity(rec_sparse_no),
  est_dense_0 = .recipes_estimate_sparsity(rec_dense_0),
  est_dense_10 = .recipes_estimate_sparsity(rec_dense_10)
)
#> # A tibble: 5 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 prep_sparse     64.8ms   67.3ms      13.3    56.8MB     28.5
#> 2 est_sparse     916.6µs  978.3µs     960.    790.7KB     18.0
#> 3 est_sparse_no  305.4µs  335.9µs    2692.      507KB     46.0
#> 4 est_dense_0    302.5µs  335.4µs    2713.      507KB     46.0
#> 5 est_dense_10   305.3µs  340.6µs    2387.      507KB     36.0

And it is decently good for recipes it understand.

.recipes_estimate_sparsity(rec_sparse)
#> [1] 0.7993268
rec_sparse |> prep() |> bake(NULL) |> sparsevctrs::sparsity()
#> [1] 0.7953081

@EmilHvitfeldt EmilHvitfeldt requested a review from topepo January 15, 2025 01:29
Copy link
Member

@topepo topepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! My comments are mostly just curious questions.

R/sparsevctrs.R Outdated Show resolved Hide resolved
#'
#' @details
#' Takes a untrained recipe an provides a rough estimate of the sparsity of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require some data though, right? Someone wouldn't be able to use

recipe(mpg ~ ., data = mtcars[0,]

(which is legal and not a horrible idea).

So we should document that a reasonable amount of training data should be in the recipe and check to make sure that some minimal amount is there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass in a 0-row data-frame. then the initial estimated sparsity is set to 0. This is a under-estimate, that i don't think is reasonable.

but you are correct that it should be documented

R/sparsevctrs.R Show resolved Hide resolved
R/sparsevctrs.R Show resolved Hide resolved
R/sparsevctrs.R Show resolved Hide resolved
recipes.Rproj Outdated Show resolved Hide resolved
@EmilHvitfeldt EmilHvitfeldt merged commit eed3579 into main Jan 15, 2025
13 checks passed
@EmilHvitfeldt EmilHvitfeldt deleted the estimate-sparsity branch January 15, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add workflows helper functions for sparse data
2 participants