add `.recipes_estimate_sparsity()` #1410

EmilHvitfeldt · 2025-01-15T01:29:00Z

To close #1397

I think i found a fairly elegant solution. it adds another S3 method, one with methods for the whole recipe, and for individual steps. This way steps that produce sparsity can return how much sparsity they produce.

It uses the newly added sparsity() function from {sparsevctrs} to estimate an initial sparsity of the input-data. Downsampling to 1000 rows for performance reasons.

Then each step that will produce sparse predictors will via their S3 method return how many columns and how sparse, they expect to return. This is used to update the estimate and repeated for each input/step.

the it obviously over estimates for recipe() |> step_other(col) |> step_dummy(col), but it is the best we can do IMO with unprepped recipes. An user still have the option to overwrite anyways.

It is quite a bit faster than prepping the recipe. And from profiling the most significant time comes from sparsevctrs::sparsity(), which has room for improvement if needed.

library(recipes)
library(modeldata)

rec_sparse <- recipe(~., ames) |>
  step_dummy(all_nominal())

rec_sparse_no <- recipe(~., ames) |>
  step_dummy(all_nominal(), sparse = "no")

rec_dense_0 <- recipe(~., ames)

rec_dense_10 <- recipe(~., ames) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric()) |>
  step_normalize(all_numeric())

bench::mark(
  check = FALSE,
  prep_sparse = prep(rec_sparse),
  est_sparse = .recipes_estimate_sparsity(rec_sparse),
  est_sparse_no = .recipes_estimate_sparsity(rec_sparse_no),
  est_dense_0 = .recipes_estimate_sparsity(rec_dense_0),
  est_dense_10 = .recipes_estimate_sparsity(rec_dense_10)
)
#> # A tibble: 5 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 prep_sparse     64.8ms   67.3ms      13.3    56.8MB     28.5
#> 2 est_sparse     916.6µs  978.3µs     960.    790.7KB     18.0
#> 3 est_sparse_no  305.4µs  335.9µs    2692.      507KB     46.0
#> 4 est_dense_0    302.5µs  335.4µs    2713.      507KB     46.0
#> 5 est_dense_10   305.3µs  340.6µs    2387.      507KB     36.0

And it is decently good for recipes it understand.

.recipes_estimate_sparsity(rec_sparse)
#> [1] 0.7993268
rec_sparse |> prep() |> bake(NULL) |> sparsevctrs::sparsity()
#> [1] 0.7953081

topepo

Looking good! My comments are mostly just curious questions.

R/sparsevctrs.R

topepo · 2025-01-15T16:52:09Z

R/sparsevctrs.R

+#'
+#' @details
+#' Takes a untrained recipe an provides a rough estimate of the sparsity of the


This would require some data though, right? Someone wouldn't be able to use

recipe(mpg ~ ., data = mtcars[0,]

(which is legal and not a horrible idea).

So we should document that a reasonable amount of training data should be in the recipe and check to make sure that some minimal amount is there.

if you pass in a 0-row data-frame. then the initial estimated sparsity is set to 0. This is a under-estimate, that i don't think is reasonable.

but you are correct that it should be documented

R/sparsevctrs.R

recipes.Rproj

Co-authored-by: Max Kuhn <[email protected]>

add .recipes_estimate_sparsity()

fc7eda4

EmilHvitfeldt requested a review from topepo January 15, 2025 01:29

pest more things about .recipes_estimate_sparsity()

9ac6fad

topepo approved these changes Jan 15, 2025

View reviewed changes

EmilHvitfeldt and others added 3 commits January 15, 2025 10:13

Update R/sparsevctrs.R

7f29b35

Co-authored-by: Max Kuhn <[email protected]>

undo projectID

ad10e96

clarify zero-row .recipes_estimate_sparsity

8585981

EmilHvitfeldt merged commit eed3579 into main Jan 15, 2025
13 checks passed

EmilHvitfeldt deleted the estimate-sparsity branch January 15, 2025 18:58

EmilHvitfeldt mentioned this pull request Jan 17, 2025

Benchmark / investigate: fast way to estimate sparsity of unprepped recipe without prepping it tidymodels/planning#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `.recipes_estimate_sparsity()` #1410

add `.recipes_estimate_sparsity()` #1410

EmilHvitfeldt commented Jan 15, 2025 •

edited

Loading

topepo left a comment

topepo Jan 15, 2025

EmilHvitfeldt Jan 15, 2025

add .recipes_estimate_sparsity() #1410

add .recipes_estimate_sparsity() #1410

Conversation

EmilHvitfeldt commented Jan 15, 2025 • edited Loading

topepo left a comment

Choose a reason for hiding this comment

topepo Jan 15, 2025

Choose a reason for hiding this comment

EmilHvitfeldt Jan 15, 2025

Choose a reason for hiding this comment

add `.recipes_estimate_sparsity()` #1410

add `.recipes_estimate_sparsity()` #1410

EmilHvitfeldt commented Jan 15, 2025 •

edited

Loading