-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add .recipes_estimate_sparsity()
#1410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! My comments are mostly just curious questions.
#' | ||
#' @details | ||
#' Takes a untrained recipe an provides a rough estimate of the sparsity of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require some data though, right? Someone wouldn't be able to use
recipe(mpg ~ ., data = mtcars[0,]
(which is legal and not a horrible idea).
So we should document that a reasonable amount of training data should be in the recipe and check to make sure that some minimal amount is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you pass in a 0-row data-frame. then the initial estimated sparsity is set to 0. This is a under-estimate, that i don't think is reasonable.
but you are correct that it should be documented
Co-authored-by: Max Kuhn <[email protected]>
To close #1397
I think i found a fairly elegant solution. it adds another S3 method, one with methods for the whole recipe, and for individual steps. This way steps that produce sparsity can return how much sparsity they produce.
It uses the newly added
sparsity()
function from {sparsevctrs} to estimate an initial sparsity of the input-data. Downsampling to 1000 rows for performance reasons.Then each step that will produce sparse predictors will via their S3 method return how many columns and how sparse, they expect to return. This is used to update the estimate and repeated for each input/step.
the it obviously over estimates for
recipe() |> step_other(col) |> step_dummy(col)
, but it is the best we can do IMO with unprepped recipes. An user still have the option to overwrite anyways.It is quite a bit faster than prepping the recipe. And from profiling the most significant time comes from
sparsevctrs::sparsity()
, which has room for improvement if needed.And it is decently good for recipes it understand.