Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make recipe() works with long formulas #1283

Merged
merged 14 commits into from
May 24, 2024
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@

* `NA` levels in factors aren't dropped when passed to `recipe()`. (#1291)

* `recipe()` no longer crashes when given long formula expression (#1283).

* `recipe()` will now show better error when columns are misspelled in formula (#1283).

# recipes 1.0.10

## Bug Fixes
Expand Down
48 changes: 25 additions & 23 deletions R/misc.R
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,30 @@ get_rhs_vars <- function(formula, data, no_lhs = FALSE) {
## or should it? what about Y ~ log(x)?
## Answer: when called from `form2args`, the function
## `inline_check` stops when in-line functions are used.
data_info <- attr(model.frame(formula, data[1, ]), "terms")
simonpcouch marked this conversation as resolved.
Show resolved Hide resolved
response_info <- attr(data_info, "response")
predictor_names <- names(attr(data_info, "dataClasses"))
if (length(response_info) > 0 && all(response_info > 0)) {
predictor_names <- predictor_names[-response_info]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are allowed to do this refactor because get_rhs_vars() is only ever called after inline_check() in form2args()

https://github.com/tidymodels/recipes/blob/main/R/recipe.R#L233-L241

the the question in essence because:

  1. if formula contains a ., use "all names" otherwise use all.names() to pull them out.
  2. pull names out of lhs
  3. take the diff if needed

outcomes_names <- all.names(
rlang::f_lhs(formula),
functions = FALSE,
unique = TRUE
)

predictors_names <- all.names(
rlang::f_rhs(formula),
functions = FALSE,
unique = TRUE
)

if (any(predictors_names == ".")) {
predictors_names <- predictors_names[predictors_names != "."]
predictors_names <- c(predictors_names, colnames(data))
predictors_names <- unique(predictors_names)
}

if (length(predictors_names) > 0 && length(outcomes_names) > 0) {
predictors_names <- setdiff(predictors_names, outcomes_names)
}
predictor_names

predictors_names
}

#' Naming Tools
Expand Down Expand Up @@ -136,25 +153,10 @@ dummy_extract_names <- function(var, lvl, ordinal = FALSE, sep = "_") {
nms
}


## As suggested by HW, brought in from the `pryr` package
## https://github.com/hadley/pryr
fun_calls <- function(f) {
if (is.function(f)) {
fun_calls(body(f))
} else if (is_quosure(f)) {
fun_calls(quo_get_expr(f))
} else if (is.call(f)) {
fname <- as.character(f[[1]])
# Calls inside .Internal are special and shouldn't be included
if (identical(fname, ".Internal")) {
return(fname)
}
unique(c(fname, unlist(lapply(f[-1], fun_calls), use.names = FALSE)))
}
fun_calls <- function(f, data) {
setdiff(all.names(f), colnames(data))
}


get_levels <- function(x) {
if (!is.factor(x) & !is.character(x)) {
return(list(values = NA, ordered = NA))
Expand Down
26 changes: 14 additions & 12 deletions R/recipe.R
Original file line number Diff line number Diff line change
Expand Up @@ -193,18 +193,20 @@ recipe.data.frame <-
#' @rdname recipe
#' @export
recipe.formula <- function(formula, data, ...) {

if (rlang::is_missing(data)) {
cli::cli_abort("{.arg data} is missing with no default.")
}

# check for minus:
f_funcs <- fun_calls(formula)
f_funcs <- fun_calls(formula, data)
if (any(f_funcs == "-")) {
cli::cli_abort(c(
"x" = "{.code -} is not allowed in a recipe formula.",
"i" = "Use {.help [{.fun step_rm}](recipes::step_rm)} instead."
))
}

if (rlang::is_missing(data)) {
cli::cli_abort("Argument {.var data} is missing, with no default.")
}
# Check for other in-line functions
args <- form2args(formula, data, ...)
obj <- recipe.data.frame(
Expand All @@ -230,7 +232,7 @@ form2args <- function(formula, data, ..., call = rlang::caller_env()) {
}

## check for in-line formulas
inline_check(formula)
inline_check(formula, data, call)

if (!is_tibble(data)) {
data <- as_tibble(data)
Expand Down Expand Up @@ -271,20 +273,20 @@ form2args <- function(formula, data, ..., call = rlang::caller_env()) {
list(x = data, vars = vars, roles = roles)
}

inline_check <- function(x) {
funs <- fun_calls(x)
funs <- funs[!(funs %in% c("~", "+", "-"))]
inline_check <- function(x, data, call) {
funs <- fun_calls(x, data)
funs <- funs[!(funs %in% c("~", "+", "-", "."))]

if (length(funs) > 0) {
cli::cli_abort(c(
x = "No in-line functions should be used here.",
i = "{cli::qty(length(funs))}The following function{?s} {?was/were} \\
found: {.and {.code {funs}}}.",
x = "Misspelled variable name or in-line functions detected.",
i = "{cli::qty(length(funs))}The following function{?s}/misspelling{?s} \\
{?was/were} found: {.and {.code {funs}}}.",
i = "Use steps to do transformations instead.",
i = "If your modeling engine uses special terms in formulas, pass \\
that formula to workflows as a \\
{.help [model formula](parsnip::model_formula)}."
))
), call = call)
}

invisible(x)
Expand Down
48 changes: 35 additions & 13 deletions tests/testthat/_snaps/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Code
recipe(HHV ~ log(nitrogen), data = biomass)
Condition
Error in `inline_check()`:
x No in-line functions should be used here.
i The following function was found: `log`.
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following function/misspelling was found: `log`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

Expand All @@ -14,9 +14,9 @@
Code
recipe(HHV ~ (.)^2, data = biomass)
Condition
Error in `inline_check()`:
x No in-line functions should be used here.
i The following functions were found: `^` and `(`.
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following functions/misspellings were found: `^` and `(`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

Expand All @@ -25,9 +25,9 @@
Code
recipe(HHV ~ nitrogen + sulfur + nitrogen:sulfur, data = biomass)
Condition
Error in `inline_check()`:
x No in-line functions should be used here.
i The following function was found: `:`.
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following function/misspelling was found: `:`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

Expand All @@ -36,9 +36,31 @@
Code
recipe(HHV ~ nitrogen^2, data = biomass)
Condition
Error in `inline_check()`:
x No in-line functions should be used here.
i The following function was found: `^`.
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following function/misspelling was found: `^`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

# Recipe on missspelled variables in formulas

Code
recipe(HHV ~ not_nitrogen, data = biomass)
Condition
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following function/misspelling was found: `not_nitrogen`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

---

Code
recipe(not_HHV ~ nitrogen, data = biomass)
Condition
Error in `recipe()`:
x Misspelled variable name or in-line functions detected.
i The following function/misspelling was found: `not_HHV`.
i Use steps to do transformations instead.
i If your modeling engine uses special terms in formulas, pass that formula to workflows as a model formula (`?parsnip::model_formula()`).

Expand Down Expand Up @@ -179,5 +201,5 @@
recipe(mpg ~ .)
Condition
Error in `recipe()`:
! Argument `data` is missing, with no default.
! `data` is missing with no default.

44 changes: 43 additions & 1 deletion tests/testthat/test-basics.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,18 @@ test_that("Recipe fails on in-line functions", {
)
})

test_that("Recipe on missspelled variables in formulas", {
EmilHvitfeldt marked this conversation as resolved.
Show resolved Hide resolved
expect_snapshot(
error = TRUE,
recipe(HHV ~ not_nitrogen, data = biomass)
)

expect_snapshot(
error = TRUE,
recipe(not_HHV ~ nitrogen, data = biomass)
)
})

test_that("return character or factor values", {
raw_recipe <- recipe(HHV ~ ., data = biomass)
centered <- raw_recipe %>%
Expand Down Expand Up @@ -357,4 +369,34 @@ test_that("NAs aren't dropped in strings2factor() (#1291)", {
bake(new_data = NULL)

expect_identical(rec_res, ex_data)
})
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is simply here to demonstrate that we doesn't get a Error: C stack usage 7954280 is too close to the limit error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, that's a lot of columns. Hell yeah!

test_that("recipe() can handle very long formulas (#1283)", {
df <- matrix(1:10000, ncol = 10000)
df <- as.data.frame(df)
names(df) <- c(paste0("x", 1:10000))

long_formula <- as.formula(paste("~ ", paste(names(df), collapse = " + ")))

expect_no_error(
rec <- recipe(long_formula, df)
)
})

test_that("recipe() works with odd formula usage (#1283)", {

expect_identical(
sort(recipe(mpg ~ ., data = mtcars)$var_info$variable),
sort(colnames(mtcars))
)

expect_identical(
sort(recipe(mpg ~ . + disp, data = mtcars)$var_info$variable),
sort(colnames(mtcars))
)

expect_identical(
sort(recipe(mpg ~ disp + disp, mtcars)$var_info$variable),
c("disp", "mpg")
)
})
Loading