Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fct_na_value_to_level does not work in recipes #1290

Closed
joranE opened this issue Mar 16, 2024 · 4 comments
Closed

fct_na_value_to_level does not work in recipes #1290

joranE opened this issue Mar 16, 2024 · 4 comments
Labels

Comments

@joranE
Copy link

joranE commented Mar 16, 2024

I can't tell why, but it seems as though forcats::fct_na_value_to_level does not work when applied in recipes, specifically in step_mutate. I discovered this when updating some code in response to the deprecation warning generated by forcats::fct_explicit_na that says to use fct_na_value_to_level instead.

Reprex demonstrating the issue below. In short, both fct_na_value_to_level & fct_explicit_na work as expected in isolation, but the newer version seems to do nothing when run inside step_mutate.

I tried debugging fct_na_value_to_level when running baking the recipe, and it enters the function and appears to generate a factor with a new level and then exits the function fine, so I presume the result is just being discarded somehow elsewhere in the recipe machinery.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(forcats)

ex_data <- 
  data.frame(
    y = 1:3,
    x = factor(c("a",NA,"c"))
  )

rec <- 
  recipe(y~x,data = ex_data) |> 
  step_mutate(
    x_old = fct_explicit_na(x),
    x_new = fct_na_value_to_level(x),
  )

rec_baked <- 
  rec |> 
  prep() |> 
  bake(new_data = NULL)
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `x_old = fct_explicit_na(x)`.
#> Caused by warning:
#> ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
#> ℹ Please use `fct_na_value_to_level()` instead.

# Deprecated fct_explicit_na generates new level
levels(rec_baked$x_old)
#> [1] "a"         "c"         "(Missing)"

# New fct_na_value_to_level does not
levels(rec_baked$x_new)
#> [1] "a" "c"

# Outside of recipe both work as expected
new_fct <- fct_na_value_to_level(ex_data$x)
levels(new_fct)
#> [1] "a" "c" NA

old_fct <- fct_explicit_na(ex_data$x)
levels(old_fct)
#> [1] "a"         "c"         "(Missing)"

Created on 2024-03-16 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.2.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Denver
#>  date     2024-03-16
#>  pandoc   3.1.11 @ /private/var/folders/8b/t1hk0l7j51xfz229wq7v5c_r0000gn/T/AppTranslocation/6B027E9E-D29F-42A6-B9CD-FF8F7EC148CD/d/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  class          7.3-22     2023-05-03 [2] CRAN (R 4.3.2)
#>  cli            3.6.2      2023-12-11 [1] CRAN (R 4.3.1)
#>  codetools      0.2-19     2023-02-01 [2] CRAN (R 4.3.2)
#>  data.table     1.15.2     2024-02-29 [1] CRAN (R 4.3.1)
#>  digest         0.6.34     2024-01-11 [1] CRAN (R 4.3.1)
#>  dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.1)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.3.0)
#>  evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.1)
#>  fansi          1.0.6      2023-12-08 [1] CRAN (R 4.3.1)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
#>  forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.3.0)
#>  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.0)
#>  future         1.33.1     2023-12-22 [1] CRAN (R 4.3.1)
#>  future.apply   1.11.1     2023-12-21 [1] CRAN (R 4.3.1)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
#>  glue           1.7.0      2024-01-09 [1] CRAN (R 4.3.1)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
#>  hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.3.0)
#>  htmltools      0.5.7      2023-11-03 [1] CRAN (R 4.3.1)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.0)
#>  knitr          1.45       2023-10-30 [1] CRAN (R 4.3.1)
#>  lattice        0.21-9     2023-10-01 [2] CRAN (R 4.3.2)
#>  lava           1.7.3      2023-11-04 [1] CRAN (R 4.3.1)
#>  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
#>  listenv        0.9.0      2022-12-16 [1] CRAN (R 4.3.0)
#>  lubridate      1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
#>  MASS           7.3-60     2023-05-04 [2] CRAN (R 4.3.2)
#>  Matrix         1.6-1.1    2023-09-18 [2] CRAN (R 4.3.2)
#>  nnet           7.3-19     2023-05-03 [2] CRAN (R 4.3.2)
#>  parallelly     1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
#>  prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
#>  purrr          1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils        2.12.3     2023-11-18 [1] CRAN (R 4.3.1)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
#>  Rcpp           1.0.12     2024-01-09 [1] CRAN (R 4.3.1)
#>  recipes      * 1.0.9      2023-12-13 [1] CRAN (R 4.3.1)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang          1.1.3      2024-01-10 [1] CRAN (R 4.3.2)
#>  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.1)
#>  rpart          4.1.21     2023-10-09 [2] CRAN (R 4.3.2)
#>  rstudioapi     0.15.0     2023-07-07 [2] CRAN (R 4.3.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
#>  styler         1.10.2     2023-08-29 [1] CRAN (R 4.3.0)
#>  survival       3.5-7      2023-08-14 [2] CRAN (R 4.3.2)
#>  tibble         3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
#>  timechange     0.3.0      2024-01-18 [1] CRAN (R 4.3.1)
#>  timeDate       4032.109   2023-12-14 [1] CRAN (R 4.3.1)
#>  utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
#>  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
#>  withr          3.0.0      2024-01-16 [1] CRAN (R 4.3.1)
#>  xfun           0.41       2023-11-01 [1] CRAN (R 4.3.1)
#>  yaml           2.3.8      2023-12-11 [2] CRAN (R 4.3.1)
#> 
#>  [1] /Users/joranelias/rlibs
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@EmilHvitfeldt
Copy link
Member

Hello @joranE 👋

To do this task we recommend that you use step_unknown() instead of relying on step_mutate()

library(recipes)

ex_data <- 
  data.frame(
    y = 1:3,
    x = factor(c("a",NA,"c"))
  )

rec <- 
  recipe(y~x,data = ex_data) |> 
  step_unknown(x)

rec_baked <- 
  rec |> 
  prep() |> 
  bake(new_data = NULL)

levels(rec_baked$x)
#> [1] "a"       "c"       "unknown"

@joranE
Copy link
Author

joranE commented Mar 18, 2024

Thanks @EmilHvitfeldt ! I did actually know about step_unknown but sort of had some blinders on as I was updating existing code from someone else and was surprised that fct_explicit_na worked but not fct_na_value_to_level.

Good to know that step_unknown should be preferred more generally.

My discovery does make me sort of nervous now, though, that step_mutate is silently unreliable, not knowing when it will work "just like mutate" and when it might not. So if there's something predictable in step_mutate where it can tell when something like fct_na_value_to_level won't work and warn/error, I still think that would be a useful improvement.

@EmilHvitfeldt
Copy link
Member

There are two things happening here:

Firstly, this uncovered a bug in strings2factors() that would wrongly drop NA levels in factors. strings2factors() was being invoked after step_mutate() did what it did.
This is being dealt with in #1291.

Secondly, step_mutate() does work like mutate(). IMO the problem you ran into in this case is that fct_explicit_na() and fct_na_value_to_level() does similar things, but not the same thing.

fct_explicit_na() takes the missing values and turns them into a level in the factor.
fct_na_value_to_level() adds missing values to the levels.

This can be seen here:

library(forcats)

x <- factor(c("a",NA,"c"))

x_old <- fct_explicit_na(x)
x_new <- fct_na_value_to_level(x)

x_old
#> [1] a         (Missing) c        
#> Levels: a c (Missing)
x_new
#> [1] a    <NA> c   
#> Levels: a c <NA>

Where this distinction is useful is that you are able to reverse fct_na_value_to_level() by using fct_na_level_to_value() if you need to do calculations that work on missing values

is.na(x_old)
#> [1] FALSE FALSE FALSE
is.na(x_new)
#> [1] FALSE FALSE FALSE

x_old_rev <- fct_na_level_to_value(x_old)
x_new_rev <- fct_na_level_to_value(x_new)

is.na(x_old_rev)
#> [1] FALSE FALSE FALSE
is.na(x_new_rev)
#> [1] FALSE  TRUE FALSE

Regardless, you should be using step_unknown() as it will stay the same 😄

Copy link

github-actions bot commented Apr 3, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants