-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spurious warnings in guess_dates #75
Comments
I wouldn’t call this spurious. These dates are all beyond last_date. What behavior do you expect?
If you want to get rid of the warning, then set last_date = Sys.date() + 365
…Sent from my iPhone
On May 19, 2019, at 12:34, Thibaut Jombart ***@***.***> wrote:
Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:
> x <- x %>%
+ mutate_at(.vars = vars(contains("date")),
+ .funs = guess_dates)
Warning messages:
1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-12-16 | 2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-07-04 | 2019-07-04
2019-10-21 | 2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-08 | 2019-08-08
2019-08-22 | 2019-08-22
2019-09-02 | 2019-09-02
2019-09-03 | 2019-09-03
2019-09-11 | 2019-09-11
2019-10-20 | 2019-10-20
2019-11-02 | 2019-11-02
2019-11-20 | 2019-11-20
2019-12-12 | 2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-08 | 2019-08-08
2019-08-15 | 2019-08-15
2019-08-16 | 2019-08-16
2019-08-18 | 2019-08-18
2019-08-19 | 2019-08-19
2019-08-22 | 2019-08-22
2019-08-30 | 2019-08-30
2019-09-06 | 2019-09-06
2019-09-14 | 2019-09-14
2019-09-16 | 2019-09-16
2019-09-17 | 2019-09-17
2019-09-19 | 2019-09-19
2019-09-21 | 2019-09-21
2019-09-27 | 2019-09-27
2019-10-04 | 2019-10-04
2019-10-08 | 2019-10-08
2019-10-10 | 2019-10-10
2019-10-12 | 2019-10-12
2019-10-13 | 2019-10-13
2019-10-24 | 2019-10-24
2019-10-25 | 2019-10-25
2019-10-30 | 2019-10-30
2019-10-31 | 2019-10-31
2019-11-02 | 2019-11-02
2019-11-09 | 2019-11-09
2019-11-13 | 2019-11-13
2019-12-14 | 2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-17 | 2019-08-17
2019-08-19 | 2019-08-19
2019-09-04 | 2019-09-04
2019-09-19 | 2019-09-19
2019-09-20 | 2019-09-20
2019-09-22 | 2019-09-22
2019-09-28 | 2019-09-28
2019-09-29 | 2019-09-29
2019-10-13 | 2019-10-13
2019-10-14 | 2019-10-14
2019-10-16 | 2019-10-16
2019-10-30 | 2019-10-30
2019-11-03 | 2019-11-03
2019-11-15 | 2019-11-15
2019-11-17 | 2019-11-17
2019-12-18 | 2019-12-18
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Problem is having long lists of dates original / parsed that are identical. |
Is the problem the length or the fact that they appear to be identical? |
To give a bit of background as to what is happening: Because library("linelist")
x <- c("04 Feb 1982", "19 Sep 2018", "2001-01-01", "2011.12.13",
"ba;abb;a: 03:11:2012!", "haha... 2013-12-13..",
"that's a NA", "gender", "not a date", "01__Feb__1999___",
"19/09/18", "09/08/18", "2018-08-09")
last_date <-as.Date("2012-11-05")
first_date <- as.Date("1962-11-05")
res <- guess_dates(x, error_tolerance = 1, last_date = last_date)
#> Warning in guess_dates(x, error_tolerance = 1, last_date = last_date):
#> The following 5 dates were not in the correct timeframe (1962-11-05 -- 2012-11-05):
#>
#> original | parsed
#> -------- | ------
#> 09/08/18 | 2018-08-09
#> 09/08/18 | 2018-09-08
#> 19 Sep 2018 | 2018-09-19
#> 19/09/18 | 2018-09-19
#> 2018-08-09 | 2018-08-09
#> haha... 2013-12-13.. | 2013-12-13
res
#> [1] "1982-02-04" NA "2001-01-01" "2011-12-13" "2012-11-03"
#> [6] NA NA NA NA "1999-02-01"
#> [11] NA NA NA Created on 2019-05-20 by the reprex package (v0.3.0) Do you want me to get rid of this warning alltogether? |
I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column. |
Thank you for adding this clarification, @ffinger, and I agree with you. Collecting warnings in a loop is not a straightforward problem, but luckily, I've already written some code to handle this situation in I think adopting the warning pattern that readr::parse_date() uses will be helpful: https://readr.tidyverse.org/reference/parse_datetime.html library("linelist")
my_data_frame <- data.frame(
raboof = c(letters[1:5], "foubar", "foobr", "fubar", "", "unknown", "fumar"),
treatment = c(letters[5:1], "Y", "Yes", "N", NA, "No", "yes"),
region = state.name[1:11]
)
corrections <- data.frame(
bad = c("foubar", "foobr", "fubar", ".missing", "unknown", "Yes", "Y", "No", "N", ".missing"),
good = c("foobar", "foobar", "foobar", "missing", "missing", "yes", "yes", "no", "no", "missing"),
column = c(rep("raboof", 5), rep("treatment", 5)),
orders = c(1:5, 5:1),
stringsAsFactors = FALSE
)
corr <- data.frame(bad = c(".default", ".default"),
good = c("check data", "check data"),
column = c("raboof", "treatment"),
orders = Inf,
stringsAsFactors = FALSE
)
corr <- rbind(corrections, corr)
clean_variable_spelling(my_data_frame, corr, warn = TRUE)
#> Warning in clean_variable_spelling(my_data_frame, corr, warn = TRUE): The following warnings were found...
#> raboof_____:
#> .... 'a', 'b', 'c', 'd', 'e', 'fumar' were changed to the default value ('check data')
#> treatment__:
#> .... 'a', 'b', 'c', 'd', 'e' were changed to the default value ('check data')
#> raboof treatment region
#> 1 check data check data Alabama
#> 2 check data check data Alaska
#> 3 check data check data Arizona
#> 4 check data check data Arkansas
#> 5 check data check data California
#> 6 foobar yes Colorado
#> 7 foobar yes Connecticut
#> 8 foobar no Delaware
#> 9 missing missing Florida
#> 10 missing no Georgia
#> 11 check data yes Hawaii Created on 2019-10-28 by the reprex package (v0.3.0) |
I am getting warnings which look like they may not be appropriate. Example below dates <- c("18_03_2020", "19_03_2020", "20_03_2020", "21_03_2020", "22_03_2020",
"23_03_2020", "24_03_2020", "25_03_2020", "26_03_2020", "27_03_2020",
"28_03_2020", "29_03_2020", "30_03_2020", "31_03_2020", "01_04_2020",
"02_04_2020", "03_04_2020", "04_04_2020", "05_04_2020", "06_04_2020",
"07_04_2020", "08_04_2020")
res <- linelist::guess_dates(dates) gives the following warning: Warning message:
In linelist::guess_dates(dates) :
The following 4 dates were not in the correct timeframe (1970-04-10 -- 2020-04-10):
original | parsed
-------- | ------
05_04_2020 | 2020-05-04
06_04_2020 | 2020-06-04
07_04_2020 | 2020-07-04
08_04_2020 | 2020-08-04 Which would suggest conversion did not go as planned, but it is actually not the case: > res
[1] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
[6] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
[11] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
[16] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
[21] "2020-04-07" "2020-04-08"
> range(res)
[1] "2020-03-18" "2020-04-08" |
The warnings come from the fact that it's trying out both the "mdy" and "dmy" versions of the dates. If you only expect dmy versions of dates, then set orders = "dmy" |
Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:
The text was updated successfully, but these errors were encountered: