-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update TADA_CheckRequiredFields and TADA_AutoClean #557
Conversation
return names of missing cols with error, update autoclean to add ActivityStartDateTime if missing
@@ -272,6 +271,17 @@ TADA_AutoClean <- function(.data) { | |||
.data$TADA.ResultMeasure.MeasureUnitCode <- toupper(.data$ResultMeasure.MeasureUnitCode) | |||
} | |||
|
|||
if ("ActivityStartDateTime" %in% colnames(.data)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add ActivityStartDateTime and convert times zones in our autoclean function here in the same format/way as the DR functions do. This is needed in cases where in the TADAShiny app, a user brings their own data file that was not downloaded using either EPATADA or dataRetrieval, but has all other required cols. It should replicate any additional things dataRetrieval readWQPdata does to a df.
Data retrieval produces "2022-06-08 16:00:00" while my code below produces "2023-05-11 11:45:00 UTC". DR also converts times and time zones to UTC, but that's not included in the value here. The time zone conversion and addition of ActivityStartDateTime occurs even when attributes are ignored (which is fine, we need that col):
results.DR <- dataRetrieval::readWQPdata(WQPquery,
dataProfile = "resultPhysChem",
ignore_attributes = TRUE
)
I'd like to understand this piece of dataretreival better. If there is a stand along function that can change all date/time fields to UTC and also creates this ActivityStartDate col (and does anything else that is happening behind the scenes), we could leverage that here instead.
@ldecicco-USGS do you have any advice on this topic? Can you please point us to the location in your code where this occurs? I did a quick search of your repo but couldn't find it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal code is here;
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/importWQP.R#L223
(you can call it, you'd just need to do a triple colon: dataRetrieval:::create_dateTime
)
offsetLibrary is a dataframe saved in sysdata.rda
You can see where and how it gets called here:
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/importWQP.R#L160
Let me know if there's something unclear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This really helped me understand how to interpret both the ActivityStartDateTime and offset cols (these starting showing up in TADA df's for the first time last month: ActivityStartTime.TimeZoneCode_offset and ActivityEndTime.TimeZoneCode_offset) that are returned from dataRetrieval. For dataRetrieval users, I think it might be more user friendly to include a column titled ActivityStartDateTime.TimeZoneCode (UTC in this case) instead of the ActivityStartTime.TimeZoneCode_offset (which includes number of hours). As is, the target time zone for ActivityStartDateTime (a function input here) is not documented anywhere in the returned df (see review_TADAProfile1 below). Alternatively, UTC could potentially be included in ActivityStartDateTime but that might break people workflows (e.g. "2023-05-11 11:45:00 UTC"). I am going to create a separate issue in TADA repo to document this issue and discuss how to address it. Is this something you are potentially interested in updating in dataRetrieval?
# Find web service URLs for each Profile using WQP User Interface (https://www.waterqualitydata.us/)
# Example WQP URL: https://www.waterqualitydata.us/#statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&providers=NWIS&providers=STEWARDS&providers=STORET
# Use TADA_ReadWQPWebServices to load the Station, Project, and Phys-Chem Result profiles
stationProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Station/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&providers=NWIS&providers=STEWARDS&providers=STORET")
physchemProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Result/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&dataProfile=resultPhysChem&providers=NWIS&providers=STEWARDS&providers=STORET")
projectProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Project/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&providers=NWIS&providers=STEWARDS&providers=STORET")
# Join all three profiles using TADA_JoinWQPProfiles
TADAProfile <- TADA_JoinWQPProfiles(FullPhysChem = physchemProfile, Sites = stationProfile, Projects = projectProfile)
# Run TADA_CheckRequiredFields, returns error message, 'The dataframe does not contain the required fields: ActivityStartDateTime'
TADA_CheckRequiredFields(TADAProfile)
# Add missing col
TADAProfile1 <- dataRetrieval:::create_dateTime(df = TADAProfile,
date_col = "ActivityStartDate",
time_col = "ActivityStartTime.Time",
tz_col = "ActivityStartTime.TimeZoneCode",
tz = "UTC")
review_TADAProfile1 = TADAProfile1 %>% dplyr::select(c("ActivityStartDate",
"ActivityStartTime.Time",
"ActivityStartTime.TimeZoneCode",
"ActivityStartDateTime",
"ActivityStartTime.TimeZoneCode_offset"))
# re-run TADA_CheckRequiredFields, returns TRUE
TADA_CheckRequiredFields(TADAProfile1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a separate issue here to continue the conversation: #558
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure we're both on the same page: In dataRetrieval, the default timezone is UTC set here:
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/readWQPdata.R#L200
You can read about changing timezones here:
https://doi-usgs.github.io/dataRetrieval/reference/readWQPdata.html#arg-tz
This sets the time zone attribute of the POSIX object.
Like this:
library(dataRetrieval)
nameToUse <- "pH"
pHData <- readWQPdata(siteid = "USGS-04024315",
characteristicName = nameToUse,
service = "ResultWQX")
attr(pHData$Activity_StartDateTime, "tzone")
[1] "UTC"
pHData$Activity_StartDateTime[1]
[1] "1975-09-27 15:50:00 UTC"
pHData2 <- readWQPdata(siteid = "USGS-04024315",
characteristicName = nameToUse,
tz = "America/Chicago",
service = "ResultWQX")
attr(pHData2$Activity_StartDateTime, "tzone")
[1] "America/Chicago"
pHData2$Activity_StartDateTime[1]
[1] "1975-09-27 10:50:00 CDT"
So what you are asking for is a column that converts the offset number of hours to the timezone it was converted to?
Note there's also the link in the help to the OlsonNames() base R function which talks about how R handles timezones. The issue is that different operating systems and depending on where in the world the computer things you are will want different abbreviations for timezones (that's why using the OlsonNames is what has been working best for dataRetrieval).
https://rdrr.io/r/base/timezones.html
@hillarymarler this is ready for review |
A few suggested formatting changes to get lines in documentation < 100 chars (for CRAN related checks). I did not attempt to shorten/format the URLs - those will still trigger that warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I committed a few small formatting changes that might be useful (reduce # of chars per line as that's a common warning we see in checks, though I did not attempt to do anything about the URLs.
Only other minor suggestion be to consider if the function name should be included in the stop message. We're inconsistent about this with our printed messages or maybe I'm unclear on the "rules" for when to include it or not.
All updates look good!
Now TADA_RandomTestingData will re-run until it returns at least 10 results
…o-results' into TADA_CheckRequiredFields
TADA_CheckRequiredFields now returns error message with names of missing cols included, updates TADA_AutoClean to add ActivityStartDateTime if missing