Merge pull request #625 from jhudsl/winter25cleaning

Winter25cleaning
jhudsl · Dec 26, 2024 · 3d4b413 · 3d4b413
2 parents e996296 + be75ba6
commit 3d4b413
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 26 deletions.
diff --git a/modules/Data_Cleaning/Data_Cleaning.Rmd b/modules/Data_Cleaning/Data_Cleaning.Rmd
@@ -155,10 +155,10 @@ miss_var_summary(bike)
 ## `naniar` plots
 
 The `gg_miss_var()` function creates a nice plot about the number of
-missing values for each variable, (need a data frame).
+missing values for each variable, (need a data frame). Using `show_pct = TRUE` shows the percent missing. 
 
 ```{r, fig.height=4, warning=FALSE, fig.align='center'}
-gg_miss_var(bike)
+gg_miss_var(bike, show_pct = TRUE)
 ```
 
 
@@ -361,6 +361,20 @@ Pay attention to your data and your `NA` values!
 knitr::include_graphics("images/debug.png")
 ```
 
+## GUT CHECK: What function can be used to remove NA values from a full dataframe or for an individual column?
+
+A. `drop_nulls()`
+
+B. `drop_na()`
+
+C. `rem_na()`
+
+## GUT CHECK: How can you keep NA values when using `filter`?
+
+A. include `| is.na()` 
+
+B. include `& is.na()`
+
 ## Summary
 
 -   `is.na()`,`any(is.na())`, `all(is.na())`,`count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
@@ -376,9 +390,11 @@ knitr::include_graphics("images/debug.png")
 
 ## Lab Part 1
 
-🏠 [Class Website](https://jhudatascience.org/intro_to_r/)    
+🏠 [Class Website](https://jhudatascience.org/intro_to_r/)  
+
 💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)
 
+📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)
 
 
 # Recoding Variables
@@ -602,6 +618,13 @@ data_ginger_mint %>%
   count(Group, Effect)
 ```
 
+## GUT CHECK: If we want all unspecified values to remain the same with `case_when()`, how should we complete the `TRUE ~` statement?
+
+A. With the name of the variable we are modifying or using as source
+
+B. With the word "same"
+
+
 # Working with strings
 
 ## Strings in R
@@ -718,14 +741,23 @@ data_ginger_mint %>%
   count(Treatment, Treatment_recoded)
 ```
 
-This is a more robust solution! It will catch typos as long as first letter is correct or there is part of the word mint.
+This is a more robust solution! It will catch typos as long as the first letter is correct or there is part of the word mint.
 
 ## That's better!
 
 ```{r, echo = FALSE, fig.align='center', out.width= "30%"}
 knitr::include_graphics("https://media1.giphy.com/media/S9ZK4mmi3u3jdc5dek/200w.webp?cid=ecf05e47h7myga959jwvek6s9x1tkog135g7pxu8vvjz2yqb&rid=200w.webp&ct=g")
 ```
 
+
+## GUT CHECK: What `stringr` function helps us find a string pattern?
+
+A. `str_replace()`
+
+B. `str_find()`
+
+C. `str_detect()`
+
 # Separating and uniting data
 
 ## Uniting columns 
@@ -784,9 +816,14 @@ knitr::include_graphics("images/case_when.png")
 
 ## Lab Part 2
 
-🏠 [Class Website](https://jhudatascience.org/intro_to_r/)    
+🏠 [Class Website](https://jhudatascience.org/intro_to_r/)  
+
 💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)
 
+📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)
+
+📃 [Posit's `stringr` Cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)
+
 ```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
 knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
 ```

diff --git a/modules/Data_Cleaning/lab/Data_Cleaning_Lab_Key.Rmd b/modules/Data_Cleaning/lab/Data_Cleaning_Lab_Key.Rmd
@@ -15,18 +15,13 @@ The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gy
 You can Download as a CSV in your current working directory.  Note its also available at: 	http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv 
 
 ```{r}
-library(readr)
 library(tidyverse)
-library(dplyr)
-library(lubridate)
 library(jhur)
-library(tidyverse)
-library(broom)
 # install.packages("naniar")
 library(naniar)
 ```
 
-Read in the bike data, you can use the URL or download the data.
+Read in the bike data, you can use the URL or download the data and save the data as an object called `bike`.
 
 Bike Lanes Dataset: BikeBaltimore is the Department of Transportation's bike program. 
 The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
@@ -67,10 +62,10 @@ have_rout <- bike %>% drop_na(route)
 
 ### 1.3
 
-Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()`). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
+Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()` and use `show_ptc = TRUE` as an argument ). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
 
 ```{r 1.3response}
-gg_miss_var(bike)
+gg_miss_var(bike, show_pct = TRUE)
 ```
 
 
@@ -85,6 +80,18 @@ pull(bike, subType) %>% pct_complete() # this
 miss_var_summary(bike) # or this
 ```
 
+## P.2
+
+Use the `na_if` function to replace values of 0 values  of the`dateInstalled` variable to be `NA`. Check your work using the `count` function.
+
+```{r}
+bike <- bike %>% 
+  mutate(dateInstalled = na_if(dateInstalled, 0))
+count(bike, dateInstalled)
+```
+
+
+
 
 # Part 2
 
@@ -96,7 +103,7 @@ Let's say we the data like so:
 
 ```{r}
 BloodType <- tibble(
-  weight_loss =
+  exposure =
     c(
       "Y", "No", "Yes", "y", "no",
       "n", "No", "N", "yes", "Yes",
@@ -121,20 +128,20 @@ There are some issues with this data that we need to figure out!
 
 ### 2.1
 
-Determine how many `NA` values there are for `weight_loss` (assume you know that`N` and `n` is for no).
+Determine how many `NA` values there are for `exposure` (assume you know that`N` and `n` is for no).
 
 ```{r 2.1response}
-count(BloodType, weight_loss) # the simple way
-sum(is.na(pull(BloodType, weight_loss))) # another way
+count(BloodType, exposure) # the simple way
+sum(is.na(pull(BloodType, exposure))) # another way
 BloodType %>% # another way
-  pull(weight_loss) %>%
+  pull(exposure) %>%
   is.na() %>%
   sum()
 ```
 
 ### 2.2
 
-Recode the `weight_loss` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!
+Recode the `exposure` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!
 
 ```
 # General format
@@ -149,21 +156,21 @@ NEW_TIBBLE <- OLD_TIBBLE %>%
 ```{r 2.2response}
 
 BloodType <- BloodType %>%
-  mutate(weight_loss = case_when(
-    weight_loss %in% c("N", "n", "No", "no") ~ "No",
-    weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
-    TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
+  mutate(exposure = case_when(
+    exposure %in% c("N", "n", "No", "no") ~ "No",
+    exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
+    TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
   ))
 
-count(BloodType, weight_loss)
+count(BloodType, exposure)
 ```
 
 ### 2.3
 
-Check to see how many values `weight_loss` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.
+Check to see how many values `exposure` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.
 
 ```{r 2.3response}
-BloodType %>% count(weight_loss)
+BloodType %>% count(exposure)
 ```
 
 ### 2.4