Skip to content

Commit

Permalink
Merge pull request #625 from jhudsl/winter25cleaning
Browse files Browse the repository at this point in the history
Winter25cleaning
  • Loading branch information
carriewright11 authored Dec 26, 2024
2 parents e996296 + be75ba6 commit 3d4b413
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 26 deletions.
47 changes: 42 additions & 5 deletions modules/Data_Cleaning/Data_Cleaning.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -155,10 +155,10 @@ miss_var_summary(bike)
## `naniar` plots

The `gg_miss_var()` function creates a nice plot about the number of
missing values for each variable, (need a data frame).
missing values for each variable, (need a data frame). Using `show_pct = TRUE` shows the percent missing.

```{r, fig.height=4, warning=FALSE, fig.align='center'}
gg_miss_var(bike)
gg_miss_var(bike, show_pct = TRUE)
```


Expand Down Expand Up @@ -361,6 +361,20 @@ Pay attention to your data and your `NA` values!
knitr::include_graphics("images/debug.png")
```

## GUT CHECK: What function can be used to remove NA values from a full dataframe or for an individual column?

A. `drop_nulls()`

B. `drop_na()`

C. `rem_na()`

## GUT CHECK: How can you keep NA values when using `filter`?

A. include `| is.na()`

B. include `& is.na()`

## Summary

- `is.na()`,`any(is.na())`, `all(is.na())`,`count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
Expand All @@ -376,9 +390,11 @@ knitr::include_graphics("images/debug.png")

## Lab Part 1

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)
🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)

📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)


# Recoding Variables
Expand Down Expand Up @@ -602,6 +618,13 @@ data_ginger_mint %>%
count(Group, Effect)
```

## GUT CHECK: If we want all unspecified values to remain the same with `case_when()`, how should we complete the `TRUE ~` statement?

A. With the name of the variable we are modifying or using as source

B. With the word "same"


# Working with strings

## Strings in R
Expand Down Expand Up @@ -718,14 +741,23 @@ data_ginger_mint %>%
count(Treatment, Treatment_recoded)
```

This is a more robust solution! It will catch typos as long as first letter is correct or there is part of the word mint.
This is a more robust solution! It will catch typos as long as the first letter is correct or there is part of the word mint.

## That's better!

```{r, echo = FALSE, fig.align='center', out.width= "30%"}
knitr::include_graphics("https://media1.giphy.com/media/S9ZK4mmi3u3jdc5dek/200w.webp?cid=ecf05e47h7myga959jwvek6s9x1tkog135g7pxu8vvjz2yqb&rid=200w.webp&ct=g")
```


## GUT CHECK: What `stringr` function helps us find a string pattern?

A. `str_replace()`

B. `str_find()`

C. `str_detect()`

# Separating and uniting data

## Uniting columns
Expand Down Expand Up @@ -784,9 +816,14 @@ knitr::include_graphics("images/case_when.png")

## Lab Part 2

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)
🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)

📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)

📃 [Posit's `stringr` Cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
Expand Down
49 changes: 28 additions & 21 deletions modules/Data_Cleaning/lab/Data_Cleaning_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,13 @@ The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gy
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

```{r}
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(jhur)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)
```

Read in the bike data, you can use the URL or download the data.
Read in the bike data, you can use the URL or download the data and save the data as an object called `bike`.

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation's bike program.
The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
Expand Down Expand Up @@ -67,10 +62,10 @@ have_rout <- bike %>% drop_na(route)

### 1.3

Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()`). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()` and use `show_ptc = TRUE` as an argument ). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

```{r 1.3response}
gg_miss_var(bike)
gg_miss_var(bike, show_pct = TRUE)
```


Expand All @@ -85,6 +80,18 @@ pull(bike, subType) %>% pct_complete() # this
miss_var_summary(bike) # or this
```

## P.2

Use the `na_if` function to replace values of 0 values of the`dateInstalled` variable to be `NA`. Check your work using the `count` function.

```{r}
bike <- bike %>%
mutate(dateInstalled = na_if(dateInstalled, 0))
count(bike, dateInstalled)
```




# Part 2

Expand All @@ -96,7 +103,7 @@ Let's say we the data like so:

```{r}
BloodType <- tibble(
weight_loss =
exposure =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
Expand All @@ -121,20 +128,20 @@ There are some issues with this data that we need to figure out!

### 2.1

Determine how many `NA` values there are for `weight_loss` (assume you know that`N` and `n` is for no).
Determine how many `NA` values there are for `exposure` (assume you know that`N` and `n` is for no).

```{r 2.1response}
count(BloodType, weight_loss) # the simple way
sum(is.na(pull(BloodType, weight_loss))) # another way
count(BloodType, exposure) # the simple way
sum(is.na(pull(BloodType, exposure))) # another way
BloodType %>% # another way
pull(weight_loss) %>%
pull(exposure) %>%
is.na() %>%
sum()
```

### 2.2

Recode the `weight_loss` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!
Recode the `exposure` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!

```
# General format
Expand All @@ -149,21 +156,21 @@ NEW_TIBBLE <- OLD_TIBBLE %>%
```{r 2.2response}
BloodType <- BloodType %>%
mutate(weight_loss = case_when(
weight_loss %in% c("N", "n", "No", "no") ~ "No",
weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
mutate(exposure = case_when(
exposure %in% c("N", "n", "No", "no") ~ "No",
exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))
count(BloodType, weight_loss)
count(BloodType, exposure)
```

### 2.3

Check to see how many values `weight_loss` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.
Check to see how many values `exposure` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.

```{r 2.3response}
BloodType %>% count(weight_loss)
BloodType %>% count(exposure)
```

### 2.4
Expand Down

0 comments on commit 3d4b413

Please sign in to comment.