Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Winter25cleaning #625

Merged
merged 8 commits into from
Dec 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 42 additions & 5 deletions modules/Data_Cleaning/Data_Cleaning.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -155,10 +155,10 @@ miss_var_summary(bike)
## `naniar` plots

The `gg_miss_var()` function creates a nice plot about the number of
missing values for each variable, (need a data frame).
missing values for each variable, (need a data frame). Using `show_pct = TRUE` shows the percent missing.

```{r, fig.height=4, warning=FALSE, fig.align='center'}
gg_miss_var(bike)
gg_miss_var(bike, show_pct = TRUE)
```


Expand Down Expand Up @@ -361,6 +361,20 @@ Pay attention to your data and your `NA` values!
knitr::include_graphics("images/debug.png")
```

## GUT CHECK: What function can be used to remove NA values from a full dataframe or for an individual column?

A. `drop_nulls()`

B. `drop_na()`

C. `rem_na()`

## GUT CHECK: How can you keep NA values when using `filter`?

A. include `| is.na()`

B. include `& is.na()`

## Summary

- `is.na()`,`any(is.na())`, `all(is.na())`,`count()`, and functions from `naniar` like `gg_miss_var()` and `miss_var_summary` can help determine if we have `NA` values
Expand All @@ -376,9 +390,11 @@ knitr::include_graphics("images/debug.png")

## Lab Part 1

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)
🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)

📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)


# Recoding Variables
Expand Down Expand Up @@ -602,6 +618,13 @@ data_ginger_mint %>%
count(Group, Effect)
```

## GUT CHECK: If we want all unspecified values to remain the same with `case_when()`, how should we complete the `TRUE ~` statement?

A. With the name of the variable we are modifying or using as source

B. With the word "same"


# Working with strings

## Strings in R
Expand Down Expand Up @@ -718,14 +741,23 @@ data_ginger_mint %>%
count(Treatment, Treatment_recoded)
```

This is a more robust solution! It will catch typos as long as first letter is correct or there is part of the word mint.
This is a more robust solution! It will catch typos as long as the first letter is correct or there is part of the word mint.

## That's better!

```{r, echo = FALSE, fig.align='center', out.width= "30%"}
knitr::include_graphics("https://media1.giphy.com/media/S9ZK4mmi3u3jdc5dek/200w.webp?cid=ecf05e47h7myga959jwvek6s9x1tkog135g7pxu8vvjz2yqb&rid=200w.webp&ct=g")
```


## GUT CHECK: What `stringr` function helps us find a string pattern?

A. `str_replace()`

B. `str_find()`

C. `str_detect()`

# Separating and uniting data

## Uniting columns
Expand Down Expand Up @@ -784,9 +816,14 @@ knitr::include_graphics("images/case_when.png")

## Lab Part 2

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)
🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

💻[Lab](https://jhudatascience.org/intro_to_r/modules/Data_Cleaning/lab/Data_Cleaning_Lab.Rmd)

📃 [Day 5 Cheatsheet](https://jhudatascience.org/intro_to_r/modules/cheatsheets/Day-5.pdf)

📃 [Posit's `stringr` Cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
Expand Down
49 changes: 28 additions & 21 deletions modules/Data_Cleaning/lab/Data_Cleaning_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,13 @@ The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gy
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

```{r}
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(jhur)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)
```

Read in the bike data, you can use the URL or download the data.
Read in the bike data, you can use the URL or download the data and save the data as an object called `bike`.

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation's bike program.
The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
Expand Down Expand Up @@ -67,10 +62,10 @@ have_rout <- bike %>% drop_na(route)

### 1.3

Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()`). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
Use `naniar` to make a visual of the amount of data missing for each variable of `bike` (use `gg_miss_var()` and use `show_ptc = TRUE` as an argument ). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

```{r 1.3response}
gg_miss_var(bike)
gg_miss_var(bike, show_pct = TRUE)
```


Expand All @@ -85,6 +80,18 @@ pull(bike, subType) %>% pct_complete() # this
miss_var_summary(bike) # or this
```

## P.2

Use the `na_if` function to replace values of 0 values of the`dateInstalled` variable to be `NA`. Check your work using the `count` function.

```{r}
bike <- bike %>%
mutate(dateInstalled = na_if(dateInstalled, 0))
count(bike, dateInstalled)
```




# Part 2

Expand All @@ -96,7 +103,7 @@ Let's say we the data like so:

```{r}
BloodType <- tibble(
weight_loss =
exposure =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
Expand All @@ -121,20 +128,20 @@ There are some issues with this data that we need to figure out!

### 2.1

Determine how many `NA` values there are for `weight_loss` (assume you know that`N` and `n` is for no).
Determine how many `NA` values there are for `exposure` (assume you know that`N` and `n` is for no).

```{r 2.1response}
count(BloodType, weight_loss) # the simple way
sum(is.na(pull(BloodType, weight_loss))) # another way
count(BloodType, exposure) # the simple way
sum(is.na(pull(BloodType, exposure))) # another way
BloodType %>% # another way
pull(weight_loss) %>%
pull(exposure) %>%
is.na() %>%
sum()
```

### 2.2

Recode the `weight_loss` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!
Recode the `exposure` variable of the `BloodType` data so that it is consistent. Use `case_when()`. Keep "Other" as "Other". Don't forget to use quotes!

```
# General format
Expand All @@ -149,21 +156,21 @@ NEW_TIBBLE <- OLD_TIBBLE %>%
```{r 2.2response}

BloodType <- BloodType %>%
mutate(weight_loss = case_when(
weight_loss %in% c("N", "n", "No", "no") ~ "No",
weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
mutate(exposure = case_when(
exposure %in% c("N", "n", "No", "no") ~ "No",
exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))

count(BloodType, weight_loss)
count(BloodType, exposure)
```

### 2.3

Check to see how many values `weight_loss` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.
Check to see how many values `exposure` has for each category (hint: use `count`). It's good practice to regularly check your data throughout the data wrangling process.

```{r 2.3response}
BloodType %>% count(weight_loss)
BloodType %>% count(exposure)
```

### 2.4
Expand Down
Loading