diff --git a/Index.Rmd b/Index.Rmd index c50f467..9ae9631 100644 --- a/Index.Rmd +++ b/Index.Rmd @@ -53,11 +53,11 @@ knitr::write_bib(c(.packages(), ```{r child_Intro, child="Introduction.Rmd"} ``` - - +```{r child_Wrangle1, child="Data_Wrangling_1.Rmd"} +``` - - +```{r child_Wrangle2, child="Data_Wrangling_2.Rmd"} +``` diff --git a/Index.html b/Index.html index a4b91ff..4f3ced6 100644 --- a/Index.html +++ b/Index.html @@ -198,10 +198,739 @@

Bonus RStudio settings

[] or curly braces {}. You can adjust these settings under Global Options > Code > Display.

- - - - + + +
+

Data Wrangling Part 1

+

+

The video below gives a brief overview of the content covered in this +section of the tutorial. All the Data +Wrangling slides are also available as a web page.

+ + +
+

Notes on R

+ +

+
+
+

Notes on R: About process

+ +
+
+

Notes on R: Focus

+

+
+
+

Notes on R: Keeping track of work

+

+

Keep it tidy

+

When writing .R files use the # symbol to annotate and +not run lines. This is a great way to make notes for others and for +future you.

+

We will talk later about using other file types like +Rmarkdown for organizing R script and other associated +languages. There it will be possible to add a lot more information and +text, citations etc. An .R file is intended for code but we can still +keep it organized in sections by ending headers with ---- +or #### annotation.

+

# Section 1 ----

+

# Section 2 ####

+

# Section 3 ####

+

Look for the Table of contents in the upper right +console of the RStudio scripting pane (next to the Run +button).

+

+

Read more tips in the Tidyverse +Style guide.

+
+
+
+

Data Wrangling Part 2

+

+

The video below offers a brief overview of the content covered in +this section of the tutorial. Feel free to watch the video and follow +along or simply work through the tutorial.

+ + +
+

Notes on tidy R

+

+

Keep it tidy

+

If you are following this tutorial by running the code on your local +machine (recommended) then it may make sense to check your R version by +running the following code in your R console:

+
+
version
+ +
+

At the time of writing this I am using R version 4.0.4 +Lost Library Book (R Core Team +2023). If you do not have this version or something newer it may +make sense to update so that you can follow along without pesky version +issues.

+

The easiest way to get libraries for today is to install the whole +tidyverse (Wickham 2023) by typing +install.packages("tidyverse") in the R console and then +running library(tidyverse):

+
install.packages("tidyverse")
+library(tidyverse)
+

If you save your work to an .R file (recommended) be sure to annotate +any code that you do not intend to run each time with the # +symbol. You should only need to install tidyverse once and should be +sure to either change that line of code to +#install.packages("tidyverse") or remove it from your +script.

+

Read more style tips in the tidyverse style guide (Wickham 2023).

+
+
+

Notes on tidy R browseVignettes

+

+

Keep it tidy

+

Get a lot of examples and details about the tidyverse by running the +following code in the R console: +browseVignettes(package = "tidyverse"). Nearly every R +library has a collection of vignettes that walk through examples and +show, often in explicit detail, the authors intended use of the +library.

+
+
+

The tidy tools manifesto

+

In this tutorial we will be following the basic ideas behind the +tidyverse.

+

+

Read the full +tidyverse manifesto here.

+
+
+

Notes on R: tidyR process

+

+

Keep it tidy

+

+ +

+

Read more style tipes in the tidyverse style guide (Wickham 2023).

+
+
+

Notes on R: Tidy Data

+

Three things make a dataset tidy:

+ +

+

Read more about this from Wickham’s paper in the +Journal of Statistical Software.

+
+
+

Wrangling: transform

+ +

+

www.codeastar.com/data-wrangling/

+
+
+

Wrangling: dplyr arguments

+

Format of dplyr

+

+

Arguments start with a data frame

+ +

https://dplyr.tidyverse.org/

+
+
+

Getting your data in R

+

Load data

+

+

The data we will use for this course is on +Github and you can save it as a .csv to your local folder.

+ +
# Use this on your machine
+participants_data <- read.csv("participants_data.csv")
+

Learn more about what this function does by typing +?read.csv in the R console.

+

You can also get the data from this Github repository by using the +read_csv function from the readr library (Wickham, Hester, and Bryan 2023) and +url function from base R. In this case you will want to use +the ‘save as’ option for the webpage so that you can have it stored +locally as a comma separated values (.csv) file on your +machine.

+
+
library(readr)
+
+urlfile = "https://raw.githubusercontent.com/CWWhitney/teaching_R/master/participants_data.csv"
+
+participants_data <- read_csv(url(urlfile))
+ +
+ +
+
+

Looking at the data

+ +
+
participants_data
+ +
+ +
+
# Change the number of rows displayed to 7
+head(participants_data, 
+     n = 4)
+ +
+
+
# use the ?head option to learn the details of the function
+?head
+
+
+
# look at the 'Arguments' section for the 'n' argument
+?head
+
+
+
# The 'n' argument should be changed from 'n = 4' to 'n = 7'
+head(participants_data, 
+     n = 7)
+
+
+
head(participants_data, 
+     n = 7)
+
+ +
+
names(participants_data)
+ +
+ +
+
str(participants_data)
+ +
+ +
+
# Change the variable to gender
+participants_data$age
+ +
+
+
participants_data$gender
+
+

Follow these steps to see the result of the rest of the +transformations we perform with tidyverse.

+
+
+

Wrangling: dplyr library

+

Using dplyr

+

Load the dplyr library by running library(dplyr) in the +R console. do the same for other libraries we need today +library(tidyr) and library(magrittr) Wickham, François, et al. (2023).

+

Inspiration for many of the following materials comes from Roger +Peng’s dplyr +tutorial.

+

+

Read more about the dplyr +library (Wickham, François, et al. +2023).

+
+
+

Wrangling: dplyr::select aca_work_set

+

Subsetting

+

+

Select

+

Create a subset of the data with the select +function:

+
+
# Change the selection to batch and age
+select(participants_data, 
+       academic_parents,
+       working_hours_per_day)
+ +
+
+
select(participants_data,
+       batch,
+       age)
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::select non_aca_work_filter

+

Subsetting

+

Select

+

Try creating a subset of the data with the select +function:

+
+
# Change the selection 
+# without batch and age
+select(participants_data,
+       -academic_parents,
+       -working_hours_per_day)
+ +
+
+
select(participants_data, 
+       -batch, 
+       -age)
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::filter work_filter

+

Subsetting

+

Filter

+

Try creating a subset of the data with the filter +function:

+
+
# Change the selection to 
+# those who work more than 5 hours a day
+filter(participants_data, 
+       working_hours_per_day >10)
+ +
+
+
filter(participants_data, 
+       working_hours_per_day >5)
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::filter work_name_filter

+

Subsetting

+

Filter

+

Create a subset of the data with multiple options in the +filter function:

+
+
# Change the filter to those who 
+# work more than 5 hours a day and 
+# names are longer than three letters
+filter(participants_data, 
+       working_hours_per_day >10 & 
+         letters_in_first_name >6)
+ +
+
+
filter(participants_data, 
+       working_hours_per_day >5 & 
+         letters_in_first_name >3)
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::rename name_length

+

Rename

+

Change the names of the variables in the data with the +rename function:

+
+
# Rename the variable km_home_to_office as commute
+rename(participants_data, 
+       name_length = letters_in_first_name)
+ +
+
+
rename(participants_data, 
+       commute = km_home_to_office)
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::mutate

+

Mutate

+
+
# Mutate a new column named age_mean that is a function of the age multiplied by the mean of all ages in the group
+mutate(participants_data, 
+       labor_mean = working_hours_per_day*
+              mean(working_hours_per_day))
+ +
+
+
mutate(participants_data, 
+       age_mean = age*
+         mean(age))
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::mutate

+

Mutate

+

Create a commute category with the mutate function:

+
+
# Mutate new column named response_speed 
+# populated by 'slow' if it took you 
+# more than a day to answer my email and 
+# 'fast' for others
+mutate(participants_data, 
+       commute = 
+         ifelse(km_home_to_office > 10, 
+                 "commuter", "local"))
+ +
+
+
mutate(participants_data, 
+       response_speed = 
+         ifelse(days_to_email_response > 1, 
+                        "slow", "fast"))
+
+

https://dplyr.tidyverse.org/

+
+
+

Wrangling: dplyr::summarize

+

Summarize

+

+

Get a summary of selected variables with summarize

+
+
# Create a summary of the participants_mutate data 
+# with the mean number of siblings 
+# and median years of study
+summarize(participants_data,
+          mean(years_of_study),
+          median(letters_in_first_name))
+ +
+
+
summarize(participants_data,
+          mean(number_of_siblings),
+          median(years_of_study))
+
+
+
+

Wrangling: magrittr use

+

Pipeline %>%

+ +
+
# Use the magrittr pipe to summarize 
+# the mean days to email response, 
+# median letters in first name, 
+# and maximum years of study by gender
+participants_data %>% 
+  group_by(research_continent) %>% 
+  summarize(mean(days_to_email_response), 
+            median(letters_in_first_name), 
+            max(years_of_study))
+ +
+
+
participants_data %>% 
+  group_by(gender) %>% 
+  summarize(mean(days_to_email_response), 
+            median(letters_in_first_name), 
+            max(years_of_study))
+
+

Now use the mutate function to subset the data and use +the group_by function to get these results for comparisons +between groups.

+
+
# Use the magrittr pipe to create a new column 
+# called commute, where those who travel 
+# more than 10km to get to the office 
+# are called "commuter" and others are "local". 
+# Summarize the mean days to email response, 
+# median letters in first name, 
+# and maximum years of study. 
+participants_data %>% 
+   mutate(response_speed = ifelse(
+     days_to_email_response > 1, 
+     "slow", "fast")) %>% 
+  group_by(response_speed) %>% 
+  summarize(mean(number_of_siblings), 
+            median(years_of_study), 
+            max(letters_in_first_name))
+ +
+
+
participants_data %>% 
+   mutate(commute = ifelse(
+     km_home_to_office > 10, 
+     "commuter", "local")) %>% 
+  group_by(commute) %>% 
+  summarize(mean(days_to_email_response), 
+            median(letters_in_first_name), 
+            max(years_of_study))
+
+
+
+

purrr: Apply a function to each element of a vector

+

+

We will use the purrr library to run a regression (Wickham and Henry 2023). Run the code +library(purrr) in your local R console to load the +library.

+

Now we will use the +purrr library for a simple linear regression (Wickham and Henry 2023). Note that when using +base R functions with the magrittr pipeline we use ‘.’ to +refer to the data. The functions split and lm +are from base R and stats (R Core Team +2023).

+

Use purrr to solve: split a data frame into pieces, fit a model to +each piece, compute the summary, then extract the R^2.

+
+
# Split the data frame by batch, 
+# fit a linear model formula 
+# (days to email response as dependent 
+# and working hours as independent) 
+# to each batch, compute the summary, 
+# then extract the R^2.
+    participants_data %>%
+      split(.$gender) %>% 
+        map(~ 
+          lm(number_of_publications ~ 
+                number_of_siblings, 
+                 data = .)) %>%
+  map(summary) %>%
+  map_dbl("r.squared")
+ +
+
+
    participants_data %>%
+      split(.$batch) %>% # from base R
+        map(~ 
+          lm(days_to_email_response ~ 
+                working_hours_per_day, 
+                 data = .)) %>%
+  map(summary) %>%
+  map_dbl("r.squared")
+
+

Learn more about purrr from in the tidyverse and from varianceexplained.

+

Check +out the purr Cheatsheet

+

### Test your new +skills

+

Your turn to perform

+

Up until this point the code has been provided for you to work on. +Now it is time for you to apply your new found skills. Please work +through the wrangling tasks we just went though. Use the +diamonds data and make the steps in long format +(i.e. assigning each step to an object) and short format with (i.e. with +the magrittr pipeline):

+ +

The diamonds data is built in with the ggplot2 library. +It is already available in your R environment. Look at the help file +with ?diamonds to learn more about it.

+
+ +
+
+
    diamonds %>%
+  #   - select: carat and price
+      select(carat, price) %>%
+# - filter: only where carat is > 0.5
+      filter(carat > 0.5) %>%
+# - rename: rename price as cost
+      rename(cost = price) %>%
+# - mutate: create a variable 'cheap_expensive' with 'expensive' if greater than mean of cost and 'cheap' otherwise
+    mutate(cheap_expensive = ifelse(
+       cost > mean(cost), 
+      "expensive", "cheap")) %>%
+  # - group_by: split into cheap and expensive
+    group_by(cheap_expensive) %>%
+  # - summarize: give some summary statistics of your choice
+    summarize(mean(cost), mean(carat))
+
@@ -341,55 +1070,997 @@

References

engine = "r", version = "4"), class = c("r", "tutorial_exercise" ))) -

- - - - - - -
-
-Aden-Buie, Garrick, Barret Schloerke, JJ Allaire, and Alexander Rossell -Hayes. 2023. Learnr: Interactive Tutorials for r. https://rstudio.github.io/learnr/. -
-
-Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A -Forward-Pipe Operator for r. https://magrittr.tidyverse.org. -
-
-Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, -Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara -Borges. 2023. Shiny: Web Application Framework for r. https://shiny.posit.co/. -
-
-R Core Team. 2023. R: A Language and Environment for Statistical -Computing. Vienna, Austria: R Foundation for Statistical Computing. -https://www.R-project.org/. -
-
-Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the -Tidyverse. https://tidyverse.tidyverse.org. -
-
-Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, -Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey -Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using -the Grammar of Graphics. https://ggplot2.tidyverse.org. -
-
-Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis -Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + +
+
+Aden-Buie, Garrick, Barret Schloerke, JJ Allaire, and Alexander Rossell +Hayes. 2023. Learnr: Interactive Tutorials for r. https://rstudio.github.io/learnr/. +
+
+Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A +Forward-Pipe Operator for r. https://magrittr.tidyverse.org. +
+
+Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, +Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara +Borges. 2023. Shiny: Web Application Framework for r. https://shiny.posit.co/. +
+
+R Core Team. 2023. R: A Language and Environment for Statistical +Computing. Vienna, Austria: R Foundation for Statistical Computing. +https://www.R-project.org/. +
+
+Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the +Tidyverse. https://tidyverse.tidyverse.org. +
+
+Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, +Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey +Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using +the Grammar of Graphics. https://ggplot2.tidyverse.org. +
+
+Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis +Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org. +
+
+Wickham, Hadley, and Lionel Henry. 2023. Purrr: Functional +Programming Tools. https://purrr.tidyverse.org/. +
+
+Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2023. Readr: Read +Rectangular Text Data. https://readr.tidyverse.org. +
+
+Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2023. Tidyr: +Tidy Messy Data. https://tidyr.tidyverse.org.