diff --git a/Index.Rmd b/Index.Rmd index c50f467..9ae9631 100644 --- a/Index.Rmd +++ b/Index.Rmd @@ -53,11 +53,11 @@ knitr::write_bib(c(.packages(), ```{r child_Intro, child="Introduction.Rmd"} ``` - - +```{r child_Wrangle1, child="Data_Wrangling_1.Rmd"} +``` - - +```{r child_Wrangle2, child="Data_Wrangling_2.Rmd"} +``` diff --git a/Index.html b/Index.html index a4b91ff..4f3ced6 100644 --- a/Index.html +++ b/Index.html @@ -198,10 +198,739 @@
[]
or curly braces {}
. You can adjust these
settings under Global Options > Code > Display.
-
-
-
-
+
+
++
The video below gives a brief overview of the content covered in this +section of the tutorial. All the Data +Wrangling slides are also available as a web page.
+ + +“[…] writing R code is a hedonistically artistic, left-brained, +paint-in-your-hair sort of experience […]
learn how to code the same way we learned how to catch +salamanders as children – trial and error, flipping over rocks till we +get a reward […]
once the ecstasy of creation has swept over us, we awake late the +next morning to find our canvas covered with 2100 lines of R code +[…]
Heads throbbing with a statistical absinthe hangover, we trudge +through it slowly over days, trying to figure out what we did.”
+Andrew MacDonald
Keep it tidy
+When writing .R files use the #
symbol to annotate and
+not run lines. This is a great way to make notes for others and for
+future you.
We will talk later about using other file types like
+Rmarkdown
for organizing R script and other associated
+languages. There it will be possible to add a lot more information and
+text, citations etc. An .R file is intended for code but we can still
+keep it organized in sections by ending headers with ----
+or ####
annotation.
# Section 1 ----
# Section 2 ####
# Section 3 ####
Look for the Table of contents
in the upper right
+console of the RStudio scripting pane (next to the Run
+button).
Read more tips in the Tidyverse +Style guide.
++
The video below offers a brief overview of the content covered in +this section of the tutorial. Feel free to watch the video and follow +along or simply work through the tutorial.
+ + +Keep it tidy
+If you are following this tutorial by running the code on your local +machine (recommended) then it may make sense to check your R version by +running the following code in your R console:
+version
+
+At the time of writing this I am using R version 4.0.4
+Lost Library Book
(R Core Team
+2023). If you do not have this version or something newer it may
+make sense to update so that you can follow along without pesky version
+issues.
The easiest way to get libraries for today is to install the whole
+tidyverse (Wickham 2023) by typing
+install.packages("tidyverse")
in the R console and then
+running library(tidyverse)
:
install.packages("tidyverse")
+library(tidyverse)
+If you save your work to an .R file (recommended) be sure to annotate
+any code that you do not intend to run each time with the #
+symbol. You should only need to install tidyverse once and should be
+sure to either change that line of code to
+#install.packages("tidyverse")
or remove it from your
+script.
Read more style tips in the tidyverse style guide (Wickham 2023).
+Keep it tidy
+Get a lot of examples and details about the tidyverse by running the
+following code in the R console:
+browseVignettes(package = "tidyverse")
. Nearly every R
+library has a collection of vignettes that walk through examples and
+show, often in explicit detail, the authors intended use of the
+library.
In this tutorial we will be following the basic ideas behind the +tidyverse.
+ +Read the full +tidyverse manifesto here.
+Keep it tidy
+ +Read more style tipes in the tidyverse style guide (Wickham 2023).
+Three things make a dataset tidy:
+Read more about this from Wickham’s paper in the +Journal of Statistical Software.
+www.codeastar.com/data-wrangling/
+Format of dplyr
+ +Arguments start with a data frame
+Load data
+ +The data we will use for this course is on +Github and you can save it as a .csv to your local folder.
+read.csv
function# Use this on your machine
+participants_data <- read.csv("participants_data.csv")
+Learn more about what this function does by typing
+?read.csv
in the R console.
You can also get the data from this Github repository by using the
+read_csv
function from the readr
library (Wickham, Hester, and Bryan 2023) and
+url
function from base R. In this case you will want to use
+the ‘save as’ option for the webpage so that you can have it stored
+locally as a comma separated values
(.csv) file on your
+machine.
library(readr)
+
+urlfile = "https://raw.githubusercontent.com/CWWhitney/teaching_R/master/participants_data.csv"
+
+participants_data <- read_csv(url(urlfile))
+
+View
+function to see it in the Rstudio ‘Environment’)participants_data
+
+head
+function. The default of the head
function is to show 6
+rows. This can be changed with the n
argument.# Change the number of rows displayed to 7
+head(participants_data,
+ n = 4)
+
+# use the ?head option to learn the details of the function
+?head
+# look at the 'Arguments' section for the 'n' argument
+?head
+# The 'n' argument should be changed from 'n = 4' to 'n = 7'
+head(participants_data,
+ n = 7)
+head(participants_data,
+ n = 7)
+names
functionnames(participants_data)
+
+str
+functionstr(participants_data)
+
+$
# Change the variable to gender
+participants_data$age
+
+participants_data$gender
+Follow these steps to see the result of the rest of the
+transformations we perform with tidyverse
.
Using dplyr
+Load the dplyr library by running library(dplyr)
in the
+R console. do the same for other libraries we need today
+library(tidyr)
and library(magrittr)
Wickham, François, et al. (2023).
Inspiration for many of the following materials comes from Roger +Peng’s dplyr +tutorial.
+ +Read more about the dplyr +library (Wickham, François, et al. +2023).
+Subsetting
+ +Select
+Create a subset of the data with the select
+function:
# Change the selection to batch and age
+select(participants_data,
+ academic_parents,
+ working_hours_per_day)
+
+select(participants_data,
+ batch,
+ age)
+Subsetting
+Select
+Try creating a subset of the data with the select
+function:
# Change the selection
+# without batch and age
+select(participants_data,
+ -academic_parents,
+ -working_hours_per_day)
+
+select(participants_data,
+ -batch,
+ -age)
+Subsetting
+Filter
+Try creating a subset of the data with the filter
+function:
# Change the selection to
+# those who work more than 5 hours a day
+filter(participants_data,
+ working_hours_per_day >10)
+
+filter(participants_data,
+ working_hours_per_day >5)
+Subsetting
+Filter
+Create a subset of the data with multiple options in the
+filter
function:
# Change the filter to those who
+# work more than 5 hours a day and
+# names are longer than three letters
+filter(participants_data,
+ working_hours_per_day >10 &
+ letters_in_first_name >6)
+
+filter(participants_data,
+ working_hours_per_day >5 &
+ letters_in_first_name >3)
+Rename
+Change the names of the variables in the data with the
+rename
function:
# Rename the variable km_home_to_office as commute
+rename(participants_data,
+ name_length = letters_in_first_name)
+
+rename(participants_data,
+ commute = km_home_to_office)
+Mutate
+# Mutate a new column named age_mean that is a function of the age multiplied by the mean of all ages in the group
+mutate(participants_data,
+ labor_mean = working_hours_per_day*
+ mean(working_hours_per_day))
+
+mutate(participants_data,
+ age_mean = age*
+ mean(age))
+Mutate
+Create a commute category with the mutate
function:
# Mutate new column named response_speed
+# populated by 'slow' if it took you
+# more than a day to answer my email and
+# 'fast' for others
+mutate(participants_data,
+ commute =
+ ifelse(km_home_to_office > 10,
+ "commuter", "local"))
+
+mutate(participants_data,
+ response_speed =
+ ifelse(days_to_email_response > 1,
+ "slow", "fast"))
+Summarize
+ +Get a summary of selected variables with summarize
# Create a summary of the participants_mutate data
+# with the mean number of siblings
+# and median years of study
+summarize(participants_data,
+ mean(years_of_study),
+ median(letters_in_first_name))
+
+summarize(participants_data,
+ mean(number_of_siblings),
+ median(years_of_study))
+Pipeline %>%
+magrittr
pipeline %>%.
+Use the group_by
function to get these results for
+comparison between groups.# Use the magrittr pipe to summarize
+# the mean days to email response,
+# median letters in first name,
+# and maximum years of study by gender
+participants_data %>%
+ group_by(research_continent) %>%
+ summarize(mean(days_to_email_response),
+ median(letters_in_first_name),
+ max(years_of_study))
+
+participants_data %>%
+ group_by(gender) %>%
+ summarize(mean(days_to_email_response),
+ median(letters_in_first_name),
+ max(years_of_study))
+Now use the mutate
function to subset the data and use
+the group_by
function to get these results for comparisons
+between groups.
# Use the magrittr pipe to create a new column
+# called commute, where those who travel
+# more than 10km to get to the office
+# are called "commuter" and others are "local".
+# Summarize the mean days to email response,
+# median letters in first name,
+# and maximum years of study.
+participants_data %>%
+ mutate(response_speed = ifelse(
+ days_to_email_response > 1,
+ "slow", "fast")) %>%
+ group_by(response_speed) %>%
+ summarize(mean(number_of_siblings),
+ median(years_of_study),
+ max(letters_in_first_name))
+
+participants_data %>%
+ mutate(commute = ifelse(
+ km_home_to_office > 10,
+ "commuter", "local")) %>%
+ group_by(commute) %>%
+ summarize(mean(days_to_email_response),
+ median(letters_in_first_name),
+ max(years_of_study))
+We will use the purrr
library to run a regression (Wickham and Henry 2023). Run the code
+library(purrr)
in your local R console to load the
+library.
Now we will use the
+purrr
library for a simple linear regression (Wickham and Henry 2023). Note that when using
+base R functions with the magrittr
pipeline we use ‘.’ to
+refer to the data. The functions split
and lm
+are from base R and stats (R Core Team
+2023).
Use purrr to solve: split a data frame into pieces, fit a model to +each piece, compute the summary, then extract the R^2.
+# Split the data frame by batch,
+# fit a linear model formula
+# (days to email response as dependent
+# and working hours as independent)
+# to each batch, compute the summary,
+# then extract the R^2.
+ participants_data %>%
+ split(.$gender) %>%
+ map(~
+ lm(number_of_publications ~
+ number_of_siblings,
+ data = .)) %>%
+ map(summary) %>%
+ map_dbl("r.squared")
+
+ participants_data %>%
+ split(.$batch) %>% # from base R
+ map(~
+ lm(days_to_email_response ~
+ working_hours_per_day,
+ data = .)) %>%
+ map(summary) %>%
+ map_dbl("r.squared")
+Learn more about purrr from in the tidyverse and from varianceexplained.
+Check +out the purr Cheatsheet
+### Test your new +skills
+Your turn to perform
+Up until this point the code has been provided for you to work on.
+Now it is time for you to apply your new found skills. Please work
+through the wrangling tasks we just went though. Use the
+diamonds
data and make the steps in long format
+(i.e. assigning each step to an object) and short format with (i.e. with
+the magrittr pipeline):
The diamonds data is built in with the ggplot2
library.
+It is already available in your R environment. Look at the help file
+with ?diamonds
to learn more about it.
diamonds %>%
+ # - select: carat and price
+ select(carat, price) %>%
+# - filter: only where carat is > 0.5
+ filter(carat > 0.5) %>%
+# - rename: rename price as cost
+ rename(cost = price) %>%
+# - mutate: create a variable 'cheap_expensive' with 'expensive' if greater than mean of cost and 'cheap' otherwise
+ mutate(cheap_expensive = ifelse(
+ cost > mean(cost),
+ "expensive", "cheap")) %>%
+ # - group_by: split into cheap and expensive
+ group_by(cheap_expensive) %>%
+ # - summarize: give some summary statistics of your choice
+ summarize(mean(cost), mean(carat))
+