diff --git a/01-rstudio-intro.md b/01-rstudio-intro.md
new file mode 100644
index 000000000..a144f0e60
--- /dev/null
+++ b/01-rstudio-intro.md
@@ -0,0 +1,893 @@
+---
+title: Introduction to R and RStudio
+teaching: 45
+exercises: 10
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Describe the purpose and use of each pane in RStudio
+- Locate buttons and options in RStudio
+- Define a variable
+- Assign data to a variable
+- Manage a workspace in an interactive R session
+- Use mathematical and comparison operators
+- Call functions
+- Manage packages
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How to find your way around RStudio?
+- How to interact with R?
+- How to manage your environment?
+- How to install packages?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+## Before Starting The Workshop
+
+Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.
+
+- [Download and install the latest version of R here](https://www.r-project.org/)
+- [Download and install RStudio here](https://www.rstudio.com/products/rstudio/download/#download)
+
+
+## Why use R and R studio?
+
+Welcome to the R portion of the Software Carpentry workshop!
+
+Science is a multi-step process: once you've designed an experiment and collected
+data, the real fun begins with analysis! Throughout this lesson, we're going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.
+
+Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to ["reproducible" research](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285).
+
+Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program.
+
+However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development
+Environment (IDE). It provides a built-in editor, works on all platforms (including
+on servers) and provides many advantages such as integration with version
+control and project management.
+
+## Overview
+
+We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from [gapminder.org](https://www.gapminder.org) containing population information for many
+countries through time. Can you read the data into R? Can you plot the population for
+Senegal? Can you calculate the average income for countries on the continent of Asia?
+By the end of these lessons you will be able to do things like plot the populations
+for all of these countries in under a minute!
+
+
+**Basic layout**
+
+When you first open RStudio, you will be greeted by three panels:
+
+- The interactive R console/Terminal (entire left)
+- Environment/History/Connections (tabbed in upper right)
+- Files/Plots/Packages/Help/Viewer (tabbed in lower right)
+
+![](fig/01-rstudio.png){alt='RStudio layout'}
+
+Once you open files, such as R scripts, an editor panel will also open
+in the top left.
+
+![](fig/01-rstudio-script.png){alt='RStudio layout with .R file open'}
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## R scripts
+
+Any commands that you write in the R console can be saved to a file
+to be re-run again. Files containing R code to be ran in this way are
+called R scripts. R scripts have `.R` at the end of their names to
+let you know what they are.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Workflow within RStudio
+
+There are two main ways one can work within RStudio:
+
+1. Test and play within the interactive R console then copy code into
+ a .R file to run later.
+ - This works well when doing small tests and initially starting off.
+ - It quickly becomes laborious
+2. Start writing in a .R file and use RStudio's short cut keys for the Run command
+ to push the current line, selected lines or modified lines to the
+ interactive R console.
+ - This is a great way to start; all your code is saved for later
+ - You will be able to run the file you create from within RStudio
+ or using R's `source()` function.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Running segments of your code
+
+RStudio offers you great flexibility in running code from within the editor
+window. There are buttons, menu choices, and keyboard shortcuts. To run the
+current line, you can
+
+1. click on the `Run` button above the editor panel, or
+2. select "Run Lines" from the "Code" menu, or
+3. hit Ctrl\+Return in Windows or Linux
+ or ⌘\+Return on OS X.
+ (This shortcut can also be seen by hovering
+ the mouse over the button). To run a block of code, select it and then `Run`.
+ If you have modified a line of code within a block of code you have just run,
+ there is no need to reselect the section and `Run`, you can use the next button
+ along, `Re-run the previous region`. This will run the previous code block
+ including the modifications you have made.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction to R
+
+Much of your time in R will be spent in the R interactive
+console. This is where you will run all of your code, and can be a
+useful environment to try out ideas before adding them to an R script
+file. This console in RStudio is the same as the one you would get if
+you typed in `R` in your command-line environment.
+
+The first thing you will see in the R interactive session is a bunch
+of information, followed by a ">" and a blinking cursor. In many ways
+this is similar to the shell environment you learned about during the
+shell lessons: it operates on the same idea of a "Read, evaluate,
+print loop": you type in commands, R tries to execute them, and then
+returns a result.
+
+## Using R as a calculator
+
+The simplest thing you could do with R is to do arithmetic:
+
+
+``` r
+1 + 100
+```
+
+``` output
+[1] 101
+```
+
+And R will print out the answer, with a preceding "[1]". [1] is the index of
+the first element of the line being printed in the console. For more information
+on indexing vectors, see [Episode 6: Subsetting Data](https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/index.html).
+
+If you type in an incomplete command, R will wait for you to
+complete it. If you are familiar with Unix Shell's bash, you may recognize this behavior from bash.
+
+```r
+> 1 +
+```
+
+```output
++
+```
+
+Any time you hit return and the R session shows a "+" instead of a ">", it
+means it's waiting for you to complete the command. If you want to cancel
+a command you can hit Esc and RStudio will give you back the ">" prompt.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Canceling commands
+
+If you're using R from the command line instead of from within RStudio,
+you need to use Ctrl\+C instead of Esc
+to cancel the command. This applies to Mac users as well!
+
+Canceling a command isn't only useful for killing incomplete commands:
+you can also use it to tell R to stop running code (for example if it's
+taking much longer than you expect), or to get rid of the code you're
+currently writing.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+When using R as a calculator, the order of operations is the same as you
+would have learned back in school.
+
+From highest to lowest precedence:
+
+- Parentheses: `(`, `)`
+- Exponents: `^` or `**`
+- Multiply: `*`
+- Divide: `/`
+- Add: `+`
+- Subtract: `-`
+
+
+``` r
+3 + 5 * 2
+```
+
+``` output
+[1] 13
+```
+
+Use parentheses to group operations in order to force the order of
+evaluation if it differs from the default, or to make clear what you
+intend.
+
+
+``` r
+(3 + 5) * 2
+```
+
+``` output
+[1] 16
+```
+
+This can get unwieldy when not needed, but clarifies your intentions.
+Remember that others may later read your code.
+
+
+``` r
+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2 # clear, if you remember the rules
+3 + 5 * (2 ^ 2) # if you forget some rules, this might help
+```
+
+The text after each line of code is called a
+"comment". Anything that follows after the hash (or octothorpe) symbol
+`#` is ignored by R when it executes code.
+
+Really small or large numbers get a scientific notation:
+
+
+``` r
+2/10000
+```
+
+``` output
+[1] 2e-04
+```
+
+Which is shorthand for "multiplied by `10^XX`". So `2e-4`
+is shorthand for `2 * 10^(-4)`.
+
+You can write numbers in scientific notation too:
+
+
+``` r
+5e3 # Note the lack of minus here
+```
+
+``` output
+[1] 5000
+```
+
+## Mathematical functions
+
+R has many built in mathematical functions. To call a function,
+we can type its name, followed by open and closing parentheses.
+Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example:
+
+
+``` r
+getwd() #returns an absolute filepath
+```
+
+doesn't require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result.
+
+
+``` r
+sin(1) # trigonometry functions
+```
+
+``` output
+[1] 0.841471
+```
+
+
+``` r
+log(1) # natural logarithm
+```
+
+``` output
+[1] 0
+```
+
+
+``` r
+log10(10) # base-10 logarithm
+```
+
+``` output
+[1] 1
+```
+
+
+``` r
+exp(0.5) # e^(1/2)
+```
+
+``` output
+[1] 1.648721
+```
+
+Don't worry about trying to remember every function in R. You
+can look them up on Google, or if you can remember the
+start of the function's name, use the tab completion in RStudio.
+
+This is one advantage that RStudio has over R on its own, it
+has auto-completion abilities that allow you to more easily
+look up functions, their arguments, and the values that they
+take.
+
+Typing a `?` before the name of a command will open the help page
+for that command. When using RStudio, this will open the 'Help' pane;
+if using R in the terminal, the help page will open in your browser.
+The help page will include a detailed description of the command and
+how it works. Scrolling to the bottom of the help page will usually
+show a collection of code examples which illustrate command usage.
+We'll go through an example later.
+
+## Comparing things
+
+We can also do comparisons in R:
+
+
+``` r
+1 == 1 # equality (note two equals signs, read as "is equal to")
+```
+
+``` output
+[1] TRUE
+```
+
+
+``` r
+1 != 2 # inequality (read as "is not equal to")
+```
+
+``` output
+[1] TRUE
+```
+
+
+``` r
+1 < 2 # less than
+```
+
+``` output
+[1] TRUE
+```
+
+
+``` r
+1 <= 1 # less than or equal to
+```
+
+``` output
+[1] TRUE
+```
+
+
+``` r
+1 > 0 # greater than
+```
+
+``` output
+[1] TRUE
+```
+
+
+``` r
+1 >= -9 # greater than or equal to
+```
+
+``` output
+[1] TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Comparing Numbers
+
+A word of warning about comparing numbers: you should
+never use `==` to compare two numbers unless they are
+integers (a data type which can specifically represent
+only whole numbers).
+
+Computers may only represent decimal numbers with a
+certain degree of precision, so two numbers which look
+the same when printed out by R, may actually have
+different underlying representations and therefore be
+different by a small margin of error (called Machine
+numeric tolerance).
+
+Instead you should use the `all.equal` function.
+
+Further reading: [http://floating-point-gui.de/](https://floating-point-gui.de/)
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Variables and assignment
+
+We can store values in variables using the assignment operator `<-`, like this:
+
+
+``` r
+x <- 1/40
+```
+
+Notice that assignment does not print a value. Instead, we stored it for later
+in something called a **variable**. `x` now contains the **value** `0.025`:
+
+
+``` r
+x
+```
+
+``` output
+[1] 0.025
+```
+
+More precisely, the stored value is a *decimal approximation* of
+this fraction called a [floating point number](https://en.wikipedia.org/wiki/Floating_point).
+
+Look for the `Environment` tab in the top right panel of RStudio, and you will see that `x` and its value
+have appeared. Our variable `x` can be used in place of a number in any calculation that expects a number:
+
+
+``` r
+log(x)
+```
+
+``` output
+[1] -3.688879
+```
+
+Notice also that variables can be reassigned:
+
+
+``` r
+x <- 100
+```
+
+`x` used to contain the value 0.025 and now it has the value 100.
+
+Assignment values can contain the variable being assigned to:
+
+
+``` r
+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2
+```
+
+The right hand side of the assignment can be any valid R expression.
+The right hand side is *fully evaluated* before the assignment occurs.
+
+Variable names can contain letters, numbers, underscores and periods but no spaces. They
+must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore).
+Variables beginning with a period are hidden variables.
+Different people use different conventions for long variable names, these include
+
+- periods.between.words
+- underscores\_between\_words
+- camelCaseToSeparateWords
+
+What you use is up to you, but **be consistent**.
+
+It is also possible to use the `=` operator for assignment:
+
+
+``` r
+x = 1/40
+```
+
+But this is much less common among R users. The most important thing is to
+**be consistent** with the operator you use. There are occasionally places
+where it is less confusing to use `<-` than `=`, and it is the most common
+symbol used in the community. So the recommendation is to use `<-`.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Which of the following are valid R variable names?
+
+
+``` r
+min_height
+max.height
+_age
+.mass
+MaxLength
+min-length
+2widths
+celsius2kelvin
+```
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+The following can be used as R variables:
+
+
+``` r
+min_height
+max.height
+MaxLength
+celsius2kelvin
+```
+
+The following creates a hidden variable:
+
+
+``` r
+.mass
+```
+
+The following will not be able to be used to create a variable
+
+
+``` r
+_age
+min-length
+2widths
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Vectorization
+
+One final thing to be aware of is that R is *vectorized*, meaning that
+variables and functions can have vectors as values. In contrast to physics and
+mathematics, a vector in R describes a set of values in a certain order of the
+same data type. For example:
+
+
+``` r
+1:5
+```
+
+``` output
+[1] 1 2 3 4 5
+```
+
+``` r
+2^(1:5)
+```
+
+``` output
+[1] 2 4 8 16 32
+```
+
+``` r
+x <- 1:5
+2^x
+```
+
+``` output
+[1] 2 4 8 16 32
+```
+
+This is incredibly powerful; we will discuss this further in an
+upcoming lesson.
+
+## Managing your environment
+
+There are a few useful commands you can use to interact with the R session.
+
+`ls` will list all of the variables and functions stored in the global environment
+(your working R session):
+
+
+``` r
+ls()
+```
+
+
+``` output
+[1] "x" "y"
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: hidden objects
+
+Like in the shell, `ls` will hide any variables or functions starting
+with a "." by default. To list all objects, type `ls(all.names=TRUE)`
+instead
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Note here that we didn't give any arguments to `ls`, but we still
+needed to give the parentheses to tell R to call the function.
+
+If we type `ls` by itself, R prints a bunch of code instead of a listing of objects.
+
+
+``` r
+ls
+```
+
+``` output
+function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+ pattern, sorted = TRUE)
+{
+ if (!missing(name)) {
+ pos <- tryCatch(name, error = function(e) e)
+ if (inherits(pos, "error")) {
+ name <- substitute(name)
+ if (!is.character(name))
+ name <- deparse(name)
+ warning(gettextf("%s converted to character string",
+ sQuote(name)), domain = NA)
+ pos <- name
+ }
+ }
+ all.names <- .Internal(ls(envir, all.names, sorted))
+ if (!missing(pattern)) {
+ if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
+ ll != length(grep("]", pattern, fixed = TRUE))) {
+ if (pattern == "[") {
+ pattern <- "\\["
+ warning("replaced regular expression pattern '[' by '\\\\['")
+ }
+ else if (length(grep("[^\\\\]\\[<-", pattern))) {
+ pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+ warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+ }
+ }
+ grep(pattern, all.names, value = TRUE)
+ }
+ else all.names
+}
+
+
+```
+
+What's going on here?
+
+Like everything in R, `ls` is the name of an object, and entering the name of
+an object by itself prints the contents of the object. The object `x` that we
+created earlier contains 1, 2, 3, 4, 5:
+
+
+``` r
+x
+```
+
+``` output
+[1] 1 2 3 4 5
+```
+
+The object `ls` contains the R code that makes the `ls` function work! We'll talk
+more about how functions work and start writing our own later.
+
+You can use `rm` to delete objects you no longer need:
+
+
+``` r
+rm(x)
+```
+
+If you have lots of things in your environment and want to delete all of them,
+you can pass the results of `ls` to the `rm` function:
+
+
+``` r
+rm(list = ls())
+```
+
+In this case we've combined the two. Like the order of operations, anything
+inside the innermost parentheses is evaluated first, and so on.
+
+In this case we've specified that the results of `ls` should be used for the
+`list` argument in `rm`. When assigning values to arguments by name, you *must*
+use the `=` operator!!
+
+If instead we use `<-`, there will be unintended side effects, or you may get an error message:
+
+
+``` r
+rm(list <- ls())
+```
+
+``` error
+Error in rm(list <- ls()): ... must contain names or character strings
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Warnings vs. Errors
+
+Pay attention when R does something unexpected! Errors, like above,
+are thrown when R cannot proceed with a calculation. Warnings on the
+other hand usually mean that the function has run, but it probably
+hasn't worked as expected.
+
+In both cases, the message that R prints out usually give you clues
+how to fix a problem.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## R Packages
+
+It is possible to add functions to R by writing a package, or by
+obtaining a package written by someone else. As of this writing, there
+are over 10,000 packages available on CRAN (the comprehensive R archive
+network). R and RStudio have functionality for managing packages:
+
+- You can see what packages are installed by typing
+ `installed.packages()`
+- You can install packages by typing `install.packages("packagename")`,
+ where `packagename` is the package name, in quotes.
+- You can update installed packages by typing `update.packages()`
+- You can remove a package with `remove.packages("packagename")`
+- You can make a package available for use with `library(packagename)`
+
+Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package.
+
+Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+What will be the value of each variable after each
+statement in the following program?
+
+
+``` r
+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+```
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+
+``` r
+mass <- 47.5
+```
+
+This will give a value of 47.5 for the variable mass
+
+
+``` r
+age <- 122
+```
+
+This will give a value of 122 for the variable age
+
+
+``` r
+mass <- mass * 2.3
+```
+
+This will multiply the existing value of 47.5 by 2.3 to give a new value of
+109.25 to the variable mass.
+
+
+``` r
+age <- age - 20
+```
+
+This will subtract 20 from the existing value of 122 to give a new value
+of 102 to the variable age.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Run the code from the previous challenge, and write a command to
+compare mass to age. Is mass larger than age?
+
+::::::::::::::: solution
+
+## Solution to challenge 3
+
+One way of answering this question in R is to use the `>` to set up the following:
+
+
+``` r
+mass > age
+```
+
+``` output
+[1] TRUE
+```
+
+This should yield a boolean value of TRUE since 109.25 is greater than 102.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+Clean up your working environment by deleting the mass and age
+variables.
+
+::::::::::::::: solution
+
+## Solution to challenge 4
+
+We can use the `rm` command to accomplish this task
+
+
+``` r
+rm(age, mass)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5
+
+Install the following packages: `ggplot2`, `plyr`, `gapminder`
+
+::::::::::::::: solution
+
+## Solution to challenge 5
+
+We can use the `install.packages()` command to install the required packages.
+
+
+``` r
+install.packages("ggplot2")
+install.packages("plyr")
+install.packages("gapminder")
+```
+
+An alternate solution, to install multiple packages with a single `install.packages()` command is:
+
+
+``` r
+install.packages(c("ggplot2", "plyr", "gapminder"))
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: instructor
+
+When installing ggplot2, it may be required for some users to use the dependencies flag as a result of lazy loading affecting the install. This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop:
+
+
+``` r
+install.packages("ggplot2", dependencies = TRUE)
+```
+
+:::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use RStudio to write and run R programs.
+- R has the usual arithmetic operators and mathematical functions.
+- Use `<-` to assign values to variables.
+- Use `ls()` to list the variables in a program.
+- Use `rm()` to delete objects in a program.
+- Use `install.packages()` to install packages (libraries).
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/02-project-intro.md b/02-project-intro.md
new file mode 100644
index 000000000..0571866f4
--- /dev/null
+++ b/02-project-intro.md
@@ -0,0 +1,284 @@
+---
+title: Project Management With RStudio
+teaching: 20
+exercises: 10
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Create self-contained projects in RStudio
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I manage my projects in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+## Introduction
+
+The scientific process is naturally incremental, and many projects
+start life as random notes, some code, then a manuscript, and
+eventually everything is a bit mixed together.
+
+
Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.
+
+
+Most people tend to organize their projects like this:
+
+![](fig/bad_layout.png){alt='Screenshot of file manager demonstrating bad project organisation'}
+
+There are many reasons why we should *ALWAYS* avoid this:
+
+1. It is really hard to tell which version of your data is
+ the original and which is the modified;
+2. It gets really messy because it mixes files with various
+ extensions together;
+3. It probably takes you a lot of time to actually find
+ things, and relate the correct figures to the exact code
+ that has been used to generate it;
+
+A good project layout will ultimately make your life easier:
+
+- It will help ensure the integrity of your data;
+- It makes it simpler to share your code with someone else
+ (a lab-mate, collaborator, or supervisor);
+- It allows you to easily upload your code with your manuscript submission;
+- It makes it easier to pick the project back up after a break.
+
+## A possible solution
+
+Fortunately, there are tools and packages which can help you manage your work effectively.
+
+One of the most powerful and useful aspects of RStudio is its project management
+functionality. We'll be using this today to create a self-contained, reproducible
+project.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1: Creating a self-contained project
+
+We're going to create a new project in RStudio:
+
+1. Click the "File" menu button, then "New Project".
+2. Click "New Directory".
+3. Click "New Project".
+4. Type in the name of the directory to store your project, e.g. "my\_project".
+5. If available, select the checkbox for "Create a git repository."
+6. Click the "Create Project" button.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+The simplest way to open an RStudio project once it has been created is to click
+through your file system to get to the directory where it was saved and double
+click on the `.Rproj` file. This will open RStudio and start your R session in the
+same directory as the `.Rproj` file. All your data, plots and scripts will now be
+relative to the project directory. RStudio projects have the added benefit of
+allowing you to open multiple projects at the same time each open to its own
+project directory. This allows you to keep multiple projects open without them
+interfering with each other.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2: Opening an RStudio project through the file system
+
+1. Exit RStudio.
+2. Navigate to the directory where you created a project in Challenge 1.
+3. Double click on the `.Rproj` file in that directory.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Best practices for project organization
+
+Although there is no "best" way to lay out a project, there are some general
+principles to adhere to that will make project management easier:
+
+### Treat data as read only
+
+This is probably the most important goal of setting up a project. Data is
+typically time consuming and/or expensive to collect. Working with them
+interactively (e.g., in Excel) where they can be modified means you are never
+sure of where the data came from, or how it has been modified since collection.
+It is therefore a good idea to treat your data as "read-only".
+
+### Data Cleaning
+
+In many cases your data will be "dirty": it will need significant preprocessing
+to get into a format R (or any other programming language) will find useful.
+This task is sometimes called "data munging". Storing these scripts in a
+separate folder, and creating a second "read-only" data folder to hold the
+"cleaned" data sets can prevent confusion between the two sets.
+
+### Treat generated output as disposable
+
+Anything generated by your scripts should be treated as disposable: it should
+all be able to be regenerated from your scripts.
+
+There are lots of different ways to manage this output. Having an output folder
+with different sub-directories for each separate analysis makes it easier later.
+Since many analyses are exploratory and don't end up being used in the final
+project, and some of the analyses get shared between projects.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Good Enough Practices for Scientific Computing
+
+[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization:
+
+1. Put each project in its own directory, which is named after the project.
+2. Put text documents associated with the project in the `doc` directory.
+3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory.
+4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory.
+5. Name all files to reflect their content or function.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Separate function definition and application
+
+One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console.
+
+When your project is in its early stages, the initial .R script file usually contains many lines
+of directly executed code. As it matures, reusable chunks get pulled into their
+own functions. It's a good idea to separate these functions into two separate folders; one
+to store useful functions that you'll reuse across analyses and projects, and
+one to store the analysis scripts.
+
+### Save the data in the data directory
+
+Now we have a good directory structure we will now place/save the data file in the `data/` directory.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Download the gapminder data from [this link to a csv file](data/gapminder_data.csv).
+
+1. Download the file (right mouse click on the link above -> "Save link as" / "Save file as", or click on the link and after the page loads, press Ctrl\+S or choose File -> "Save page as")
+2. Make sure it's saved under the name `gapminder_data.csv`
+3. Save the file in the `data/` folder within your project.
+
+We will load and inspect these data later.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+It is useful to get some general idea about the dataset, directly from the
+command line, before loading it into R. Understanding the dataset better
+will come in handy when making decisions on how to load it in R. Use the command-line
+shell to answer the following questions:
+
+1. What is the size of the file?
+2. How many rows of data does it contain?
+3. What kinds of values are stored in this file?
+
+::::::::::::::: solution
+
+## Solution to Challenge 4
+
+By running these commands in the shell:
+
+
+``` sh
+ls -lh data/gapminder_data.csv
+```
+
+``` output
+-rw-r--r-- 1 runner docker 80K Nov 19 00:20 data/gapminder_data.csv
+```
+
+The file size is 80K.
+
+
+``` sh
+wc -l data/gapminder_data.csv
+```
+
+``` output
+1705 data/gapminder_data.csv
+```
+
+There are 1705 lines. The data looks like:
+
+
+``` sh
+head data/gapminder_data.csv
+```
+
+``` output
+country,year,pop,continent,lifeExp,gdpPercap
+Afghanistan,1952,8425333,Asia,28.801,779.4453145
+Afghanistan,1957,9240934,Asia,30.332,820.8530296
+Afghanistan,1962,10267083,Asia,31.997,853.10071
+Afghanistan,1967,11537966,Asia,34.02,836.1971382
+Afghanistan,1972,13079460,Asia,36.088,739.9811058
+Afghanistan,1977,14880372,Asia,38.438,786.11336
+Afghanistan,1982,12881816,Asia,39.854,978.0114388
+Afghanistan,1987,13867957,Asia,40.822,852.3959448
+Afghanistan,1992,16317921,Asia,41.674,649.3413952
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: command line in RStudio
+
+The Terminal tab in the console pane provides a convenient place directly
+within RStudio to interact directly with the command line.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Working directory
+
+Knowing R's current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory.
+
+Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing `.Rproj` file, it will open that project and set R's working directory to the folder that file is in.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5
+
+You can check the current working directory with the `getwd()` command, or by using the menus in RStudio.
+
+1. In the console, type `getwd()` ("wd" is short for "working directory") and hit Enter.
+2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click "More" and then select "Go To Working Directory".
+
+You can change the working directory with `setwd()`, or by using RStudio menus.
+
+1. In the console, type `setwd("data")` and hit Enter. Type `getwd()` and hit Enter to see the new working directory.
+2. In the menus at the top of the RStudio window, click the "Session" menu button, and then select "Set Working Directory" and then "Choose Directory". Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: File does not exist errors
+
+When you're attempting to reference a file in your R code and you're getting errors saying the file doesn't exist, it's a good idea to check your working directory.
+You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Version Control
+
+It is important to use version control with projects. Go [here for a good lesson which describes using Git with RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html).
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use RStudio to create and manage projects with consistent layout.
+- Treat raw data as read-only.
+- Treat generated output as disposable.
+- Separate function definition and application.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/03-seeking-help.md b/03-seeking-help.md
new file mode 100644
index 000000000..3b2bbb168
--- /dev/null
+++ b/03-seeking-help.md
@@ -0,0 +1,345 @@
+---
+title: Seeking Help
+teaching: 10
+exercises: 10
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to read R help files for functions and special operators.
+- To be able to use CRAN task views to identify packages to solve a problem.
+- To be able to seek help from your peers.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I get help in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+## Reading Help Files
+
+R, and every package, provide help files for functions. The general syntax to search for help on any
+function, "function\_name", from a specific function that is in a package loaded into your
+namespace (your interactive R session) is:
+
+
+``` r
+?function_name
+help(function_name)
+```
+
+For example take a look at the help file for `write.table()`, we will be using a similar function in an upcoming episode.
+
+
+``` r
+?write.table()
+```
+
+This will load up a help page in RStudio (or as plain text in R itself).
+
+Each help page is broken down into sections:
+
+- Description: An extended description of what the function does.
+- Usage: The arguments of the function and their default values (which can be changed).
+- Arguments: An explanation of the data each argument is expecting.
+- Details: Any important details to be aware of.
+- Value: The data the function returns.
+- See Also: Any related functions you might find useful.
+- Examples: Some examples for how to use the function.
+
+Different functions might have different sections, but these are the main ones you should be aware of.
+
+Notice how related functions might call for the same help file:
+
+
+``` r
+?write.table()
+?write.csv()
+```
+
+This is because these functions have very similar applicability and often share the same arguments as inputs to the function, so package authors often choose to document them together in a single help file.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Running Examples
+
+From within the function help page, you can highlight code in the
+Examples and hit Ctrl\+Return to run it in
+RStudio console. This gives you a quick way to get a feel for
+how a function works.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Reading Help Files
+
+One of the most daunting aspects of R is the large number of functions
+available. It would be prohibitive, if not impossible to remember the
+correct usage for every function you use. Luckily, using the help files
+means you don't have to remember that!
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Special Operators
+
+To seek help on special operators, use quotes or backticks:
+
+
+``` r
+?"<-"
+?`<-`
+```
+
+## Getting Help with Packages
+
+Many packages come with "vignettes": tutorials and extended example documentation.
+Without any arguments, `vignette()` will list all vignettes for all installed packages;
+`vignette(package="package-name")` will list all available vignettes for
+`package-name`, and `vignette("vignette-name")` will open the specified vignette.
+
+If a package doesn't have any vignettes, you can usually find help by typing
+`help("package-name")`.
+
+RStudio also has a set of excellent
+[cheatsheets](https://rstudio.com/resources/cheatsheets/) for many packages.
+
+## When You Remember Part of the Function Name
+
+If you're not sure what package a function is in or how it's specifically spelled, you can do a fuzzy search:
+
+
+``` r
+??function_name
+```
+
+A fuzzy search is when you search for an approximate string match. For example, you may remember that the function
+to set your working directory includes "set" in its name. You can do a fuzzy search to help you identify the function:
+
+
+``` r
+??set
+```
+
+## When You Have No Idea Where to Begin
+
+If you don't know what function or package you need to use
+[CRAN Task Views](https://cran.at.r-project.org/web/views)
+is a specially maintained list of packages grouped into
+fields. This can be a good starting point.
+
+## When Your Code Doesn't Work: Seeking Help from Your Peers
+
+If you're having trouble using a function, 9 times out of 10,
+the answers you seek have already been answered on
+[Stack Overflow](https://stackoverflow.com/). You can search using
+the `[r]` tag. Please make sure to see their page on
+[how to ask a good question.](https://stackoverflow.com/help/how-to-ask)
+
+If you can't find the answer, there are a few useful functions to
+help you ask your peers:
+
+
+``` r
+?dput
+```
+
+Will dump the data you're working with into a format that can
+be copied and pasted by others into their own R session.
+
+
+``` r
+sessionInfo()
+```
+
+``` output
+R version 4.4.2 (2024-10-31)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.5 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
+ [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
+ [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
+[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+loaded via a namespace (and not attached):
+[1] compiler_4.4.2 tools_4.4.2 yaml_2.3.10 knitr_1.48 xfun_0.49
+[6] renv_1.0.11 evaluate_1.0.1
+```
+
+Will print out your current version of R, as well as any packages you
+have loaded. This can be useful for others to help reproduce and debug
+your issue.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Look at the help page for the `c` function. What kind of vector do you
+expect will be created if you evaluate the following:
+
+
+``` r
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+```
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+The `c()` function creates a vector, in which all elements are of the
+same type. In the first case, the elements are numeric, in the
+second, they are characters, and in the third they are also characters:
+the numeric values are "coerced" to be characters.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Look at the help for the `paste` function. You will need to use it later.
+What's the difference between the `sep` and `collapse` arguments?
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+To look at the help for the `paste()` function, use:
+
+
+``` r
+help("paste")
+?paste
+```
+
+The difference between `sep` and `collapse` is a little
+tricky. The `paste` function accepts any number of arguments, each of which
+can be a vector of any length. The `sep` argument specifies the string
+used between concatenated terms — by default, a space. The result is a
+vector as long as the longest argument supplied to `paste`. In contrast,
+`collapse` specifies that after concatenation the elements are *collapsed*
+together using the given separator, the result being a single string.
+
+It is important to call the arguments explicitly by typing out the argument
+name e.g `sep = ","` so the function understands to use the "," as a
+separator and not a term to concatenate.
+e.g.
+
+
+``` r
+paste(c("a","b"), "c")
+```
+
+``` output
+[1] "a c" "b c"
+```
+
+``` r
+paste(c("a","b"), "c", ",")
+```
+
+``` output
+[1] "a c ," "b c ,"
+```
+
+``` r
+paste(c("a","b"), "c", sep = ",")
+```
+
+``` output
+[1] "a,c" "b,c"
+```
+
+``` r
+paste(c("a","b"), "c", collapse = "|")
+```
+
+``` output
+[1] "a c|b c"
+```
+
+``` r
+paste(c("a","b"), "c", sep = ",", collapse = "|")
+```
+
+``` output
+[1] "a,c|b,c"
+```
+
+(For more information,
+scroll to the bottom of the `?paste` help page and look at the
+examples, or try `example('paste')`.)
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Use help to find a function (and its associated parameters) that you could
+use to load data from a tabular file in which columns are delimited with "\\t"
+(tab) and the decimal point is a "." (period). This check for decimal
+separator is important, especially if you are working with international
+colleagues, because different countries have different conventions for the
+decimal point (i.e. comma vs period).
+Hint: use `??"read table"` to look up functions related to reading in tabular data.
+
+::::::::::::::: solution
+
+## Solution to Challenge 3
+
+The standard R function for reading tab-delimited files with a period
+decimal separator is read.delim(). You can also do this with
+`read.table(file, sep="\t")` (the period is the *default* decimal
+separator for `read.table()`), although you may have to change
+the `comment.char` argument as well if your data file contains
+hash (#) characters.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Other Resources
+
+- [Quick R](https://www.statmethods.net/)
+- [RStudio cheat sheets](https://www.rstudio.com/resources/cheatsheets/)
+- [Cookbook for R](https://www.cookbook-r.com/)
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `help()` to get online help in R.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/04-data-structures-part1.md b/04-data-structures-part1.md
new file mode 100644
index 000000000..3847908a7
--- /dev/null
+++ b/04-data-structures-part1.md
@@ -0,0 +1,1614 @@
+---
+title: Data Structures
+teaching: 40
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to identify the 5 main data types.
+- To begin exploring data frames, and understand how they are related to vectors and lists.
+- To be able to ask questions from R about the type, class, and structure of an object.
+- To understand the information of the attributes "names", "class", and "dim".
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I read data in R?
+- What are the basic data types in R?
+- How do I represent categorical information in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+One of R's most powerful features is its ability to deal with tabular data -
+such as you may already have in a spreadsheet or a CSV file. Let's start by
+making a toy dataset in your `data/` directory, called `feline-data.csv`:
+
+
+``` r
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+ weight = c(2.1, 5.0, 3.2),
+ likes_catnip = c(1, 0, 1))
+```
+
+We can now save `cats` as a CSV file. It is good practice to call the argument
+names explicitly so the function knows what default values you are changing. Here we
+are setting `row.names = FALSE`. Recall you can use `?write.csv` to pull
+up the help file to check out the argument names and their default values.
+
+
+``` r
+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)
+```
+
+The contents of the new file, `feline-data.csv`:
+
+
+``` r
+coat,weight,likes_catnip
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Tip: Editing Text files in R
+
+Alternatively, you can create `data/feline-data.csv` using a text editor (Nano),
+or within RStudio with the **File -> New File -> Text File** menu item.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+We can load this into R via the following:
+
+
+``` r
+cats <- read.csv(file = "data/feline-data.csv")
+cats
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 1
+2 black 5.0 0
+3 tabby 3.2 1
+```
+
+The `read.table` function is used for reading in tabular data stored in a text
+file where the columns of data are separated by punctuation characters such as
+CSV files (csv = comma-separated values). Tabs and commas are the most common
+punctuation characters used to separate or delimit data points in csv files.
+For convenience R provides 2 other versions of `read.table`. These are: `read.csv`
+for files where the data are separated with commas and `read.delim` for files
+where the data are separated with tabs. Of these three functions `read.csv` is
+the most commonly used. If needed it is possible to override the default
+delimiting punctuation marks for both `read.csv` and `read.delim`.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Check your data for factors
+
+In recent times, the default way how R handles textual data has changed. Text
+data was interpreted by R automatically into a format called "factors". But
+there is an easier format that is called "character". We will hear about
+factors later, and what to use them for. For now, remember that in most cases,
+they are not needed and only complicate your life, which is why newer R
+versions read in text as "character". Check now if your version of R has
+automatically created factors and convert them to "character" format:
+
+1. Check the data types of your input by typing `str(cats)`
+2. In the output, look at the three-letter codes after the colons: If you see
+ only "num" and "chr", you can continue with the lesson and skip this box.
+ If you find "fct", continue to step 3.
+3. Prevent R from automatically creating "factor" data. That can be done by
+ the following code: `options(stringsAsFactors = FALSE)`. Then, re-read
+ the cats table for the change to take effect.
+4. You must set this option every time you restart R. To not forget this,
+ include it in your analysis script before you read in any data, for example
+ in one of the first lines.
+5. For R versions greater than 4.0.0, text data is no longer converted to
+ factors anymore. So you can install this or a newer version to avoid this
+ problem. If you are working on an institute or company computer, ask your
+ administrator to do it.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+We can begin exploring our dataset right away, pulling out columns by specifying
+them using the `$` operator:
+
+
+``` r
+cats$weight
+```
+
+``` output
+[1] 2.1 5.0 3.2
+```
+
+``` r
+cats$coat
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+We can do other operations on the columns:
+
+
+``` r
+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2
+```
+
+``` output
+[1] 4.1 7.0 5.2
+```
+
+``` r
+paste("My cat is", cats$coat)
+```
+
+``` output
+[1] "My cat is calico" "My cat is black" "My cat is tabby"
+```
+
+But what about
+
+
+``` r
+cats$weight + cats$coat
+```
+
+``` error
+Error in cats$weight + cats$coat: non-numeric argument to binary operator
+```
+
+Understanding what happened here is key to successfully analyzing data in R.
+
+### Data Types
+
+If you guessed that the last command will return an error because `2.1` plus
+`"black"` is nonsense, you're right - and you already have some intuition for an
+important concept in programming called *data types*. We can ask what type of
+data something is:
+
+
+``` r
+typeof(cats$weight)
+```
+
+``` output
+[1] "double"
+```
+
+There are 5 main types: `double`, `integer`, `complex`, `logical` and `character`.
+For historic reasons, `double` is also called `numeric`.
+
+
+``` r
+typeof(3.14)
+```
+
+``` output
+[1] "double"
+```
+
+``` r
+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
+```
+
+``` output
+[1] "integer"
+```
+
+``` r
+typeof(1+1i)
+```
+
+``` output
+[1] "complex"
+```
+
+``` r
+typeof(TRUE)
+```
+
+``` output
+[1] "logical"
+```
+
+``` r
+typeof('banana')
+```
+
+``` output
+[1] "character"
+```
+
+No matter how
+complicated our analyses become, all data in R is interpreted as one of these
+basic data types. This strictness has some really important consequences.
+
+A user has added details of another cat. This information is in the file
+`data/feline-data_v2.csv`.
+
+
+``` r
+file.show("data/feline-data_v2.csv")
+```
+
+
+``` r
+coat,weight,likes_catnip
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+```
+
+Load the new cats data like before, and check what type of data we find in the
+`weight` column:
+
+
+``` r
+cats <- read.csv(file="data/feline-data_v2.csv")
+typeof(cats$weight)
+```
+
+``` output
+[1] "character"
+```
+
+Oh no, our weights aren't the double type anymore! If we try to do the same math
+we did on them before, we run into trouble:
+
+
+``` r
+cats$weight + 2
+```
+
+``` error
+Error in cats$weight + 2: non-numeric argument to binary operator
+```
+
+What happened?
+The `cats` data we are working with is something called a *data frame*. Data frames
+are one of the most common and versatile types of *data structures* we will work with in R.
+A given column in a data frame cannot be composed of different data types.
+In this case, R does not read everything in the data frame column `weight` as a *double*, therefore the entire
+column data type changes to something that is suitable for everything in the column.
+
+When R reads a csv file, it reads it in as a *data frame*. Thus, when we loaded the `cats`
+csv file, it is stored as a data frame. We can recognize data frames by the first row that
+is written by the `str()` function:
+
+
+``` r
+str(cats)
+```
+
+``` output
+'data.frame': 4 obs. of 3 variables:
+ $ coat : chr "calico" "black" "tabby" "tabby"
+ $ weight : chr "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int 1 0 1 1
+```
+
+*Data frames* are composed of rows and columns, where each column has the
+same number of rows. Different columns in a data frame can be made up of different
+data types (this is what makes them so versatile), but everything in a given
+column needs to be the same type (e.g., vector, factor, or list).
+
+Let's explore more about different data structures and how they behave.
+For now, let's remove that extra line from our cats data and reload it,
+while we investigate this behavior further:
+
+feline-data.csv:
+
+```
+coat,weight,likes_catnip
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+```
+
+And back in RStudio:
+
+
+``` r
+cats <- read.csv(file="data/feline-data.csv")
+```
+
+
+
+### Vectors and Type Coercion
+
+To better understand this behavior, let's meet another of the data structures:
+the *vector*.
+
+
+``` r
+my_vector <- vector(length = 3)
+my_vector
+```
+
+``` output
+[1] FALSE FALSE FALSE
+```
+
+A vector in R is essentially an ordered list of things, with the special
+condition that *everything in the vector must be the same basic data type*. If
+you don't choose the datatype, it'll default to `logical`; or, you can declare
+an empty vector of whatever type you like.
+
+
+``` r
+another_vector <- vector(mode='character', length=3)
+another_vector
+```
+
+``` output
+[1] "" "" ""
+```
+
+You can check if something is a vector:
+
+
+``` r
+str(another_vector)
+```
+
+``` output
+ chr [1:3] "" "" ""
+```
+
+The somewhat cryptic output from this command indicates the basic data type
+found in this vector - in this case `chr`, character; an indication of the
+number of things in the vector - actually, the indexes of the vector, in this
+case `[1:3]`; and a few examples of what's actually in the vector - in this case
+empty character strings. If we similarly do
+
+
+``` r
+str(cats$weight)
+```
+
+``` output
+ num [1:3] 2.1 5 3.2
+```
+
+we see that `cats$weight` is a vector, too - *the columns of data we load into R
+data.frames are all vectors*, and that's the root of why R forces everything in
+a column to be the same basic data type.
+
+:::::::::::::::::::::::::::::::::::::: discussion
+
+### Discussion 1
+
+Why is R so opinionated about what we put in our columns of data?
+How does this help us?
+
+::::::::::::::: solution
+
+### Discussion 1
+
+By keeping everything in a column the same, we allow ourselves to make simple
+assumptions about our data; if you can interpret one entry in the column as a
+number, then you can interpret *all* of them as numbers, so we don't have to
+check every time. This consistency is what people mean when they talk about
+*clean data*; in the long run, strict consistency goes a long way to making
+our lives easier in R.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+#### Coercion by combining vectors
+
+You can also make vectors with explicit contents with the combine function:
+
+
+``` r
+combine_vector <- c(2,6,3)
+combine_vector
+```
+
+``` output
+[1] 2 6 3
+```
+
+Given what we've learned so far, what do you think the following will produce?
+
+
+``` r
+quiz_vector <- c(2,6,'3')
+```
+
+This is something called *type coercion*, and it is the source of many surprises
+and the reason why we need to be aware of the basic data types and how R will
+interpret them. When R encounters a mix of types (here double and character) to
+be combined into a single vector, it will force them all to be the same
+type. Consider:
+
+
+``` r
+coercion_vector <- c('a', TRUE)
+coercion_vector
+```
+
+``` output
+[1] "a" "TRUE"
+```
+
+``` r
+another_coercion_vector <- c(0, TRUE)
+another_coercion_vector
+```
+
+``` output
+[1] 0 1
+```
+
+#### The type hierarchy
+
+The coercion rules go: `logical` -> `integer` -> `double` ("`numeric`") ->
+`complex` -> `character`, where -> can be read as *are transformed into*. For
+example, combining `logical` and `character` transforms the result to
+`character`:
+
+
+``` r
+c('a', TRUE)
+```
+
+``` output
+[1] "a" "TRUE"
+```
+
+A quick way to recognize `character` vectors is by the quotes that enclose them
+when they are printed.
+
+You can try to force
+coercion against this flow using the `as.` functions:
+
+
+``` r
+character_vector_example <- c('0','2','4')
+character_vector_example
+```
+
+``` output
+[1] "0" "2" "4"
+```
+
+``` r
+character_coerced_to_double <- as.double(character_vector_example)
+character_coerced_to_double
+```
+
+``` output
+[1] 0 2 4
+```
+
+``` r
+double_coerced_to_logical <- as.logical(character_coerced_to_double)
+double_coerced_to_logical
+```
+
+``` output
+[1] FALSE TRUE TRUE
+```
+
+As you can see, some surprising things can happen when R forces one basic data
+type into another! Nitty-gritty of type coercion aside, the point is: if your
+data doesn't look like what you thought it was going to look like, type coercion
+may well be to blame; make sure everything is the same type in your vectors and
+your columns of data.frames, or you will get nasty surprises!
+
+But coercion can also be very useful! For example, in our `cats` data
+`likes_catnip` is numeric, but we know that the 1s and 0s actually represent
+`TRUE` and `FALSE` (a common way of representing them). We should use the
+`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is
+exactly what our data represents. We can 'coerce' this column to be `logical` by
+using the `as.logical` function:
+
+
+``` r
+cats$likes_catnip
+```
+
+``` output
+[1] 1 0 1
+```
+
+``` r
+cats$likes_catnip <- as.logical(cats$likes_catnip)
+cats$likes_catnip
+```
+
+``` output
+[1] TRUE FALSE TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 1
+
+An important part of every data analysis is cleaning the input data. If you
+know that the input data is all of the same format, (e.g. numbers), your
+analysis is much easier! Clean the cat data set from the chapter about
+type coercion.
+
+#### Copy the code template
+
+Create a new script in RStudio and copy and paste the following code. Then
+move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_).
+
+```
+# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+# The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+# print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+# Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+```
+
+### Instructions for the tasks
+
+#### 1\. Print the data
+
+Execute the first statement (`read.csv(...)`). Then print the data to the
+console
+
+::::::::::::::: solution
+
+### Tip 1.1
+
+Show the content of any variable by typing its name.
+
+
+### Solution to Challenge 1.1
+
+Two correct solutions:
+
+```
+cats
+print(cats)
+```
+
+:::::::::::::::::::::::::
+
+#### 2\. Overview of the data types
+
+The data type of your data is as important as the data itself. Use a
+function we saw earlier to print out the data types of all columns of the
+`cats` table.
+
+::::::::::::::: solution
+
+### Tip 1.2
+
+In the chapter "Data types" we saw two functions that can show data types.
+One printed just a single word, the data type name. The other printed
+a short form of the data type, and the first few values. We need the second
+here.
+
+
+:::::::::::::::::::::::::
+
+> ### Solution to Challenge 1.2
+>
+> ```
+> str(cats)
+> ```
+
+#### 3\. Which data type do we need?
+
+The shown data type is not the right one for this data (weight of
+a cat). Which data type do we need?
+
+- Why did the `read.csv()` function not choose the correct data type?
+- Fill in the gap in the comment with the correct data type for cat weight!
+
+::::::::::::::: solution
+
+### Tip 1.3
+
+Scroll up to the section about the [type hierarchy](#the-type-hierarchy)
+to review the available data types
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::: solution
+
+### Solution to Challenge 1.3
+
+- Weight is expressed on a continuous scale (real numbers). The R
+ data type for this is "double" (also known as "numeric").
+- The fourth row has the value "2.3 or 2.4". That is not a number
+ but two, and an english word. Therefore, the "character" data type
+ is chosen. The whole column is now text, because all values in the same
+ columns have to be the same data type.
+
+
+:::::::::::::::::::::::::
+
+#### 4\. Correct the problematic value
+
+The code to assign a new weight value to the problematic fourth row is given.
+Think first and then execute it: What will be the data type after assigning
+a number like in this example?
+You can check the data type after executing to see if you were right.
+
+::::::::::::::: solution
+
+### Tip 1.4
+
+Revisit the hierarchy of data types when two different data types are
+combined.
+
+
+:::::::::::::::::::::::::
+
+> ### Solution to challenge 1.4
+>
+> The data type of the column "weight" is "character". The assigned data
+> type is "double". Combining two data types yields the data type that is
+> higher in the following hierarchy:
+>
+> ```
+> logical < integer < double < complex < character
+> ```
+>
+> Therefore, the column is still of type character! We need to manually
+> convert it to "double".
+> {: .solution}
+
+#### 5\. Convert the column "weight" to the correct data type
+
+Cat weight are numbers. But the column does not have this data type yet.
+Coerce the column to floating point numbers.
+
+::::::::::::::: solution
+
+### Tip 1.5
+
+The functions to convert data types start with `as.`. You can look
+for the function further up in the manuscript or use the RStudio
+auto-complete function: Type "`as.`" and then press the TAB key.
+
+
+:::::::::::::::::::::::::
+
+> ### Solution to Challenge 1.5
+>
+> There are two functions that are synonymous for historic reasons:
+>
+> ```
+> cats$weight <- as.double(cats$weight)
+> cats$weight <- as.numeric(cats$weight)
+> ```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Some basic vector functions
+
+The combine function, `c()`, will also append things to an existing vector:
+
+
+``` r
+ab_vector <- c('a', 'b')
+ab_vector
+```
+
+``` output
+[1] "a" "b"
+```
+
+``` r
+combine_example <- c(ab_vector, 'SWC')
+combine_example
+```
+
+``` output
+[1] "a" "b" "SWC"
+```
+
+You can also make series of numbers:
+
+
+``` r
+mySeries <- 1:10
+mySeries
+```
+
+``` output
+ [1] 1 2 3 4 5 6 7 8 9 10
+```
+
+``` r
+seq(10)
+```
+
+``` output
+ [1] 1 2 3 4 5 6 7 8 9 10
+```
+
+``` r
+seq(1,10, by=0.1)
+```
+
+``` output
+ [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
+[16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
+[31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
+[46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
+[61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
+[76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
+[91] 10.0
+```
+
+We can ask a few questions about vectors:
+
+
+``` r
+sequence_example <- 20:25
+head(sequence_example, n=2)
+```
+
+``` output
+[1] 20 21
+```
+
+``` r
+tail(sequence_example, n=4)
+```
+
+``` output
+[1] 22 23 24 25
+```
+
+``` r
+length(sequence_example)
+```
+
+``` output
+[1] 6
+```
+
+``` r
+typeof(sequence_example)
+```
+
+``` output
+[1] "integer"
+```
+
+We can get individual elements of a vector by using the bracket notation:
+
+
+``` r
+first_element <- sequence_example[1]
+first_element
+```
+
+``` output
+[1] 20
+```
+
+To change a single element, use the bracket on the other side of the arrow:
+
+
+``` r
+sequence_example[1] <- 30
+sequence_example
+```
+
+``` output
+[1] 30 21 22 23 24 25
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 2
+
+Start by making a vector with the numbers 1 through 26.
+Then, multiply the vector by 2.
+
+::::::::::::::: solution
+
+### Solution to Challenge 2
+
+
+``` r
+x <- 1:26
+x <- x * 2
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Lists
+
+Another data structure you'll want in your bag of tricks is the `list`. A list
+is simpler in some ways than the other types, because you can put anything you
+want in it. Remember *everything in the vector must be of the same basic data type*,
+but a list can have different data types:
+
+
+``` r
+list_example <- list(1, "a", TRUE, 1+4i)
+list_example
+```
+
+``` output
+[[1]]
+[1] 1
+
+[[2]]
+[1] "a"
+
+[[3]]
+[1] TRUE
+
+[[4]]
+[1] 1+4i
+```
+
+When printing the object structure with `str()`, we see the data types of all
+elements:
+
+
+``` r
+str(list_example)
+```
+
+``` output
+List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+```
+
+What is the use of lists? They can **organize data of different types**. For
+example, you can organize different tables that belong together, similar to
+spreadsheets in Excel. But there are many other uses, too.
+
+We will see another example that will maybe surprise you in the next chapter.
+
+To retrieve one of the elements of a list, use the **double bracket**:
+
+
+``` r
+list_example[[2]]
+```
+
+``` output
+[1] "a"
+```
+
+The elements of lists also can have **names**, they can be given by prepending
+them to the values, separated by an equals sign:
+
+
+``` r
+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
+another_list
+```
+
+``` output
+$title
+[1] "Numbers"
+
+$numbers
+ [1] 1 2 3 4 5 6 7 8 9 10
+
+$data
+[1] TRUE
+```
+
+This results in a **named list**. Now we have a new function of our object!
+We can access single elements by an additional way!
+
+
+``` r
+another_list$title
+```
+
+``` output
+[1] "Numbers"
+```
+
+## Names
+
+With names, we can give meaning to elements. It is the first time that we do not
+only have the **data**, but also explaining information. It is *metadata*
+that can be stuck to the object like a label. In R, this is called an
+**attribute**. Some attributes enable us to do more with our
+object, for example, like here, accessing an element by a self-defined name.
+
+### Accessing vectors and lists by name
+
+We have already seen how to generate a named list. The way to generate a named
+vector is very similar. You have seen this function before:
+
+
+``` r
+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )
+```
+
+The way to retrieve elements is different, though:
+
+
+``` r
+pizza_price["pizzasubito"]
+```
+
+``` output
+pizzasubito
+ 5.64
+```
+
+The approach used for the list does not work:
+
+
+``` r
+pizza_price$pizzafresh
+```
+
+``` error
+Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+```
+
+It will pay off if you remember this error message, you will meet it in your own
+analyses. It means that you have just tried accessing an element like it was in
+a list, but it is actually in a vector.
+
+### Accessing and changing names
+
+If you are only interested in the names, use the `names()` function:
+
+
+``` r
+names(pizza_price)
+```
+
+``` output
+[1] "pizzasubito" "pizzafresh" "callapizza"
+```
+
+We have seen how to access and change single elements of a vector. The same is
+possible for names:
+
+
+``` r
+names(pizza_price)[3]
+```
+
+``` output
+[1] "callapizza"
+```
+
+``` r
+names(pizza_price)[3] <- "call-a-pizza"
+pizza_price
+```
+
+``` output
+ pizzasubito pizzafresh call-a-pizza
+ 5.64 6.60 4.50
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 3
+
+- What is the data type of the names of `pizza_price`? You can find out
+ using the `str()` or `typeof()` functions.
+
+::::::::::::::: solution
+
+### Solution to Challenge 3
+
+You get the names of an object by wrapping the object name inside
+`names(...)`. Similarly, you get the data type of the names by again
+wrapping the whole code in `typeof(...)`:
+
+```
+typeof(names(pizza))
+```
+
+alternatively, use a new variable if this is easier for you to read:
+
+```
+n <- names(pizza)
+typeof(n)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 4
+
+Instead of just changing some of the names a vector/list already has, you can
+also set all names of an object by writing code like (replace ALL CAPS text):
+
+```
+names( OBJECT ) <- CHARACTER_VECTOR
+```
+
+Create a vector that gives the number for each letter in the alphabet!
+
+1. Generate a vector called `letter_no` with the sequence of numbers from 1
+ to 26!
+2. R has a built-in object called `LETTERS`. It is a 26-character vector, from
+ A to Z. Set the names of the number sequence to this 26 letters
+3. Test yourself by calling `letter_no["B"]`, which should give you the number
+ 2!
+
+::::::::::::::: solution
+
+### Solution to Challenge 4
+
+```
+letter_no <- 1:26 # or seq(1,26)
+names(letter_no) <- LETTERS
+letter_no["B"]
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Data frames
+
+We have data frames at the very beginning of this lesson, they represent
+a table of data. We didn't go much further into detail with our example cat
+data frame:
+
+
+``` r
+cats
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 TRUE
+2 black 5.0 FALSE
+3 tabby 3.2 TRUE
+```
+
+We can now understand something a bit surprising in our data.frame; what happens
+if we run:
+
+
+``` r
+typeof(cats)
+```
+
+``` output
+[1] "list"
+```
+
+We see that data.frames look like lists 'under the hood'. Think again what we
+heard about what lists can be used for:
+
+> Lists organize data of different types
+
+Columns of a data frame are vectors of different types, that are organized
+by belonging to the same table.
+
+A data.frame is really a list of vectors. It is a special list in which all the
+vectors must have the same length.
+
+How is this "special"-ness written into the object, so that R does not treat it
+like any other list, but as a table?
+
+
+``` r
+class(cats)
+```
+
+``` output
+[1] "data.frame"
+```
+
+A **class**, just like names, is an attribute attached to the object. It tells
+us what this object means for humans.
+
+You might wonder: Why do we need another what-type-of-object-is-this-function?
+We already have `typeof()`? That function tells us how the object is
+**constructed in the computer**. The `class` is the **meaning of the object for
+humans**. Consequently, what `typeof()` returns is *fixed* in R (mainly the
+five data types), whereas the output of `class()` is *diverse* and *extendable*
+by R packages.
+
+In our `cats` example, we have an integer, a double and a logical variable. As
+we have seen already, each column of data.frame is a vector.
+
+
+``` r
+cats$coat
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+``` r
+cats[,1]
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+``` r
+typeof(cats[,1])
+```
+
+``` output
+[1] "character"
+```
+
+``` r
+str(cats[,1])
+```
+
+``` output
+ chr [1:3] "calico" "black" "tabby"
+```
+
+Each row is an *observation* of different variables, itself a data.frame, and
+thus can be composed of elements of different types.
+
+
+``` r
+cats[1,]
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 TRUE
+```
+
+``` r
+typeof(cats[1,])
+```
+
+``` output
+[1] "list"
+```
+
+``` r
+str(cats[1,])
+```
+
+``` output
+'data.frame': 1 obs. of 3 variables:
+ $ coat : chr "calico"
+ $ weight : num 2.1
+ $ likes_catnip: logi TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 5
+
+There are several subtly different ways to call variables, observations and
+elements from data.frames:
+
+- `cats[1]`
+- `cats[[1]]`
+- `cats$coat`
+- `cats["coat"]`
+- `cats[1, 1]`
+- `cats[, 1]`
+- `cats[1, ]`
+
+Try out these examples and explain what is returned by each one.
+
+*Hint:* Use the function `typeof()` to examine what is returned in each case.
+
+::::::::::::::: solution
+
+### Solution to Challenge 5
+
+
+``` r
+cats[1]
+```
+
+``` output
+ coat
+1 calico
+2 black
+3 tabby
+```
+
+We can think of a data frame as a list of vectors. The single brace `[1]`
+returns the first slice of the list, as another list. In this case it is the
+first column of the data frame.
+
+
+``` r
+cats[[1]]
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+The double brace `[[1]]` returns the contents of the list item. In this case
+it is the contents of the first column, a *vector* of type *character*.
+
+
+``` r
+cats$coat
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+This example uses the `$` character to address items by name. *coat* is the
+first column of the data frame, again a *vector* of type *character*.
+
+
+``` r
+cats["coat"]
+```
+
+``` output
+ coat
+1 calico
+2 black
+3 tabby
+```
+
+Here we are using a single brace `["coat"]` replacing the index number with
+the column name. Like example 1, the returned object is a *list*.
+
+
+``` r
+cats[1, 1]
+```
+
+``` output
+[1] "calico"
+```
+
+This example uses a single brace, but this time we provide row and column
+coordinates. The returned object is the value in row 1, column 1. The object
+is a *vector* of type *character*.
+
+
+``` r
+cats[, 1]
+```
+
+``` output
+[1] "calico" "black" "tabby"
+```
+
+Like the previous example we use single braces and provide row and column
+coordinates. The row coordinate is not specified, R interprets this missing
+value as all the elements in this *column* and returns them as a *vector*.
+
+
+``` r
+cats[1, ]
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 TRUE
+```
+
+Again we use the single brace with row and column coordinates. The column
+coordinate is not specified. The return value is a *list* containing all the
+values in the first row.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Tip: Renaming data frame columns
+
+Data frames have column names, which can be accessed with the `names()` function.
+
+
+``` r
+names(cats)
+```
+
+``` output
+[1] "coat" "weight" "likes_catnip"
+```
+
+If you want to rename the second column of `cats`, you can assign a new name to the second element of `names(cats)`.
+
+
+``` r
+names(cats)[2] <- "weight_kg"
+cats
+```
+
+``` output
+ coat weight_kg likes_catnip
+1 calico 2.1 TRUE
+2 black 5.0 FALSE
+3 tabby 3.2 TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+### Matrices
+
+Last but not least is the matrix. We can declare a matrix full of zeros:
+
+
+``` r
+matrix_example <- matrix(0, ncol=6, nrow=3)
+matrix_example
+```
+
+``` output
+ [,1] [,2] [,3] [,4] [,5] [,6]
+[1,] 0 0 0 0 0 0
+[2,] 0 0 0 0 0 0
+[3,] 0 0 0 0 0 0
+```
+
+What makes it special is the `dim()` attribute:
+
+
+``` r
+dim(matrix_example)
+```
+
+``` output
+[1] 3 6
+```
+
+And similar to other data structures, we can ask things about our matrix:
+
+
+``` r
+typeof(matrix_example)
+```
+
+``` output
+[1] "double"
+```
+
+``` r
+class(matrix_example)
+```
+
+``` output
+[1] "matrix" "array"
+```
+
+``` r
+str(matrix_example)
+```
+
+``` output
+ num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
+```
+
+``` r
+nrow(matrix_example)
+```
+
+``` output
+[1] 3
+```
+
+``` r
+ncol(matrix_example)
+```
+
+``` output
+[1] 6
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 6
+
+What do you think will be the result of
+`length(matrix_example)`?
+Try it.
+Were you right? Why / why not?
+
+::::::::::::::: solution
+
+### Solution to Challenge 6
+
+What do you think will be the result of
+`length(matrix_example)`?
+
+
+``` r
+matrix_example <- matrix(0, ncol=6, nrow=3)
+length(matrix_example)
+```
+
+``` output
+[1] 18
+```
+
+Because a matrix is a vector with added dimension attributes, `length`
+gives you the total number of elements in the matrix.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 7
+
+Make another matrix, this time containing the numbers 1:50,
+with 5 columns and 10 rows.
+Did the `matrix` function fill your matrix by column, or by
+row, as its default behaviour?
+See if you can figure out how to change this.
+(hint: read the documentation for `matrix`!)
+
+::::::::::::::: solution
+
+### Solution to Challenge 7
+
+Make another matrix, this time containing the numbers 1:50,
+with 5 columns and 10 rows.
+Did the `matrix` function fill your matrix by column, or by
+row, as its default behaviour?
+See if you can figure out how to change this.
+(hint: read the documentation for `matrix`!)
+
+
+``` r
+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 8
+
+Create a list of length two containing a character vector for each of the sections in this part of the workshop:
+
+- Data types
+- Data structures
+
+Populate each character vector with the names of the data types and data
+structures we've seen so far.
+
+::::::::::::::: solution
+
+### Solution to Challenge 8
+
+
+``` r
+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)
+```
+
+Note: it's nice to make a list in big writing on the board or taped to the wall
+listing all of these types and structures - leave it up for the rest of the workshop
+to remind people of the importance of these basics.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge 9
+
+Consider the R output of the matrix below:
+
+
+``` output
+ [,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+```
+
+What was the correct command used to write this matrix? Examine
+each command and try to figure out the correct one before typing them.
+Think about what matrices the other commands will produce.
+
+1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)`
+2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)`
+3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)`
+4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)`
+
+::::::::::::::: solution
+
+### Solution to Challenge 9
+
+Consider the R output of the matrix below:
+
+
+``` output
+ [,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+```
+
+What was the correct command used to write this matrix? Examine
+each command and try to figure out the correct one before typing them.
+Think about what matrices the other commands will produce.
+
+
+``` r
+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `read.csv` to read tabular data in R.
+- The basic data types in R are double, integer, complex, logical, and character.
+- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/05-data-structures-part2.md b/05-data-structures-part2.md
new file mode 100644
index 000000000..c450668c8
--- /dev/null
+++ b/05-data-structures-part2.md
@@ -0,0 +1,592 @@
+---
+title: Exploring Data Frames
+teaching: 20
+exercises: 10
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Add and remove rows or columns.
+- Append two data frames.
+- Display basic properties of data frames including size and class of the columns, names, and first few rows.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I manipulate a data frame?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+At this point, you've seen it all: in the last lesson, we toured all the basic
+data types and data structures in R. Everything you do will be a manipulation of
+those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we'll learn a few more things
+about working with data frames.
+
+## Adding columns and rows in data frames
+
+We already learned that the columns of a data frame are vectors, so that our
+data are consistent in type throughout the columns. As such, if we want to add a
+new column, we can start by making a new vector:
+
+
+
+
+``` r
+age <- c(2, 3, 5)
+cats
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 1
+2 black 5.0 0
+3 tabby 3.2 1
+```
+
+We can then add this as a column via:
+
+
+``` r
+cbind(cats, age)
+```
+
+``` output
+ coat weight likes_catnip age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+```
+
+Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail:
+
+
+``` r
+age <- c(2, 3, 5, 12)
+cbind(cats, age)
+```
+
+``` error
+Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
+```
+
+``` r
+age <- c(2, 3)
+cbind(cats, age)
+```
+
+``` error
+Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
+```
+
+Why didn't this work? Of course, R wants to see one element in our new column
+for every row in the table:
+
+
+``` r
+nrow(cats)
+```
+
+``` output
+[1] 3
+```
+
+``` r
+length(age)
+```
+
+``` output
+[1] 2
+```
+
+So for it to work we need to have `nrow(cats)` = `length(age)`. Let's overwrite the content of cats with our new data frame.
+
+
+``` r
+age <- c(2, 3, 5)
+cats <- cbind(cats, age)
+```
+
+Now how about adding rows? We already know that the rows of a
+data frame are lists:
+
+
+``` r
+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)
+```
+
+Let's confirm that our new row was added correctly.
+
+
+``` r
+cats
+```
+
+``` output
+ coat weight likes_catnip age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+4 tortoiseshell 3.3 1 9
+```
+
+
+## Removing rows
+
+We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows.
+
+
+``` r
+cats
+```
+
+``` output
+ coat weight likes_catnip age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+4 tortoiseshell 3.3 1 9
+```
+
+We can ask for a data frame minus the last row:
+
+
+``` r
+cats[-4, ]
+```
+
+``` output
+ coat weight likes_catnip age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+```
+
+Notice the comma with nothing after it to indicate that we want to drop the entire fourth row.
+
+Note: we could also remove several rows at once by putting the row numbers
+inside of a vector, for example: `cats[c(-3,-4), ]`
+
+
+## Removing columns
+
+We can also remove columns in our data frame. What if we want to remove the column "age". We can remove it in two ways, by variable number or by index.
+
+
+``` r
+cats[,-4]
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 1
+2 black 5.0 0
+3 tabby 3.2 1
+4 tortoiseshell 3.3 1
+```
+
+Notice the comma with nothing before it, indicating we want to keep all of the rows.
+
+Alternatively, we can drop the column by using the index name and the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `cats`, and asks, "Does this element occur in the second argument?"
+
+
+``` r
+drop <- names(cats) %in% c("age")
+cats[,!drop]
+```
+
+``` output
+ coat weight likes_catnip
+1 calico 2.1 1
+2 black 5.0 0
+3 tabby 3.2 1
+4 tortoiseshell 3.3 1
+```
+
+We will cover subsetting with logical operators like `%in%` in more detail in the next episode. See the section [Subsetting through other logical operations](06-data-subsetting.Rmd)
+
+## Appending to a data frame
+
+The key to remember when adding data to a data frame is that *columns are
+vectors and rows are lists.* We can also glue two data frames
+together with `rbind`:
+
+
+``` r
+cats <- rbind(cats, cats)
+cats
+```
+
+``` output
+ coat weight likes_catnip age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+4 tortoiseshell 3.3 1 9
+5 calico 2.1 1 2
+6 black 5.0 0 3
+7 tabby 3.2 1 5
+8 tortoiseshell 3.3 1 9
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+You can create a new data frame right from within R with the following syntax:
+
+
+``` r
+df <- data.frame(id = c("a", "b", "c"),
+ x = 1:3,
+ y = c(TRUE, TRUE, FALSE))
+```
+
+Make a data frame that holds the following information for yourself:
+
+- first name
+- last name
+- lucky number
+
+Then use `rbind` to add an entry for the people sitting beside you.
+Finally, use `cbind` to add a column with each person's answer to the question, "Is it time for coffee break?"
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+
+``` r
+df <- data.frame(first = c("Grace"),
+ last = c("Hopper"),
+ lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Realistic example
+
+So far, you have seen the basics of manipulating data frames with our cat data;
+now let's use those skills to digest a more realistic dataset. Let's read in the
+`gapminder` dataset that we downloaded previously:
+
+
+``` r
+gapminder <- read.csv("data/gapminder_data.csv")
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Miscellaneous Tips
+
+- Another type of file you might encounter are tab-separated value files (.tsv). To specify a tab as a separator, use `"\\t"` or `read.delim()`.
+
+- Files can also be downloaded directly from the Internet into a local
+ folder of your choice onto your computer using the `download.file` function.
+ The `read.csv` function can then be executed to read the downloaded file from the download location, for example,
+
+
+``` r
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
+```
+
+- Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example,
+
+
+``` r
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
+```
+
+- You can read directly from excel spreadsheets without
+ converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package.
+
+- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors).
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Let's investigate gapminder a bit; the first thing we should always do is check
+out what the data looks like with `str`:
+
+
+``` r
+str(gapminder)
+```
+
+``` output
+'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+```
+
+An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.
+
+
+``` r
+summary(gapminder)
+```
+
+``` output
+ country year pop continent
+ Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
+ Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
+ Mode :character Median :1980 Median :7.024e+06 Mode :character
+ Mean :1980 Mean :2.960e+07
+ 3rd Qu.:1993 3rd Qu.:1.959e+07
+ Max. :2007 Max. :1.319e+09
+ lifeExp gdpPercap
+ Min. :23.60 Min. : 241.2
+ 1st Qu.:48.20 1st Qu.: 1202.1
+ Median :60.71 Median : 3531.8
+ Mean :59.47 Mean : 7215.3
+ 3rd Qu.:70.85 3rd Qu.: 9325.5
+ Max. :82.60 Max. :113523.1
+```
+
+Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function:
+
+
+``` r
+typeof(gapminder$year)
+```
+
+``` output
+[1] "integer"
+```
+
+``` r
+typeof(gapminder$country)
+```
+
+``` output
+[1] "character"
+```
+
+``` r
+str(gapminder$country)
+```
+
+``` output
+ chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+```
+
+We can also interrogate the data frame for information about its dimensions;
+remembering that `str(gapminder)` said there were 1704 observations of 6
+variables in gapminder, what do you think the following will produce, and why?
+
+
+``` r
+length(gapminder)
+```
+
+``` output
+[1] 6
+```
+
+A fair guess would have been to say that the length of a data frame would be the
+number of rows it has (1704), but this is not the case; remember, a data frame
+is a *list of vectors and factors*:
+
+
+``` r
+typeof(gapminder)
+```
+
+``` output
+[1] "list"
+```
+
+When `length` gave us 6, it's because gapminder is built out of a list of 6
+columns. To get the number of rows and columns in our dataset, try:
+
+
+``` r
+nrow(gapminder)
+```
+
+``` output
+[1] 1704
+```
+
+``` r
+ncol(gapminder)
+```
+
+``` output
+[1] 6
+```
+
+Or, both at once:
+
+
+``` r
+dim(gapminder)
+```
+
+``` output
+[1] 1704 6
+```
+
+We'll also likely want to know what the titles of all the columns are, so we can
+ask for them later:
+
+
+``` r
+colnames(gapminder)
+```
+
+``` output
+[1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
+```
+
+At this stage, it's important to ask ourselves if the structure R is reporting
+matches our intuition or expectations; do the basic data types reported for each
+column make sense? If not, we need to sort any problems out now before they turn
+into bad surprises down the road, using what we've learned about how R
+interprets data, and the importance of *strict consistency* in how we record our
+data.
+
+Once we're happy that the data types and structures seem reasonable, it's time
+to start digging into our data proper. Check out the first few lines:
+
+
+``` r
+head(gapminder)
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+It's good practice to also check the last few lines of your data and some in the middle. How would you do this?
+
+Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this?
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+To check the last few lines it's relatively simple as R already has a function for this:
+
+```r
+tail(gapminder)
+tail(gapminder, n = 15)
+```
+
+What about a few arbitrary rows just in case something is odd in the middle?
+
+## Tip: There are several ways to achieve this.
+
+The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it!
+Remember my\_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don't know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this.
+
+```r
+gapminder[sample(nrow(gapminder), 5), ]
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+To make sure our analysis is reproducible, we should put the code
+into a script file so we can come back to it later.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Go to file -> new file -> R script, and write an R script
+to load in the gapminder dataset. Put it in the `scripts/`
+directory and add it to version control.
+
+Run the script using the `source` function, using the file path
+as its argument (or by pressing the "source" button in RStudio).
+
+::::::::::::::: solution
+
+## Solution to Challenge 3
+
+The `source` function can be used to use a script within a script.
+Assume you would like to load the same type of file over and over
+again and therefore you need to specify the arguments to fit the
+needs of your file. Instead of writing the necessary argument again
+and again you could just write it once and save it as a script. Then,
+you can use `source("Your_Script_containing_the_load_function")` in a new
+script to use the function of that script without writing everything again.
+Check out `?source` to find out more.
+
+
+``` r
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")
+```
+
+To run the script and load the data into the `gapminder` variable:
+
+
+``` r
+source(file = "scripts/load-gapminder.R")
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+Read the output of `str(gapminder)` again;
+this time, use what you've learned about lists and vectors,
+as well as the output of functions like `colnames` and `dim`
+to explain what everything that `str` prints out for gapminder means.
+If there are any parts you can't interpret, discuss with your neighbors!
+
+::::::::::::::: solution
+
+## Solution to Challenge 4
+
+The object `gapminder` is a data frame with columns
+
+- `country` and `continent` are character strings.
+- `year` is an integer vector.
+- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `cbind()` to add a new column to a data frame.
+- Use `rbind()` to add a new row to a data frame.
+- Remove rows from a data frame.
+- Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `head()`, and `typeof()` to understand the structure of a data frame.
+- Read in a csv file using `read.csv()`.
+- Understand what `length()` of a data frame represents.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/06-data-subsetting.md b/06-data-subsetting.md
new file mode 100644
index 000000000..37ce85487
--- /dev/null
+++ b/06-data-subsetting.md
@@ -0,0 +1,1285 @@
+---
+title: Subsetting Data
+teaching: 35
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to subset vectors, factors, matrices, lists, and data frames
+- To be able to extract individual and multiple elements: by index, by name, using comparison operations
+- To be able to skip and remove elements from various data structures.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I work with subsets of data in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+R has many powerful subset operators. Mastering them will allow you to
+easily perform complex operations on any kind of dataset.
+
+There are six different ways we can subset any kind of object, and three
+different subsetting operators for the different data structures.
+
+Let's start with the workhorse of R: a simple numeric vector.
+
+
+``` r
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+x
+```
+
+``` output
+ a b c d e
+5.4 6.2 7.1 4.8 7.5
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Atomic vectors
+
+In R, simple vectors containing character strings, numbers, or logical values are called *atomic* vectors because they can't be further simplified.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+So now that we've created a dummy vector to play with, how do we get at its
+contents?
+
+## Accessing elements using their indices
+
+To extract elements of a vector we can give their corresponding index, starting
+from one:
+
+
+``` r
+x[1]
+```
+
+``` output
+ a
+5.4
+```
+
+
+``` r
+x[4]
+```
+
+``` output
+ d
+4.8
+```
+
+It may look different, but the square brackets operator is a function. For vectors
+(and matrices), it means "get me the nth element".
+
+We can ask for multiple elements at once:
+
+
+``` r
+x[c(1, 3)]
+```
+
+``` output
+ a c
+5.4 7.1
+```
+
+Or slices of the vector:
+
+
+``` r
+x[1:4]
+```
+
+``` output
+ a b c d
+5.4 6.2 7.1 4.8
+```
+
+the `:` operator creates a sequence of numbers from the left element to the right.
+
+
+``` r
+1:4
+```
+
+``` output
+[1] 1 2 3 4
+```
+
+``` r
+c(1, 2, 3, 4)
+```
+
+``` output
+[1] 1 2 3 4
+```
+
+We can ask for the same element multiple times:
+
+
+``` r
+x[c(1,1,3)]
+```
+
+``` output
+ a a c
+5.4 5.4 7.1
+```
+
+If we ask for an index beyond the length of the vector, R will return a missing value:
+
+
+``` r
+x[6]
+```
+
+``` output
+
+ NA
+```
+
+This is a vector of length one containing an `NA`, whose name is also `NA`.
+
+If we ask for the 0th element, we get an empty vector:
+
+
+``` r
+x[0]
+```
+
+``` output
+named numeric(0)
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Vector numbering in R starts at 1
+
+In many programming languages (C and Python, for example), the first
+element of a vector has an index of 0. In R, the first element is 1.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Skipping and removing elements
+
+If we use a negative number as the index of a vector, R will return
+every element *except* for the one specified:
+
+
+``` r
+x[-2]
+```
+
+``` output
+ a c d e
+5.4 7.1 4.8 7.5
+```
+
+We can skip multiple elements:
+
+
+``` r
+x[c(-1, -5)] # or x[-c(1,5)]
+```
+
+``` output
+ b c d
+6.2 7.1 4.8
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Order of operations
+
+A common trip up for novices occurs when trying to skip
+slices of a vector. It's natural to try to negate a
+sequence like so:
+
+
+``` r
+x[-1:3]
+```
+
+This gives a somewhat cryptic error:
+
+
+``` error
+Error in x[-1:3]: only 0's may be mixed with negative subscripts
+```
+
+But remember the order of operations. `:` is really a function.
+It takes its first argument as -1, and its second as 3,
+so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`.
+
+The correct solution is to wrap that function call in brackets, so
+that the `-` operator applies to the result:
+
+
+``` r
+x[-(1:3)]
+```
+
+``` output
+ d e
+4.8 7.5
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+To remove elements from a vector, we need to assign the result back
+into the variable:
+
+
+``` r
+x <- x[-4]
+x
+```
+
+``` output
+ a b c e
+5.4 6.2 7.1 7.5
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Given the following code:
+
+
+``` r
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+```
+
+``` output
+ a b c d e
+5.4 6.2 7.1 4.8 7.5
+```
+
+Come up with at least 2 different commands that will produce the following output:
+
+
+``` output
+ b c d
+6.2 7.1 4.8
+```
+
+After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+
+``` r
+x[2:4]
+```
+
+``` output
+ b c d
+6.2 7.1 4.8
+```
+
+
+``` r
+x[-c(1,5)]
+```
+
+``` output
+ b c d
+6.2 7.1 4.8
+```
+
+
+``` r
+x[c(2,3,4)]
+```
+
+``` output
+ b c d
+6.2 7.1 4.8
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Subsetting by name
+
+We can extract elements by using their name, instead of extracting by index:
+
+
+``` r
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]
+```
+
+``` output
+ a c
+5.4 7.1
+```
+
+This is usually a much more reliable way to subset objects: the
+position of various elements can often change when chaining together
+subsetting operations, but the names will always remain the same!
+
+## Subsetting through other logical operations {#logical-operations}
+
+We can also use any logical vector to subset:
+
+
+``` r
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+```
+
+``` output
+ c e
+7.1 7.5
+```
+
+Since comparison operators (e.g. `>`, `<`, `==`) evaluate to logical vectors, we can also
+use them to succinctly subset vectors: the following statement gives
+the same result as the previous one.
+
+
+``` r
+x[x > 7]
+```
+
+``` output
+ c e
+7.1 7.5
+```
+
+Breaking it down, this statement first evaluates `x>7`, generating
+a logical vector `c(FALSE, FALSE, TRUE, FALSE, TRUE)`, and then
+selects the elements of `x` corresponding to the `TRUE` values.
+
+We can use `==` to mimic the previous method of indexing by name
+(remember you have to use `==` rather than `=` for comparisons):
+
+
+``` r
+x[names(x) == "a"]
+```
+
+``` output
+ a
+5.4
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Combining logical conditions
+
+We often want to combine multiple logical
+criteria. For example, we might want to find all the countries that are
+located in Asia **or** Europe **and** have life expectancies within a certain
+range. Several operations for combining logical vectors exist in R:
+
+- `&`, the "logical AND" operator: returns `TRUE` if both the left and right
+ are `TRUE`.
+- `|`, the "logical OR" operator: returns `TRUE`, if either the left or right
+ (or both) are `TRUE`.
+
+You may sometimes see `&&` and `||` instead of `&` and `|`. These two-character operators
+only look at the first element of each vector and ignore the
+remaining elements. In general you should not use the two-character
+operators in data analysis; save them
+for programming, i.e. deciding whether to execute a statement.
+
+- `!`, the "logical NOT" operator: converts `TRUE` to `FALSE` and `FALSE` to
+ `TRUE`. It can negate a single logical condition (eg `!TRUE` becomes
+ `FALSE`), or a whole vector of conditions(eg `!c(TRUE, FALSE)` becomes
+ `c(FALSE, TRUE)`).
+
+Additionally, you can compare the elements within a single vector using the
+`all` function (which returns `TRUE` if every element of the vector is `TRUE`)
+and the `any` function (which returns `TRUE` if one or more elements of the
+vector are `TRUE`).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Given the following code:
+
+
+``` r
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+```
+
+``` output
+ a b c d e
+5.4 6.2 7.1 4.8 7.5
+```
+
+Write a subsetting command to return the values in x that are greater than 4 and less than 7.
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+
+``` r
+x_subset <- x[x<7 & x>4]
+print(x_subset)
+```
+
+``` output
+ a b d
+5.4 6.2 4.8
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Non-unique names
+
+You should be aware that it is possible for multiple elements in a
+vector to have the same name. (For a data frame, columns can have
+the same name --- although R tries to avoid this --- but row names
+must be unique.) Consider these examples:
+
+
+``` r
+x <- 1:3
+x
+```
+
+``` output
+[1] 1 2 3
+```
+
+``` r
+names(x) <- c('a', 'a', 'a')
+x
+```
+
+``` output
+a a a
+1 2 3
+```
+
+``` r
+x['a'] # only returns first value
+```
+
+``` output
+a
+1
+```
+
+``` r
+x[names(x) == 'a'] # returns all three values
+```
+
+``` output
+a a a
+1 2 3
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Getting help for operators
+
+Remember you can search for help on operators by wrapping them in quotes:
+`help("%in%")` or `?"%in%"`.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Skipping named elements
+
+Skipping or removing named elements is a little harder. If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn't know how to take the negative of a string:
+
+
+``` r
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
+x[-"a"]
+```
+
+``` error
+Error in -"a": invalid argument to unary operator
+```
+
+However, we can use the `!=` (not-equals) operator to construct a logical vector that will do what we want:
+
+
+``` r
+x[names(x) != "a"]
+```
+
+``` output
+ b c d e
+6.2 7.1 4.8 7.5
+```
+
+Skipping multiple named indices is a little bit harder still. Suppose we want to drop the `"a"` and `"c"` elements, so we try this:
+
+
+``` r
+x[names(x)!=c("a","c")]
+```
+
+``` warning
+Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+```
+
+``` output
+ b c d e
+6.2 7.1 4.8 7.5
+```
+
+R did *something*, but it gave us a warning that we ought to pay attention to - and it apparently *gave us the wrong answer* (the `"c"` element is still included in the vector)!
+
+So what does `!=` actually do in this case? That's an excellent question.
+
+### Recycling
+
+Let's take a look at the comparison component of this code:
+
+
+``` r
+names(x) != c("a", "c")
+```
+
+``` warning
+Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+```
+
+``` output
+[1] FALSE TRUE TRUE TRUE TRUE
+```
+
+Why does R give `TRUE` as the third element of this vector, when `names(x)[3] != "c"` is obviously false?
+When you use `!=`, R tries to compare each element
+of the left argument with the corresponding element of its right
+argument. What happens when you compare vectors of different lengths?
+
+![](fig/06-rmd-inequality.1.png){alt='Inequality testing'}
+
+When one vector is shorter than the other, it gets *recycled*:
+
+![](fig/06-rmd-inequality.2.png){alt='Inequality testing: results of recycling'}
+
+In this case R **repeats** `c("a", "c")` as many times as necessary to match `names(x)`, i.e. we get `c("a","c","a","c","a")`. Since the recycled `"a"`
+doesn't match the third element of `names(x)`, the value of `!=` is `TRUE`.
+Because in this case the longer vector length (5) isn't a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and `names(x)` had contained six elements, R would *silently* have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs!
+
+The way to get R to do what we really want (match *each* element of the left argument with *all* of the elements of the right argument) it to use the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". Here, since we want to *exclude* values, we also need a `!` operator to change "in" to "not in":
+
+
+``` r
+x[! names(x) %in% c("a","c") ]
+```
+
+``` output
+ b d e
+6.2 4.8 7.5
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Selecting elements of a vector that match any of a list of components
+is a very common data analysis task. For example, the gapminder data set
+contains `country` and `continent` variables, but no information between
+these two scales. Suppose we want to pull out information from southeast
+Asia: how do we set up an operation to produce a logical vector that
+is `TRUE` for all of the countries in southeast Asia and `FALSE` otherwise?
+
+Suppose you have these data:
+
+
+``` r
+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))
+```
+
+There's a wrong way (using only `==`), which will give you a warning;
+a clunky way (using the logical operators `==` and `|`); and
+an elegant way (using `%in%`). See whether you can come up with all three
+and explain how they (don't) work.
+
+::::::::::::::: solution
+
+## Solution to challenge 3
+
+- The **wrong** way to do this problem is `countries==seAsia`. This
+ gives a warning (`"In countries == seAsia : longer object length is not a multiple of shorter object length"`) and the wrong answer (a vector of all
+ `FALSE` values), because none of the recycled values of `seAsia` happen
+ to line up correctly with matching values in `country`.
+- The **clunky** (but technically correct) way to do this problem is
+
+
+``` r
+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
+```
+
+(or `countries==seAsia[1] | countries==seAsia[2] | ...`). This
+gives the correct values, but hopefully you can see how awkward it
+is (what if we wanted to select countries from a much longer list?).
+
+- The best way to do this problem is `countries %in% seAsia`, which
+ is both correct and easy to type (and read).
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Handling special values
+
+At some point you will encounter functions in R that cannot handle missing, infinite,
+or undefined data.
+
+There are a number of special functions you can use to filter out this data:
+
+- `is.na` will return all positions in a vector, matrix, or data.frame
+ containing `NA` (or `NaN`)
+- likewise, `is.nan`, and `is.infinite` will do the same for `NaN` and `Inf`.
+- `is.finite` will return all positions in a vector, matrix, or data.frame
+ that do not contain `NA`, `NaN` or `Inf`.
+- `na.omit` will filter out all missing values from a vector
+
+## Factor subsetting
+
+Now that we've explored the different ways to subset vectors, how
+do we subset the other data structures?
+
+Factor subsetting works the same way as vector subsetting.
+
+
+``` r
+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]
+```
+
+``` output
+[1] a a
+Levels: a b c d
+```
+
+``` r
+f[f %in% c("b", "c")]
+```
+
+``` output
+[1] b c c
+Levels: a b c d
+```
+
+``` r
+f[1:3]
+```
+
+``` output
+[1] a a b
+Levels: a b c d
+```
+
+Skipping elements will not remove the level
+even if no more of that category exists in the factor:
+
+
+``` r
+f[-3]
+```
+
+``` output
+[1] a a c c d
+Levels: a b c d
+```
+
+## Matrix subsetting
+
+Matrices are also subsetted using the `[` function. In this case
+it takes two arguments: the first applying to the rows, the second
+to its columns:
+
+
+``` r
+set.seed(1)
+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]
+```
+
+``` output
+ [,1] [,2]
+[1,] 1.12493092 -0.8356286
+[2,] -0.04493361 1.5952808
+```
+
+You can leave the first or second arguments blank to retrieve all the
+rows or columns respectively:
+
+
+``` r
+m[, c(3,4)]
+```
+
+``` output
+ [,1] [,2]
+[1,] -0.62124058 0.82122120
+[2,] -2.21469989 0.59390132
+[3,] 1.12493092 0.91897737
+[4,] -0.04493361 0.78213630
+[5,] -0.01619026 0.07456498
+[6,] 0.94383621 -1.98935170
+```
+
+If we only access one row or column, R will automatically convert the result
+to a vector:
+
+
+``` r
+m[3,]
+```
+
+``` output
+[1] -0.8356286 0.5757814 1.1249309 0.9189774
+```
+
+If you want to keep the output as a matrix, you need to specify a *third* argument;
+`drop = FALSE`:
+
+
+``` r
+m[3, , drop=FALSE]
+```
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774
+```
+
+Unlike vectors, if we try to access a row or column outside of the matrix,
+R will throw an error:
+
+
+``` r
+m[, c(3,6)]
+```
+
+``` error
+Error in m[, c(3, 6)]: subscript out of bounds
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Higher dimensional arrays
+
+when dealing with multi-dimensional arrays, each argument to `[`
+corresponds to a dimension. For example, a 3D array, the first three
+arguments correspond to the rows, columns, and depth dimension.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Because matrices are vectors, we can
+also subset using only one argument:
+
+
+``` r
+m[5]
+```
+
+``` output
+[1] 0.3295078
+```
+
+This usually isn't useful, and often confusing to read. However it is useful to note that matrices
+are laid out in *column-major format* by default. That is the elements of the
+vector are arranged column-wise:
+
+
+``` r
+matrix(1:6, nrow=2, ncol=3)
+```
+
+``` output
+ [,1] [,2] [,3]
+[1,] 1 3 5
+[2,] 2 4 6
+```
+
+If you wish to populate the matrix by row, use `byrow=TRUE`:
+
+
+``` r
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+```
+
+``` output
+ [,1] [,2] [,3]
+[1,] 1 2 3
+[2,] 4 5 6
+```
+
+Matrices can also be subsetted using their rownames and column names
+instead of their row and column indices.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+Given the following code:
+
+
+``` r
+m <- matrix(1:18, nrow=3, ncol=6)
+print(m)
+```
+
+``` output
+ [,1] [,2] [,3] [,4] [,5] [,6]
+[1,] 1 4 7 10 13 16
+[2,] 2 5 8 11 14 17
+[3,] 3 6 9 12 15 18
+```
+
+1. Which of the following commands will extract the values 11 and 14?
+
+A. `m[2,4,2,5]`
+
+B. `m[2:5]`
+
+C. `m[4:5,2]`
+
+D. `m[2,c(4,5)]`
+
+::::::::::::::: solution
+
+## Solution to challenge 4
+
+D
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## List subsetting
+
+Now we'll introduce some new subsetting operators. There are three functions
+used to subset lists. We've already seen these when learning about atomic vectors and matrices: `[`, `[[`, and `$`.
+
+Using `[` will always return a list. If you want to *subset* a list, but not
+*extract* an element, then you will likely use `[`.
+
+
+``` r
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+xlist[1]
+```
+
+``` output
+$a
+[1] "Software Carpentry"
+```
+
+This returns a *list with one element*.
+
+We can subset elements of a list exactly the same way as atomic
+vectors using `[`. Comparison operations however won't work as
+they're not recursive, they will try to condition on the data structures
+in each element of the list, not the individual elements within those
+data structures.
+
+
+``` r
+xlist[1:2]
+```
+
+``` output
+$a
+[1] "Software Carpentry"
+
+$b
+ [1] 1 2 3 4 5 6 7 8 9 10
+```
+
+To extract individual elements of a list, you need to use the double-square
+bracket function: `[[`.
+
+
+``` r
+xlist[[1]]
+```
+
+``` output
+[1] "Software Carpentry"
+```
+
+Notice that now the result is a vector, not a list.
+
+You can't extract more than one element at once:
+
+
+``` r
+xlist[[1:2]]
+```
+
+``` error
+Error in xlist[[1:2]]: subscript out of bounds
+```
+
+Nor use it to skip elements:
+
+
+``` r
+xlist[[-1]]
+```
+
+``` error
+Error in xlist[[-1]]: invalid negative subscript in get1index
+```
+
+But you can use names to both subset and extract elements:
+
+
+``` r
+xlist[["a"]]
+```
+
+``` output
+[1] "Software Carpentry"
+```
+
+The `$` function is a shorthand way for extracting elements by name:
+
+
+``` r
+xlist$data
+```
+
+``` output
+ mpg cyl disp hp drat wt qsec vs am gear carb
+Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
+Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
+Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
+Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
+Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
+Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5
+
+Given the following list:
+
+
+``` r
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+```
+
+Using your knowledge of both list and vector subsetting, extract the number 2 from xlist.
+Hint: the number 2 is contained within the "b" item in the list.
+
+::::::::::::::: solution
+
+## Solution to challenge 5
+
+
+``` r
+xlist$b[2]
+```
+
+``` output
+[1] 2
+```
+
+
+``` r
+xlist[[2]][2]
+```
+
+``` output
+[1] 2
+```
+
+
+``` r
+xlist[["b"]][2]
+```
+
+``` output
+[1] 2
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 6
+
+Given a linear model:
+
+
+``` r
+mod <- aov(pop ~ lifeExp, data=gapminder)
+```
+
+Extract the residual degrees of freedom (hint: `attributes()` will help you)
+
+::::::::::::::: solution
+
+## Solution to challenge 6
+
+
+``` r
+attributes(mod) ## `df.residual` is one of the names of `mod`
+```
+
+
+``` r
+mod$df.residual
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Data frames
+
+Remember the data frames are lists underneath the hood, so similar rules
+apply. However they are also two dimensional objects:
+
+`[` with one argument will act the same way as for lists, where each list
+element corresponds to a column. The resulting object will be a data frame:
+
+
+``` r
+head(gapminder[3])
+```
+
+``` output
+ pop
+1 8425333
+2 9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372
+```
+
+Similarly, `[[` will act to extract *a single column*:
+
+
+``` r
+head(gapminder[["lifeExp"]])
+```
+
+``` output
+[1] 28.801 30.332 31.997 34.020 36.088 38.438
+```
+
+And `$` provides a convenient shorthand to extract columns by name:
+
+
+``` r
+head(gapminder$year)
+```
+
+``` output
+[1] 1952 1957 1962 1967 1972 1977
+```
+
+With two arguments, `[` behaves the same way as for matrices:
+
+
+``` r
+gapminder[1:3,]
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+```
+
+If we subset a single row, the result will be a data frame (because
+the elements are mixed types):
+
+
+``` r
+gapminder[3,]
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+```
+
+But for a single column the result will be a vector (this can
+be changed with the third argument, `drop = FALSE`).
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 7
+
+Fix each of the following common data frame subsetting errors:
+
+1. Extract observations collected for the year 1957
+
+
+ ``` r
+ gapminder[gapminder$year = 1957,]
+ ```
+
+2. Extract all columns except 1 through to 4
+
+
+ ``` r
+ gapminder[,-1:4]
+ ```
+
+3. Extract the rows where the life expectancy is longer the 80 years
+
+
+ ``` r
+ gapminder[gapminder$lifeExp > 80]
+ ```
+
+4. Extract the first row, and the fourth and fifth columns
+ (`continent` and `lifeExp`).
+
+
+ ``` r
+ gapminder[1, 4, 5]
+ ```
+
+5. Advanced: extract rows that contain information for the years 2002
+ and 2007
+
+
+ ``` r
+ gapminder[gapminder$year == 2002 | 2007,]
+ ```
+
+::::::::::::::: solution
+
+## Solution to challenge 7
+
+Fix each of the following common data frame subsetting errors:
+
+1. Extract observations collected for the year 1957
+
+
+ ``` r
+ # gapminder[gapminder$year = 1957,]
+ gapminder[gapminder$year == 1957,]
+ ```
+
+2. Extract all columns except 1 through to 4
+
+
+ ``` r
+ # gapminder[,-1:4]
+ gapminder[,-c(1:4)]
+ ```
+
+3. Extract the rows where the life expectancy is longer than 80 years
+
+
+ ``` r
+ # gapminder[gapminder$lifeExp > 80]
+ gapminder[gapminder$lifeExp > 80,]
+ ```
+
+4. Extract the first row, and the fourth and fifth columns
+ (`continent` and `lifeExp`).
+
+
+ ``` r
+ # gapminder[1, 4, 5]
+ gapminder[1, c(4, 5)]
+ ```
+
+5. Advanced: extract rows that contain information for the years 2002
+ and 2007
+
+
+ ``` r
+ # gapminder[gapminder$year == 2002 | 2007,]
+ gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+ gapminder[gapminder$year %in% c(2002, 2007),]
+ ```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 8
+
+1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`?
+
+2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9
+ and 19 through 23. You can do this in one or two steps.
+
+::::::::::::::: solution
+
+## Solution to challenge 8
+
+1. `gapminder` is a data.frame so needs to be subsetted on two dimensions. `gapminder[1:20, ]` subsets the data to give the first 20 rows and all columns.
+
+2.
+
+``` r
+gapminder_small <- gapminder[c(1:9, 19:23),]
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Indexing in R starts at 1, not 0.
+- Access individual values by location using `[]`.
+- Access slices of data using `[low:high]`.
+- Access arbitrary sets of data using `[c(...)]`.
+- Use logical operations and logical vectors to access subsets of data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/07-control-flow.md b/07-control-flow.md
new file mode 100644
index 000000000..fd43a75ef
--- /dev/null
+++ b/07-control-flow.md
@@ -0,0 +1,675 @@
+---
+title: Control Flow
+teaching: 45
+exercises: 20
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Write conditional statements with `if...else` statements and `ifelse()`.
+- Write and understand `for()` loops.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I make data-dependent choices in R?
+- How can I repeat operations in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+Often when we're coding we want to control the flow of our actions. This can be done
+by setting actions to occur only if a condition or a set of conditions are met.
+Alternatively, we can also set an action to occur a particular number of times.
+
+There are several ways you can control flow in R.
+For conditional statements, the most commonly used approaches are the constructs:
+
+
+``` r
+# if
+if (condition is true) {
+ perform action
+}
+
+# if ... else
+if (condition is true) {
+ perform action
+} else { # that is, if the condition is false,
+ perform alternative action
+}
+```
+
+Say, for example, that we want R to print a message if a variable `x` has a particular value:
+
+
+``` r
+x <- 8
+
+if (x >= 10) {
+ print("x is greater than or equal to 10")
+}
+
+x
+```
+
+``` output
+[1] 8
+```
+
+The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement.
+
+
+``` r
+x <- 8
+
+if (x >= 10) {
+ print("x is greater than or equal to 10")
+} else {
+ print("x is less than 10")
+}
+```
+
+``` output
+[1] "x is less than 10"
+```
+
+You can also test multiple conditions by using `else if`.
+
+
+``` r
+x <- 8
+
+if (x >= 10) {
+ print("x is greater than or equal to 10")
+} else if (x > 5) {
+ print("x is greater than 5, but less than 10")
+} else {
+ print("x is less than 5")
+}
+```
+
+``` output
+[1] "x is greater than 5, but less than 10"
+```
+
+**Important:** when R evaluates the condition inside `if()` statements, it is
+looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some
+headaches for beginners. For example:
+
+
+``` r
+x <- 4 == 3
+if (x) {
+ "4 equals 3"
+} else {
+ "4 does not equal 3"
+}
+```
+
+``` output
+[1] "4 does not equal 3"
+```
+
+As we can see, the not equal message was printed because the vector x is `FALSE`
+
+
+``` r
+x <- 4 == 3
+x
+```
+
+``` output
+[1] FALSE
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Use an `if()` statement to print a suitable message
+reporting whether there are any records from 2002 in
+the `gapminder` dataset.
+Now do the same for 2012.
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+We will first see a solution to Challenge 1 which does not use the `any()` function.
+We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`:
+
+
+``` r
+gapminder[(gapminder$year == 2002),]
+```
+
+Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002:
+
+
+``` r
+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
+```
+
+The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more:
+
+
+``` r
+rows2002_number >= 1
+```
+
+Putting all together, we obtain:
+
+
+``` r
+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+ print("Record(s) for the year 2002 found.")
+}
+```
+
+All this can be done more quickly with `any()`. The logical condition can be expressed as:
+
+
+``` r
+if(any(gapminder$year == 2002)){
+ print("Record(s) for the year 2002 found.")
+}
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Did anyone get a warning message like this?
+
+
+``` error
+Error in if (gapminder$year == 2012) {: the condition has length > 1
+```
+
+The `if()` function only accepts singular (of length 1) inputs, and therefore
+returns an error when you use it with a vector. The `if()` function will still
+run, but will only evaluate the condition in the first element of the vector.
+Therefore, to use the `if()` function, you need to make sure your input is
+singular (of length 1).
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Built in `ifelse()` function
+
+`R` accepts both `if()` and `else if()` statements structured as outlined above,
+but also statements using `R`'s built-in `ifelse()` function. This
+function accepts both singular and vector inputs and is structured as
+follows:
+
+
+``` r
+# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+```
+
+where the first argument is the condition or a set of conditions to be met, the
+second argument is the statement that is evaluated when the condition is `TRUE`,
+and the third statement is the statement that is evaluated when the condition
+is `FALSE`.
+
+
+``` r
+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")
+```
+
+``` output
+[1] "y is a negative number"
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: `any()` and `all()`
+
+The `any()` function will return `TRUE` if at least one
+`TRUE` value is found within a vector, otherwise it will return `FALSE`.
+This can be used in a similar way to the `%in%` operator.
+The function `all()`, as the name suggests, will only return `TRUE` if all values in
+the vector are `TRUE`.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Repeating operations
+
+If you want to iterate over
+a set of values, when the order of iteration is important, and perform the
+same operation on each, a `for()` loop will do the job.
+We saw `for()` loops in the [shell lessons earlier](https://swcarpentry.github.io/shell-novice/05-loop.html). This is the most
+flexible of looping operations, but therefore also the hardest to use
+correctly. In general, the advice of many `R` users would be to learn about
+`for()` loops, but to avoid using `for()` loops unless the order of iteration is
+important: i.e. the calculation at each iteration depends on the results of
+previous iterations. If the order of iteration is not important, then you
+should learn about vectorized alternatives, such as the `purrr` package, as they
+pay off in computational efficiency.
+
+The basic structure of a `for()` loop is:
+
+
+``` r
+for (iterator in set of values) {
+ do a thing
+}
+```
+
+For example:
+
+
+``` r
+for (i in 1:10) {
+ print(i)
+}
+```
+
+``` output
+[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10
+```
+
+The `1:10` bit creates a vector on the fly; you can iterate
+over any other vector as well.
+
+We can use a `for()` loop nested within another `for()` loop to iterate over two things at
+once.
+
+
+``` r
+for (i in 1:5) {
+ for (j in c('a', 'b', 'c', 'd', 'e')) {
+ print(paste(i,j))
+ }
+}
+```
+
+``` output
+[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"
+```
+
+We notice in the output that when the first index (`i`) is set to 1, the second
+index (`j`) iterates through its full set of indices. Once the indices of `j`
+have been iterated through, then `i` is incremented. This process continues
+until the last index has been used for each `for()` loop.
+
+Rather than printing the results, we could write the loop output to a new object.
+
+
+``` r
+output_vector <- c()
+for (i in 1:5) {
+ for (j in c('a', 'b', 'c', 'd', 'e')) {
+ temp_output <- paste(i, j)
+ output_vector <- c(output_vector, temp_output)
+ }
+}
+output_vector
+```
+
+``` output
+ [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"
+```
+
+This approach can be useful, but 'growing your results' (building
+the result object incrementally) is computationally inefficient, so avoid
+it when you are iterating through a lot of values.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: don't grow your results
+
+One of the biggest things that trips up novices and
+experienced R users alike, is building a results object
+(vector, list, matrix, data frame) as your for loop progresses.
+Computers are very bad at handling this, so your calculations
+can very quickly slow to a crawl. It's much better to define
+an empty results object before hand of appropriate dimensions, rather
+than initializing an empty object without dimensions.
+So if you know the end result will be stored in a matrix like above,
+create an empty matrix with 5 row and 5 columns, then at each iteration
+store the results in the appropriate location.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+A better way is to define your (empty) output object before filling in the values.
+For this example, it looks more involved, but is still more efficient.
+
+
+``` r
+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+ for (j in 1:5) {
+ temp_j_value <- j_vector[j]
+ temp_output <- paste(i, temp_j_value)
+ output_matrix[i, j] <- temp_output
+ }
+}
+output_vector2 <- as.vector(output_matrix)
+output_vector2
+```
+
+``` output
+ [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: While loops
+
+Sometimes you will find yourself needing to repeat an operation as long as a certain
+condition is met. You can do this with a `while()` loop.
+
+
+``` r
+while(this condition is true){
+ do a thing
+}
+```
+
+R will interpret a condition being met as "TRUE".
+
+As an example, here's a while loop
+that generates random numbers from a uniform distribution (the `runif()` function)
+between 0 and 1 until it gets one that's less than 0.1.
+
+```r
+z <- 1
+while(z > 0.1){
+ z <- runif(1)
+ cat(z, "\n")
+}
+```
+
+`while()` loops will not always be appropriate. You have to be particularly careful
+that you don't end up stuck in an infinite loop because your condition is always met and hence the while statement never terminates.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Compare the objects `output_vector` and
+`output_vector2`. Are they the same? If not, why not?
+How would you change the last block of code to make `output_vector2`
+the same as `output_vector`?
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+We can check whether the two vectors are identical using the `all()` function:
+
+
+``` r
+all(output_vector == output_vector2)
+```
+
+However, all the elements of `output_vector` can be found in `output_vector2`:
+
+
+``` r
+all(output_vector %in% output_vector2)
+```
+
+and vice versa:
+
+
+``` r
+all(output_vector2 %in% output_vector)
+```
+
+therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order.
+This is because `as.vector()` outputs the elements of an input matrix going over its column.
+Taking a look at `output_matrix`, we can notice that we want its elements by rows.
+The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function
+`t()` or by inputting the elements in the right order.
+The first solution requires to change the original
+
+
+``` r
+output_vector2 <- as.vector(output_matrix)
+```
+
+into
+
+
+``` r
+output_vector2 <- as.vector(t(output_matrix))
+```
+
+The second solution requires to change
+
+
+``` r
+output_matrix[i, j] <- temp_output
+```
+
+into
+
+
+``` r
+output_matrix[j, i] <- temp_output
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Write a script that loops through the `gapminder` data by continent and prints out
+whether the mean life expectancy is smaller or larger than 50
+years.
+
+::::::::::::::: solution
+
+## Solution to Challenge 3
+
+**Step 1**: We want to make sure we can extract all the unique values of the continent vector
+
+
+``` r
+gapminder <- read.csv("data/gapminder_data.csv")
+unique(gapminder$continent)
+```
+
+**Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data.
+We can do that as follows:
+
+1. Loop over each of the unique values of 'continent'
+2. For each value of continent, create a temporary variable storing that subset
+3. Return the calculated life expectancy to the user by printing the output:
+
+
+``` r
+for (iContinent in unique(gapminder$continent)) {
+ tmp <- gapminder[gapminder$continent == iContinent, ]
+ cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+ rm(tmp)
+}
+```
+
+**Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50.
+So we need to add an `if()` condition before printing, which evaluates whether the calculated average life expectancy is above or below a threshold, and prints an output conditional on the result.
+We need to amend (3) from above:
+
+3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold:
+
+
+``` r
+thresholdValue <- 50
+
+for (iContinent in unique(gapminder$continent)) {
+ tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+
+ if (tmp < thresholdValue){
+ cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+ } else {
+ cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+ } # end if else condition
+ rm(tmp)
+} # end for loop
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+Modify the script from Challenge 3 to loop over each
+country. This time print out whether the life expectancy is
+smaller than 50, between 50 and 70, or greater than 70.
+
+::::::::::::::: solution
+
+## Solution to Challenge 4
+
+We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements:
+
+
+``` r
+ lowerThreshold <- 50
+ upperThreshold <- 70
+
+for (iCountry in unique(gapminder$country)) {
+ tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+ if(tmp < lowerThreshold) {
+ cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+ } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+ cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+ } else {
+ cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+ }
+ rm(tmp)
+}
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5 - Advanced
+
+Write a script that loops over each country in the `gapminder` dataset,
+tests whether the country starts with a 'B', and graphs life expectancy
+against time as a line graph if the mean life expectancy is under 50 years.
+
+::::::::::::::: solution
+
+## Solution for Challenge 5
+
+We will use the `grep()` command that was introduced in the [Unix Shell lesson](https://swcarpentry.github.io/shell-novice/07-find.html)
+to find countries that start with "B."
+Lets understand how to do this first.
+Following from the Unix shell section we may be tempted to try the following
+
+
+``` r
+grep("^B", unique(gapminder$country))
+```
+
+But when we evaluate this command it returns the indices of the factor variable `country` that start with "B."
+To get the values, we must add the `value=TRUE` option to the `grep()` command:
+
+
+``` r
+grep("^B", unique(gapminder$country), value = TRUE)
+```
+
+We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable.
+Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy using `with()` and `subset()`:
+
+
+``` r
+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+
+for (iCountry in candidateCountries) {
+ tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+ if (tmp < thresholdValue) {
+ cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+ with(subset(gapminder, country == iCountry),
+ plot(year, lifeExp,
+ type = "o",
+ main = paste("Life Expectancy in", iCountry, "over time"),
+ ylab = "Life Expectancy",
+ xlab = "Year"
+ ) # end plot
+ ) # end with
+ } # end if
+ rm(tmp)
+} # end for loop
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `if` and `else` to make choices.
+- Use `for` to repeat operations.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/08-plot-ggplot2.md b/08-plot-ggplot2.md
new file mode 100644
index 000000000..b36eb75ff
--- /dev/null
+++ b/08-plot-ggplot2.md
@@ -0,0 +1,555 @@
+---
+title: Creating Publication-Quality Graphics with ggplot2
+teaching: 60
+exercises: 20
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to use ggplot2 to generate publication-quality graphics.
+- To apply geometry, aesthetic, and statistics layers to a ggplot plot.
+- To manipulate the aesthetics of a plot using different colors, shapes, and lines.
+- To improve data visualization through transforming scales and paneling by group.
+- To save a plot created with ggplot to disk.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I create publication-quality graphics in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+Plotting our data is one of the best ways to
+quickly explore it and the various relationships
+between variables.
+
+There are three main plotting systems in R,
+the [base plotting system][base], the [lattice]
+package, and the [ggplot2] package.
+
+Today we'll be learning about the ggplot2 package, because
+it is the most effective for creating publication-quality
+graphics.
+
+ggplot2 is built on the grammar of graphics, the idea that any plot can be
+built from the same set of components: a **data set**,
+**mapping aesthetics**, and graphical **layers**:
+
+- **Data sets** are the data that you, the user, provide.
+
+- **Mapping aesthetics** are what connect the data to the graphics.
+ They tell ggplot2 how to use your data to affect how the graph looks,
+ such as changing what is plotted on the X or Y axis, or the size or
+ color of different data points.
+
+- **Layers** are the actual graphical output from ggplot2. Layers
+ determine what kinds of plot are shown (scatterplot, histogram, etc.),
+ the coordinate system used (rectangular, polar, others), and other
+ important aspects of the plot. The idea of layers of graphics may
+ be familiar to you if you have used image editing programs
+ like Photoshop, Illustrator, or Inkscape.
+
+Let's start off building an example using the gapminder data from earlier.
+The most basic function is `ggplot`, which lets R know that we're
+creating a new plot. Any of the arguments we give the `ggplot`
+function are the *global* options for the plot: they apply to all
+layers on the plot.
+
+
+``` r
+library("ggplot2")
+ggplot(data = gapminder)
+```
+
+
+
+Here we called `ggplot` and told it what data we want to show on
+our figure. This is not enough information for `ggplot` to actually
+draw anything. It only creates a blank slate for other elements
+to be added to.
+
+Now we're going to add in the **mapping aesthetics** using the
+`aes` function. `aes` tells `ggplot` how variables in the **data**
+map to *aesthetic* properties of the figure, such as which columns
+of the data should be used for the **x** and **y** locations.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
+```
+
+
+
+Here we told `ggplot` we want to plot the "gdpPercap" column of the
+gapminder data frame on the x-axis, and the "lifeExp" column on the
+y-axis. Notice that we didn't need to explicitly pass `aes` these
+columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because
+`ggplot` is smart enough to know to look in the **data** for that column!
+
+The final part of making our plot is to tell `ggplot` how we want to
+visually represent the data. We do this by adding a new **layer**
+to the plot using one of the **geom** functions.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point()
+```
+
+
+
+Here we used `geom_point`, which tells `ggplot` we want to visually
+represent the relationship between **x** and **y** as a scatterplot of points.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Modify the example so that the figure shows how life expectancy has
+changed over time:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
+```
+
+Hint: the gapminder dataset has a column called "year", which should appear
+on the x-axis.
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+Here is one possible solution:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
+```
+
+
+
+
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+In the previous examples and challenge we've used the `aes` function to tell
+the scatterplot **geom** about the **x** and **y** locations of each point.
+Another *aesthetic* property we can modify is the point *color*. Modify the
+code from the previous challenge to **color** the points by the "continent"
+column. What trends do you see in the data? Are they what you expected?
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+The solution presented below adds `color=continent` to the call of the `aes`
+function. The general trend seems to indicate an increased life expectancy
+over the years. On continents with stronger economies we find a longer life
+expectancy.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+ geom_point()
+```
+
+
+
+
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Layers
+
+Using a scatterplot probably isn't the best for visualizing change over time.
+Instead, let's tell `ggplot` to visualize the data as a line plot:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+ geom_line()
+```
+
+
+
+Instead of adding a `geom_point` layer, we've added a `geom_line` layer.
+
+However, the result doesn't look quite as we might have expected: it seems to be jumping around a lot in each continent. Let's try to separate the data by country, plotting one line for each country:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+ geom_line()
+```
+
+
+
+We've added the **group** *aesthetic*, which tells `ggplot` to draw a line for each
+country.
+
+But what if we want to visualize both lines and points on the plot? We can
+add another layer to the plot:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+ geom_line() + geom_point()
+```
+
+
+
+It's important to note that each layer is drawn on top of the previous layer. In
+this example, the points have been drawn *on top of* the lines. Here's a
+demonstration:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_line(mapping = aes(color=continent)) + geom_point()
+```
+
+
+
+In this example, the *aesthetic* mapping of **color** has been moved from the
+global plot options in `ggplot` to the `geom_line` layer so it no longer applies
+to the points. Now we can clearly see that the points are drawn on top of the
+lines.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Setting an aesthetic to a value instead of a mapping
+
+So far, we've seen how to use an aesthetic (such as **color**) as a *mapping* to a variable in the data. For example, when we use `geom_line(mapping = aes(color=continent))`, ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that `geom_line(mapping = aes(color="blue"))` should work, but it doesn't. Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the `aes()` function, like this: `geom_line(color="blue")`.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Switch the order of the point and line layers from the previous example. What
+happened?
+
+::::::::::::::: solution
+
+## Solution to challenge 3
+
+The lines now get drawn over the points!
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
+```
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Transformations and statistics
+
+ggplot2 also makes it easy to overlay statistical models over the data. To
+demonstrate we'll go back to our first example:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point()
+```
+
+
+
+Currently it's hard to see the relationship between the points due to some strong
+outliers in GDP per capita. We can change the scale of units on the x axis using
+the *scale* functions. These control the mapping between the data values and
+visual values of an aesthetic. We can also modify the transparency of the
+points, using the *alpha* function, which is especially helpful when you have
+a large amount of data which is very clustered.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(alpha = 0.5) + scale_x_log10()
+```
+
+
+
+
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+
+
+The `scale_x_log10` function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip Reminder: Setting an aesthetic to a value instead of a mapping
+
+Notice that we used `geom_point(alpha = 0.5)`. As the previous tip mentioned, using a setting outside of the `aes()` function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, *alpha* can also be mapped to a variable in the data. For example, we can give a different transparency to each continent with `geom_point(mapping = aes(alpha = continent))`.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+We can fit a simple relationship to the data by adding another layer,
+`geom_smooth`:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
+```
+
+``` output
+`geom_smooth()` using formula = 'y ~ x'
+```
+
+
+
+We can make the line thicker by *setting* the **linewidth** aesthetic in the
+`geom_smooth` layer:
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5)
+```
+
+``` output
+`geom_smooth()` using formula = 'y ~ x'
+```
+
+
+
+There are two ways an *aesthetic* can be specified. Here we *set* the **linewidth** aesthetic by passing it as an argument to `geom_smooth` and it is applied the same to the whole `geom`. Previously in the lesson we've used the `aes` function to define a *mapping* between data variables and their visual representation.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4a
+
+Modify the color and size of the points on the point layer in the previous
+example.
+
+Hint: do not use the `aes` function.
+
+Hint: the equivalent of `linewidth` for points is `size`.
+
+::::::::::::::: solution
+
+## Solution to challenge 4a
+
+Here a possible solution:
+Notice that the `color` argument is supplied outside of the `aes()` function.
+This means that it applies to all data points on the graph and is not related to
+a specific variable.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", linewidth=1.5)
+```
+
+``` output
+`geom_smooth()` using formula = 'y ~ x'
+```
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4b
+
+Modify your solution to Challenge 4a so that the
+points are now a different shape and are colored by continent with new
+trendlines. Hint: The color argument can be used inside the aesthetic.
+
+::::::::::::::: solution
+
+## Solution to challenge 4b
+
+Here is a possible solution:
+Notice that supplying the `color` argument inside the `aes()` functions enables you to
+connect it to a certain variable. The `shape` argument, as you can see, modifies all
+data points the same way (it is outside the `aes()` call) while the `color` argument which
+is placed inside the `aes()` call modifies a point's color based on its continent value.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", linewidth=1.5)
+```
+
+``` output
+`geom_smooth()` using formula = 'y ~ x'
+```
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Multi-panel figures
+
+Earlier we visualized the change in life expectancy over time across all
+countries in one plot. Alternatively, we can split this out over multiple panels
+by adding a layer of **facet** panels.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip
+
+We start by making a subset of data including only countries located
+in the Americas. This includes 25 countries, which will begin to
+clutter the figure. Note that we apply a "theme" definition to rotate
+the x-axis labels to maintain readability. Nearly everything in
+ggplot2 is customizable.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+``` r
+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+ geom_line() +
+ facet_wrap( ~ country) +
+ theme(axis.text.x = element_text(angle = 45))
+```
+
+
+
+The `facet_wrap` layer took a "formula" as its argument, denoted by the tilde
+(~). This tells R to draw a panel for each unique value in the country column
+of the gapminder dataset.
+
+## Modifying text
+
+To clean this figure up for a publication we need to change some of the text
+elements. The x-axis is too cluttered, and the y axis should read
+"Life expectancy", rather than the column name in the data frame.
+
+We can do this by adding a couple of different layers. The **theme** layer
+controls the axis text, and overall text size. Labels for the axes, plot
+title and any legend can be set using the `labs` function. Legend titles
+are set using the same names we used in the `aes` specification. Thus below
+the color legend title is set using `color = "Continent"`, while the title
+of a fill legend would be set using `fill = "MyTitle"`.
+
+
+``` r
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+ geom_line() + facet_wrap( ~ country) +
+ labs(
+ x = "Year", # x axis title
+ y = "Life expectancy", # y axis title
+ title = "Figure 1", # main title of figure
+ color = "Continent" # title of legend
+ ) +
+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
+```
+
+
+
+## Exporting the plot
+
+The `ggsave()` function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable `lifeExp_plot`, then tell `ggsave` to save that plot in `png` format to a directory called `results`. (Make sure you have a `results/` folder in your working directory.)
+
+
+
+
+``` r
+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+ geom_line() + facet_wrap( ~ country) +
+ labs(
+ x = "Year", # x axis title
+ y = "Life expectancy", # y axis title
+ title = "Figure 1", # main title of figure
+ color = "Continent" # title of legend
+ ) +
+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
+```
+
+There are two nice things about `ggsave`. First, it defaults to the last plot, so if you omit the `plot` argument it will automatically save the last plot you created with `ggplot`. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example `.png` or `.pdf`). If you need to, you can specify the format explicitly in the `device` argument.
+
+This is a taste of what you can do with ggplot2. RStudio provides a
+really useful [cheat sheet][cheat] of the different layers available, and more
+extensive documentation is available on the [ggplot2 website][ggplot-doc]. All RStudio cheat sheets are available from the [RStudio website][cheat_all].
+Finally, if you have no idea how to change something, a quick Google search will
+usually send you to a relevant question and answer on Stack Overflow with reusable
+code to modify!
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5
+
+Generate boxplots to compare life expectancy between the different continents during the available years.
+
+Advanced:
+
+- Rename y axis as Life Expectancy.
+- Remove x axis labels.
+
+::::::::::::::: solution
+
+## Solution to Challenge 5
+
+Here a possible solution:
+`xlab()` and `ylab()` set labels for the x and y axes, respectively
+The axis title, text and ticks are attributes of the theme and must be modified within a `theme()` call.
+
+
+``` r
+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+ axis.text.x = element_blank(),
+ axis.ticks.x = element_blank())
+```
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+[base]: https://www.statmethods.net/graphs/index.html
+[lattice]: https://www.statmethods.net/advgraphs/trellis.html
+[ggplot2]: https://www.statmethods.net/advgraphs/ggplot2.html
+[cheat]: https://www.rstudio.org/links/data_visualization_cheat_sheet
+[cheat_all]: https://www.rstudio.com/resources/cheatsheets/
+[ggplot-doc]: https://ggplot2.tidyverse.org/reference/
+
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `ggplot2` to create plots.
+- Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/09-vectorization.md b/09-vectorization.md
new file mode 100644
index 000000000..308ddea97
--- /dev/null
+++ b/09-vectorization.md
@@ -0,0 +1,489 @@
+---
+title: Vectorization
+teaching: 10
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To understand vectorized operations in R.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I operate on all the elements of a vector at once?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+Most of R's functions are vectorized, meaning that the function will
+operate on all elements of a vector without needing to loop through
+and act on each element one at a time. This makes writing code more
+concise, easy to read, and less error prone.
+
+
+``` r
+x <- 1:4
+x * 2
+```
+
+``` output
+[1] 2 4 6 8
+```
+
+The multiplication happened to each element of the vector.
+
+We can also add two vectors together:
+
+
+``` r
+y <- 6:9
+x + y
+```
+
+``` output
+[1] 7 9 11 13
+```
+
+Each element of `x` was added to its corresponding element of `y`:
+
+
+``` r
+x: 1 2 3 4
+ + + + +
+y: 6 7 8 9
+---------------
+ 7 9 11 13
+```
+
+Here is how we would add two vectors together using a for loop:
+
+
+``` r
+output_vector <- c()
+for (i in 1:4) {
+ output_vector[i] <- x[i] + y[i]
+}
+output_vector
+```
+
+``` output
+[1] 7 9 11 13
+```
+
+Compare this to the output using vectorised operations.
+
+
+``` r
+sum_xy <- x + y
+sum_xy
+```
+
+``` output
+[1] 7 9 11 13
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Let's try this on the `pop` column of the `gapminder` dataset.
+
+Make a new column in the `gapminder` data frame that
+contains population in units of millions of people.
+Check the head or tail of the data frame to make sure
+it worked.
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+Let's try this on the `pop` column of the `gapminder` dataset.
+
+Make a new column in the `gapminder` data frame that
+contains population in units of millions of people.
+Check the head or tail of the data frame to make sure
+it worked.
+
+
+``` r
+gapminder$pop_millions <- gapminder$pop / 1e6
+head(gapminder)
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453 8.425333
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530 9.240934
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007 10.267083
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971 11.537966
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811 13.079460
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134 14.880372
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+On a single graph, plot population, in
+millions, against year, for all countries. Do not worry about
+identifying which country is which.
+
+Repeat the exercise, graphing only for China, India, and
+Indonesia. Again, do not worry about which is which.
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+Refresh your plotting skills by plotting population in millions against year.
+
+
+``` r
+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
+```
+
+
+
+``` r
+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+ aes(x = year, y = pop_millions)) +
+ geom_point()
+```
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Comparison operators, logical operators, and many functions are also
+vectorized:
+
+**Comparison operators**
+
+
+``` r
+x > 2
+```
+
+``` output
+[1] FALSE FALSE TRUE TRUE
+```
+
+**Logical operators**
+
+
+``` r
+a <- x > 3 # or, for clarity, a <- (x > 3)
+a
+```
+
+``` output
+[1] FALSE FALSE FALSE TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: some useful functions for logical vectors
+
+`any()` will return `TRUE` if *any* element of a vector is `TRUE`.
+`all()` will return `TRUE` if *all* elements of a vector are `TRUE`.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Most functions also operate element-wise on vectors:
+
+**Functions**
+
+
+``` r
+x <- 1:4
+log(x)
+```
+
+``` output
+[1] 0.0000000 0.6931472 1.0986123 1.3862944
+```
+
+Vectorized operations work element-wise on matrices:
+
+
+``` r
+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1
+```
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] -1 -4 -7 -10
+[2,] -2 -5 -8 -11
+[3,] -3 -6 -9 -12
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: element-wise vs. matrix multiplication
+
+Very important: the operator `*` gives you element-wise multiplication!
+To do matrix multiplication, we need to use the `%*%` operator:
+
+
+``` r
+m %*% matrix(1, nrow=4, ncol=1)
+```
+
+``` output
+ [,1]
+[1,] 22
+[2,] 26
+[3,] 30
+```
+
+``` r
+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
+```
+
+``` output
+ [,1]
+[1,] 30
+```
+
+For more on matrix algebra, see the [Quick-R reference
+guide](https://www.statmethods.net/advstats/matrix.html)
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Given the following matrix:
+
+
+``` r
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+```
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] 1 4 7 10
+[2,] 2 5 8 11
+[3,] 3 6 9 12
+```
+
+Write down what you think will happen when you run:
+
+1. `m ^ -1`
+2. `m * c(1, 0, -1)`
+3. `m > c(0, 20)`
+4. `m * c(1, 0, -1, 2)`
+
+Did you get the output you expected? If not, ask a helper!
+
+::::::::::::::: solution
+
+## Solution to challenge 3
+
+Given the following matrix:
+
+
+``` r
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+```
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] 1 4 7 10
+[2,] 2 5 8 11
+[3,] 3 6 9 12
+```
+
+Write down what you think will happen when you run:
+
+1. `m ^ -1`
+
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
+```
+
+2. `m * c(1, 0, -1)`
+
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] 1 4 7 10
+[2,] 0 0 0 0
+[3,] -3 -6 -9 -12
+```
+
+3. `m > c(0, 20)`
+
+
+``` output
+ [,1] [,2] [,3] [,4]
+[1,] TRUE FALSE TRUE FALSE
+[2,] FALSE TRUE FALSE TRUE
+[3,] TRUE FALSE TRUE FALSE
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+We're interested in looking at the sum of the
+following sequence of fractions:
+
+
+``` r
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+```
+
+This would be tedious to type out, and impossible for high values of
+n. Use vectorisation to compute x when n=100. What is the sum when
+n=10,000?
+
+::::::::::::::: solution
+
+## Challenge 4
+
+We're interested in looking at the sum of the
+following sequence of fractions:
+
+
+``` r
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+```
+
+This would be tedious to type out, and impossible for
+high values of n.
+Can you use vectorisation to compute x, when n=100?
+How about when n=10,000?
+
+
+``` r
+sum(1/(1:100)^2)
+```
+
+``` output
+[1] 1.634984
+```
+
+``` r
+sum(1/(1:1e04)^2)
+```
+
+``` output
+[1] 1.644834
+```
+
+``` r
+n <- 10000
+sum(1/(1:n)^2)
+```
+
+``` output
+[1] 1.644834
+```
+
+We can also obtain the same results using a function:
+
+
+``` r
+inverse_sum_of_squares <- function(n) {
+ sum(1/(1:n)^2)
+}
+inverse_sum_of_squares(100)
+```
+
+``` output
+[1] 1.634984
+```
+
+``` r
+inverse_sum_of_squares(10000)
+```
+
+``` output
+[1] 1.644834
+```
+
+``` r
+n <- 10000
+inverse_sum_of_squares(n)
+```
+
+``` output
+[1] 1.644834
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Operations on vectors of unequal length
+
+Operations can also be performed on vectors of unequal length, through
+a process known as *recycling*. This process automatically repeats the smaller vector
+until it matches the length of the larger vector. R will provide a warning
+if the larger vector is not a multiple of the smaller vector.
+
+
+``` r
+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y
+```
+
+``` warning
+Warning in x + y: longer object length is not a multiple of shorter object
+length
+```
+
+``` output
+[1] 2 4 6 5 7 9 8
+```
+
+Vector `x` was recycled to match the length of vector `y`
+
+
+``` r
+x: 1 2 3 1 2 3 1
+ + + + + + + +
+y: 1 2 3 4 5 6 7
+-----------------------
+ 2 4 6 5 7 9 8
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use vectorized operations instead of loops.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/10-functions.md b/10-functions.md
new file mode 100644
index 000000000..a831b0b0d
--- /dev/null
+++ b/10-functions.md
@@ -0,0 +1,685 @@
+---
+title: Functions Explained
+teaching: 45
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Define a function that takes arguments.
+- Return a value from a function.
+- Check argument conditions with `stopifnot()` in functions.
+- Test a function.
+- Set default values for function arguments.
+- Explain why we should divide programs into small, single-purpose functions.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I write a new function in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+If we only had one data set to analyze, it would probably be faster to load the
+file into a spreadsheet and use that to plot simple statistics. However, the
+gapminder data is updated periodically, and we may want to pull in that new
+information later and re-run our analysis again. We may also obtain similar data
+from a different source in the future.
+
+In this lesson, we'll learn how to write a function so that we can repeat
+several operations with a single command.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## What is a function?
+
+Functions gather a sequence of operations into a whole, preserving it for
+ongoing use. Functions provide:
+
+- a name we can remember and invoke it by
+- relief from the need to remember the individual operations
+- a defined set of inputs and expected outputs
+- rich connections to the larger programming environment
+
+As the basic building block of most programming languages, user-defined
+functions constitute "programming" as much as any single abstraction can. If
+you have written a function, you are a computer programmer.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Defining a function
+
+Let's open a new R script file in the `functions/` directory and call it
+functions-lesson.R.
+
+The general structure of a function is:
+
+
+``` r
+my_function <- function(parameters) {
+ # perform action
+ # return value
+}
+```
+
+Let's define a function `fahr_to_kelvin()` that converts temperatures from
+Fahrenheit to Kelvin:
+
+
+``` r
+fahr_to_kelvin <- function(temp) {
+ kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+ return(kelvin)
+}
+```
+
+We define `fahr_to_kelvin()` by assigning it to the output of `function`. The
+list of argument names are contained within parentheses. Next, the
+[body](../learners/reference.md#body) of the function--the
+statements that are executed when it runs--is contained within curly braces
+(`{}`). The statements in the body are indented by two spaces. This makes the
+code easier to read but does not affect how the code operates.
+
+It is useful to think of creating functions like writing a cookbook. First you define the "ingredients" that your function needs. In this case, we only need one ingredient to use our function: "temp". After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it.
+
+When we call the function, the values we pass to it as arguments are assigned to
+those variables so that we can use them inside the function. Inside the
+function, we use a [return
+statement](../learners/reference.md#return-statement) to send a result back to
+whoever asked for it.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip
+
+One feature unique to R is that the return statement is not required.
+R automatically returns whichever variable is on the last line of the body
+of the function. But for clarity, we will explicitly define the
+return statement.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Let's try running our function.
+Calling our own function is no different from calling any other function:
+
+
+``` r
+# freezing point of water
+fahr_to_kelvin(32)
+```
+
+``` output
+[1] 273.15
+```
+
+
+``` r
+# boiling point of water
+fahr_to_kelvin(212)
+```
+
+``` output
+[1] 373.15
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Write a function called `kelvin_to_celsius()` that takes a temperature in
+Kelvin and returns that temperature in Celsius.
+
+Hint: To convert from Kelvin to Celsius you subtract 273.15
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin
+and returns that temperature in Celsius
+
+
+``` r
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Combining functions
+
+The real power of functions comes from mixing, matching and combining them
+into ever-larger chunks to get the effect we want.
+
+Let's define two functions that will convert temperature from Fahrenheit to
+Kelvin, and Kelvin to Celsius:
+
+
+``` r
+fahr_to_kelvin <- function(temp) {
+ kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+ return(kelvin)
+}
+
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Define the function to convert directly from Fahrenheit to Celsius,
+by reusing the two functions above (or using your own functions if you
+prefer).
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+Define the function to convert directly from Fahrenheit to Celsius,
+by reusing these two functions above
+
+
+``` r
+fahr_to_celsius <- function(temp) {
+ temp_k <- fahr_to_kelvin(temp)
+ result <- kelvin_to_celsius(temp_k)
+ return(result)
+}
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Interlude: Defensive Programming
+
+Now that we've begun to appreciate how writing functions provides an efficient
+way to make R code re-usable and modular, we should note that it is important
+to ensure that functions only work in their intended use-cases. Checking
+function parameters is related to the concept of *defensive programming*.
+Defensive programming encourages us to frequently check conditions and throw an
+error if something is wrong. These checks are referred to as assertion
+statements because we want to assert some condition is `TRUE` before proceeding.
+They make it easier to debug because they give us a better idea of where the
+errors originate.
+
+### Checking conditions with `stopifnot()`
+
+Let's start by re-examining `fahr_to_kelvin()`, our function for converting
+temperatures from Fahrenheit to Kelvin. It was defined like so:
+
+
+``` r
+fahr_to_kelvin <- function(temp) {
+ kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+ return(kelvin)
+}
+```
+
+For this function to work as intended, the argument `temp` must be a `numeric`
+value; otherwise, the mathematical procedure for converting between the two
+temperature scales will not work. To create an error, we can use the function
+`stop()`. For example, since the argument `temp` must be a `numeric` vector, we
+could check for this condition with an `if` statement and throw an error if the
+condition was violated. We could augment our function above like so:
+
+
+``` r
+fahr_to_kelvin <- function(temp) {
+ if (!is.numeric(temp)) {
+ stop("temp must be a numeric vector.")
+ }
+ kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+ return(kelvin)
+}
+```
+
+If we had multiple conditions or arguments to check, it would take many lines
+of code to check all of them. Luckily R provides the convenience function
+`stopifnot()`. We can list as many requirements that should evaluate to `TRUE`;
+`stopifnot()` throws an error if it finds one that is `FALSE`. Listing these
+conditions also serves a secondary purpose as extra documentation for the
+function.
+
+Let's try out defensive programming with `stopifnot()` by adding assertions to
+check the input to our function `fahr_to_kelvin()`.
+
+We want to assert the following: `temp` is a numeric vector. We may do that like
+so:
+
+
+``` r
+fahr_to_kelvin <- function(temp) {
+ stopifnot(is.numeric(temp))
+ kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+ return(kelvin)
+}
+```
+
+It still works when given proper input.
+
+
+``` r
+# freezing point of water
+fahr_to_kelvin(temp = 32)
+```
+
+``` output
+[1] 273.15
+```
+
+But fails instantly if given improper input.
+
+
+``` r
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))
+```
+
+``` error
+Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Use defensive programming to ensure that our `fahr_to_celsius()` function
+throws an error immediately if the argument `temp` is specified
+inappropriately.
+
+::::::::::::::: solution
+
+## Solution to challenge 3
+
+Extend our previous definition of the function by adding in an explicit call
+to `stopifnot()`. Since `fahr_to_celsius()` is a composition of two other
+functions, checking inside here makes adding checks to the two component
+functions redundant.
+
+
+``` r
+fahr_to_celsius <- function(temp) {
+ stopifnot(is.numeric(temp))
+ temp_k <- fahr_to_kelvin(temp)
+ result <- kelvin_to_celsius(temp_k)
+ return(result)
+}
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## More on combining functions
+
+Now, we're going to define a function that calculates the Gross Domestic Product
+of a nation from the data available in our dataset:
+
+
+``` r
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+ gdp <- dat$pop * dat$gdpPercap
+ return(gdp)
+}
+```
+
+We define `calcGDP()` by assigning it to the output of `function`. The list of
+argument names are contained within parentheses. Next, the body of the function
+\-- the statements executed when you call the function -- is contained within
+curly braces (`{}`).
+
+We've indented the statements in the body by two spaces. This makes the code
+easier to read but does not affect how it operates.
+
+When we call the function, the values we pass to it are assigned to the
+arguments, which become variables inside the body of the function.
+
+Inside the function, we use the `return()` function to send back the result.
+This `return()` function is optional: R will automatically return the results of
+whatever command is executed on the last line of the function.
+
+
+``` r
+calcGDP(head(gapminder))
+```
+
+``` output
+[1] 6567086330 7585448670 8758855797 9648014150 9678553274 11697659231
+```
+
+That's not very informative. Let's add some more arguments so we can extract
+that per year and country.
+
+
+``` r
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+ if(!is.null(year)) {
+ dat <- dat[dat$year %in% year, ]
+ }
+ if (!is.null(country)) {
+ dat <- dat[dat$country %in% country,]
+ }
+ gdp <- dat$pop * dat$gdpPercap
+
+ new <- cbind(dat, gdp=gdp)
+ return(new)
+}
+```
+
+If you've been writing these functions down into a separate R script
+(a good idea!), you can load in the functions into our R session by using the
+`source()` function:
+
+
+``` r
+source("functions/functions-lesson.R")
+```
+
+Ok, so there's a lot going on in this function now. In plain English, the
+function now subsets the provided data by year if the year argument isn't empty,
+then subsets the result by country if the country argument isn't empty. Then it
+calculates the GDP for whatever subset emerges from the previous two steps. The
+function then adds the GDP as a new column to the subsetted data and returns
+this as the final result. You can see that the output is much more informative
+than a vector of numbers.
+
+Let's take a look at what happens when we specify the year:
+
+
+``` r
+head(calcGDP(gapminder, year=2007))
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap gdp
+12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
+24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
+36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
+48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
+60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
+72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
+```
+
+Or for a specific country:
+
+
+``` r
+calcGDP(gapminder, country="Australia")
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap gdp
+61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
+62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
+63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
+64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
+65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
+66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
+67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
+68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
+69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
+70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
+71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
+```
+
+Or both:
+
+
+``` r
+calcGDP(gapminder, year=2007, country="Australia")
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap gdp
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
+```
+
+Let's walk through the body of the function:
+
+
+``` r
+calcGDP <- function(dat, year=NULL, country=NULL) {
+```
+
+Here we've added two arguments, `year`, and `country`. We've set
+*default arguments* for both as `NULL` using the `=` operator
+in the function definition. This means that those arguments will
+take on those values unless the user specifies otherwise.
+
+
+``` r
+ if(!is.null(year)) {
+ dat <- dat[dat$year %in% year, ]
+ }
+ if (!is.null(country)) {
+ dat <- dat[dat$country %in% country,]
+ }
+```
+
+Here, we check whether each additional argument is set to `null`, and whenever
+they're not `null` overwrite the dataset stored in `dat` with a subset given by
+the non-`null` argument.
+
+Building these conditionals into the function makes it more flexible for later.
+Now, we can use it to calculate the GDP for:
+
+- The whole dataset;
+- A single year;
+- A single country;
+- A single combination of year and country.
+
+By using `%in%` instead, we can also give multiple years or countries to those
+arguments.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Pass by value
+
+Functions in R almost always make copies of the data to operate on
+inside of a function body. When we modify `dat` inside the function
+we are modifying the copy of the gapminder dataset stored in `dat`,
+not the original variable we gave as the first argument.
+
+This is called "pass-by-value" and it makes writing code much safer:
+you can always be sure that whatever changes you make within the
+body of the function, stay inside the body of the function.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Function scope
+
+Another important concept is scoping: any variables (or functions!) you
+create or modify inside the body of a function only exist for the lifetime
+of the function's execution. When we call `calcGDP()`, the variables `dat`,
+`gdp` and `new` only exist inside the body of the function. Even if we
+have variables of the same name in our interactive R session, they are
+not modified in any way when executing a function.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+``` r
+ gdp <- dat$pop * dat$gdpPercap
+ new <- cbind(dat, gdp=gdp)
+ return(new)
+}
+```
+
+Finally, we calculated the GDP on our new subset, and created a new data frame
+with that column added. This means when we call the function later we can see
+the context for the returned GDP values, which is much better than in our first
+attempt where we got a vector of numbers.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 4
+
+Test out your GDP function by calculating the GDP for New Zealand in 1987. How
+does this differ from New Zealand's GDP in 1952?
+
+::::::::::::::: solution
+
+## Solution to challenge 4
+
+
+``` r
+ calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")
+```
+
+GDP for New Zealand in 1987: 65050008703
+
+GDP for New Zealand in 1952: 21058193787
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 5
+
+The `paste()` function can be used to combine text together, e.g:
+
+
+``` r
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")
+```
+
+``` output
+[1] "Write programs for people not computers"
+```
+
+Write a function called `fence()` that takes two vectors as arguments, called
+`text` and `wrapper`, and prints out the text wrapped with the `wrapper`:
+
+
+``` r
+fence(text=best_practice, wrapper="***")
+```
+
+*Note:* the `paste()` function has an argument called `sep`, which specifies
+the separator between text. The default is a space: " ". The default for
+`paste0()` is no space "".
+
+::::::::::::::: solution
+
+## Solution to challenge 5
+
+Write a function called `fence()` that takes two vectors as arguments,
+called `text` and `wrapper`, and prints out the text wrapped with the
+`wrapper`:
+
+
+``` r
+fence <- function(text, wrapper){
+ text <- c(wrapper, text, wrapper)
+ result <- paste(text, collapse = " ")
+ return(result)
+}
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")
+```
+
+``` output
+[1] "*** Write programs for people not computers ***"
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip
+
+R has some unique aspects that can be exploited when performing more
+complicated operations. We will not be writing anything that requires
+knowledge of these more advanced concepts. In the future when you are
+comfortable writing functions in R, you can learn more by reading the
+[R Language Manual][man] or this [chapter] from
+[Advanced R Programming][adv-r] by Hadley Wickham.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Testing and documenting
+
+It's important to both test functions and document them:
+Documentation helps you, and others, understand what the
+purpose of your function is, and how to use it, and its
+important to make sure that your function actually does
+what you think.
+
+When you first start out, your workflow will probably look a lot
+like this:
+
+1. Write a function
+2. Comment parts of the function to document its behaviour
+3. Load in the source file
+4. Experiment with it in the console to make sure it behaves
+ as you expect
+5. Make any necessary bug fixes
+6. Rinse and repeat.
+
+Formal documentation for functions, written in separate `.Rd`
+files, gets turned into the documentation you see in help
+files. The [roxygen2] package allows R coders to write documentation
+alongside the function code and then process it into the appropriate `.Rd`
+files. You will want to switch to this more formal method of writing
+documentation when you start writing more complicated R projects. In fact,
+packages are, in essence, bundles of functions with this formal documentation.
+Loading your own functions through `source("functions.R")` is equivalent to
+loading someone else's functions (or your own one day!) through
+`library("package")`.
+
+Formal automated tests can be written using the [testthat] package.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+[man]: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects
+[chapter]: https://adv-r.had.co.nz/Environments.html
+[adv-r]: https://adv-r.had.co.nz/
+[roxygen2]: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html
+[testthat]: https://r-pkgs.had.co.nz/tests.html
+
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use `function` to define a new function in R.
+- Use parameters to pass values into functions.
+- Use `stopifnot()` to flexibly check function arguments in R.
+- Load functions into programs using `source()`.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/11-writing-data.md b/11-writing-data.md
new file mode 100644
index 000000000..d28fcdf9c
--- /dev/null
+++ b/11-writing-data.md
@@ -0,0 +1,217 @@
+---
+title: Writing Data
+teaching: 10
+exercises: 10
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to write out plots and data from R.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I save plots and data created in R?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+## Saving plots
+
+You have already seen how to save the most recent plot you create in `ggplot2`,
+using the command `ggsave`. As a refresher:
+
+
+``` r
+ggsave("My_most_recent_plot.pdf")
+```
+
+You can save a plot from within RStudio using the 'Export' button
+in the 'Plot' window. This will give you the option of saving as a
+.pdf or as .png, .jpg or other image formats.
+
+Sometimes you will want to save plots without creating them in the
+'Plot' window first. Perhaps you want to make a pdf document with
+multiple pages: each one a different plot, for example. Or perhaps
+you're looping through multiple subsets of a file, plotting data from
+each subset, and you want to save each plot, but obviously can't stop
+the loop to click 'Export' for each one.
+
+In this case you can use a more flexible approach. The function
+`pdf` creates a new pdf device. You can control the size and resolution
+using the arguments to this function.
+
+
+``` r
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+ geom_line() +
+ theme(legend.position = "none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+```
+
+Open up this document and have a look.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Rewrite your 'pdf' command to print a second
+page in the pdf, showing a facet plot (hint: use `facet_grid`)
+of the same data with one panel per continent.
+
+::::::::::::::: solution
+
+## Solution to challenge 1
+
+
+``` r
+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+ geom_line() +
+ theme(legend.position = "none")
+p
+p + facet_grid(~continent)
+dev.off()
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+The commands `jpeg`, `png` etc. are used similarly to produce
+documents in different formats.
+
+## Writing data
+
+At some point, you'll also want to write out data from R.
+
+We can use the `write.table` function for this, which is
+very similar to `read.table` from before.
+
+Let's create a data-cleaning script, for this analysis, we
+only want to focus on the gapminder data for Australia:
+
+
+``` r
+aust_subset <- gapminder[gapminder$country == "Australia",]
+
+write.table(aust_subset,
+ file="cleaned-data/gapminder-aus.csv",
+ sep=","
+)
+```
+
+Let's switch back to the shell to take a look at the data to make sure it looks
+OK:
+
+
+``` bash
+head cleaned-data/gapminder-aus.csv
+```
+
+``` output
+"country","year","pop","continent","lifeExp","gdpPercap"
+"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
+"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
+"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
+"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
+"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
+"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
+"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
+"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
+"69","Australia",1992,17481977,"Oceania",77.56,23424.76683
+```
+
+Hmm, that's not quite what we wanted. Where did all these
+quotation marks come from? Also the row numbers are
+meaningless.
+
+Let's look at the help file to work out how to change this
+behaviour.
+
+
+``` r
+?write.table
+```
+
+By default R will wrap character vectors with quotation marks
+when writing out to file. It will also write out the row and
+column names.
+
+Let's fix this:
+
+
+``` r
+write.table(
+ gapminder[gapminder$country == "Australia",],
+ file="cleaned-data/gapminder-aus.csv",
+ sep=",", quote=FALSE, row.names=FALSE
+)
+```
+
+Now lets look at the data again using our shell skills:
+
+
+``` bash
+head cleaned-data/gapminder-aus.csv
+```
+
+``` output
+country,year,pop,continent,lifeExp,gdpPercap
+Australia,1952,8691212,Oceania,69.12,10039.59564
+Australia,1957,9712569,Oceania,70.33,10949.64959
+Australia,1962,10794968,Oceania,70.93,12217.22686
+Australia,1967,11872264,Oceania,71.1,14526.12465
+Australia,1972,13177000,Oceania,71.93,16788.62948
+Australia,1977,14074100,Oceania,73.49,18334.19751
+Australia,1982,15184200,Oceania,74.74,19477.00928
+Australia,1987,16257249,Oceania,76.32,21888.88903
+Australia,1992,17481977,Oceania,77.56,23424.76683
+```
+
+That looks better!
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Write a data-cleaning script file that subsets the gapminder
+data to include only data points collected since 1990.
+
+Use this script to write out the new subset to a file
+in the `cleaned-data/` directory.
+
+::::::::::::::: solution
+
+## Solution to challenge 2
+
+
+``` r
+write.table(
+ gapminder[gapminder$year > 1990, ],
+ file = "cleaned-data/gapminder-after1990.csv",
+ sep = ",", quote = FALSE, row.names = FALSE
+)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Save plots from RStudio using the 'Export' button.
+- Use `write.table` to save tabular data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/12-dplyr.md b/12-dplyr.md
new file mode 100644
index 000000000..8d7dfa4b2
--- /dev/null
+++ b/12-dplyr.md
@@ -0,0 +1,672 @@
+---
+title: Data Frame Manipulation with dplyr
+teaching: 40
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To be able to use the six main data frame manipulation 'verbs' with pipes in `dplyr`.
+- To understand how `group_by()` and `summarize()` can be combined to summarize datasets.
+- Be able to analyze a subset of data using logical filtering.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I manipulate data frames without repeating myself?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+Manipulation of data frames means many things to many researchers: we often
+select certain observations (rows) or variables (columns), we often group the
+data by a certain variable(s), or we even calculate summary statistics. We can
+do these operations using the normal base R operations:
+
+
+``` r
+mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
+```
+
+``` output
+[1] 2193.755
+```
+
+``` r
+mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
+```
+
+``` output
+[1] 7136.11
+```
+
+``` r
+mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
+```
+
+``` output
+[1] 7902.15
+```
+
+But this isn't very *nice* because there is a fair bit of repetition. Repeating
+yourself will cost you time, both now and later, and potentially introduce some
+nasty bugs.
+
+## The `dplyr` package
+
+Luckily, the [`dplyr`](https://cran.r-project.org/package=dplyr)
+package provides a number of very useful functions for manipulating data frames
+in a way that will reduce the above repetition, reduce the probability of making
+errors, and probably even save you some typing. As an added bonus, you might
+even find the `dplyr` grammar easier to read.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Tidyverse
+
+`dplyr` package belongs to a broader family of opinionated R packages
+designed for data science called the "Tidyverse". These
+packages are specifically designed to work harmoniously together.
+Some of these packages will be covered along this course, but you can find more
+complete information here: [https://www.tidyverse.org/](https://www.tidyverse.org/).
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Here we're going to cover 5 of the most commonly used functions as well as using
+pipes (`%>%`) to combine them.
+
+1. `select()`
+2. `filter()`
+3. `group_by()`
+4. `summarize()`
+5. `mutate()`
+
+If you have have not installed this package earlier, please do so:
+
+
+``` r
+install.packages('dplyr')
+```
+
+Now let's load the package:
+
+
+``` r
+library("dplyr")
+```
+
+## Using select()
+
+If, for example, we wanted to move forward with only a few of the variables in
+our data frame we could use the `select()` function. This will keep only the
+variables you select.
+
+
+``` r
+year_country_gdp <- select(gapminder, year, country, gdpPercap)
+```
+
+![](fig/13-dplyr-fig1.png){alt='Diagram illustrating use of select function to select two columns of a data frame'}
+If we want to remove one column only from the `gapminder` data, for example,
+removing the `continent` column.
+
+
+``` r
+smaller_gapminder_data <- select(gapminder, -continent)
+```
+
+If we open up `year_country_gdp` we'll see that it only contains the year,
+country and gdpPercap. Above we used 'normal' grammar, but the strengths of
+`dplyr` lie in combining several functions using pipes. Since the pipes grammar
+is unlike anything we've seen in R before, let's repeat what we've done above
+using pipes.
+
+
+``` r
+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)
+```
+
+To help you understand why we wrote that in that way, let's walk through it step
+by step. First we summon the gapminder data frame and pass it on, using the pipe
+symbol `%>%`, to the next step, which is the `select()` function. In this case
+we don't specify which data object we use in the `select()` function since in
+gets that from the previous pipe. **Fun Fact**: There is a good chance you have
+encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the
+shell it is `|` but the concept is the same!
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Tip: Renaming data frame columns in dplyr
+
+In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the `names()` function.
+Just like select, this is a bit cumbersome, but thankfully dplyr has a `rename()` function.
+
+Within a pipeline, the syntax is `rename(new_name = old_name)`.
+For example, we may want to rename the gdpPercap column name from our `select()` statement above.
+
+
+``` r
+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)
+
+head(tidy_gdp)
+```
+
+``` output
+ year country gdp_per_capita
+1 1952 Afghanistan 779.4453
+2 1957 Afghanistan 820.8530
+3 1962 Afghanistan 853.1007
+4 1967 Afghanistan 836.1971
+5 1972 Afghanistan 739.9811
+6 1977 Afghanistan 786.1134
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Using filter()
+
+If we now want to move forward with the above, but only with European
+countries, we can combine `select` and `filter`
+
+
+``` r
+year_country_gdp_euro <- gapminder %>%
+ filter(continent == "Europe") %>%
+ select(year, country, gdpPercap)
+```
+
+If we now want to show life expectancy of European countries but only
+for a specific year (e.g., 2007), we can do as below.
+
+
+``` r
+europe_lifeExp_2007 <- gapminder %>%
+ filter(continent == "Europe", year == 2007) %>%
+ select(country, lifeExp)
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Write a single command (which can span multiple lines and includes pipes) that
+will produce a data frame that has the African values for `lifeExp`, `country`
+and `year`, but not for other Continents. How many rows does your data frame
+have and why?
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+
+``` r
+year_country_lifeExp_Africa <- gapminder %>%
+ filter(continent == "Africa") %>%
+ select(year, country, lifeExp)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+As with last time, first we pass the gapminder data frame to the `filter()`
+function, then we pass the filtered version of the gapminder data frame to the
+`select()` function. **Note:** The order of operations is very important in this
+case. If we used 'select' first, filter would not be able to find the variable
+continent since we would have removed it in the previous step.
+
+## Using group\_by()
+
+Now, we were supposed to be reducing the error prone repetitiveness of what can
+be done with base R, but up to now we haven't done that since we would have to
+repeat the above for each continent. Instead of `filter()`, which will only pass
+observations that meet your criteria (in the above: `continent=="Europe"`), we
+can use `group_by()`, which will essentially use every unique criteria that you
+could have used in filter.
+
+
+``` r
+str(gapminder)
+```
+
+``` output
+'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+```
+
+``` r
+str(gapminder %>% group_by(continent))
+```
+
+``` output
+gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+ ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+ ..$ .rows : list [1:5]
+ .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+ .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+ .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+ .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+ .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+ .. ..@ ptype: int(0)
+ ..- attr(*, ".drop")= logi TRUE
+```
+
+You will notice that the structure of the data frame where we used `group_by()`
+(`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A
+`grouped_df` can be thought of as a `list` where each item in the `list`is a
+`data.frame` which contains only the rows that correspond to the a particular
+value `continent` (at least in the example above).
+
+![](fig/13-dplyr-fig2.png){alt='Diagram illustrating how the group by function oraganizes a data frame into groups'}
+
+## Using summarize()
+
+The above was a bit on the uneventful side but `group_by()` is much more
+exciting in conjunction with `summarize()`. This will allow us to create new
+variable(s) by using functions that repeat for each of the continent-specific
+data frames. That is to say, using the `group_by()` function, we split our
+original data frame into multiple pieces, then we can run functions
+(e.g. `mean()` or `sd()`) within `summarize()`.
+
+
+``` r
+gdp_bycontinents <- gapminder %>%
+ group_by(continent) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap))
+```
+
+![](fig/13-dplyr-fig3.png){alt='Diagram illustrating the use of group by and summarize together to create a new variable'}
+
+
+``` r
+continent mean_gdpPercap
+
+1 Africa 2193.755
+2 Americas 7136.110
+3 Asia 7902.150
+4 Europe 14469.476
+5 Oceania 18621.609
+```
+
+That allowed us to calculate the mean gdpPercap for each continent, but it gets
+even better.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Calculate the average life expectancy per country. Which has the longest average life
+expectancy and which has the shortest average life expectancy?
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+
+``` r
+lifeExp_bycountry <- gapminder %>%
+ group_by(country) %>%
+ summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+ filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
+```
+
+``` output
+# A tibble: 2 × 2
+ country mean_lifeExp
+
+1 Iceland 76.5
+2 Sierra Leone 36.8
+```
+
+Another way to do this is to use the `dplyr` function `arrange()`, which
+arranges the rows in a data frame according to the order of one or more
+variables from the data frame. It has similar syntax to other functions from
+the `dplyr` package. You can use `desc()` inside `arrange()` to sort in
+descending order.
+
+
+``` r
+lifeExp_bycountry %>%
+ arrange(mean_lifeExp) %>%
+ head(1)
+```
+
+``` output
+# A tibble: 1 × 2
+ country mean_lifeExp
+
+1 Sierra Leone 36.8
+```
+
+``` r
+lifeExp_bycountry %>%
+ arrange(desc(mean_lifeExp)) %>%
+ head(1)
+```
+
+``` output
+# A tibble: 1 × 2
+ country mean_lifeExp
+
+1 Iceland 76.5
+```
+
+Alphabetical order works too
+
+
+``` r
+lifeExp_bycountry %>%
+ arrange(desc(country)) %>%
+ head(1)
+```
+
+``` output
+# A tibble: 1 × 2
+ country mean_lifeExp
+
+1 Zimbabwe 52.7
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::
+
+The function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`.
+
+
+``` r
+gdp_bycontinents_byyear <- gapminder %>%
+ group_by(continent, year) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`.
+
+
+``` r
+gdp_pop_bycontinents_byyear <- gapminder %>%
+ group_by(continent, year) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap),
+ sd_gdpPercap = sd(gdpPercap),
+ mean_pop = mean(pop),
+ sd_pop = sd(pop))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+## count() and n()
+
+A very common operation is to count the number of observations for each
+group. The `dplyr` package comes with two related functions that help with this.
+
+For instance, if we wanted to check the number of countries included in the
+dataset for the year 2002, we can use the `count()` function. It takes the name
+of one or more columns that contain the groups we are interested in, and we can
+optionally sort the results in descending order by adding `sort=TRUE`:
+
+
+``` r
+gapminder %>%
+ filter(year == 2002) %>%
+ count(continent, sort = TRUE)
+```
+
+``` output
+ continent n
+1 Africa 52
+2 Asia 33
+3 Europe 30
+4 Americas 25
+5 Oceania 2
+```
+
+If we need to use the number of observations in calculations, the `n()` function
+is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent:
+
+
+``` r
+gapminder %>%
+ group_by(continent) %>%
+ summarize(se_le = sd(lifeExp)/sqrt(n()))
+```
+
+``` output
+# A tibble: 5 × 2
+ continent se_le
+
+1 Africa 0.366
+2 Americas 0.540
+3 Asia 0.596
+4 Europe 0.286
+5 Oceania 0.775
+```
+
+You can also chain together several summary operations; in this case calculating the `minimum`, `maximum`, `mean` and `se` of each continent's per-country life-expectancy:
+
+
+``` r
+gapminder %>%
+ group_by(continent) %>%
+ summarize(
+ mean_le = mean(lifeExp),
+ min_le = min(lifeExp),
+ max_le = max(lifeExp),
+ se_le = sd(lifeExp)/sqrt(n()))
+```
+
+``` output
+# A tibble: 5 × 5
+ continent mean_le min_le max_le se_le
+
+1 Africa 48.9 23.6 76.4 0.366
+2 Americas 64.7 37.6 80.7 0.540
+3 Asia 60.1 28.8 82.6 0.596
+4 Europe 71.9 43.6 81.8 0.286
+5 Oceania 74.3 69.1 81.2 0.775
+```
+
+## Using mutate()
+
+We can also create new variables prior to (or even after) summarizing information using `mutate()`.
+
+
+``` r
+gdp_pop_bycontinents_byyear <- gapminder %>%
+ mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+ group_by(continent,year) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap),
+ sd_gdpPercap = sd(gdpPercap),
+ mean_pop = mean(pop),
+ sd_pop = sd(pop),
+ mean_gdp_billion = mean(gdp_billion),
+ sd_gdp_billion = sd(gdp_billion))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+## Connect mutate with logical filtering: ifelse
+
+When creating new variables, we can hook this with a logical condition. A simple combination of
+`mutate()` and `ifelse()` facilitates filtering right where it is needed: in the moment of creating something new.
+This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension
+of the data frame will not change) or for updating values depending on this given condition.
+
+
+``` r
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+ mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+ group_by(continent, year) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap),
+ sd_gdpPercap = sd(gdpPercap),
+ mean_pop = mean(pop),
+ sd_pop = sd(pop),
+ mean_gdp_billion = mean(gdp_billion),
+ sd_gdp_billion = sd(gdp_billion))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+``` r
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+ mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+ group_by(continent, year) %>%
+ summarize(mean_gdpPercap = mean(gdpPercap),
+ mean_gdpPercap_expected = mean(gdp_futureExpectation))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+## Combining `dplyr` and `ggplot2`
+
+First install and load ggplot2:
+
+
+``` r
+install.packages('ggplot2')
+```
+
+
+``` r
+library("ggplot2")
+```
+
+In the plotting lesson we looked at how to make a multi-panel figure by adding
+a layer of facet panels using `ggplot2`. Here is the code we used (with some
+extra comments):
+
+
+``` r
+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+ geom_line() +
+ facet_wrap( ~ country) +
+ theme(axis.text.x = element_text(angle = 45))
+```
+
+
+
+This code makes the right plot but it also creates an intermediate variable
+(`americas`) that we might not have any other uses for. Just as we used
+`%>%` to pipe data along a chain of `dplyr` functions we can use it to pass data
+to `ggplot()`. Because `%>%` replaces the first argument in a function we don't
+need to specify the `data =` argument in the `ggplot()` function. By combining
+`dplyr` and `ggplot2` functions we can make the same figure without creating any
+new variables or modifying the data.
+
+
+``` r
+gapminder %>%
+ # Filter countries located in the Americas
+ filter(continent == "Americas") %>%
+ # Make the plot
+ ggplot(mapping = aes(x = year, y = lifeExp)) +
+ geom_line() +
+ facet_wrap( ~ country) +
+ theme(axis.text.x = element_text(angle = 45))
+```
+
+
+
+More examples of using the function `mutate()` and the `ggplot2` package.
+
+
+``` r
+gapminder %>%
+ # extract first letter of country name into new column
+ mutate(startsWith = substr(country, 1, 1)) %>%
+ # only keep countries starting with A or Z
+ filter(startsWith %in% c("A", "Z")) %>%
+ # plot lifeExp into facets
+ ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+ geom_line() +
+ facet_wrap(vars(country)) +
+ theme_minimal()
+```
+
+
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Advanced Challenge
+
+Calculate the average life expectancy in 2002 of 2 randomly selected countries
+for each continent. Then arrange the continent names in reverse order.
+**Hint:** Use the `dplyr` functions `arrange()` and `sample_n()`, they have
+similar syntax to other dplyr functions.
+
+::::::::::::::: solution
+
+## Solution to Advanced Challenge
+
+
+``` r
+lifeExp_2countries_bycontinents <- gapminder %>%
+ filter(year==2002) %>%
+ group_by(continent) %>%
+ sample_n(2) %>%
+ summarize(mean_lifeExp=mean(lifeExp)) %>%
+ arrange(desc(mean_lifeExp))
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Other great resources
+
+- [R for Data Science](https://r4ds.hadley.nz/) (online book)
+- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file)
+- [Introduction to dplyr](https://dplyr.tidyverse.org/) (online documentation)
+- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video)
+- [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) (online book)
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use the `dplyr` package to manipulate data frames.
+- Use `select()` to choose variables from a data frame.
+- Use `filter()` to choose data based on values.
+- Use `group_by()` and `summarize()` to work with subsets of data.
+- Use `mutate()` to create new variables.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/13-tidyr.md b/13-tidyr.md
new file mode 100644
index 000000000..9c49eb6b3
--- /dev/null
+++ b/13-tidyr.md
@@ -0,0 +1,605 @@
+---
+title: Data Frame Manipulation with tidyr
+teaching: 30
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- To understand the concepts of 'longer' and 'wider' data frame formats and be able to convert between them with `tidyr`.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I change the layout of a data frame?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+Researchers often want to reshape their data frames from 'wide' to 'longer'
+layouts, or vice-versa. The 'long' layout or format is where:
+
+- each column is a variable
+- each row is an observation
+
+In the purely 'long' (or 'longest') format, you usually have 1 column for the observed variable and the other columns are ID variables.
+
+For the 'wide' format each row is often a site/subject/patient and you have
+multiple observation variables containing the same type of data. These can be
+either repeated observations over time, or observation of multiple variables (or
+a mix of both). You may find data input may be simpler or some other
+applications may prefer the 'wide' format. However, many of `R`'s functions have
+been designed assuming you have 'longer' formatted data. This tutorial will help you
+efficiently transform your data shape regardless of original format.
+
+![](fig/14-tidyr-fig1.png){alt='Diagram illustrating the difference between a wide versus long layout of a data frame'}
+
+Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due
+to its shape. However, the long format is more machine readable and is closer
+to the formatting of databases. The ID variables in our data frames are similar to
+the fields in a database and observed variables are like the database values.
+
+## Getting started
+
+First install the packages if you haven't already done so (you probably
+installed dplyr in the previous lesson):
+
+
+``` r
+#install.packages("tidyr")
+#install.packages("dplyr")
+```
+
+Load the packages
+
+
+``` r
+library("tidyr")
+library("dplyr")
+```
+
+First, lets look at the structure of our original gapminder data frame:
+
+
+``` r
+str(gapminder)
+```
+
+``` output
+'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Is gapminder a purely long, purely wide, or some intermediate format?
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+The original gapminder data.frame is in an intermediate format. It is not
+purely long since it had multiple observation variables
+(`pop`,`lifeExp`,`gdpPercap`).
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Sometimes, as with the gapminder dataset, we have multiple types of observed
+data. It is somewhere in between the purely 'long' and 'wide' data formats. We
+have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation
+variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be
+preferred despite not having ALL observations in 1 column given that all 3
+observation variables have different units. There are few operations that would
+need us to make this data frame any longer (i.e. 4 ID variables and 1
+Observation variable).
+
+While using many of the functions in R, which are often vector based, you
+usually do not want to do mathematical operations on values with different
+units. For example, using the purely long format, a single mean for all of the
+values of population, life expectancy, and GDP would not be meaningful since it
+would return the mean of values with 3 incompatible units. The solution is that
+we first manipulate the data either by grouping (see the lesson on `dplyr`), or
+we change the structure of the data frame. **Note:** Some plotting functions in
+R actually work better in the wide format data.
+
+## From wide to long format with pivot\_longer()
+
+Until now, we've been using the nicely formatted original gapminder dataset, but
+'real' data (i.e. our own research data) will never be so well organized. Here
+let's start with the wide formatted version of the gapminder dataset.
+
+> Download the wide version of the gapminder data from [this link to a csv file](data/gapminder_wide.csv)
+> and save it in your data folder.
+
+We'll load the data file and look at it. Note: we don't want our continent and
+country columns to be factors, so we use the stringsAsFactors argument for
+`read.csv()` to disable that.
+
+
+``` r
+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
+str(gap_wide)
+```
+
+``` output
+'data.frame': 142 obs. of 38 variables:
+ $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
+ $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
+```
+
+![](fig/14-tidyr-fig2.png){alt='Diagram illustrating the wide format of the gapminder data frame'}
+
+To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable.
+
+![](fig/14-tidyr-fig3.png){alt='Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format'}
+
+
+``` r
+gap_long <- gap_wide %>%
+ pivot_longer(
+ cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+ names_to = "obstype_year", values_to = "obs_values"
+ )
+str(gap_long)
+```
+
+``` output
+tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
+```
+
+Here we have used piping syntax which is similar to what we were doing in the
+previous lesson with dplyr. In fact, these are compatible and you can use a mix
+of tidyr and dplyr functions by piping them together.
+
+We first provide to `pivot_longer()` a vector of column names that will be
+pivoted into longer format. We could type out all the observation variables, but
+as in the `select()` function (see `dplyr` lesson), we can use the `starts_with()`
+argument to select all variables that start with the desired character string.
+`pivot_longer()` also allows the alternative syntax of using the `-` symbol to
+identify which variables are not to be pivoted (i.e. ID variables).
+
+The next arguments to `pivot_longer()` are `names_to` for naming the column that
+will contain the new ID variable (`obstype_year`) and `values_to` for naming the
+new amalgamated observation variable (`obs_value`). We supply these new column
+names as strings.
+
+![](fig/14-tidyr-fig4.png){alt='Diagram illustrating the long format of the gapminder data'}
+
+
+``` r
+gap_long <- gap_wide %>%
+ pivot_longer(
+ cols = c(-continent, -country),
+ names_to = "obstype_year", values_to = "obs_values"
+ )
+str(gap_long)
+```
+
+``` output
+tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values : num [1:5112] 2449 3014 2551 3247 4183 ...
+```
+
+That may seem trivial with this particular data frame, but sometimes you have 1
+ID variable and 40 observation variables with irregular variable names. The
+flexibility is a huge time saver!
+
+Now `obstype_year` actually contains 2 pieces of information, the observation
+type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the
+`separate()` function to split the character strings into multiple variables
+
+
+``` r
+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Using `gap_long`, calculate the mean life expectancy, population, and gdpPercap for each continent.
+**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+
+``` r
+gap_long %>% group_by(continent, obs_type) %>%
+ summarize(means=mean(obs_values))
+```
+
+``` output
+`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+```
+
+``` output
+# A tibble: 15 × 3
+# Groups: continent [5]
+ continent obs_type means
+
+ 1 Africa gdpPercap 2194.
+ 2 Africa lifeExp 48.9
+ 3 Africa pop 9916003.
+ 4 Americas gdpPercap 7136.
+ 5 Americas lifeExp 64.7
+ 6 Americas pop 24504795.
+ 7 Asia gdpPercap 7902.
+ 8 Asia lifeExp 60.1
+ 9 Asia pop 77038722.
+10 Europe gdpPercap 14469.
+11 Europe lifeExp 71.9
+12 Europe pop 17169765.
+13 Oceania gdpPercap 18622.
+14 Oceania lifeExp 74.3
+15 Oceania pop 8874672.
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## From long to intermediate format with pivot\_wider()
+
+It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let's start with the intermediate format.
+
+The `pivot_wider()` function takes `names_from` and `values_from` arguments.
+
+To `names_from` we supply the column name whose contents will be pivoted into new
+output columns in the widened data frame. The corresponding values will be added
+from the column named in the `values_from` argument.
+
+
+``` r
+gap_normal <- gap_long %>%
+ pivot_wider(names_from = obs_type, values_from = obs_values)
+dim(gap_normal)
+```
+
+``` output
+[1] 1704 6
+```
+
+``` r
+dim(gapminder)
+```
+
+``` output
+[1] 1704 6
+```
+
+``` r
+names(gap_normal)
+```
+
+``` output
+[1] "continent" "country" "year" "gdpPercap" "lifeExp" "pop"
+```
+
+``` r
+names(gapminder)
+```
+
+``` output
+[1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
+```
+
+Now we've got an intermediate data frame `gap_normal` with the same dimensions as
+the original `gapminder`, but the order of the variables is different. Let's fix
+that before checking if they are `all.equal()`.
+
+
+``` r
+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)
+```
+
+``` output
+[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"
+[3] "Component \"country\": 1704 string mismatches"
+[4] "Component \"pop\": Mean relative difference: 1.634504"
+[5] "Component \"continent\": 1212 string mismatches"
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"
+```
+
+``` r
+head(gap_normal)
+```
+
+``` output
+# A tibble: 6 × 6
+ country year pop continent lifeExp gdpPercap
+
+1 Algeria 1952 9279525 Africa 43.1 2449.
+2 Algeria 1957 10270856 Africa 45.7 3014.
+3 Algeria 1962 11000948 Africa 48.3 2551.
+4 Algeria 1967 12760499 Africa 51.4 3247.
+5 Algeria 1972 14760787 Africa 54.5 4183.
+6 Algeria 1977 17152804 Africa 58.0 4910.
+```
+
+``` r
+head(gapminder)
+```
+
+``` output
+ country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+```
+
+We're almost there, the original was sorted by `country`, then
+`year`.
+
+
+``` r
+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)
+```
+
+``` output
+[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"
+```
+
+That's great! We've gone from the longest format back to the intermediate and we
+didn't introduce any errors in our code.
+
+Now let's convert the long all the way back to the wide. In the wide format, we
+will keep country and continent as ID variables and pivot the observations
+across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we
+need to create appropriate labels for all our new variables (time\*metric
+combinations) and we also need to unify our ID variables to simplify the process
+of defining `gap_wide`.
+
+
+``` r
+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
+str(gap_temp)
+```
+
+``` output
+tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+```
+
+``` r
+gap_temp <- gap_long %>%
+ unite(ID_var, continent, country, sep = "_") %>%
+ unite(var_names, obs_type, year, sep = "_")
+str(gap_temp)
+```
+
+``` output
+tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+```
+
+Using `unite()` we now have a single ID variable which is a combination of
+`continent`,`country`,and we have defined variable names. We're now ready to
+pipe in `pivot_wider()`
+
+
+``` r
+gap_wide_new <- gap_long %>%
+ unite(ID_var, continent, country, sep = "_") %>%
+ unite(var_names, obs_type, year, sep = "_") %>%
+ pivot_wider(names_from = var_names, values_from = obs_values)
+str(gap_wide_new)
+```
+
+``` output
+tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952 : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957 : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962 : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967 : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972 : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977 : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982 : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987 : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992 : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997 : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002 : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007 : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952 : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957 : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962 : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967 : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972 : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977 : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982 : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987 : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992 : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997 : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002 : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007 : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Take this 1 step further and create a `gap_ludicrously_wide` format data by pivoting over countries, year and the 3 metrics?
+**Hint** this new data frame should only have 5 rows.
+
+::::::::::::::: solution
+
+## Solution to Challenge 3
+
+
+``` r
+gap_ludicrously_wide <- gap_long %>%
+ unite(var_names, obs_type, year, country, sep = "_") %>%
+ pivot_wider(names_from = var_names, values_from = obs_values)
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Now we have a great 'wide' format data frame, but the `ID_var` could be more
+usable, let's separate it into 2 variables with `separate()`
+
+
+``` r
+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+ unite(ID_var, continent, country, sep = "_") %>%
+ unite(var_names, obs_type, year, sep = "_") %>%
+ pivot_wider(names_from = var_names, values_from = obs_values) %>%
+ separate(ID_var, c("continent","country"), sep = "_")
+str(gap_wide_betterID)
+```
+
+``` output
+tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952 : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957 : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962 : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967 : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972 : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977 : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982 : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987 : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992 : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997 : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002 : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007 : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952 : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957 : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962 : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967 : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972 : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977 : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982 : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987 : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992 : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997 : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002 : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007 : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+```
+
+``` r
+all.equal(gap_wide, gap_wide_betterID)
+```
+
+``` output
+[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"
+```
+
+There and back again!
+
+## Other great resources
+
+- [R for Data Science](https://r4ds.hadley.nz/) (online book)
+- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file)
+- [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (online documentation)
+- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video)
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Use the `tidyr` package to change the layout of data frames.
+- Use `pivot_longer()` to go from wide to longer layout.
+- Use `pivot_wider()` to go from long to wider layout.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/14-knitr-markdown.md b/14-knitr-markdown.md
new file mode 100644
index 000000000..aebe2b467
--- /dev/null
+++ b/14-knitr-markdown.md
@@ -0,0 +1,475 @@
+---
+title: Producing Reports With knitr
+teaching: 60
+exercises: 15
+source: Rmd
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Understand the value of writing reproducible reports
+- Learn how to recognise and compile the basic components of an R Markdown file
+- Become familiar with R code chunks, and understand their purpose, structure and options
+- Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations
+- Be aware of alternative output formats to which an R Markdown file can be exported
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I integrate software and reports?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+
+## Data analysis reports
+
+Data analysts tend to write a lot of reports, describing their
+analyses and results, for their collaborators or to document their
+work for future reference.
+
+Many new users begin by first writing a single R script containing all of their
+work, and then share the analysis by emailing the script and various graphs
+as attachments. But this can be cumbersome, requiring a lengthy discussion to
+explain which attachment was which result.
+
+Writing formal reports with Word or [LaTeX](https://www.latex-project.org/)
+can simplify this process by incorporating both the analysis report and output graphs
+into a single document. But tweaking formatting to make figures look correct
+and fixing obnoxious page breaks can be tedious and lead to a lengthy "whack-a-mole"
+game of fixing new mistakes resulting from a single formatting change.
+
+Creating a report as a web page (which is an html file) using R Markdown makes things easier.
+The report can be one long stream, so tall figures that wouldn't ordinarily fit on
+one page can be kept at full size and easier to read, since the reader can simply
+keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend
+more time on your analyses instead of writing reports.
+
+## Literate programming
+
+Ideally, such analysis reports are *reproducible* documents: If an
+error is discovered, or if some additional subjects are added to the
+data, you can just re-compile the report and get the new or corrected
+results rather than having to reconstruct figures, paste them into
+a Word document, and hand-edit various detailed results.
+
+The key R package here is [`knitr`](https://yihui.name/knitr/). It allows you
+to create a document that is a mixture of text and chunks of
+code. When the document is processed by `knitr`, chunks of code will
+be executed, and graphs or other results will be inserted into the final document.
+
+This sort of idea has been called "literate programming".
+
+`knitr` allows you to mix basically any type of text with code from different programming languages, but we recommend that you use `R Markdown`, which mixes Markdown
+with R. [Markdown](https://www.markdownguide.org/) is a light-weight mark-up language for creating web
+pages.
+
+## Creating an R Markdown file
+
+Within RStudio, click File → New File → R Markdown and
+you'll get a dialog box like this:
+
+![](fig/New_R_Markdown.png){alt='Screenshot of the New R Markdown file dialogue box in RStudio'}
+
+You can stick with the default (HTML output), but give it a title.
+
+## Basic components of R Markdown
+
+The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want
+to produce. In this case, we're creating an html document.
+
+```
+---
+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document
+---
+```
+
+You can delete any of those fields if you don't want them
+included. The double-quotes aren't strictly *necessary* in this case.
+They're mostly needed if you want to include a colon in the title.
+
+RStudio creates the document with some example text to get you
+started. Note below that there are chunks like
+
+
+```{r}
+summary(cars)
+```
+
+
+These are chunks of R code that will be executed by `knitr` and replaced
+by their results. More on this later.
+
+## Markdown
+
+Markdown is a system for writing web pages by marking up the text much
+as you would in an email rather than writing html code. The marked-up
+text gets *converted* to html, replacing the marks with the proper
+html code.
+
+For now, let's delete all of the stuff that's there and write a bit of
+markdown.
+
+You make things **bold** using two asterisks, like this: `**bold**`,
+and you make things *italics* by using underscores, like this:
+`_italics_`.
+
+You can make a bulleted list by writing a list with hyphens or
+asterisks with a space between the list and other text, like this:
+
+```
+A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+```
+
+or like this:
+
+```
+A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+```
+
+Each will appear as:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+
+You can use whatever method you prefer, but *be consistent*. This maintains the
+readability of your code.
+
+You can make a numbered list by just using numbers. You can even use the
+same number over and over if you want:
+
+```
+1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+```
+
+This will appear as:
+
+1. bold with double-asterisks
+2. italics with underscores
+3. code-type font with backticks
+
+You can make section headers of different sizes by initiating a line
+with some number of `#` symbols:
+
+```
+# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+```
+
+You *compile* the R Markdown document to an html webpage by clicking
+the "Knit" button in the upper-left.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 1
+
+Create a new R Markdown document. Delete all of the R code chunks
+and write a bit of Markdown (some sections, some italicized
+text, and an itemized list).
+
+Convert the document to a webpage.
+
+::::::::::::::: solution
+
+## Solution to Challenge 1
+
+In RStudio, select File > New file > R Markdown...
+
+Delete the placeholder text and add the following:
+
+```
+# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+```
+
+Then click the 'Knit' button on the toolbar to generate an html document (webpage).
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## A bit more Markdown
+
+You can make a hyperlink like this:
+`[Carpentries Home Page](https://carpentries.org/)`.
+
+You can include an image file like this: `![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)`
+
+You can do subscripts (e.g., F~2~) with `F~2~` and superscripts (e.g.,
+F^2^) with `F^2^`.
+
+If you know how to write equations in
+[LaTeX](https://www.latex-project.org/), you can use `$ $` and `$$ $$` to insert math equations, like
+`$E = mc^2$` and
+
+```
+$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+```
+
+You can review Markdown syntax by navigating to the
+"Markdown Quick Reference" under the "Help" field in the
+toolbar at the top of RStudio.
+
+## R code chunks
+
+The real power of Markdown comes from
+mixing markdown with chunks of code. This is R Markdown. When
+processed, the R code will be executed; if they produce figures, the
+figures will be inserted in the final document.
+
+The main code chunks look like this:
+
+
+
+That is, you place a chunk of R code between \`\`\`{r chunk\_name}
+and \`\`\`. You should give each chunk
+a unique name, as they will help you to fix errors and, if any graphs are
+produced, the file names are based on the name of the code chunk that
+produced them. You can create code chunks quickly in RStudio using the shortcuts
+Ctrl\+Alt\+I on Windows and Linux, or Cmd\+Option\+I on Mac.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 2
+
+Add code chunks to:
+
+- Load the ggplot2 package
+- Read the gapminder data
+- Create a plot
+
+::::::::::::::: solution
+
+## Solution to Challenge 2
+
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## How things get compiled
+
+When you press the "Knit" button, the R Markdown document is
+processed by [`knitr`](https://yihui.name/knitr) and a plain Markdown
+document is produced (as well as, potentially, a set of figure files): the R code is executed
+and replaced by both the input and the output; if figures are
+produced, links to those figures are included.
+
+The Markdown and figure documents are then processed by the tool
+[`pandoc`](https://pandoc.org/), which converts the Markdown file into an
+html file, with the figures embedded.
+
+
+
+## Chunk options
+
+There are a variety of options to affect how the code chunks are
+treated. Here are some examples:
+
+- Use `echo=FALSE` to avoid having the code itself shown.
+- Use `results="hide"` to avoid having any results printed.
+- Use `eval=FALSE` to have the code shown but not evaluated.
+- Use `warning=FALSE` and `message=FALSE` to hide any warnings or
+ messages produced.
+- Use `fig.height` and `fig.width` to control the size of the figures
+ produced (in inches).
+
+So you might write:
+
+
+
+The `fig.path` option defines where the figures will be saved. The `/`
+here is really important; without it, the figures would be saved in
+the standard place but just with names that begin with `Figs`.
+
+If you have multiple R Markdown files in a common directory, you might
+want to use `fig.path` to define separate prefixes for the figure file
+names, like `fig.path="Figs/cleaning-"` and `fig.path="Figs/analysis-"`.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Challenge 3
+
+Use chunk options to control the size of a figure and to hide the
+code.
+
+::::::::::::::: solution
+
+## Solution to Challenge 3
+
+
>];
+
+ labelloc="t";
+ fontname="Courier";
+ label="long format";
+ }
+ ')
diff --git a/fig/New_R_Markdown.png b/fig/New_R_Markdown.png
new file mode 100644
index 000000000..8542fe9bd
Binary files /dev/null and b/fig/New_R_Markdown.png differ
diff --git a/fig/bad_layout.png b/fig/bad_layout.png
new file mode 100644
index 000000000..fcfda0c5a
Binary files /dev/null and b/fig/bad_layout.png differ
diff --git a/fig/rmd-06-equality.0.svg b/fig/rmd-06-equality.0.svg
new file mode 100644
index 000000000..9671b0b3e
--- /dev/null
+++ b/fig/rmd-06-equality.0.svg
@@ -0,0 +1,288 @@
+
+
+
+
diff --git a/fig/rmd-06-equality.1.png b/fig/rmd-06-equality.1.png
new file mode 100644
index 000000000..f4152a338
Binary files /dev/null and b/fig/rmd-06-equality.1.png differ
diff --git a/fig/rmd-06-equality.2.png b/fig/rmd-06-equality.2.png
new file mode 100644
index 000000000..e33f4cf4f
Binary files /dev/null and b/fig/rmd-06-equality.2.png differ
diff --git a/fig/software-carpentry-banner.png b/fig/software-carpentry-banner.png
new file mode 100644
index 000000000..746a9c53c
Binary files /dev/null and b/fig/software-carpentry-banner.png differ
diff --git a/fig/visual_mode_icon.png b/fig/visual_mode_icon.png
new file mode 100644
index 000000000..d224e3cee
Binary files /dev/null and b/fig/visual_mode_icon.png differ
diff --git a/index.md b/index.md
new file mode 100644
index 000000000..4606f0759
--- /dev/null
+++ b/index.md
@@ -0,0 +1,37 @@
+---
+site: sandpaper::sandpaper_site
+---
+
+*an introduction to R for non-programmers using gapminder data*
+
+The goal of this lesson is to teach novice programmers to write modular code
+and best practices for using R for data analysis. R is commonly used in many
+scientific disciplines for statistical analysis and its array of third-party
+packages. We find that many scientists who come to Software Carpentry workshops
+use R and want to learn more. The emphasis of these materials is to give
+attendees a strong foundation in the fundamentals of R, and to teach best
+practices for scientific computing: breaking down analyses into modular units,
+task automation, and encapsulation.
+
+Note that this workshop will focus on teaching the fundamentals of the
+programming language R, and will not teach statistical analysis.
+
+The lesson contains more material than can be taught in a day. The [instructor notes page](instructors/instructor-notes.md) has some suggested lesson plans suitable for a one or half day workshop.
+
+A variety of third party packages are used throughout this workshop. These
+are not necessarily the best, nor are they comprehensive, but they are
+packages we find useful, and have been chosen primarily for their
+usability.
+
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## Prerequisites
+
+Understand that computers store data and instructions (programs, scripts etc.) in files.
+Files are organised in directories (folders).
+Know how to access files not in the working directory by specifying the path.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/instructor-notes.md b/instructor-notes.md
new file mode 100644
index 000000000..83d1925ef
--- /dev/null
+++ b/instructor-notes.md
@@ -0,0 +1,135 @@
+---
+title: Instructor Notes
+---
+
+## Timing
+
+Leave about 30 minutes at the start of each workshop and another 15 mins
+at the start of each session for technical difficulties like WiFi and
+installing things (even if you asked students to install in advance, longer if
+not).
+
+## Lesson Plans
+
+The lesson contains much more material than can be taught in a day.
+Instructors will need to pick an appropriate subset of episodes to use
+in a standard one day course.
+
+Some suggested paths through the material are:
+
+(suggested by [@liz-is](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-276529213))
+
+- 01 Introduction to R and RStudio
+- 04 Data Structures
+- 05 Exploring Data Frames ("Realistic example" section onwards)
+- 08 Creating Publication-Quality Graphics with ggplot2
+- 10 Functions Explained
+- 13 Dataframe Manipulation with dplyr
+- 15 Producing Reports With knitr
+
+(suggested by [@naupaka](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-312547509))
+
+- 01 Introduction to R and RStudio
+- 02 Project Management With RStudio
+- 03 Seeking Help
+- 04 Data Structures
+- 05 Exploring Data Frames
+- 06 Subsetting Data
+- 09 Vectorization
+- 08 Creating Publication-Quality Graphics with ggplot2 *OR*
+ 13 Dataframe Manipulation with dplyr
+- 15 Producing Reports With knitr
+
+A half day course could consist of (suggested by [@karawoo](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-277599864)):
+
+- 01 Introduction to R and RStudio
+- 04 Data Structures (only creating vectors with `c()`)
+- 05 Exploring Data Frames ("Realistic example" section onwards)
+- 06 Subsetting Data (excluding factor, matrix and list subsetting)
+- 08 Creating Publication-Quality Graphics with ggplot2
+
+## Setting up git in RStudio
+
+There can be difficulties linking git to RStudio depending on the
+operating system and the version of the operating system. To make sure
+Git is properly installed and configured, the learners should go to
+the Options window in the RStudio application.
+
+- **Mac OS X:**
+ - Go RStudio -> Preferences... -> Git/SVN
+ - Check and see whether there is a path to a file in the "Git executable" window. If not, the next challenge is figuring out where Git is located.
+ - In the terminal enter `which git` and you will get a path to the git executable. In the "Git executable" window you may have difficulties finding the directory since OS X hides many of the operating system files. While the file selection window is open, pressing "Command-Shift-G" will pop up a text entry box where you will be able to type or paste in the full path to your git executable: e.g. /usr/bin/git or whatever else it might be.
+- **Windows:**
+ - Go Tools -> Global options... -> Git/SVN
+ - If you use the Software Carpentry Installer, then 'git.exe' should be installed at `C:/Program Files/Git/bin/git.exe`.
+
+To prevent the learners from having to re-enter their password each time they push a commit to GitHub, this command (which can be run from a bash prompt) will make it so they only have to enter their password once:
+
+```bash
+$ git config --global credential.helper 'cache --timeout=10000000'
+```
+
+## RStudio Color Preview
+
+RStudio has a feature to preview the color for certain named colors and hexadecimal colors. This may confuse or distract learners (and instructors) who are not expecting it.
+
+Mainly, this is likely to come up during the episode on "Data Structures" with the following code block:
+
+```r
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+ weight = c(2.1, 5.0, 3.2),
+ likes_string = c(1, 0, 1))
+```
+
+This option can be turned off and on in the following menu setting:
+Tools -> Global Options -> Code -> Display -> Enable preview of named and hexadecimal colors (under "Syntax")
+
+## Pulling in Data
+
+The easiest way to get the data used in this lesson during a workshop is to have
+attendees download the raw data from [gapminder-data] and
+[gapminder-data-wide].
+
+Attendees can use the `File - Save As` dialog in their browser to save the file.
+
+## Overall
+
+Make sure to emphasize good practices: put code in scripts, and make
+sure they're version controlled. Encourage students to create script
+files for challenges.
+
+If you're working in a cloud environment, get them to upload the
+gapminder data after the second lesson.
+
+Make sure to emphasize that matrices are vectors underneath the hood
+and data frames are lists underneath the hood: this will explain a
+lot of the esoteric behaviour encountered in basic operations.
+
+Vector recycling and function stacks are probably best explained
+with diagrams on a whiteboard.
+
+Be sure to actually go through examples of an R help page: help files
+can be intimidating at first, but knowing how to read them is tremendously
+useful.
+
+Be sure to show the CRAN task views, look at one of the topics.
+
+There's a lot of content: move quickly through the earlier lessons. Their
+extensiveness is mostly for purposes of learning by osmosis: so that their
+memory will trigger later when they encounter a problem or some esoteric behaviour.
+
+Key lessons to take time on:
+
+- Data subsetting - conceptually difficult for novices
+- Functions - learners especially struggle with this
+- Data structures - worth being thorough, but you can go through it quickly.
+
+Don't worry about being correct or knowing the material back-to-front. Use
+mistakes as teaching moments: the most vital skill you can impart is how to
+debug and recover from unexpected errors.
+
+[gapminder-data]: data/gapminder_data.csv
+[gapminder-data-wide]: data/gapminder_wide.csv
+
+
+
diff --git a/learner-profiles.md b/learner-profiles.md
new file mode 100644
index 000000000..434e335aa
--- /dev/null
+++ b/learner-profiles.md
@@ -0,0 +1,5 @@
+---
+title: FIXME
+---
+
+This is a placeholder file. Please add content here.
diff --git a/md5sum.txt b/md5sum.txt
new file mode 100644
index 000000000..f846e9521
--- /dev/null
+++ b/md5sum.txt
@@ -0,0 +1,26 @@
+"file" "checksum" "built" "date"
+"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-11-19"
+"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-11-19"
+"config.yaml" "8b9d63dd3c46d5b4d5fa4219b51a0dfc" "site/built/config.yaml" "2024-11-19"
+"index.md" "86c8fb559b13d1695d55b52dd6cbf574" "site/built/index.md" "2024-11-19"
+"episodes/01-rstudio-intro.Rmd" "04f6b758558750cef962768d78dd63b0" "site/built/01-rstudio-intro.md" "2024-11-19"
+"episodes/02-project-intro.Rmd" "cd60cc3116d4f6be92f03f5cc51bcc3b" "site/built/02-project-intro.md" "2024-11-19"
+"episodes/03-seeking-help.Rmd" "d24c310b8f36930e70379458f3c93461" "site/built/03-seeking-help.md" "2024-11-19"
+"episodes/04-data-structures-part1.Rmd" "afc6c3ced3677ab088457152f8d84b54" "site/built/04-data-structures-part1.md" "2024-11-19"
+"episodes/05-data-structures-part2.Rmd" "95c5dd30b8288090ce89ecbf2d3072bd" "site/built/05-data-structures-part2.md" "2024-11-19"
+"episodes/06-data-subsetting.Rmd" "5d4ce8731ab37ddea81874d63ae1ce86" "site/built/06-data-subsetting.md" "2024-11-19"
+"episodes/07-control-flow.Rmd" "6a8691c8668737e4202f49b52aeb8ac6" "site/built/07-control-flow.md" "2024-11-19"
+"episodes/08-plot-ggplot2.Rmd" "d694904459c32b9e35acd872830ee75c" "site/built/08-plot-ggplot2.md" "2024-11-19"
+"episodes/09-vectorization.Rmd" "e229eb061b3f072a132c4b31bbc2fdb0" "site/built/09-vectorization.md" "2024-11-19"
+"episodes/10-functions.Rmd" "14edd4cf50edb8fefeb987a17d740e1a" "site/built/10-functions.md" "2024-11-19"
+"episodes/11-writing-data.Rmd" "8b26e062dddd2394d00c6847ff0b7505" "site/built/11-writing-data.md" "2024-11-19"
+"episodes/12-dplyr.Rmd" "99b53f3fcaf96a394b950f19f4d5e118" "site/built/12-dplyr.md" "2024-11-19"
+"episodes/13-tidyr.Rmd" "1c59c3bea4cec5e0c47654a546294f07" "site/built/13-tidyr.md" "2024-11-19"
+"episodes/14-knitr-markdown.Rmd" "0c63ce92263a32f19fbec9f7b619b682" "site/built/14-knitr-markdown.md" "2024-11-19"
+"episodes/15-wrap-up.Rmd" "c5ce0d34a37b7a99624ad1d6ac482256" "site/built/15-wrap-up.md" "2024-11-19"
+"instructors/instructor-notes.md" "e61e7587564a6c4c11dbb6beea127764" "site/built/instructor-notes.md" "2024-11-19"
+"learners/discuss.md" "42ad66ab1907e030914dbb2a94376a47" "site/built/discuss.md" "2024-11-19"
+"learners/reference.md" "1cd851cc85adc26ea172da91e8c564f7" "site/built/reference.md" "2024-11-19"
+"learners/setup.md" "f888f8a54b071715c0cf56896e650c00" "site/built/setup.md" "2024-11-19"
+"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2024-11-19"
+"renv/profiles/lesson-requirements/renv.lock" "d4a067fadca2975fc084ccfca6bdae6c" "site/built/renv.lock" "2024-11-19"
diff --git a/reference.md b/reference.md
new file mode 100644
index 000000000..186affe03
--- /dev/null
+++ b/reference.md
@@ -0,0 +1,344 @@
+---
+title: 'Reference'
+---
+
+## Reference
+
+## [Introduction to R and RStudio](episodes/01-rstudio-intro.Rmd)
+
+- Use the escape key to cancel incomplete commands or running code
+ (Ctrl+C) if you're using R from the shell.
+- Basic arithmetic operations follow standard order of precedence:
+ - Brackets: `(`, `)`
+ - Exponents: `^` or `**`
+ - Divide: `/`
+ - Multiply: `*`
+ - Add: `+`
+ - Subtract: `-`
+- Scientific notation is available, e.g: `2e-3`
+- Anything to the right of a `#` is a comment, R will ignore this!
+- Functions are denoted by `function_name()`. Expressions inside the
+ brackets are evaluated before being passed to the function, and
+ functions can be nested.
+- Mathematical functions: `exp`, `sin`, `log`, `log10`, `log2` etc.
+- Comparison operators: `<`, `<=`, `>`, `>=`, `==`, `!=`
+- Use `all.equal` to compare numbers!
+- `<-` is the assignment operator. Anything to the right is evaluate, then
+ stored in a variable named to the left.
+- `ls` lists all variables and functions you've created
+- `rm` can be used to remove them
+- When assigning values to function arguments, you *must* use `=`.
+
+## [Project management with RStudio](episodes/02-project-intro.Rmd)
+
+- To create a new project, go to File -> New Project
+- Install the `packrat` package to create self-contained projects
+- `install.packages` to install packages from CRAN
+- `library` to load a package into R
+- `packrat::status` to check whether all packages referenced in your
+ scripts have been installed.
+
+## [Seeking help](episodes/03-seeking-help.Rmd)
+
+- To access help for a function type `?function_name` or `help(function_name)`
+- Use quotes for special operators e.g. `?"+"`
+- Use fuzzy search if you can't remember a name '??search\_term'
+- [CRAN task views](https://cran.at.r-project.org/web/views) are a good starting point.
+- [Stack Overflow](https://stackoverflow.com/) is a good place to get help with your code.
+ - `?dput` will dump data you are working from so others can load it easily.
+ - `sessionInfo()` will give details of your setup that others may need for debugging.
+
+## [Data structures](episodes/04-data-structures-part1.Rmd)
+
+Individual values in R must be one of 5 **data types**, multiple values can be grouped in **data structures**.
+
+**Data types**
+
+- `typeof(object)` gives information about an items data type.
+
+- There are 5 main data types:
+
+ - `?numeric` real (decimal) numbers
+ - `?integer` whole numbers only
+ - `?character` text
+ - `?complex` complex numbers
+ - `?logical` TRUE or FALSE values
+
+ **Special types:**
+
+ - `?NA` missing values
+ - `?NaN` "not a number" for undefined values (e.g. `0/0`).
+ - `?Inf`, `-Inf` infinity.
+ - `?NULL` a data structure that doesn't exist
+
+ `NA` can occur in any atomic vector. `NaN`, and `Inf` can only
+ occur in complex, integer or numeric type vectors. Atomic vectors
+ are the building blocks for all other data structures. A `NULL` value
+ will occur in place of an entire data structure (but can occur as list
+ elements).
+
+**Basic data structures in R:**
+
+- atomic `?vector` (can only contain one type)
+- `?list` (containers for other objects)
+- `?data.frame` two dimensional objects whose columns can contain different types of data
+- `?matrix` two dimensional objects that can contain only one type of data.
+- `?factor` vectors that contain predefined categorical data.
+- `?array` multi-dimensional objects that can only contain one type of data
+
+Remember that matrices are really atomic vectors underneath the hood, and that
+data.frames are really lists underneath the hood (this explains some of the weirder
+behaviour of R).
+
+**[Vectors](episodes/04-data-structures-part1.Rmd)**
+
+- `?vector()` All items in a vector must be the same type.
+- Items can be converted from one type to another using *coercion*.
+- The concatenate function 'c()' will append items to a vector.
+- `seq(from=0, to=1, by=1)` will create a sequence of numbers.
+- Items in a vector can be named using the `names()` function.
+
+**[Factors](episodes/04-data-structures-part1.Rmd)**
+
+- `?factor()` Factors are a data structure designed to store categorical data.
+- `levels()` shows the valid values that can be stored in a vector of type factor.
+
+**[Lists](episodes/04-data-structures-part1.Rmd)**
+
+- `?list()` Lists are a data structure designed to store data of different types.
+
+**[Matrices](episodes/04-data-structures-part1.Rmd)**
+
+- `?matrix()` Matrices are a data structure designed to store 2-dimensional data.
+
+**[Data Frames](episodes/05-data-structures-part2.Rmd)**
+
+- `?data.frame` is a key data structure. It is a `list` of `vectors`.
+- `cbind()` will add a column (vector) to a data.frame.
+- `rbind()` will add a row (list) to a data.frame.
+
+**Useful functions for querying data structures:**
+
+- `?str` structure, prints out a summary of the whole data structure
+- `?typeof` tells you the type inside an atomic vector
+- `?class` what is the data structure?
+- `?head` print the first `n` elements (rows for two-dimensional objects)
+- `?tail` print the last `n` elements (rows for two-dimensional objects)
+- `?rownames`, `?colnames`, `?dimnames` retrieve or modify the row names
+ and column names of an object.
+- `?names` retrieve or modify the names of an atomic vector or list (or
+ columns of a data.frame).
+- `?length` get the number of elements in an atomic vector
+- `?nrow`, `?ncol`, `?dim` get the dimensions of a n-dimensional object
+ (Won't work on atomic vectors or lists).
+
+## [Exploring Data Frames](episodes/05-data-structures-part2.Rmd)
+
+- `read.csv` to read in data in a regular structure
+ - `sep` argument to specify the separator
+ - "," for comma separated
+ - "\\t" for tab separated
+ - Other arguments:
+ - `header=TRUE` if there is a header row
+
+## [Subsetting data](episodes/06-data-subsetting.Rmd)
+
+- Elements can be accessed by:
+
+ - Index
+ - Name
+ - Logical vectors
+
+- `[` single square brackets:
+
+ - *extract* single elements or *subset* vectors
+ - e.g.`x[1]` extracts the first item from vector x.
+ - *extract* single elements of a list. The returned value will be another `list()`.
+ - *extract* columns from a data.frame
+
+- `[` with two arguments to:
+
+ - *extract* rows and/or columns of
+ - matrices
+ - data.frames
+ - e.g. `x[1,2]` will extract the value in row 1, column 2.
+ - e.g. `x[2,:]` will extract the entire second column of values.
+
+- `[[` double square brackets to extract items from lists.
+
+- `$` to access columns or list elements by name
+
+- negative indices skip elements
+
+## [Control flow](episodes/07-control-flow.Rmd)
+
+- Use `if` condition to start a conditional statement, `else if` condition to provide
+ additional tests, and `else` to provide a default
+- The bodies of the branches of conditional statements must be indented.
+- Use `==` to test for equality.
+- `%in%` will return a `TRUE`/`FALSE` indicating if there is a match between an element and a vector.
+- `X && Y` is only true if both X and Y are `TRUE`.
+- `X || Y` is true if either X or Y, or both, are `TRUE`.
+- Zero is considered `FALSE`; all other numbers are considered `TRUE`
+- Nest loops to operate on multi-dimensional data.
+
+## [Creating publication quality graphics](episodes/08-plot-ggplot2.Rmd)
+
+- figures can be created with the grammar of graphics:
+ - `library(ggplot2)`
+ - `ggplot` to create the base figure
+ - `aes`thetics specify the data axes, shape, color, and data size
+ - `geom`etry functions specify the type of plot, e.g. `point`, `line`, `density`, `box`
+ - `geom`etry functions also add statistical transforms, e.g. `geom_smooth`
+ - `scale` functions change the mapping from data to aesthetics
+ - `facet` functions stratify the figure into panels
+ - `aes`thetics apply to individual layers, or can be set for the whole plot
+ inside `ggplot`.
+ - `theme` functions change the overall look of the plot
+ - order of layers matters!
+ - `ggsave` to save a figure.
+
+## [Vectorization](episodes/09-vectorization.Rmd)
+
+- Most functions and operations apply to each element of a vector
+- `*` applies element-wise to matrices
+- `%*%` for true matrix multiplication
+- `any()` will return `TRUE` if any element of a vector is `TRUE`
+- `all()` will return `TRUE` if *all* elements of a vector are `TRUE`
+
+## [Functions explained](episodes/10-functions.Rmd)
+
+- `?"function"`
+- Put code whose parameters change frequently in a function, then call it with
+ different parameter values to customize its behavior.
+- The last line of a function is returned, or you can use `return` explicitly
+- Any code written in the body of the function will preferably look for variables defined inside the function.
+- Document Why, then What, then lastly How (if the code isn't self explanatory)
+
+## [Writing data](episodes/11-writing-data.Rmd)
+
+- `write.table` to write out objects in regular format
+- set `quote=FALSE` so that text isn't wrapped in `"` marks
+
+## [Dataframe manipulation with dplyr](episodes/12-dplyr.Rmd)
+
+- `library(dplyr)`
+- `?select` to extract variables by name.
+- `?filter` return rows with matching conditions.
+- `?group_by` group data by one of more variables.
+- `?summarize` summarize multiple values to a single value.
+- `?mutate` add new variables to a data.frame.
+- Combine operations using the `?"%>%"` pipe operator.
+
+## [Dataframe manipulation with tidyr](episodes/13-tidyr.Rmd)
+
+- `library(tidyr)`
+- `?pivot_longer` convert data from *wide* to *long* format.
+- `?pivot_wider` convert data from *long* to *wide* format.
+- `?separate` split a single value into multiple values.
+- `?unite` merge multiple values into a single value.
+
+## [Producing reports with knitr](episodes/14-knitr-markdown.Rmd)
+
+- Value of reproducible reports
+- Basics of Markdown
+- R code chunks
+- Chunk options
+- Inline R code
+- Other output formats
+
+## [Best practices for writing good code](episodes/15-wrap-up.Rmd)
+
+- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do.
+- Write tests before writing code in order to help determine exactly what that code is supposed to do.
+- Know what code is supposed to do before trying to debug it.
+- Make it fail every time.
+- Make it fail fast.
+- Change one thing at a time, and for a reason.
+- Keep track of what you've done.
+- Be humble
+
+## Glossary
+
+[argument]{#argument}
+: A value given to a function or program when it runs.
+The term is often used interchangeably (and inconsistently) with [parameter](#parameter).
+
+[assign]{#assign}
+: To give a value a name by associating a variable with it.
+
+[body]{#body}
+: (of a function): the statements that are executed when a function runs.
+
+[comment]{#comment}
+: A remark in a program that is intended to help human readers understand what is going on,
+but is ignored by the computer.
+Comments in Python, R, and the Unix shell start with a `#` character and run to the end of the line;
+comments in SQL start with `--`,
+and other languages have other conventions.
+
+[comma-separated values]{#comma-separated-values}
+: (CSV) A common textual representation for tables
+in which the values in each row are separated by commas.
+
+[delimiter]{#delimiter}
+: A character or characters used to separate individual values,
+such as the commas between columns in a [CSV](#comma-separated-values) file.
+
+[documentation]{#documentation}
+: Human-language text written to explain what software does,
+how it works, or how to use it.
+
+[floating-point number]{#floating-point-number}
+: A number containing a fractional part and an exponent.
+See also: [integer](#integer).
+
+[for loop]{#for-loop}
+: A loop that is executed once for each value in some kind of set, list, or range.
+See also: [while loop](#while-loop).
+
+[index]{#index}
+: A subscript that specifies the location of a single value in a collection,
+such as a single pixel in an image.
+
+[integer]{#integer}
+: A whole number, such as -12343. See also: [floating-point number](#floating-point-number).
+
+[library]{#library}
+: In R, the directory(ies) where [packages](#package) are stored.
+
+[package]{#package}
+: A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a [library](#library) and loaded using the library() function.
+
+[parameter]{#parameter}
+: A variable named in the function's declaration that is used to hold a value passed into the call.
+The term is often used interchangeably (and inconsistently) with [argument](#argument).
+
+[return statement]{#return-statement}
+: A statement that causes a function to stop executing and return a value to its caller immediately.
+
+[sequence]{#sequence}
+: A collection of information that is presented in a specific order.
+
+[shape]{#shape}
+: An array's dimensions, represented as a vector.
+For example, a 5×3 array's shape is `(5,3)`.
+
+[string]{#string}
+: Short for "character string",
+a [sequence](#sequence) of zero or more characters.
+
+[syntax error]{#syntax-error}
+: A programming error that occurs when statements are in an order or contain characters
+not expected by the programming language.
+
+[type]{#type}
+: The classification of something in a program (for example, the contents of a variable)
+as a kind of number (e.g. [floating-point number](#floating-point-number), [integer](#integer)), [string](#string),
+or something else. In R the command typeof() is used to query a variables type.
+
+[while loop]{#while-loop}
+: A loop that keeps executing as long as some condition is true.
+See also: [for loop](#for-loop).
+
+
diff --git a/renv.lock b/renv.lock
new file mode 100644
index 000000000..8f35e8965
--- /dev/null
+++ b/renv.lock
@@ -0,0 +1,1022 @@
+{
+ "R": {
+ "Version": "4.4.2",
+ "Repositories": [
+ {
+ "Name": "carpentries",
+ "URL": "https://carpentries.r-universe.dev"
+ },
+ {
+ "Name": "carpentries_archive",
+ "URL": "https://carpentries.github.io/drat"
+ },
+ {
+ "Name": "CRAN",
+ "URL": "https://cran.rstudio.com"
+ }
+ ]
+ },
+ "Packages": {
+ "DiagrammeR": {
+ "Package": "DiagrammeR",
+ "Version": "1.0.11",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "RColorBrewer",
+ "cli",
+ "dplyr",
+ "glue",
+ "htmltools",
+ "htmlwidgets",
+ "igraph",
+ "magrittr",
+ "purrr",
+ "readr",
+ "rlang",
+ "rstudioapi",
+ "scales",
+ "stringr",
+ "tibble",
+ "tidyr",
+ "viridisLite",
+ "visNetwork"
+ ],
+ "Hash": "584c1e1cbb6f9b6c3b0f4ef0ad960966"
+ },
+ "MASS": {
+ "Package": "MASS",
+ "Version": "7.3-61",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "graphics",
+ "methods",
+ "stats",
+ "utils"
+ ],
+ "Hash": "0cafd6f0500e5deba33be22c46bf6055"
+ },
+ "Matrix": {
+ "Package": "Matrix",
+ "Version": "1.7-1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "graphics",
+ "grid",
+ "lattice",
+ "methods",
+ "stats",
+ "utils"
+ ],
+ "Hash": "5122bb14d8736372411f955e1b16bc8a"
+ },
+ "R6": {
+ "Package": "R6",
+ "Version": "2.5.1",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "470851b6d5d0ac559e9d01bb352b4021"
+ },
+ "RColorBrewer": {
+ "Package": "RColorBrewer",
+ "Version": "1.1-3",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "45f0398006e83a5b10b72a90663d8d8c"
+ },
+ "base64enc": {
+ "Package": "base64enc",
+ "Version": "0.1-3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "543776ae6848fde2f48ff3816d0628bc"
+ },
+ "bit": {
+ "Package": "bit",
+ "Version": "4.5.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "5dc7b2677d65d0e874fc4aaf0e879987"
+ },
+ "bit64": {
+ "Package": "bit64",
+ "Version": "4.5.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "bit",
+ "methods",
+ "stats",
+ "utils"
+ ],
+ "Hash": "e84984bf5f12a18628d9a02322128dfd"
+ },
+ "bslib": {
+ "Package": "bslib",
+ "Version": "0.8.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "base64enc",
+ "cachem",
+ "fastmap",
+ "grDevices",
+ "htmltools",
+ "jquerylib",
+ "jsonlite",
+ "lifecycle",
+ "memoise",
+ "mime",
+ "rlang",
+ "sass"
+ ],
+ "Hash": "b299c6741ca9746fb227debcb0f9fb6c"
+ },
+ "cachem": {
+ "Package": "cachem",
+ "Version": "1.1.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "fastmap",
+ "rlang"
+ ],
+ "Hash": "cd9a672193789068eb5a2aad65a0dedf"
+ },
+ "cli": {
+ "Package": "cli",
+ "Version": "3.6.3",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "utils"
+ ],
+ "Hash": "b21916dd77a27642b447374a5d30ecf3"
+ },
+ "clipr": {
+ "Package": "clipr",
+ "Version": "0.8.0",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "utils"
+ ],
+ "Hash": "3f038e5ac7f41d4ac41ce658c85e3042"
+ },
+ "colorspace": {
+ "Package": "colorspace",
+ "Version": "2.1-1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "graphics",
+ "methods",
+ "stats"
+ ],
+ "Hash": "d954cb1c57e8d8b756165d7ba18aa55a"
+ },
+ "cpp11": {
+ "Package": "cpp11",
+ "Version": "0.5.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "91570bba75d0c9d3f1040c835cee8fba"
+ },
+ "crayon": {
+ "Package": "crayon",
+ "Version": "1.5.3",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "grDevices",
+ "methods",
+ "utils"
+ ],
+ "Hash": "859d96e65ef198fd43e82b9628d593ef"
+ },
+ "digest": {
+ "Package": "digest",
+ "Version": "0.6.37",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "utils"
+ ],
+ "Hash": "33698c4b3127fc9f506654607fb73676"
+ },
+ "dplyr": {
+ "Package": "dplyr",
+ "Version": "1.1.4",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "R6",
+ "cli",
+ "generics",
+ "glue",
+ "lifecycle",
+ "magrittr",
+ "methods",
+ "pillar",
+ "rlang",
+ "tibble",
+ "tidyselect",
+ "utils",
+ "vctrs"
+ ],
+ "Hash": "fedd9d00c2944ff00a0e2696ccf048ec"
+ },
+ "evaluate": {
+ "Package": "evaluate",
+ "Version": "1.0.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "3fd29944b231036ad67c3edb32e02201"
+ },
+ "fansi": {
+ "Package": "fansi",
+ "Version": "1.0.6",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "utils"
+ ],
+ "Hash": "962174cf2aeb5b9eea581522286a911f"
+ },
+ "farver": {
+ "Package": "farver",
+ "Version": "2.1.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Hash": "680887028577f3fa2a81e410ed0d6e42"
+ },
+ "fastmap": {
+ "Package": "fastmap",
+ "Version": "1.2.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Hash": "aa5e1cd11c2d15497494c5292d7ffcc8"
+ },
+ "fontawesome": {
+ "Package": "fontawesome",
+ "Version": "0.5.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "htmltools",
+ "rlang"
+ ],
+ "Hash": "c2efdd5f0bcd1ea861c2d4e2a883a67d"
+ },
+ "fs": {
+ "Package": "fs",
+ "Version": "1.6.5",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "methods"
+ ],
+ "Hash": "7f48af39fa27711ea5fbd183b399920d"
+ },
+ "generics": {
+ "Package": "generics",
+ "Version": "0.1.3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "methods"
+ ],
+ "Hash": "15e9634c0fcd294799e9b2e929ed1b86"
+ },
+ "ggplot2": {
+ "Package": "ggplot2",
+ "Version": "3.5.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "MASS",
+ "R",
+ "cli",
+ "glue",
+ "grDevices",
+ "grid",
+ "gtable",
+ "isoband",
+ "lifecycle",
+ "mgcv",
+ "rlang",
+ "scales",
+ "stats",
+ "tibble",
+ "vctrs",
+ "withr"
+ ],
+ "Hash": "44c6a2f8202d5b7e878ea274b1092426"
+ },
+ "glue": {
+ "Package": "glue",
+ "Version": "1.8.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "methods"
+ ],
+ "Hash": "5899f1eaa825580172bb56c08266f37c"
+ },
+ "gtable": {
+ "Package": "gtable",
+ "Version": "0.3.6",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "glue",
+ "grid",
+ "lifecycle",
+ "rlang",
+ "stats"
+ ],
+ "Hash": "de949855009e2d4d0e52a844e30617ae"
+ },
+ "highr": {
+ "Package": "highr",
+ "Version": "0.11",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "xfun"
+ ],
+ "Hash": "d65ba49117ca223614f71b60d85b8ab7"
+ },
+ "hms": {
+ "Package": "hms",
+ "Version": "1.1.3",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "lifecycle",
+ "methods",
+ "pkgconfig",
+ "rlang",
+ "vctrs"
+ ],
+ "Hash": "b59377caa7ed00fa41808342002138f9"
+ },
+ "htmltools": {
+ "Package": "htmltools",
+ "Version": "0.5.8.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "base64enc",
+ "digest",
+ "fastmap",
+ "grDevices",
+ "rlang",
+ "utils"
+ ],
+ "Hash": "81d371a9cc60640e74e4ab6ac46dcedc"
+ },
+ "htmlwidgets": {
+ "Package": "htmlwidgets",
+ "Version": "1.6.4",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "grDevices",
+ "htmltools",
+ "jsonlite",
+ "knitr",
+ "rmarkdown",
+ "yaml"
+ ],
+ "Hash": "04291cc45198225444a397606810ac37"
+ },
+ "igraph": {
+ "Package": "igraph",
+ "Version": "2.1.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "Matrix",
+ "R",
+ "cli",
+ "cpp11",
+ "grDevices",
+ "graphics",
+ "lifecycle",
+ "magrittr",
+ "methods",
+ "pkgconfig",
+ "rlang",
+ "stats",
+ "utils",
+ "vctrs"
+ ],
+ "Hash": "c03878b48737a0e2da3b772d7b2e22da"
+ },
+ "isoband": {
+ "Package": "isoband",
+ "Version": "0.2.7",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "grid",
+ "utils"
+ ],
+ "Hash": "0080607b4a1a7b28979aecef976d8bc2"
+ },
+ "jquerylib": {
+ "Package": "jquerylib",
+ "Version": "0.1.4",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "htmltools"
+ ],
+ "Hash": "5aab57a3bd297eee1c1d862735972182"
+ },
+ "jsonlite": {
+ "Package": "jsonlite",
+ "Version": "1.8.9",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "methods"
+ ],
+ "Hash": "4e993b65c2c3ffbffce7bb3e2c6f832b"
+ },
+ "knitr": {
+ "Package": "knitr",
+ "Version": "1.48",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "evaluate",
+ "highr",
+ "methods",
+ "tools",
+ "xfun",
+ "yaml"
+ ],
+ "Hash": "acf380f300c721da9fde7df115a5f86f"
+ },
+ "labeling": {
+ "Package": "labeling",
+ "Version": "0.4.3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "graphics",
+ "stats"
+ ],
+ "Hash": "b64ec208ac5bc1852b285f665d6368b3"
+ },
+ "lattice": {
+ "Package": "lattice",
+ "Version": "0.22-6",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "graphics",
+ "grid",
+ "stats",
+ "utils"
+ ],
+ "Hash": "cc5ac1ba4c238c7ca9fa6a87ca11a7e2"
+ },
+ "lifecycle": {
+ "Package": "lifecycle",
+ "Version": "1.0.4",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "glue",
+ "rlang"
+ ],
+ "Hash": "b8552d117e1b808b09a832f589b79035"
+ },
+ "magrittr": {
+ "Package": "magrittr",
+ "Version": "2.0.3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "7ce2733a9826b3aeb1775d56fd305472"
+ },
+ "memoise": {
+ "Package": "memoise",
+ "Version": "2.0.1",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "cachem",
+ "rlang"
+ ],
+ "Hash": "e2817ccf4a065c5d9d7f2cfbe7c1d78c"
+ },
+ "mgcv": {
+ "Package": "mgcv",
+ "Version": "1.9-1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "Matrix",
+ "R",
+ "graphics",
+ "methods",
+ "nlme",
+ "splines",
+ "stats",
+ "utils"
+ ],
+ "Hash": "110ee9d83b496279960e162ac97764ce"
+ },
+ "mime": {
+ "Package": "mime",
+ "Version": "0.12",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "tools"
+ ],
+ "Hash": "18e9c28c1d3ca1560ce30658b22ce104"
+ },
+ "munsell": {
+ "Package": "munsell",
+ "Version": "0.5.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "colorspace",
+ "methods"
+ ],
+ "Hash": "4fd8900853b746af55b81fda99da7695"
+ },
+ "nlme": {
+ "Package": "nlme",
+ "Version": "3.1-166",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "graphics",
+ "lattice",
+ "stats",
+ "utils"
+ ],
+ "Hash": "ccbb8846be320b627e6aa2b4616a2ded"
+ },
+ "pillar": {
+ "Package": "pillar",
+ "Version": "1.9.0",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "cli",
+ "fansi",
+ "glue",
+ "lifecycle",
+ "rlang",
+ "utf8",
+ "utils",
+ "vctrs"
+ ],
+ "Hash": "15da5a8412f317beeee6175fbc76f4bb"
+ },
+ "pkgconfig": {
+ "Package": "pkgconfig",
+ "Version": "2.0.3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "utils"
+ ],
+ "Hash": "01f28d4278f15c76cddbea05899c5d6f"
+ },
+ "prettyunits": {
+ "Package": "prettyunits",
+ "Version": "1.2.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "6b01fc98b1e86c4f705ce9dcfd2f57c7"
+ },
+ "progress": {
+ "Package": "progress",
+ "Version": "1.2.3",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "R6",
+ "crayon",
+ "hms",
+ "prettyunits"
+ ],
+ "Hash": "f4625e061cb2865f111b47ff163a5ca6"
+ },
+ "purrr": {
+ "Package": "purrr",
+ "Version": "1.0.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "lifecycle",
+ "magrittr",
+ "rlang",
+ "vctrs"
+ ],
+ "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc"
+ },
+ "rappdirs": {
+ "Package": "rappdirs",
+ "Version": "0.3.3",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "5e3c5dc0b071b21fa128676560dbe94d"
+ },
+ "readr": {
+ "Package": "readr",
+ "Version": "2.1.5",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "R6",
+ "cli",
+ "clipr",
+ "cpp11",
+ "crayon",
+ "hms",
+ "lifecycle",
+ "methods",
+ "rlang",
+ "tibble",
+ "tzdb",
+ "utils",
+ "vroom"
+ ],
+ "Hash": "9de96463d2117f6ac49980577939dfb3"
+ },
+ "renv": {
+ "Package": "renv",
+ "Version": "1.0.11",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "utils"
+ ],
+ "Hash": "47623f66b4e80b3b0587bc5d7b309888"
+ },
+ "rlang": {
+ "Package": "rlang",
+ "Version": "1.1.4",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "utils"
+ ],
+ "Hash": "3eec01f8b1dee337674b2e34ab1f9bc1"
+ },
+ "rmarkdown": {
+ "Package": "rmarkdown",
+ "Version": "2.29",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "bslib",
+ "evaluate",
+ "fontawesome",
+ "htmltools",
+ "jquerylib",
+ "jsonlite",
+ "knitr",
+ "methods",
+ "tinytex",
+ "tools",
+ "utils",
+ "xfun",
+ "yaml"
+ ],
+ "Hash": "df99277f63d01c34e95e3d2f06a79736"
+ },
+ "rstudioapi": {
+ "Package": "rstudioapi",
+ "Version": "0.17.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Hash": "5f90cd73946d706cfe26024294236113"
+ },
+ "sass": {
+ "Package": "sass",
+ "Version": "0.4.9",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R6",
+ "fs",
+ "htmltools",
+ "rappdirs",
+ "rlang"
+ ],
+ "Hash": "d53dbfddf695303ea4ad66f86e99b95d"
+ },
+ "scales": {
+ "Package": "scales",
+ "Version": "1.3.0",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "R6",
+ "RColorBrewer",
+ "cli",
+ "farver",
+ "glue",
+ "labeling",
+ "lifecycle",
+ "munsell",
+ "rlang",
+ "viridisLite"
+ ],
+ "Hash": "c19df082ba346b0ffa6f833e92de34d1"
+ },
+ "stringi": {
+ "Package": "stringi",
+ "Version": "1.8.4",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "stats",
+ "tools",
+ "utils"
+ ],
+ "Hash": "39e1144fd75428983dc3f63aa53dfa91"
+ },
+ "stringr": {
+ "Package": "stringr",
+ "Version": "1.5.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "glue",
+ "lifecycle",
+ "magrittr",
+ "rlang",
+ "stringi",
+ "vctrs"
+ ],
+ "Hash": "960e2ae9e09656611e0b8214ad543207"
+ },
+ "tibble": {
+ "Package": "tibble",
+ "Version": "3.2.1",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "fansi",
+ "lifecycle",
+ "magrittr",
+ "methods",
+ "pillar",
+ "pkgconfig",
+ "rlang",
+ "utils",
+ "vctrs"
+ ],
+ "Hash": "a84e2cc86d07289b3b6f5069df7a004c"
+ },
+ "tidyr": {
+ "Package": "tidyr",
+ "Version": "1.3.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "cpp11",
+ "dplyr",
+ "glue",
+ "lifecycle",
+ "magrittr",
+ "purrr",
+ "rlang",
+ "stringr",
+ "tibble",
+ "tidyselect",
+ "utils",
+ "vctrs"
+ ],
+ "Hash": "915fb7ce036c22a6a33b5a8adb712eb1"
+ },
+ "tidyselect": {
+ "Package": "tidyselect",
+ "Version": "1.2.1",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "glue",
+ "lifecycle",
+ "rlang",
+ "vctrs",
+ "withr"
+ ],
+ "Hash": "829f27b9c4919c16b593794a6344d6c0"
+ },
+ "tinytex": {
+ "Package": "tinytex",
+ "Version": "0.54",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "xfun"
+ ],
+ "Hash": "3ec7e3ddcacc2d34a9046941222bf94d"
+ },
+ "tzdb": {
+ "Package": "tzdb",
+ "Version": "0.4.0",
+ "Source": "Repository",
+ "Repository": "RSPM",
+ "Requirements": [
+ "R",
+ "cpp11"
+ ],
+ "Hash": "f561504ec2897f4d46f0c7657e488ae1"
+ },
+ "utf8": {
+ "Package": "utf8",
+ "Version": "1.2.4",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "62b65c52671e6665f803ff02954446e9"
+ },
+ "vctrs": {
+ "Package": "vctrs",
+ "Version": "0.6.5",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "cli",
+ "glue",
+ "lifecycle",
+ "rlang"
+ ],
+ "Hash": "c03fa420630029418f7e6da3667aac4a"
+ },
+ "viridisLite": {
+ "Package": "viridisLite",
+ "Version": "0.4.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R"
+ ],
+ "Hash": "c826c7c4241b6fc89ff55aaea3fa7491"
+ },
+ "visNetwork": {
+ "Package": "visNetwork",
+ "Version": "2.1.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "htmltools",
+ "htmlwidgets",
+ "jsonlite",
+ "magrittr",
+ "methods",
+ "stats",
+ "utils"
+ ],
+ "Hash": "3e48b097e8d9a91ecced2ed4817a678d"
+ },
+ "vroom": {
+ "Package": "vroom",
+ "Version": "1.6.5",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "bit64",
+ "cli",
+ "cpp11",
+ "crayon",
+ "glue",
+ "hms",
+ "lifecycle",
+ "methods",
+ "progress",
+ "rlang",
+ "stats",
+ "tibble",
+ "tidyselect",
+ "tzdb",
+ "vctrs",
+ "withr"
+ ],
+ "Hash": "390f9315bc0025be03012054103d227c"
+ },
+ "withr": {
+ "Package": "withr",
+ "Version": "3.0.2",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "graphics"
+ ],
+ "Hash": "cc2d62c76458d425210d1eb1478b30b4"
+ },
+ "xfun": {
+ "Package": "xfun",
+ "Version": "0.49",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Requirements": [
+ "R",
+ "grDevices",
+ "stats",
+ "tools"
+ ],
+ "Hash": "8687398773806cfff9401a2feca96298"
+ },
+ "yaml": {
+ "Package": "yaml",
+ "Version": "2.3.10",
+ "Source": "Repository",
+ "Repository": "CRAN",
+ "Hash": "51dab85c6c98e50a18d7551e9d49f76c"
+ }
+ }
+}
diff --git a/results/lifeExp.png b/results/lifeExp.png
new file mode 100644
index 000000000..fb1730b8f
Binary files /dev/null and b/results/lifeExp.png differ
diff --git a/setup.md b/setup.md
new file mode 100644
index 000000000..1e9a9654d
--- /dev/null
+++ b/setup.md
@@ -0,0 +1,10 @@
+---
+title: Setup
+---
+
+This lesson assumes you have R and RStudio installed on your computer.
+
+- [Download and install the latest version of R](https://www.r-project.org/).
+- [Download and install RStudio](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features. You will need the free Desktop version for your computer.
+
+