-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy pathtidyverse_tutorial.Rmd
647 lines (456 loc) · 23.9 KB
/
tidyverse_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
---
title: "Medium Data in the Tidyverse"
author: "Mike Frank"
date: "6/22/2017"
output: html_document
---
Starting note: The best reference for this material is Hadley Wickham's [R for data scientists](http://r4ds.had.co.nz/). My contribution here is to translate this reference for psychology.
```{r setup, include=FALSE}
library(tidyverse)
library(DT) # dependency
library(ggthemes) # dependency
library(wordbankr)
library(purrr)
library(shiny)
knitr::opts_chunk$set(echo = TRUE, cache=TRUE)
```
<!-- ----------------------------------------------------------------------- -->
# Goals and Introduction
By the end of this tutorial, you will know:
+ What "tidy data" is and why it's an awesome format
+ How to do some stuff with tidy data
+ How to get your data to be tidy
+ Some tips'n'tricks for dealing with "medium data" in R
This intro will describe a few concepts that you will need to know, using the famous `iris` dataset that comes with `ggplot2`.
## Data frames
The basic data structure we're working with is the data frame, or `tibble` (in the `tidyverse` reimplementation).
Data frames have rows and columns, and each column has a distinct data type. The implementation in Python's `pandas` is distinct but most of the concepts are the same.
`iris` is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it's blooming, petals are the actual petals of the flower).
```{r}
head(iris)
```
> **Exercise.** R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use `$` to access a column, as in `iris$Sepal.Length` or you can treat the data frame as a matrix, e.g. `iris[1,1]` or even as a list, as in `iris[[1]]`. You can also mix numeric references and named references, e.g. `iris[["Sepal.Length"]]`. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).
```{r}
# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).
iris$Petal.Length[3]
iris[3,3]
iris[[3]][3]
iris[3,"Petal.Length"]
iris[["Petal.Length"]][3]
```
> **Discussion.** Why might some ways of doing this be better than others?
## Tidy data
> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Here's the basic idea: In tidy data, every row is a single **observation** (trial), and every column describes a **variable** with some **value** describing that trial.
And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:
"There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine."
`iris` is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.
> **Exercise.** Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It's a replication of a paper by [Janiszewksi & Uy (2008)](http://warrington.ufl.edu/departments/mkt/docs/janiszewski/Anchor.pdf). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).
```{r}
ju <- read_csv("data/janiszewski_rep_cleaned.csv")
head(ju)
```
## Functions and Pipes
Everything you typically want to do in statistical programming uses **functions**. `mean` is a good example. `mean` takes one **argument**, a numeric vector.
```{r}
mean(iris$Petal.Length)
```
We're going to call this **applying** the function `mean` to the variable `Petal.Length`.
Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:
```{r}
iris$Petal.Length %>% mean
```
That's not very useful yet, but when you start **nesting** functions, it gets better.
```{r}
mean(unique(iris$Petal.Length))
iris$Petal.Length %>% unique() %>% mean(na.rm=TRUE)
```
or
```{r}
round(mean(unique(iris$Petal.Length)), digits = 2)
iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)
# indenting makes things even easier to read
iris$Petal.Length %>%
unique %>%
mean %>%
round(digits = 2)
```
This can be super helpful for writing strings of functions so that they are readable and distinct.
We'll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.
> **Exercise.** Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:
```{r}
length(unique(iris$Species)) # number of species
```
Piped version:
```{r}
iris$Species %>%
unique %>%
length
```
## `ggplot2` and tidy data
The last piece of our workflow here is going to be the addition of visualiation elements. `ggplot2` is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):
+ `aes` - the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)
+ `geom` - the plotting objects that represent the data (points, lines, shapes, etc.)
```{r}
iris %>%
ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) +
geom_point()
```
And just to let you know my biases, I like `theme_few` from `ggthemes` and `scale_color_solarized` as my palette.
```{r}
iris %>%
ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) +
geom_point() +
ggthemes::theme_few() +
ggthemes::scale_color_solarized()
```
<!-- ----------------------------------------------------------------------- -->
# Tidy Data Analysis with `dplyr`
Reference: [R4DS Chapter 5](http://r4ds.had.co.nz/transform.html)
Let's take a psychological dataset. Here are the raw data from [Stiller, Goodman, & Frank (2015)].
These data are tidy: each row describes a single trial, each column describes some aspect of tha trial, including their id (`subid`), age (`age`), condition (`condition` - "label" is the experimental condition, "No Label" is the control), item (`item` - which thing furble was trying to find).
We are going to manipulate these data using "verbs" from `dplyr`. I'll only teach four verbs, the most common in my workflow (but there are many other useful ones):
+ `filter` - remove rows by some logical condition
+ `mutate` - create new columns
+ `group_by` - group the data into subsets by some column
+ `summarize` - apply some function over columns in each group
## Exploring and characterizing the dataset
```{r}
sgf <- read_csv("data/stiller_scales_data.csv")
sgf
```
Inspect the various variables before you start any analysis. Lots of people recommend `summary` but TBH I don't find it useful.
```{r}
summary(sgf)
```
This output just feels overwhelming and uninformative.
You can look at each variable by itself:
```{r}
unique(sgf$condition)
sgf$subid %>%
unique %>%
length
```
Or use interactive tools like `View` or `DT::datatable` (which I really like).
```{r}
View(sgf)
DT::datatable(sgf)
```
## Filtering & Mutating
There are lots of reasons you might want to remove *rows* from your dataset, including getting rid of outliers, selecting subpopulations, etc. `filter` is a verb (function) that takes a data frame as its first argument, and then as its second takes the **condition** you want to filter on.
So if you wanted to look only at two year olds, you could do this. (Note you can give two conditions, could also do `age > 2 & age < 3`). (equivalent: `filter(sgf, age > 2, age < 3)`)
Note that we're going to be using pipes with functions over data frames here. The way this works is that:
+ `dplyr` verbs always take the data frame as their first argument, and
+ because pipes pull out the first argument, the data frame just gets passed through successive operations
+ so you can read a pipe chain as "take this data frame and first do this, then do this, then do that."
This is essentially the huge insight of `dplyr`: you can chain verbs into readable and efficient sequences of operations over dataframes, provided 1) the verbs all have the same syntax (which they do) and 2) the data all have the same structure (which they do if they are tidy).
OK, so filtering:
```{r}
sgf %>%
filter(age > 2,
age < 3)
```
**Exercise.** Filter out only the "face" trial in the "Label" condition.
```{r}
sgf %>%
filter(condition == "Label",
item == "faces")
sgf[sgf$condition == "Label" & sgf$item == "faces", ] # all the columns
```
There are also times when you want to add or remove *columns*. You might want to remove columns to simplify the dataset. There's not much to simplify here, but if you wanted to do that, the verb is `select`.
```{r}
sgf %>%
select(subid, age, correct)
sgf %>%
select(-condition)
sgf %>%
select(1)
sgf %>%
select(starts_with("sub"))
# learn about this with ?select
```
Perhaps more useful is *adding columns*. You might do this perhaps to compute some kind of derived variable. `mutate` is the verb for these situations - it allows you to add a column. Let's add a discrete age group factor to our dataset.
```{r}
sgf <- sgf %>%
mutate(age_group = cut(age, 2:5, include.lowest = TRUE),
age_group_halfyear = cut(age, seq(2,5,.5), include.lowest = TRUE))
# sgf$age_group <- cut(sgf$age, 2:5, include.lowest = TRUE)
# sgf$age_group <- with(sgf, cut(age, 2:5, include.lowest = TRUE))
head(sgf$age_group)
```
## Standard psychological descriptives
We typically describe datasets at the level of subjects, not trials. We need two verbs to get a summary at the level of subjects: `group_by` and `summarise` (kiwi spelling). Grouping alone doesn't do much.
```{r}
sgf %>%
group_by(age_group)
```
All it does is add a grouping marker.
What `summarise` does is to *apply a function* to a part of the dataset to create a new summary dataset. So we can apply the function `mean` to the dataset and get the grand mean.
```{r}
## DO NOT DO THIS!!!
# foo <- initialize_the_thing_being_bound()
# for (i in 1:length(unique(sgf$item))) {
# for (j in 1:length(unique(sgf$condition))) {
# this_data <- sgf[sgf$item == unique(sgf$item)[i] &
# sgf$condition == unique(sgf$condition)[n],]
# do_a_thing(this_data)
# bind_together_somehow(this_data)
# }
# }
sgf %>%
summarise(correct = mean(correct))
```
Note the syntax here: `summarise` takes multiple `new_column_name = function_to_be_applied_to_data(data_column)` entries in a list. Using this syntax, we can create more elaborate summary datasets also:
```{r}
sgf %>%
summarise(correct = mean(correct),
n_observations = length(subid))
```
Where these two verbs shine is in combination, though. Because `summarise` applies functions to columns in your *grouped data*, not just to the whole dataset!
So we can group by age or condition or whatever else we want and then carry out the same procedure, and all of a sudden we are doing something extremely useful!
```{r}
sgf_means <- sgf %>%
group_by(age_group, condition) %>%
summarise(correct = mean(correct),
n_observations = length(subid))
sgf_means
```
These summary data are typically very useful for plotting. .
```{r}
ggplot(sgf_means,
aes(x = age_group, y = correct, col = condition, group = condition)) +
geom_line() +
ylim(0,1) +
ggthemes::theme_few()
# sgf %>%
# mutate(age_group) %>%
# group_by() %>%
# summarise %>%
# ggplot()
```
> **Exercise.** One of the most important analytic workflows for psychological data is to take some function (e.g., the mean) *for each participant* and then look at grand means and variability *across participant means*. This analytic workflow requires grouping, summarising, and then grouping again and summarising again! Use `dplyr` to make the same table as above (`sgf_means`) but with means (and SDs if you want) computed across subject means, not across all data points. (The means will be pretty similar as this is a balanced design but in a case with lots of missing data, they will vary. In contrast, the SD doesn't even really make sense across the binary data before you aggregate across subjects.)
```{r}
# exercise
sgf_sub_means <- sgf %>%
group_by(age_group, condition, subid) %>%
summarise(correct = mean(correct))
sgf_grand_means <- sgf_sub_means %>%
group_by(age_group, condition) %>%
summarise(mean_correct = mean(correct),
sd_correct = sd(correct))
```
<!-- ----------------------------------------------------------------------- -->
# Getting to Tidy with `tidyr`
Reference: [R4DS Chapter 12](http://r4ds.had.co.nz/tidy-data.html)
Psychological data often comes in two flavors: *long* and *wide* data. Long form data is *tidy*, but that format is less common. It's much more common to get *wide* data, in which every row is a case (e.g., a subject), and each column is a variable. In this format multiple trials (observations) are stored as columns.
This can go a bunch of ways, for example, the most common might be to have subjects as rows and trials as columns. But here's an example from a real dataset on "unconscious arithmetic" from [Sklar et al. (2012)](http://www.pnas.org/content/109/48/19614.short). In it, *items* (particular arithmetic problems) are rows and *subjects* are columns.
```{r}
sklar <- read_csv("data/sklar_data.csv")
head(sklar)
```
## Tidy verbs
The two main verbs for tidying are `gather` and `spread`. (There are lots of others in the `tidyr` package if you want to split or merge columns etc.).
First, let's go *away* from tidiness. We're going to `spread` a tidy dataset. Remember that tidy data has one observation in each row, but we want to "spread" it out so it's wide. (The metaphor works better in this description). This may not be helpful, but I think of the data as a long cream cheese pat, and I "spread" it over a wide bagel.
Let's try it on the SGF data above. First we'll spread it so it's wide. I do this by indicating what column is going to be the *column labels* in the new data frame, here it's `item`, and what column is going to have the *values* in those columns, here it's `correct`:
```{r}
sgf_wide <- sgf %>%
spread(item, correct)
head(sgf_wide)
```
Now you can see that there is no explicit specification that all those item columns, e.g. `faces`, `beds` are holding `correct` values, but the data are much more compact. (This form is easy to work with in Excel, so that's probably why people use it in psych).
OK, let's go back to our original format. `gather` is about making wide data into tidy (long) data. When you gather a dataset you are "gathering" a bunch of columns (maybe that you previously `spread`). You specify what all the columns have in common (e.g., they are all `subject_id`s in the example above), and you say what measure they all contain (they all have RTs). So in that sense, it's the flip of `spread`. You did `spread(item, correct)` and now you'll `gather(item, correct, ...)`. The one extra argument is that you need to specify the columns that will go into `item`!
```{r}
sgf_long <- sgf_wide %>%
gather(item, correct, beds, faces, houses, pasta)
head(sgf_long)
head(sgf)
```
There are lots of flexible ways to specify these columns - you can enumerate their names like I did.
```{r}
# gather(item, correct, 5:8)
# gather(item, correct, starts_with("foo"))
```
> **Exercise.** Take the Sklar data from above, where each column is a separate subject, and `gather` it so that it's a tidy dataset. What challenges come up?
```{r}
sklar
```
```{r}
sklar_tidy <- sklar %>%
gather(subid, rt, 8:28)
sklar_tidy
```
Let's also go back and tidy an easier one: `iris`.
```{r}
iris
```
```{r}
iris %>%
mutate(iris_id = 1:nrow(iris)) %>%
gather(measurement, centimeters, Sepal.Length, Petal.Length, Sepal.Width, Petal.Width)
```
<!-- ----------------------------------------------------------------------- -->
# A bigger worked example: Wordbank data
We're going to be using some data on vocabulary growth that we load from the Wordbank database. [Wordbank](http://wordbank.stanford.edu) is a database of children's language learning.
(Go explore it for a moment).
We're going to look at data from the English Words and Sentences form. These data describe the repsonses of parents to questions about whether their child says 680 different words.
`dplyr` really shines in this context.
```{r}
# to avoid dependency on the wordbankr package, we cache these data.
# ws <- wordbankr::get_administration_data(language = "English",
# form = "WS")
ws <- read_csv("data/ws.csv")
```
Take a look at the data that comes out.
```{r}
DT::datatable(ws)
```
```{r}
ggplot(ws, aes(x = age, y = production)) +
geom_point()
```
Aside: How can we fix this plot? Suggestions from group?
```{r}
ggplot(ws, aes(x = age, y = production)) +
geom_jitter(size = .5, width = .25, height = 0, alpha = .3)
```
Ok, let's plot the relationship between sex and productive vocabulary, using `dplyr`.
```{r}
ggplot(ws, aes(x = age, y = production, col=sex)) +
geom_jitter(size = .5, width = .25, height = 0, alpha = .3)
```
This is a bit useless, because the variability is so high. So let's summarise!
> **Exercise.** Get means and SDs of productive vocabulary (`production`) by `age` and `sex`. Filter the kids with missing data for `sex` (coded by `NA`).
HINT: `is.na(x)` is useful for filtering.
```{r}
# View(ws)
ws_sex <- ws %>%
filter(!is.na(sex)) %>%
group_by(age, sex) %>%
summarise(production = mean(production),
production_sd = sd(production))
ws_sex
```
Now plot:
```{r}
ggplot(ws_sex,
aes(x = age, y = production, col = sex)) +
geom_line() +
geom_jitter(data = filter(ws, !is.na(sex)),
size = .5, width = .25, height = 0, alpha = .3) +
geom_linerange(aes(ymin = production - production_sd,
ymax = production + production_sd),
position = position_dodge(width = .2)) # keep SDs from overlapping
```
**Bonus: Compute effect size.**
```{r}
# instructor demo
```
<!-- ----------------------------------------------------------------------- -->
# Exciting stuff you can do with this workflow
Here are three little demos of exciting stuff that you can do (and that are facilitated by this workflow).
## Reading bigger files, faster
A few other things will help you with "medium size data":
+ `read_csv` - Much faster than `read.csv` and has better defaults.
+ `dbplyr` - For connecting directly to databases. This package got forked off of `dplyr` recently but is very useful.
+ `feather` - The `feather` package is a fast-loading binary format that is interoperable with python. All you need to know is `write_feather(d, "filename")` and `read_feather("filename")`.
Here's a timing demo for `read.csv`, `read_csv`, and `read_feather`.
```{r}
system.time(read.csv("data/ws.csv"))
system.time(read_csv("data/ws.csv"))
system.time(feather::read_feather("data/ws.feather"))
```
I see about a 2x speedup for `read_csv` (bigger for bigger files) and a 20x speedup for `read_feather`.
## Interactive visualization
The `shiny` package is a great way to do interactives in R. We'll walk through constructing a simple shiny app for the wordbank data here.
Technically, this is [embedded shiny](http://rmarkdown.rstudio.com/authoring_embedded_shiny.html) as opposed to freestanding shiny apps (like Wordbank).
The two parts of a shiny app are `ui` and `server`. Both of these are funny in that they are lists of other things. The `ui` is a list of elements of an HTML page, and the server is a list of "reactive" elements. In brief, the UI says what should be shown, and the server specifies the mechanics of how to create those elements.
This little embedded shiny app shows a page with two elements: 1) a selector that lets you choose a demographic field, and 2) a plot of vocabulary split by that field.
The server then has the job of splitting the data by that field (for `ws_split`) and rendering the plot (`agePlot`).
The one fancy thing that's going on here is that the app makes use of the calls `group_by_` (in the `dplyr` chain) and `aes_` (for the `ggplot` call). These `_` functions are a little complex - they are an example of "standard evaluation" that lets you feed *actual variables* into `ggplot2` and `dplyr` rather than *names of variables*. For more information, there is a nice vignette on standard and non-standard evaluation: try `(vignette("nse")`.
```{r}
shinyApp(
ui <- fluidPage(
selectInput("demographic", "Demographic Split Variable",
c("Sex" = "sex", "Maternal Education" = "mom_ed",
"Birth Order" = "birth_order", "Ethnicity" = "ethnicity")),
plotOutput("agePlot")
),
server <- function(input, output) {
ws_split <- reactive({
ws %>%
group_by_("age", input$demographic) %>%
summarise(production_mean = mean(production))
})
output$agePlot <- renderPlot({
ggplot(ws_split(),
aes_(quote(age), quote(production_mean), col = as.name(input$demographic))) +
geom_line()
})
},
options = list(height = 500)
)
```
## Function application
As I've tried to highlight, `dplyr` is actually all about applying functions. `summarise` is a verb that helps you apply functions to chunks of data and then bind them together. But that creates a requirement that all the functions return a single value (e.g., `mean`). There are lots of things you can do that summarise data but *don't* return a single value. For example, maybe you want to run a linear regression and return the slope *and* the intercept.
For that, I want to highlight two things.
One is `do`, which allows function application to grouped data. The only tricky thing about using `do` is that you have to refer to the dataframe that you're working on as `.`.
The second is the amazing `broom` package, which provides methods to `tidy` the output of lots of different statistical models. So for example, you can run a linear regression on chunks of a dataset and get back out the coefficients in a data frame.
Here's a toy example, again with Wordbank data.
```{r}
ws %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
do(broom::tidy(lm(production ~ age, data = .)))
```
In recent years, this workflow in R ihas gotten really good. `purrr` is an amazing package that introduces consistent ways to `map` functions. It's beyond the scope of the course.
<!-- ----------------------------------------------------------------------- -->
# Exercise solutions
Returning the third cell.
```{r}
iris$Petal.Length[3]
iris[3,3]
iris[3,"Petal.Length"]
iris[[3]][3]
iris[["Petal.Length"]][3]
# probably more?
```
Piped commands.
```{r}
iris$Species %>%
unique %>%
length
```
Mean of participant means.
```{r}
sgf %>%
group_by(age_group, subid) %>%
summarise(correct = mean(correct)) %>%
summarise(mean_correct = mean(correct),
sd_correct = sd(correct))
```
Sklar tidying.
```{r}
sklar %>%
gather(participant, RT, 8:28)
# might be a better way to select these columns than by number, e.g. regex
```
Sex means.
```{r}
ws_sex <- ws %>%
filter(!is.na(sex)) %>%
group_by(age, sex) %>%
summarise(production_sd = sd(production, na.rm=TRUE),
production_mean = mean(production))
```
Effect size. (Instructor demo)
```{r}
ws_es <- ws_sex %>%
group_by(age) %>%
summarise(es = (production_mean[sex=="Female"] - production_mean[sex=="Male"]) /
mean(production_sd))
ggplot(ws_es, aes(x = age, y = es)) +
geom_point() +
geom_smooth(span = 1) +
ylab("Female advantage (standard deviations)") +
xlab("Age (months)")
```