-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02_tidy_data.Rmd
106 lines (88 loc) · 3.19 KB
/
02_tidy_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "02_tidy_data"
output: html_document
---
name: tidy data
## 2.2 The Concept of Tidy Data
> Tidy data sets are all alike; but every messy data set is messy in its own way.
~ [Wickham/Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html)
.pull-left[
**Tidy Data Principles:** The concept of tidy data has been coined by Hadley Wickham in his 2014 paper ["Tidy Data"](https://www.jstatsoft.org/article/view/v059i10). The concept formulates principles for structuring rectangular, tabular data sets consisting of rows and columns:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
]
.pull-right[
```{r, echo=F}
print(penguins, width = 50)
```
]
???
- 3: relates to the storage of one data set per table (analogy to principles in data base design) -> here the type of observational unit might be the citizen, he/she reserves a policy treatment, e.g., tax reduction (hence information about firms might be stored in a different data frame)
- all the upcoming tools are geared towards bringing data into this tabular shape (inversely we will not work with text or image data)
---
## 2.2 The Concept of Tidy Data
**Violations of the Tidy Data Principles:**
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables.
.panelset[
.panel[.panel-name[Example 1]
```{r, echo=F, warning=F}
set.seed(123)
penguins %>%
group_by(species, island) %>%
summarise(n = n(), .groups = "drop") %>%
pivot_wider(names_from = island, values_from = n) %>%
unnest(cols = c(Biscoe, Dream, Torgersen))
```
]
.panel[.panel-name[Example 2]
```{r, echo=F}
penguins %>%
select(species, island, sex, year) %>%
unite(col, species, sex) %>%
sample_n(5)
```
]
.panel[.panel-name[Example 3]
```{r, echo=F, message=F, warning=F}
penguins %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
corrr::correlate(method = "pearson")
```
]
.panel[.panel-name[Example 4]
```{r, echo=F, message=F, warning=F}
penguins %>%
select(species, island, sex) %>%
sample_n(3) %>%
bind_rows(
mtcars %>%
tibble::rownames_to_column("model") %>%
select(model, mpg, cyl) %>%
sample_n(3)
)
```
]
.panel[.panel-name[Example 5]
```
# A tibble: 4 x 2 # A tibble: 4 x 4
species island species bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 1 Adelie 39.1 18.7 181
2 Adelie Torgersen 2 Adelie 39.5 17.4 186
3 Adelie Torgersen 3 Adelie 40.3 18 195
4 Adelie Torgersen 4 Adelie NA NA NA
```
]
]
---
## 2.2 The Concept of Tidy Data
```{r, echo=F, out.width='80%', out.height='80%', fig.align='center'}
knitr::include_graphics(
"https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_3.jpg"
)
```