generated from stat545ubc-2022/collaborative-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtroubleshooting-2.Rmd
241 lines (177 loc) · 9.01 KB
/
troubleshooting-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Team Troubleshooting Deliverable 2"
output: github_document
---
```{r include = FALSE}
knitr::opts_chunk$set(error = TRUE)
```
There are **11 code chunks with errors** in this Rmd. Your objective is to fix all of the errors in this worksheet. For the purpose of grading, each erroneous code chunk is equally weighted.
Note that errors are not all syntactic (i.e., broken code)! Some are logical errors as well (i.e. code that does not do what it was intended to do).
## Exercise 1: Exploring with `select()` and `filter()`
[MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages.
```{r eval=FALSE, include=FALSE}
# Changed previous install.packages to have ""
install.packages("dslabs")
install.packages("tidyverse")
install.packages("stringr")
install.packages("devtools") # Do not run this if you already have this package installed!
devtools::install_github("JoeyBernhardt/singer")
install.packages("gapminder")
```
```{r message=FALSE, warning=FALSE}
library("dslabs")
library("tidyverse")
library("stringr")
library("gapminder")
```
Let's have a look at the dataset! My goal is to:
- Find out the "class" of the dataset.
- If it isn't a tibble already, coerce it into a tibble and store it in the variable "movieLens".
- Have a quick look at the tibble, using a *dplyr function*.
```{r}
class(dslabs::movielens)
movieLens <- as_tibble(dslabs::movielens)
dim(movieLens)
# In addition to dim() (which is a part of base R), I used the dplyr function glipmse()
# to have a brief description of what the dataset looks like
glimpse(movieLens)
```
Now that we've had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I'd like to consider the movie entries that...
- belong *exclusively* to the genre *"Drama"*;
- don't belong *exclusively* to the genre *"Drama"*;
- were filmed *after* the year 2000;
- were filmed in 1999 *or* 2000;
- have *more than* 4.5 stars, and were filmed *before* 1995.
```{r}
filter(movieLens, genres == "Drama")
# Changed this
#filter(movieLens, !genres == "Drama")
# To this:
filter(movieLens, genres != "Drama")
# Changed this
#filter(movieLens, year >= 2000)
# To this:
filter(movieLens, year > 2000)
# Changed this
#filter(movieLens, year == 1999 | month == 2000)
# To this:
filter(movieLens, year == 1999 | year == 2000)
filter(movieLens, rating > 4.5, year < 1995)
```
While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"...
```{r}
movieLens %>%
# Changed this
#filter(!genres == "Drama") %>%
# To this
filter(genres != "Drama") %>%
# Changed this
#select(title, genres, year, rating, timestamp)
# To this
select(title, genres, everything())
```
## Exercise 2: Calculating with `mutate()`-like functions
Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens".
```{r}
movieLens <- movieLens %>%
# Changed this
#rename(user_id == userId,
# movie_id == movieId)
# To this
rename(user_id = userId,
movie_id = movieId)
head(movielens)
```
As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember?
```{r}
# Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations"
# (see: https://dplyr.tidyverse.org/reference/transmute.html)
# Changed this
#mutate(movieLens,
# average_rating = mean(rating))
# To this
transmute(movieLens,
average_rating = mean(rating))
```
## Exercise 3: Calculating with `summarise()`-like functions
Alone, `tally()` is a short form of `summarise()`. `count()` is short-hand for `group_by()` and `tally()`.
Each entry of the movieLens table corresponds to a movie rating by a user. Therefore, if more than one user rated the same movie, there will be several entries for the same movie. I want to find out how many times each movie has been reviewed, or in other words, how many times each movie title appears in the dataset.
```{r}
movieLens %>%
group_by(title) %>%
tally()
```
Without using `group_by()`, I want to find out how many movie reviews there have been for each year.
```{r}
#movieLens %>%
# tally(year)
#Changed to
movieLens %>% # Tally is used for grouped data, count is a short-hand for group_by()
count(year)
```
Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results.
```{r}
#movieLens %>%
# count(c(title, rating), sort = TRUE)
# changed to:
movieLens %>%
count(title, rating, sort = TRUE) # c() function call should not be passed into count()
```
Not only do `count()` and `tally()` quickly allow you to count items within your dataset, `add_tally()` and `add_count()` are handy shortcuts that add an additional columns to your tibble, rather than collapsing each group.
## Exercise 4: Calculating with `group_by()`
We can calculate the mean rating by year, and store it in a new column called `avg_rating`:
```{r}
movieLens %>%
group_by(year) %>%
summarize(avg_rating = mean(rating))
```
Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively.
```{r}
#movieLens %>%
# mutate(min_rating = min(rating),
# max_rating = max(rating))
# Changed to: (the pipeline did not group data by title)
movieLens %>%
group_by(title) %>%
summarize(min_rating = min(rating, na.rm = TRUE), max_rating = max(rating, na.rm = TRUE))
```
## Exercise 5: Scoped variants with `across()`
`across()` is a newer dplyr function (`dplyr` 1.0.0) that allows you to apply a transformation to multiple variables selected with the `select()` and `rename()` syntax. For this section, we will use the `starwars` dataset, which is built into R. First, let's transform it into a tibble and store it under the variable `starWars`.
```{r}
starWars <- as_tibble(starwars)
```
We can find the mean for all columns that are numeric, ignoring the missing values:
```{r}
starWars %>%
summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE)))
```
We can find the minimum height and mass within each species, ignoring the missing values:
```{r}
### ERROR HERE (fixed) ###
starWars %>%
group_by(species) %>%
summarise(across(c("height", "mass"), function(x) min(x, na.rm=TRUE)))
```
Note that here R has taken the convention that the minimum value of a set of `NA`s is `Inf`.
## Exercise 6: Making tibbles
Manually create a tibble with 4 columns:
- `birth_year` should contain years 1998 to 2005 (inclusive);
- `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45;
- `birth_location` should contain three locations (Liverpool, Seattle, and New York).
- Modification: add *,* after `birth_location`, add *""* for birth_location value.
```{r}
fakeStarWars <- tribble(
~name, ~birth_weight, ~birth_year, ~birth_location,
"Luke Skywalker", 1.35 , 1998 , "Liverpool, England",
"C-3PO" , 1.80 , 1999 , "Liverpool, England",
"R2-D2" , 2.25 , 2000 , "Seattle, WA",
"Darth Vader" , 2.70 , 2001 , "Liverpool, England",
"Leia Organa" , 3.15 , 2002 , "New York, NY",
"Owen Lars" , 3.60 , 2003 , "Seattle, WA",
"Beru Whitesun Iars", 4.05 , 2004 , "Liverpool, England",
"R5-D4" , 4.50 , 2005 , "New York, NY"
)
fakeStarWars
```
## Attributions
Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.