Skip to content

Commit

Permalink
Merge pull request #25 from stat545ubc-2023/mile2ex6
Browse files Browse the repository at this point in the history
Milestone 2 Exercise 6 (Graded)
  • Loading branch information
Sebastian-Santana-Ort authored Sep 28, 2023
2 parents 3a79227 + 284d2e0 commit 0e1ea27
Show file tree
Hide file tree
Showing 2 changed files with 564 additions and 22 deletions.
52 changes: 30 additions & 22 deletions troubleshooting-2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@ Note that errors are not all syntactic (i.e., broken code)! Some are logical err

[MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages.

```{r}
### ERROR HERE ###
```{r eval=FALSE, include=FALSE}
# Changed previous install.packages to have ""
install.packages("dslabs")
install.packages("tidyverse")
Expand All @@ -25,6 +24,12 @@ install.packages("devtools") # Do not run this if you already have this package
devtools::install_github("JoeyBernhardt/singer")
install.packages("gapminder")
```
```{r message=FALSE, warning=FALSE}
library("dslabs")
library("tidyverse")
library("stringr")
library("gapminder")
```

Let's have a look at the dataset! My goal is to:

Expand All @@ -33,7 +38,7 @@ Let's have a look at the dataset! My goal is to:
- Have a quick look at the tibble, using a *dplyr function*.

```{r}
### ERROR HERE ###
class(dslabs::movielens)
movieLens <- as_tibble(dslabs::movielens)
dim(movieLens)
Expand All @@ -51,9 +56,8 @@ Now that we've had a quick look at the dataset, it would be interesting to explo
- have *more than* 4.5 stars, and were filmed *before* 1995.

```{r}
### ERROR HERE ###
filter(movieLens, genres == "Drama")
filter(movieLens, genres == "Drama")
# Changed this
#filter(movieLens, !genres == "Drama")
# To this:
Expand All @@ -69,14 +73,13 @@ filter(movieLens, year > 2000)
# To this:
filter(movieLens, year == 1999 | year == 2000)
filter(movieLens, rating > 4.5, year < 1995)
```

While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"...

```{r}
### ERROR HERE ###
movieLens %>%
# Changed this
#filter(!genres == "Drama") %>%
Expand All @@ -93,7 +96,7 @@ movieLens %>%
Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens".

```{r}
### ERROR HERE ###
movieLens <- movieLens %>%
# Changed this
#rename(user_id == userId,
Expand All @@ -103,12 +106,13 @@ movieLens <- movieLens %>%
movie_id = movieId)
head(movielens)
```

As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember?

```{r}
### ERROR HERE ###
# Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations"
# (see: https://dplyr.tidyverse.org/reference/transmute.html)
# Changed this
Expand All @@ -134,7 +138,7 @@ movieLens %>%
Without using `group_by()`, I want to find out how many movie reviews there have been for each year.

```{r}
### ERROR HERE ###
#movieLens %>%
# tally(year)
#Changed to
Expand All @@ -145,7 +149,7 @@ movieLens %>% # Tally is used for grouped data, count is a short-hand for grou
Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results.

```{r}
### ERROR HERE ###
#movieLens %>%
# count(c(title, rating), sort = TRUE)
# changed to:
Expand All @@ -168,14 +172,15 @@ movieLens %>%
Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively.

```{r}
### ERROR HERE ###
#movieLens %>%
# mutate(min_rating = min(rating),
# max_rating = max(rating))
# Changed to: (the pipeline did not group data by title)
movieLens %>%
group_by(title) %>%
summarize(min_rating = min(rating, na.rm = TRUE), max_rating = max(rating, na.rm = TRUE))
```

## Exercise 5: Scoped variants with `across()`
Expand Down Expand Up @@ -212,22 +217,25 @@ Manually create a tibble with 4 columns:
- `birth_year` should contain years 1998 to 2005 (inclusive);
- `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45;
- `birth_location` should contain three locations (Liverpool, Seattle, and New York).
- Modification: add *,* after `birth_location`, add *""* for birth_location value.

```{r}
### ERROR HERE ###
fakeStarWars <- tribble(
~name, ~birth_weight, ~birth_year, ~birth_location
"Luke Skywalker", 1.35 , 1998 , Liverpool, England,
"C-3PO" , 1.80 , 1999 , Liverpool, England,
"R2-D2" , 2.25 , 2000 , Seattle, WA,
"Darth Vader" , 2.70 , 2001 , Liverpool, England,
"Leia Organa" , 3.15 , 2002 , New York, NY,
"Owen Lars" , 3.60 , 2003 , Seattle, WA,
"Beru Whitesun Iars", 4.05 , 2004 , Liverpool, England,
"R5-D4" , 4.50 , 2005 , New York, NY,
~name, ~birth_weight, ~birth_year, ~birth_location,
"Luke Skywalker", 1.35 , 1998 , "Liverpool, England",
"C-3PO" , 1.80 , 1999 , "Liverpool, England",
"R2-D2" , 2.25 , 2000 , "Seattle, WA",
"Darth Vader" , 2.70 , 2001 , "Liverpool, England",
"Leia Organa" , 3.15 , 2002 , "New York, NY",
"Owen Lars" , 3.60 , 2003 , "Seattle, WA",
"Beru Whitesun Iars", 4.05 , 2004 , "Liverpool, England",
"R5-D4" , 4.50 , 2005 , "New York, NY"
)
fakeStarWars
```

## Attributions

Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.

Loading

0 comments on commit 0e1ea27

Please sign in to comment.