troubleshooting-2.Rmd

---
title: "Team Troubleshooting Deliverable 2"
output: github_document
---

```{r include = FALSE}
knitr::opts_chunk$set(error = TRUE)
```

There are **11 code chunks with errors** in this Rmd. Your objective is to fix all of the errors in this worksheet. For the purpose of grading, each erroneous code chunk is equally weighted.

Note that errors are not all syntactic (i.e., broken code)! Some are logical errors as well (i.e. code that does not do what it was intended to do).

## Exercise 1: Exploring with `select()` and `filter()`

[MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages.

```{r}
### ERROR HERE ###
library(dslabs) ##change to library function for syntax error
library(tidyverse)
library(stringr)
library(gapminder)
```

Let's have a look at the dataset! My goal is to:

-   Find out the "class" of the dataset.
-   If it isn't a tibble already, coerce it into a tibble and store it in the variable "movieLens".
-   Have a quick look at the tibble, using a *dplyr function*.

```{r}
### ERROR HERE ###
class(dslabs::movielens)
movieLens <- as_tibble(dslabs::movielens)
dim(movieLens)
glimpse(movieLens) ## add to show some rows of dataset
```

Now that we've had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I'd like to consider the movie entries that...

-   belong *exclusively* to the genre *"Drama"*;
-   don't belong *exclusively* to the genre *"Drama"*;
-   were filmed *after* the year 2000;
-   were filmed in 1999 *or* 2000;
-   have *more than* 4.5 stars, and were filmed *before* 1995.

```{r}
### ERROR HERE ###
filter(movieLens, genres == "Drama")
filter(movieLens, !genres == "Drama")
filter(movieLens, year > 2000) ## change to > 2000 as we need "after" year 2000
filter(movieLens, year == 1999 | year == 2000) ## change to year == 2000
filter(movieLens, rating > 4.5, year < 1995)
```

While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"...

```{r}
### ERROR HERE ###
movieLens %>%
  filter(!genres == "Drama") %>%
  select(title, genres, everything()) ##use everything to represent the rest
```

## Exercise 2: Calculating with `mutate()`-like functions

Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens".

```{r}
### ERROR HERE ###
movieLens <- movieLens %>%
  rename(user_id = userId,
         movie_id = movieId) ## == to = for syntax error
head(movieLens)
```

As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember?

```{r}
### ERROR HERE ### 
transmute(movieLens,
       average_rating = mean(rating)) ## use transmute to show only new column.
```

## Exercise 3: Calculating with `summarise()`-like functions

Alone, `tally()` is a short form of `summarise()`. `count()` is short-hand for `group_by()` and `tally()`.

Each entry of the movieLens table corresponds to a movie rating by a user. Therefore, if more than one user rated the same movie, there will be several entries for the same movie. I want to find out how many times each movie has been reviewed, or in other words, how many times each movie title appears in the dataset.

```{r}
movieLens %>%
  group_by(title) %>%
  tally()
```

Without using `group_by()`, I want to find out how many movie reviews there have been for each year.

```{r}
### ERROR HERE ###
movieLens %>%
  count(year) ##change from tally to count
```

Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results.

```{r}
### ERROR HERE ###
movieLens %>%
  count(title, rating, sort = TRUE) ## don't have to make vector as input to count
```

Not only do `count()` and `tally()` quickly allow you to count items within your dataset, `add_tally()` and `add_count()` are handy shortcuts that add an additional columns to your tibble, rather than collapsing each group.

## Exercise 4: Calculating with `group_by()`

We can calculate the mean rating by year, and store it in a new column called `avg_rating`:

```{r}
movieLens %>%
  group_by(year) %>%
  summarize(avg_rating = mean(rating))
```

Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively.

```{r}
### ERROR HERE ###
movieLens %>%
  group_by(title) %>% ## group by the title
  summarise(min_rating = min(rating),  ## summarise the min and max of rating
         max_rating = max(rating))
```

## Exercise 5: Scoped variants with `across()`

`across()` is a newer dplyr function (`dplyr` 1.0.0) that allows you to apply a transformation to multiple variables selected with the `select()` and `rename()` syntax. For this section, we will use the `starwars` dataset, which is built into R. First, let's transform it into a tibble and store it under the variable `starWars`.

```{r}
starWars <- as_tibble(starwars)
```

We can find the mean for all columns that are numeric, ignoring the missing values:

```{r}
starWars %>%
  summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE)))
```

We can find the minimum height and mass within each species, ignoring the missing values: 

```{r}
### ERROR HERE ###
starWars %>%
  group_by(species) %>%
  summarise(across(c(height, mass), function(x) min(x, na.rm=TRUE))) ## create vector of columns "height" and "mass" in across to fix syntax error
```

Note that here R has taken the convention that the minimum value of a set of `NA`s is `Inf`.

## Exercise 6: Making tibbles

Manually create a tibble with 4 columns:

-   `birth_year` should contain years 1998 to 2005 (inclusive);
-   `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45;
-   `birth_location` should contain three locations (Liverpool, Seattle, and New York).

```{r}
### ERROR HERE ###
fakeStarWars <- tribble(
  ~name,            ~birth_weight,  ~birth_year, ~birth_location,
  "Luke Skywalker",  1.35      ,   1998        ,  "Liverpool, England",
  "C-3PO"         ,  1.80      ,   1999        ,  "Liverpool, England",
  "R2-D2"         ,  2.25      ,   2000        ,  "Seattle, WA",
  "Darth Vader"   ,  2.70      ,   2001        ,  "Liverpool, England",
  "Leia Organa"   ,  3.15      ,   2002        ,  "New York, NY",
  "Owen Lars"     ,  3.60      ,   2003        ,  "Seattle, WA",
  "Beru Whitesun Iars", 4.05   ,   2004        ,  "Liverpool, England",
  "R5-D4"         ,  4.50      ,   2005        ,  "New York, NY"
) ## add , after birth_location to solve syntax error, and "" the birth locations
#fakeStarWars %>% mutate(birth_weight_check = (birth_year-1995)*0.45)
fakeStarWars
```

## Attributions

Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.