-
Notifications
You must be signed in to change notification settings - Fork 38
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
81 changed files
with
17,267 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
231 changes: 231 additions & 0 deletions
231
...gisticsstuffnotlinkedanywhere/10100-2024-project10-teachinglearning-backup.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
= TDM 10100: R Project 10 -- 2024 | ||
|
||
**Motivation:** In Project 9 we introduced the concept of outside libraries and packages in R, learned how to import a new library, and used some basic functions from libraries we imported. This project will serve as a direct extension to that where we will focus specifically on two of the most-used R libraries: `tidyr` and `dplyr` (pronounced dee-plier), the paradigms that come with each, and some commonly-used functions that they contain. | ||
|
||
**Context:** Libraries, import statements, and R syntax will be important to have a good understanding of for this project. | ||
|
||
**Scope:** dplyr, tidyr, libraries, R | ||
|
||
.Learning Objectives: | ||
**** | ||
- Learn what "tidy" data is and why its useful | ||
- Use basic `tidyr` utilities to process and reformat data | ||
- Learn about `dplyr` mutations | ||
- Manipulate data and reimplement `apply()` processes using `dplyr` | ||
- Combine `dplyr` and `tidyr` utilities to "make a dataframe tidy" | ||
**** | ||
|
||
Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. | ||
|
||
== Dataset(s) | ||
|
||
This project will use the following dataset(s): | ||
|
||
- `/anvil/projects/tdm/data/beer/beer.csv` | ||
|
||
== Questions | ||
|
||
=== Question 1 (2 pts) | ||
|
||
As you've likely observed throughout this semester, much of data science consists solely of creating new columns or reshaping data in such a way that it is easier to analyze, causing problems where lots of time is sunk purely into data pre-processing. `tidyr` puts forward an answer to this issue: a standarized way to reformat data so that pre-processing and data analysis is as hands-off and easy as possible. | ||
|
||
While we could continue to provide our own perception of what "tidy data" is, who better to explain it to you than the authors of tidyr? Read through https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html[this fantastic article] and, in a markdown cell, answer the following questions in your own words: | ||
|
||
. What is tidy data and why is it important? | ||
. What is the difference between an observation and a variable? | ||
|
||
Additionally, read the data at `/anvil/projects/tdm/data/beer/beers.csv` into a new data.table called `beers_dat` (you may need to review the previous project for a reminder on how to use the data.table library). Print the head of that table. | ||
|
||
Finally, add to your existing markdown cell a few sentences (at least 1-2) detailing what the variables are in the provided data, what the observations are in the provided data, and any suggestions you can think of to make the data more "tidy". Don't make any of these changes, but take a few minutes to think about how you might restructure the data on your own. | ||
|
||
.Deliverables | ||
==== | ||
- A few sentences on what makes data "tidy" and why that matters | ||
- A few sentences on the difference between an observation and a variable | ||
- A few sentences detailing the variables, observations and how you could tidy the data you read in. | ||
- A new data.table, `beers_dat`, created from the data in the specified `beers.csv` file | ||
==== | ||
|
||
=== Question 2 (2 pts) | ||
|
||
One of the most common uses for `tidyr` is the reshaping and combination of rows and columns in our dataframes/data.tables. Let's take a look at some useful functions provided by `tidyr` and then apply them to make our data a bit more concise. | ||
|
||
Firstly, let's take a look at the https://www.rdocumentation.org/packages/tidyr/versions/0.8.2/topics/unite[`unite()`] function. `unite()` takes two columns and "unites" them, often with a separator character between the two that makes it easy to distinguish where the values for one column begin and the others end. Here's a short example: | ||
|
||
[source, R] | ||
---- | ||
library(tidyr) | ||
# if I have two columns in a data.table called images_dat, color and brightness, and want to combine them so that each row is of the form color:brightness, I could do the following. | ||
unite(images_dat, "color_brightness", c("color","brightness"), sep=":") | ||
---- | ||
|
||
[NOTE] | ||
==== | ||
The opposite of `unite()` is https://tidyr.tidyverse.org/reference/separate.html[`separate()`]. While we won't use it in this question, you may make a mistake and want to separate your columns. The notation is generally of a form similar to `separate(dat, "newcol", c("col1", "col2"), sep=";")`. | ||
==== | ||
|
||
To solve this question, we want you to accomplish two main tasks: | ||
|
||
. Firstly, use `unite()` to merge the state and country columns into a new column called `state_column`, using an underscore "_" as the separator character | ||
. Next, print the `head()` of your modified `beers_dat` table | ||
|
||
.Deliverables | ||
==== | ||
- The results of merging the state and country columns | ||
==== | ||
|
||
=== Question 3 (2 pts) | ||
|
||
Another oft-used `tidyr` function is https://tidyr.tidyverse.org/reference/pivot_longer.html[pivot_longer()], which can be used to efficiently collapse multiple columns down to easy-to-evaluate key-value pairs. Please briefly review the previously linked documentation, paying special attention to the example provided towards the bottom of the page in order to best grasp how `pivot_longer()` is used. | ||
|
||
[NOTE] | ||
==== | ||
The opposite of `pivot_longer()` is https://tidyr.tidyverse.org/reference/pivot_wider.html[pivot_wider()]. Again, we won't be using it in this question, and its generally not used on tidy data as it typically contributes to "untidiness", but it is still a useful function to be aware of in case you ever want to reformat or drastically restructure your data. | ||
==== | ||
|
||
. Let's look at our data's structure. Most of this data is devoted to attributes of the different beers, but there are three columns that deal with more brewery-specific details: `brewery_id`, `availability` and `retired`. Use pivot_longer on the `availability` and `retired` columns, pivoting so that there is a new column called `status_category` that contains the names (either "availability" or "retired"), and a new column called `status` that contains the values (e.g. "Rotating", "Year-round", "f"). | ||
. Using https://www.statmethods.net/management/subset.html[`subset()`], create a new dataframe called `bar_logistics` that contains the `id` and `brewery_id` columns as identifiers, along with the `status_category` and `status` columns that you just created. Then use `pivot_wider` to restore the `status_category` and `status` columns to their separate `availability` and `retired` columns. | ||
. Finally, drop `brewery_id`, `status_category`, and `status` from your original `beers_dat` dataframe, as we are now storing them in their own, separate table. (Hint: this can also be done using `subset()`) | ||
|
||
[IMPORTANT] | ||
==== | ||
If this isn't your first foray into the `tidyverse` (the group of libraries built around `tidyr`, curated largely by Dr. Hadley Wickham) you may be familiar with the https://tidyr.tidyverse.org/reference/gather.html[`gather()`] function and think that it largely resembles `pivot_longer()`. This is because `pivot_longer` is basically a newer, better form of `gather()`, and as it is encouraged to be used over `gather()`, its what we'll focus on here. | ||
==== | ||
|
||
Finally, print the head of your resulting data and observe the results. You should see something similar to the following: | ||
|
||
image::f24-101-p10-1.png[Pivoted Data Table, width=792, height=500, loading=lazy, title="Pivoted Data Table"] | ||
|
||
[NOTE] | ||
==== | ||
Typically a pivot like is made because we wanted to separate our data. As you now know, modularity is a key concept when it comes to tidy data: different concepts and groups of data should be kept in separate tables. The motivations behind this are complex and many: it simplifies visual analysis to limit columns, makes operators that work on an entire table at once easier to use, and can help us to separate confidential data from non-confidential data. | ||
==== | ||
|
||
[IMPORTANT] | ||
==== | ||
While this question is technically doable using different methods, only `pivot()` functions like `pivot_longer()` or `pivot_wider()` will be accepted for full credit. | ||
==== | ||
|
||
.Deliverables | ||
==== | ||
- A new table `bar_logistics` containing `status_category`, `status`, `id`, and `beer_id` | ||
- Your original `beers_dat` table, with the brewery-specific items dropped out | ||
==== | ||
|
||
=== Question 4 (2 pts) | ||
|
||
Now, with our data a little more separated and easier to handle, we can move into our second library of focus for this project: https://dplyr.tidyverse.org/[`dplyr`]! Arguably the two most important utilities provided by `dplyr` are `group_by()` and `mutate`, although `filter()` and others are often commonly used. In the next two questions, we'll cover these three functions in detail. | ||
|
||
Before we dive into `group_by`, let's quickly cover something you likely already saw in documentation: `%>%` piping. Piping allows us to write cleaner code by taking the output of one function and then using that as the input to another function. Sounds simple enough, right? It can get pretty complicated, but as long as you break down each step piping is a fantastic tool. Take a look at the below example: | ||
|
||
[NOTE] | ||
==== | ||
One small thing to notice is that piping `%>%` is not actually part of the base R functions, and is instead part of the `dplyr` library. Be sure to import `dplyr` using `library(dplyr)` before attempting to use pipes in your project. | ||
==== | ||
|
||
[source, R] | ||
---- | ||
library(dplyr) | ||
# prints hello world | ||
"Hello World!" %>% print() | ||
# generate a list of 20 random numbers between 1-10, then print the mean | ||
sample(1:10, 20, replace = TRUE) %>% | ||
mean() %>% | ||
cat( "is the average of our list") | ||
---- | ||
|
||
[IMPORTANT] | ||
==== | ||
If you are piping input to a function that takes multiple arguments, any input you pipe in will be placed before other arguments you pass to the function (see above example) | ||
==== | ||
|
||
`group_by()` is a rather on-the-nose name. It creates a "grouped table", which allows to commit operations over each group separately. For example, if we had a dataset of different cars and their gas mileage we could first group the cars by type of car (e.g. sedan, SUV, pickup) and then get the average gas mileage for each type of car. The possibilites here are dauntingly large, so we'll just cover some basic uses and provide you more space to develop your own methodologies in futures projects and questions. | ||
|
||
Below is an example where we take our `beers_dat` table, group by state-country pair, and then get the average abv for each pair (and sort highest to lowest). | ||
|
||
[source, R] | ||
---- | ||
library(dplyr) | ||
avg_abv_by_statecountry <- beers_dat %>% # pass in our beer data | ||
group_by(state_country) %>% # groups by state-country pair | ||
summarise(avg_abv_by_statecountry = mean(abv, na.rm = TRUE)) %>% # gets average abv for each group | ||
arrange(desc(avg_abv_by_statecountry)) # sort by highest to lowest average abv | ||
# prints the head of our data | ||
head(avg_abv_by_statecountry) | ||
---- | ||
|
||
For this question, we'd like you to first group your `beers_dat` data by the `style` category and then get the average abv for each style. Your final answer should include both a table of the average values by style called `avg_abv_by_style`, and also a print statement of the top 5 styles by abv (this can just be a head of the `avg_abv_by_style` table, if you sorted it). | ||
|
||
If you're struggling with how to go about this problem, pay special care to the above example. It is **extremely** similar to what you're being asked to do here. | ||
|
||
[NOTE] | ||
==== | ||
In case you hadn't yet Googled it, _abv_ stands for _alcohol by volume_, and is a common metric of the strength of an alcohol. | ||
==== | ||
|
||
.Deliverables | ||
==== | ||
- A new table called `avg_abv_by_style` | ||
- The top 5 strongest styles of beer by ABV | ||
==== | ||
|
||
=== Question 5 (2 pts) | ||
|
||
Your data-analysis skills have already increased tenfold since the beginning of this project, so we'll add just two more tools to your skillset to cap things off. | ||
|
||
Firstly, one of the most important utilities that `dplyr` provides: `mutate()`. `mutate()` allows for the easy manipulation of a function over entire columns of data at once, similar to `apply()` with slightly different (and, I'd argue better) syntax and utility. | ||
|
||
Secondly is the `filter()` function, which allows for the filtration of a dataframe into only rows that meet specific conditions. | ||
|
||
Read through and run on your own the below code, paying specific attention to the intermediate results returned by `mutate()` and `filter()` respectively. | ||
|
||
[source, r] | ||
---- | ||
# NOTE: This code builds off of the results of the example code provided in the previous problem | ||
# Merge average abv back into original dataframe | ||
beers_dat <- merge(beers_dat, avg_abv_by_statecountry, by = "style", suffixes = c("", "_style")) | ||
# Calculate the mean difference for each beer | ||
beers_dat <- beers_dat %>% | ||
mutate(mean_difference = abv - avg_abv_by_statecountry) | ||
# Filter for beers with an abv at least 10 above the average for that style | ||
filtered_data <- beers_dat %>% | ||
filter(mean_difference >= 10) | ||
---- | ||
|
||
To complete this question, we want you to create a new column in your dataset called mean_difference that is the difference in abv between each beer and the average abv for that style of beer. You can think of this in three steps. First, use `merge()` to merge your `avg_abv_by_style` table from the last question into your `beers_dat` as a new column using the same name (Hint: we did this in the last project). Then, use `mutate()` to create your new column, `mean_difference`, that is the difference in abv between each beer and the average abv for that style of beer. Next, use `filter()` to filter only for beers with an abv at least 50 above the average for that style. Finally, print the resulting filtered data. | ||
|
||
[NOTE] | ||
==== | ||
To check your answers, the resulting filtered data should contain 16 entries (viewable using `count()`). | ||
==== | ||
|
||
.Deliverables | ||
==== | ||
- The head of your data, filtered to have beers with an abv at least 50 above its style average | ||
==== | ||
|
||
== Submitting your Work | ||
|
||
In finishing this project, you've successfully learned and applied some of the most prominently used functions in R for data analysis. While different libraries are largely used based on preference, common ones like `data.table`, `tidyr`, and `dplyr` are so useful that relatively any working R professional is familiar with them on even just a basic level. | ||
|
||
In the next project we'll slow down massively and introduce the `tidyverse`, a collection of common libraries including both of the ones we introduced in this project. We will spend the majority of the project focusing on date-time analysis, an important and difficult part of handling data in R, and some common tools that can help us ingest and analyze date-time data in various formats. | ||
|
||
.Items to submit | ||
==== | ||
- firstname_lastname_project10.ipynb | ||
==== | ||
|
||
[WARNING] | ||
==== | ||
You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. | ||
You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. | ||
==== |
Oops, something went wrong.