-
Notifications
You must be signed in to change notification settings - Fork 460
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
6,210 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# CS 109A/STAT 121A/AC 209A/CSCI E-109A \n", | ||
"\n", | ||
"## Lab 2: Cleaning and EDA of Goodreads \n", | ||
"\n", | ||
"**Harvard University**<br>\n", | ||
"**Fall 2017**<br>\n", | ||
"**Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine**\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Table of Contents \n", | ||
"<ol start=\"0\">\n", | ||
"<li> Learning Goals </li>\n", | ||
"<li> Loading and Cleaning with Pandas</li>\n", | ||
"<li> Asking Questions? </li>\n", | ||
"<li> Parsing and Completing the Dataframe </li>\n", | ||
"<li> EDA </li>\n", | ||
"<li> Determining the Best Books </li>\n", | ||
"<li>Trends in Popularity of Genres </li>\n", | ||
"</ol>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Learning Goals\n", | ||
"\n", | ||
"HTML pages about 6000 odd \"best books\" were fetched and parsed from [Goodreads](https://www.goodreads.com). The \"bestness\" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.\n", | ||
"\n", | ||
"We parsed the page for each book and saved data from all these pages a tabular format as a CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data analysis to answer questions about these best books and popular genres. \n", | ||
"\n", | ||
"\n", | ||
"By the end of this lab, you should be able to:\n", | ||
"\n", | ||
"- Scrape data using beautiful soup and python request.\n", | ||
"- Load and systematically address missing values, ancoded as `NaN` values in our data set, for example, by removing observations associated with these values.\n", | ||
"- Parse columns in the dataframe to create new dataframe columns.\n", | ||
"- Create and interpret visualizations to explore the data set\n", | ||
"\n", | ||
"*This lab corresponds to lectures 2 and 3 and maps on to homework 1 and further.*" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Basic EDA workflow\n", | ||
"\n", | ||
"(From the lecture, repeated here for convenience).\n", | ||
"\n", | ||
"The basic workflow is as follows:\n", | ||
"\n", | ||
"1. **Build** a DataFrame from the data (ideally, put all data in this object)\n", | ||
"2. **Clean** the DataFrame. It should have the following properties:\n", | ||
" - Each row describes a single object\n", | ||
" - Each column describes a property of that object\n", | ||
" - Columns are numeric whenever appropriate\n", | ||
" - Columns contain atomic properties that cannot be further decomposed\n", | ||
"3. Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.\n", | ||
"4. Explore **group properties**. Use groupby and small multiples to compare subsets of the data.\n", | ||
"\n", | ||
"This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Part 1: Loading and Cleaning with Pandas \n", | ||
"Read in the `goodreads.csv` file, examine the data, and do any necessary data cleaning. \n", | ||
"\n", | ||
"Here is a description of the columns (in order) present in this csv file:\n", | ||
"\n", | ||
"```\n", | ||
"rating: the average rating on a 1-5 scale achieved by the book\n", | ||
"review_count: the number of Goodreads users who reviewed this book\n", | ||
"isbn: the ISBN code for the book\n", | ||
"booktype: an internal Goodreads identifier for the book\n", | ||
"author_url: the Goodreads (relative) URL for the author of the book\n", | ||
"year: the year the book was published\n", | ||
"genre_urls: a string with '|' separated relative URLS of Goodreads genre pages\n", | ||
"dir: a directory identifier internal to the scraping code\n", | ||
"rating_count: the number of ratings for this book (this is different from the number of reviews)\n", | ||
"name: the name of the book\n", | ||
"```\n", | ||
"\n", | ||
"Report all the issues you found with the data and how you resolved them. \n", | ||
"\n", | ||
"[15 minutes]\n", | ||
"\n", | ||
"----" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Part 2: Asking Questions \n", | ||
"Think of few questions we want to ask and then examine the data and decide if the dataframe contains what you need to address these questions. \n", | ||
"\n", | ||
"[5 min]\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Part 3: Parsing and Completing the Data Frame \n", | ||
"\n", | ||
"We will need author and genre to proceed! Parse the `author` column from the author_url and `genres` column from the genre_urls. Keep the `genres` column as a string separated by '|'.\n", | ||
"\n", | ||
"Hint: Use panda's `map` to assign new columns to the dataframe. \n", | ||
"\n", | ||
"[10 minutes]\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Part 4: EDA \n", | ||
"Before proceeding any further, get to know the dataset using a few \"global property\" visualizations, illustrating histograms with both linear and log scales. Do you find anything interesting or strange? \n", | ||
"\n", | ||
"\n", | ||
"[10 minutes]\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"## Part 5: Determining the Best Books \n", | ||
"\n", | ||
"This is an example of an analysis of the \"grouped property\" type.\n", | ||
"\n", | ||
"Think of some reasonable definitions of what it could mean to be a \"best book.\" (After all, these are all the best books according to Goodreads)\n", | ||
"\n", | ||
"[5 minutes] \n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Part 6: Trends in Popularity of Genres \n", | ||
"\n", | ||
"This is an example of an analysis of the \"grouped property\" type.\n", | ||
"\n", | ||
"There are a lot of questions you could ask about genres.\n", | ||
"* Which genre is currently the most popular?\n", | ||
"* Better, based on our data, what draw conclusions can you draw about the time evolution of the popularity of each genre?\n", | ||
"\n", | ||
"[15 minutes]\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Part 6.1: What can you conclude from the above visualizations?\n", | ||
"Pick two or three genres and describe how the popularity of these genres fluctuates with time. " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"anaconda-cloud": {}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
Oops, something went wrong.