added lab2

cs109 · Sep 14, 2017 · ecff47a · ecff47a
1 parent dc44d32
commit ecff47a
Show file tree

Hide file tree

Showing 2 changed files with 6,210 additions and 0 deletions.
diff --git a/Labs/Lab2_Worked_EDA/Lab2_CleaningEDA.ipynb b/Labs/Lab2_Worked_EDA/Lab2_CleaningEDA.ipynb
@@ -0,0 +1,210 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CS 109A/STAT 121A/AC 209A/CSCI E-109A  \n",
+    "\n",
+    "## Lab 2: Cleaning and EDA of Goodreads \n",
+    "\n",
+    "**Harvard University**<br>\n",
+    "**Fall 2017**<br>\n",
+    "**Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine**\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Table of Contents \n",
+    "<ol start=\"0\">\n",
+    "<li> Learning Goals </li>\n",
+    "<li> Loading and Cleaning with Pandas</li>\n",
+    "<li> Asking Questions?  </li>\n",
+    "<li> Parsing and Completing the Dataframe  </li>\n",
+    "<li> EDA  </li>\n",
+    "<li> Determining the Best Books  </li>\n",
+    "<li>Trends in Popularity of Genres </li>\n",
+    "</ol>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Learning Goals\n",
+    "\n",
+    "HTML pages about 6000 odd \"best books\" were fetched and parsed from [Goodreads](https://www.goodreads.com). The \"bestness\" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.\n",
+    "\n",
+    "We parsed the page for each book and saved data from all these pages  a tabular format as a CSV file. In this lab we'll clean and further parse the data.  We'll then do some exploratory data analysis to answer questions about these best books and popular genres.  \n",
+    "\n",
+    "\n",
+    "By the end of this lab, you should be able to:\n",
+    "\n",
+    "- Scrape data using beautiful soup and python request.\n",
+    "- Load and systematically address missing values, ancoded as `NaN` values in our data set, for example, by removing observations associated with these values.\n",
+    "- Parse columns in the dataframe to create new dataframe columns.\n",
+    "- Create and interpret visualizations to explore the data set\n",
+    "\n",
+    "*This lab corresponds to lectures 2 and 3 and maps on to homework 1 and further.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Basic EDA workflow\n",
+    "\n",
+    "(From the lecture, repeated here for convenience).\n",
+    "\n",
+    "The basic workflow is as follows:\n",
+    "\n",
+    "1. **Build** a DataFrame from the data (ideally, put all data in this object)\n",
+    "2. **Clean** the DataFrame. It should have the following properties:\n",
+    "    - Each row describes a single object\n",
+    "    - Each column describes a property of that object\n",
+    "    - Columns are numeric whenever appropriate\n",
+    "    - Columns contain atomic properties that cannot be further decomposed\n",
+    "3. Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.\n",
+    "4. Explore **group properties**. Use groupby and small multiples to compare subsets of the data.\n",
+    "\n",
+    "This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 1: Loading and Cleaning with Pandas \n",
+    "Read in the `goodreads.csv` file, examine the data, and do any necessary data cleaning. \n",
+    "\n",
+    "Here is a description of the columns (in order) present in this csv file:\n",
+    "\n",
+    "```\n",
+    "rating: the average rating on a 1-5 scale achieved by the book\n",
+    "review_count: the number of Goodreads users who reviewed this book\n",
+    "isbn: the ISBN code for the book\n",
+    "booktype: an internal Goodreads identifier for the book\n",
+    "author_url: the Goodreads (relative) URL for the author of the book\n",
+    "year: the year the book was published\n",
+    "genre_urls: a string with '|' separated relative URLS of Goodreads genre pages\n",
+    "dir: a directory identifier internal to the scraping code\n",
+    "rating_count: the number of ratings for this book (this is different from the number of reviews)\n",
+    "name: the name of the book\n",
+    "```\n",
+    "\n",
+    "Report all the issues you found with the data and how you resolved them.  \n",
+    "\n",
+    "[15 minutes]\n",
+    "\n",
+    "----"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##   Part 2: Asking Questions \n",
+    "Think of few questions we want to ask and then examine the data and decide if the dataframe contains what you need to address these questions. \n",
+    "\n",
+    "[5 min]\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##  Part 3: Parsing and Completing the Data Frame \n",
+    "\n",
+    "We will need author and genre to proceed! Parse the `author` column from the author_url and `genres` column from the genre_urls. Keep the `genres` column as a string separated by '|'.\n",
+    "\n",
+    "Hint: Use panda's `map` to assign new columns to the dataframe.  \n",
+    "\n",
+    "[10 minutes]\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 4: EDA \n",
+    "Before proceeding any further, get to know the dataset using a few \"global property\" visualizations, illustrating histograms with both linear and log scales. Do you find anything interesting or strange? \n",
+    "\n",
+    "\n",
+    "[10 minutes]\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Part 5:  Determining the Best Books \n",
+    "\n",
+    "This is an example of an analysis of the \"grouped property\" type.\n",
+    "\n",
+    "Think of some reasonable definitions of what it could mean to be a \"best book.\" (After all, these are all the best books according to Goodreads)\n",
+    "\n",
+    "[5 minutes] \n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 6:  Trends in Popularity of Genres \n",
+    "\n",
+    "This is an example of an analysis of the \"grouped property\" type.\n",
+    "\n",
+    "There are a lot of questions you could ask about genres.\n",
+    "* Which genre is currently the most popular?\n",
+    "* Better, based on our data, what draw conclusions can you draw about the time evolution of the popularity of each genre?\n",
+    "\n",
+    "[15 minutes]\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part 6.1: What can you conclude from the above visualizations?\n",
+    "Pick two or three genres and describe how the popularity of these genres fluctuates with time.  "
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}