Skip to content

3 Project Proposal

sopheakneak edited this page Nov 6, 2017 · 20 revisions

Table of Contents: Project Description | Technical Description | Logistics


Project Description

  • The goal of our project is to use data science in order to uncover patterns in the data that we collect to find if restaurants in affluent areas get better reviews and ratings. The term “affluent” for our description will be based on census data using household median income in order to find places in which these areas could play a factor in restaurant prices and ratings. The area we will be focusing on is in and around the Seattle area as we know that there is a mix of socio-economic zones. Based on our previous knowledge, we have assumed that more affluent areas tend to have more higher-end restaurants, and area’s that are not as wealthy have will have more restaurants with a more affordable price range. However, even with us knowing about the price of these restaurants, our research will mainly focus on the differences between the reviews and ratings of restaurants in different communities. The data we will be using comes from the US census data through the Seattle government website. The data will have information on all census tracts in which we will extract city borders and calculate median household incomes. Using this income data, we can create a clear distinction between affluent and non-affluent areas. To obtain the restaurant data, we will be using the Yelp API found on their website. With this API, we can extract information on Seattle restaurants in different areas and also extract their ratings and reviews so that we can try to make a connection between the restaurant location, its current reviews, and ratings and it’s surrounding economic status.
  • Restaurant-goers and foodies are who we believe are our key stakeholders, and there are several reasons. First of all, comparing with other people, they care more about the food quality and popularity. The online platform, such as Yelp, has become the most popular app where they could proactively search for these information, besides asking friends around for recommendations. However, with the problem assumption that restaurants in affluent areas get better reviews and ratings, they might not get the most correct information about restaurants ratings. Our data science project can help them access to better-evaluated data on restaurants ratings by extracting information on restaurants and see if there’s correlation between its reviews, locations, and socio-economic status.
  • The main goal of our project is to support our audience make better decisions when they use any business-review. Specifically, users can understand that with our project, they’ll be able to distinguish there’s a biased difference between restaurants in different areas, if our assumption is proved correct. So for foodies and restaurants goers who care about restaurant qualities based on online platform, they’ll be more careful making their decisions after using those platforms, such as Yelp, Google reviews, etc.
  • There are 3 steps we use to measure and answer this goal. Our first goal is to get the dataset of median household income of different locations from Census.gov. By accessing these information, we can have a general understanding between affluent and non-affluent areas, and thus help us locate restaurants in those areas later in the next sub-goal. Secondly, we’ll extract rating, price, and reviews from the online platform we’re using, which is Yelp. By accessing the restaurant's’ data from Yelp, we’ll be able to filter restaurant information only related to areas we did research on from Census.gov, which is Seattle region. Our last subgoal is to make the connections between location information and restaurants data, in order to analyze if there’s a correlation between restaurants ratings and their locations. For example, we made an assumption that restaurants in affluent areas get better reviews and ratings. With the last question, we’ll be able to discover if there’s this pattern and prove or disprove our assumption and these further develop our data science project based on the discovery.

Technical Description

  • The final project will be a Shiny app. Since Shiny is a web application framework, we are able to create a more interactive, dynamic display of our findings. It will increase the user experience and it will be able to create a bigger opportunity for the audience to view the data in multiple perspectives, versus a static html page. In addition, Shiny allows us to publish our apps online which will make displaying our final product easier. Our project has mainly two types of data: (1) the restaurant’s ratings and locations, and (2) the socio-economics of communities. We already faced some challenges when trying to retrieve restaurant’s ratings and locations since Yelp and TripAdvisor’s main API closed down to the public. However, we were able to find a work around and is now able to continue to use Yelp as a source of information. The next challenge is obtaining census data of the socio-economic of communities. We would have to not only define what socio-economics mean, but also define the borders of communities. This leads into our last challenge, which is that using the restaurant’s location and knowing which communities the restaurant is a part of.
  • We all will need to learn how to gather the appropriate data within the scope of Seattle, WA. There are different definitions of how we want to split up the neighborhoods into different socioeconomic and that will influence the results we get. In order to split them up, we will need to understand the most reasonable, and consistent way to split the location. The hardest challenge we will need to overcome is figuring out how we will determine which neighborhood the restaurant is in based on the latitude and longitude of the restaurant. We will be conducting our own metrics to define the restaurant’s ratings based on their current ratings and number of reviews. We want to do this because this formula will include different factors that contribute to a restaurant’s overall review rather than comparing based on one factor. This is how we are going to model our approach.

Logistics

  1. Gather Data (Complete By the end of Week 6 - 11/12/2017)

Demographics data:

  • Download household income data from data.seattle.gov
  • Filtered data to only neighborhood within Seattle (not by county)
  • Clean and format the data in R using Data Cleaning Techniques

Restaurant data:

  • Yelp
    • Using Yelp graph QL obtain needed data on
    • Define the scope of data (fields)
    • Filtered data to only Restaurants within Seattle (not by county)
    • Clean and format the data in R using Data Cleaning Techniques
  • Google Maps API
    • Using Google Maps API, gather a list of all restaurants that would be considered in Seattle areas
    • Benchmark the list with Yelp directories and add restaurant to directory as necessary
  1. Analyze Data ( Complete by end of week 7 - 11/19/2017)
  • Mutate data set within R
  • Look for correlating trends through plotting
  • Fit dataset into our own metrics
  1. Create Shiny App Skeleton (Complete by end of Week 8 - 11/26/2017)

Header

  • Dynamic Content
    • Possible drop-down menus
    • Possible notification system
  • SideBar
    • Create inputs within the side
  • Body
    • Adding Dynamic plots to show trends
  1. Apply Data into Shiny App appropriately & Finish presentation (Complete by end of Week 9 - 12/03/2017)
  • Check to see if the plots match the ones from step 2.
  • Finish PPT
  • Allot proper amount of speaking time during presentation to each team member

Risk(s):

  1. Uneven reviews sample size from Yelp GraphQL:

There is a high chance that we will encounter restaurants that have a large amount of reviews while others may have such a small amount of reviews. For example, a 4.5 rated restaurant with over 100 reviews is a more accurate representation of the restaurant quality than a 4.7 rated restaurant with only 5 reviews. To help balance the uneven distribution of the reviews, instead of using average rating as the main indicator of quality, we would split all associated reviews of each restaurant into rating buckets (1-5 star) and the bucket with the highest amount would be the main indicator of the restaurants. Essentially, we would consider mode as a better data computation than the median. In addition, to help validate the credibility of review, we will be considering the number of reviews in comparison to the number of ratings. For example, ratings with the higher number of reviews should weigh more than the ones with less.

  1. Non Yelp-associated restaurant may be missing from the Yelp GraphQL:

It is a possibility that there are restaurants within Seattle area that are not associated with Yelp, meaning they do not have a presence on Yelp platform or choose to be on a different rating and review platform. This may be an issue such as that affluent restaurants may have a lot more presence on the internet, hence, the directory of restaurants in affluent areas may be more accurate than those who are not in the affluent areas. To take that into consideration, we will be looking at various platforms that may have a bigger directory than Yelp, for example, Google Maps. We can use Google Maps API has a benchmark and see whether there is a higher number of restaurants to complete the Seattle restaurant list.

Clone this wiki locally