-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfinalArtifact.Rmd
447 lines (310 loc) · 30.3 KB
/
finalArtifact.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "Seattle Yelp Restaurants Rating Analysis"
date: "December 8, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, echo=FALSE, include=FALSE}
if(!require(mlbench)){install.packages("mlbench"); require(mlbench)} # common datasets to use
if(!require(tidyverse)){install.packages("tidyverse"); library(tidyverse)}
if(!require(modelr)){install.packages("modelr"); library(modelr)}
# load Google Maps
if(!require(ggmap)){install.packages("ggmap"); library(ggmap)}
# some dependencies for caret that aren't automatically installed
if(!require(ModelMetrics)){install.packages("ModelMetrics"); require(ModelMetrics)}
if(!require(recipes)){install.packages("recipes"); require(recipes)}
if(!require(DEoptimR)){install.packages("DEoptimR"); require(DEoptimR)}
if(!require(caret)){install.packages("caret"); require(caret)} # ML package WITHOUT its dependencies. Should not take as long
if(!require(dplyr)){install.packages("dplyr"); require(dplyr)}
if(!require(jsonlite)){install.packages("jsonlite"); require(jsonlite)}
set.seed(370)
```
# Are there biases in Yelp ratings for restaurants in different socioeconomic statuses?
```{r, echo=FALSE, include=FALSE}
# Getting and filtering data
json_file <- "SeattleYelpRestaurantsWithNewHealthInfo.json"
yelp.data <- fromJSON(json_file) %>% filter(!is.na(price)) %>% filter(!is.na(reviewCount)) %>% filter(!is.na(censusMedianHHIncome)) %>% filter(!is.na(censusIncomePerCapita)) %>% filter(!is.na(censusGiniIndexOfInequality)) %>% filter(!is.na(rating)) %>% filter(!is.na(category)) %>% filter(!is.na(censusTract)) %>% filter(!is.na(recentHealthInspectionScore)) %>% filter(!is.na(restaurantMaxSeats)) %>% filter(!is.na(totalInspectionScore))
# convert price into numeric value
price.num <- c(nchar(yelp.data[,"price"]))
yelp.data$price = price.num
# cast Health Inspection categorical data into factor for dummy variables
yelp.data$healthInspectionResult <- as.factor(yelp.data$recentHealthInspectionResult)
yelp.data$healthViolationType <- as.factor(yelp.data$recentHealthViolationType)
yelp.data$recentHealthInspectionGrade <- as.factor(yelp.data$recentHealthInspectionGrade)
# create dummy variables for healthInspectionResult, inspection grade and healthViolationType
dmy <- dummyVars(~ recentHealthInspectionResult + recentHealthViolationType + recentHealthInspectionGrade, data=yelp.data)
# add dummy variables data frame to the rest
yelp.data <- data.frame(yelp.data,predict(dmy, newdata = yelp.data))
# define the low/high quantile threshold by MedianHHincome
low <- quantile(yelp.data$censusMedianHHIncome, 0.33) # cutoff number
high <- quantile(yelp.data$censusMedianHHIncome, 0.66) # cutoff number
# filter data into 2 distinct set: high SES and low SES
above.avg <- yelp.data %>% filter(censusMedianHHIncome > high)
below.avg <- yelp.data %>% filter(censusMedianHHIncome < low)
mid.avg <- yelp.data %>% filter(censusMedianHHIncome < high & censusMedianHHIncome > low)
# filter data to only have relating variables: reviewCount, price and rating
# this dataset is used for modeling
# only using healthInspectionResult dummy variables: complete, incomplete, Satisfactory and unsatisfactory
above.avg.wo.census <- subset(above.avg, select=c("rating", "reviewCount", "recentHealthInspectionScore","totalInspectionScore", "totalInspectionCount", "avgInspectionScore", "restaurantTotalMonths", "restaurantMaxSeats", "recentHealthInspectionResultSatisfactory", "recentHealthInspectionResultUnsatisfactory", "recentHealthViolationType.1", "recentHealthViolationTypeblue", "recentHealthViolationTypered", "recentHealthInspectionGrade.1", "recentHealthInspectionGrade.2", "recentHealthInspectionGrade.3", "recentHealthInspectionGrade.4"))
below.avg.wo.census <- subset(below.avg, select=c("rating", "reviewCount", "recentHealthInspectionScore","totalInspectionScore", "totalInspectionCount", "avgInspectionScore", "restaurantTotalMonths", "restaurantMaxSeats", "recentHealthInspectionResultSatisfactory", "recentHealthInspectionResultUnsatisfactory", "recentHealthViolationType.1", "recentHealthViolationTypeblue", "recentHealthViolationTypered", "recentHealthInspectionGrade.1", "recentHealthInspectionGrade.2", "recentHealthInspectionGrade.3", "recentHealthInspectionGrade.4"))
check.unique.dummy.variable <- function(dataset) {
result.complete <- length(unique(dataset$recentHealthInspectionResultComplete)) > 1
result.incomplete <- length(unique(dataset$recentHealthInspectionResultIncomplete)) > 1
result.not.ready <- length(unique(dataset$recentHealthInspectionResultNot.Ready.For.Inspection)) > 1
result.satisfactory <- length(unique(dataset$recentHealthInspectionResultSatisfactory)) > 1
result.unsatisfactory <- length(unique(dataset$recentHealthInspectionResultUnsatisfactory)) > 1
result.violationtype <- length(unique(dataset$recentHealthViolationType)) > 1
result.violationBlue <- length(unique(dataset$recentHealthViolationTypeblue)) > 1
result.violationRed <- length(unique(dataset$recentHealthViolationTypered)) > 1
result.inspectionGrade1 <- length(unique(dataset$recentHealthInspectionGrade.1)) > 1
result.inspectionGrade2 <- length(unique(dataset$recentHealthInspectionGrade.2)) > 1
result.inspectionGrade3 <- length(unique(dataset$recentHealthInspectionGrade.3)) > 1
result.inspectionGrade4 <- length(unique(dataset$recentHealthInspectionGrade.4)) > 1
return(c(result.complete,result.incomplete,result.not.ready,result.satisfactory,result.unsatisfactory,result.violationtype,result.violationBlue,result.violationRed,result.inspectionGrade1,result.inspectionGrade2,result.inspectionGrade3,result.inspectionGrade4))
}
# yelp restaurants with all variables/feature
filtered.yelp.w.census <- subset(yelp.data, select=c("reviewCount", "censusTract", "censusMedianHHIncome", "censusIncomePerCapita", "censusGiniIndexOfInequality", "category", "price"))
```
## Decision context maker
With our decision, what we are trying to inform the Office of Economic Development is to change how they decide to allocate their public fundings. With the information we have, we hope to be able to change allocations of their fundings to improve quality of life in underfunded communities. We chose to base our decision context around OED city officials in order to raise awareness on the issues of economic disparity among less funded community along different census tracts. With this, it will provide an opportunity for business owners to be more successful without the social impact of Yelp bias.
## Yelp Restaurants in Seattle
```{r, echo=FALSE, warning=FALSE}
mapgilbert <- get_map(location = c(lon = mean(yelp.data$longitude), lat = mean(yelp.data$latitude)), zoom = 11, maptype = "roadmap", scale = "auto")
# plotting the map with some points on it
ggmap(mapgilbert) +
geom_point(data = yelp.data, aes(x = yelp.data$longitude, y = yelp.data$latitude, fill = "red", alpha = 0.8), size = 2, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE) +
labs(title="Yelp Restaurants in Seattle for the Analysis")
```
This map above the locations for all of the restaurants in the Seattle Area that we used for our project. In the future, we will be plotting each restaurant with a certain color that is determined based on their current rating. 1852 restaurants from 157 different category were evaluated and used for this project.
```{r, include=FALSE}
# total number of restaurants we're observing; we filter out restaurants that don't have PriceLevel, Category, and Rating
restaurantCount <- nrow(yelp.data)
restaurantCount
# all category types
allCategory <- as.data.frame(table(yelp.data$category))
allCategory
#total category
categoryCount <- nrow(allCategory)
categoryCount
```
```{r, echo=FALSE}
## Distribution of Yelp-associated Restaurants by Rating Seattle
ratingFreq <- as.data.frame(table(yelp.data$rating))
ratingFreq$percentage = ratingFreq$Freq / sum(ratingFreq$Freq)
restaurantsByRating <- ggplot(ratingFreq, aes(y=percentage, x=Var1,group=1)) +
geom_line() + labs(title="Distribution of Seattle Restaurants by Rating on Yelp", y="Total Proportion of Restaurants", x="Average Rating on Yelp", caption="(based on data from Yelp Fusion API)")
restaurantsByRating
```
This graph is plotting the distribution of restaurants grouped by rating from 1 to 5 on a 0.5 scale. The x axis indicates the restaurant rating with 1 being the lowest and 5 being the highest. The y axis holds percentage in number of restaurants for each rating. The percentage for each rating level was calculated by dividing the frequency of restaurants with that rating by the sum of all restaurants in Seattle.
The line graph suggested that ratings for Seattle restaurants are on a normal distribution. It indicates that out of all the restaurants in the Seattle area, a rating of 4 is the most common and more than 70% of the restaurants have a rating higher than 3.
```{r, echo=FALSE}
## Distribution of Yelp-associated Restaurants Rating vs Household Income Violin Plot
ratingViolinPlot <- ggplot(yelp.data, aes(as.factor(rating), censusMedianHHIncome)) +
geom_violin() +
geom_boxplot(width=0.1) +
stat_summary(fun.y=mean, geom="point", shape=23, size=2, color="blue") +
stat_summary(fun.y=median, geom="point", size=2, color="red") +
labs(title="Distribution of Seattle Restaurants Rating \nin various Median Household Income on Yelp", y="Median Household Income ($)", x="Average Rating on Yelp", caption="(based on data from Yelp Fusion API and Seattle, Census Reporter, and the U.S. Census Bureau)")
ratingViolinPlot
```
The violin plot above demonstrates the distribution of median household income at specific ratings. The distribution of median household income is moderately skewed for restaurants with 1.5, 3 and 4 rating. The mean point of median household income for 1.5, 3, and 4 ratings are all below the median; meaning the distribution is skewed to the left. In addition, the median point of the restaurant median household income increases as the rating increases.
```{r, echo=FALSE, include=FALSE}
## Distribution of Yelp-associated Restaurants in Seattle
sorted.highSES <- transform(above.avg, censusTract = reorder(censusTract, censusMedianHHIncome))
highSES <- ggplot(sorted.highSES, aes(censusTract, fill=censusMedianHHIncome)) +geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Seattle Restaurants in High Socio-Economic Neighborhoods on Yelp", y="Number of Restaurants", x="Seattle Census Tract Code", fill="Median Household Income ($)", caption="(based on data from Yelp Fusion API and Seattle, FCC Data, Census Reporter, and the U.S. Census Bureau)")
sorted.midSES <- transform(mid.avg, censusTract = reorder(censusTract, censusMedianHHIncome))
midSES <- ggplot(sorted.midSES, aes(censusTract, fill=censusMedianHHIncome)) +geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Seattle Restaurants in Median Socio-Economic Neighborhoods on Yelp", y="Number of Restaurants", x="Seattle Census Tract Code", fill="Median Household Income ($)", caption="(based on data from Yelp Fusion API and Seattle, FCC Data, Census Reporter, and the U.S. Census Bureau)")
sorted.lowSES <- transform(below.avg, censusTract = reorder(censusTract, censusMedianHHIncome))
lowSES <- ggplot(sorted.lowSES, aes(censusTract, fill=censusMedianHHIncome)) +geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Seattle Restaurants in Low Socio-Economic Neighborhoods on Yelp", y="Number of Restaurants", x="Seattle Census Tract Code", fill="Median Household Income ($)", caption="(based on data from Yelp Fusion API and Seattle, FCC Data, Census Reporter, and the U.S. Census Bureau)")
```
```{r, echo=FALSE}
highSES
```
The data we had on the restaurants was broken down into three categories of high, mid and low SES based on median household income. It should be noted that it was not separated based on Seattle household income. The cut off for high socio-economic restaurants are the top 33% based on median household income, low socio-economic restaurants are the bottom 33%, and mid socio-economic restaurants are the cut off in between the high and low. This graph is plotting for the High SES which was determined to be a median household income of a minimum of $74,559 or more. This unique census tract holds 54 counts.
```{r,echo=FALSE}
midSES
```
The restaurants in this data are within census tracts that have a median household income between $59,275 to $74,559. We have found there are 28 unique census tract. The x-axis are the unique census tracts in the dataset and the y-axis counts the number of restaurants in each census tracts. Out of the three socioeconomics, this bracket contains the least amount of census tract, which may imply there is less area that has restaurants spread out or there are not a lot of census tracts that falls within the middle socioeconomic status.
```{r echo=FALSE}
lowSES
```
The restaurants in this data are within census tracts that have a low household income lower than $59,275. We have found there are 39 unique census tract. The x-axis are the unique census tracts in the dataset and the y-axis counts the number of restaurants in each census tracts. We see that there is a disproportionate spread in number of restaurants in each census tracts within the lower household income bracket. This can indicate a bigger, densely populated census tract or a more commercialized area to open restaurant business.
```{r, echo=FALSE}
## Distribution of Yelp-associated Restaurants By Price level in Seattle
rating.data <- yelp.data %>% mutate(MedianHHLevel = ifelse(censusMedianHHIncome < low, "Low SES",
ifelse(censusMedianHHIncome > high, "High SES", "Mid SES")))
priceFreq <- as.data.frame(table(rating.data$price,rating.data$MedianHHLevel))
priceFreq$percentage = priceFreq$Freq / sum(priceFreq$Freq)
restaurantsByPrice <- ggplot(priceFreq, aes(fill=Var2, y=Freq, x=Var1)) +
geom_bar( stat="identity", position="dodge") + labs(title="Distribution of Seattle Restaurants by Price Level on Yelp", y="Number of Restaurants", x="Price Level on Yelp ($ - $$$$)", fill="Socio-Economic Level \n(based on Median Household Income)", caption="(based on data from Yelp Fusion API and Seattle, Census Reporter, and the U.S. Census Bureau)")
restaurantsByPrice
```
The graph above is plotting the distribution of restaurants based on their reviews while being grouped by High/low/mid SES. The x axis indicates the price level or $ sign with low being the same as 1 dollar sign and high being equivalent to 4 dollar signs. The y axis holds the number of the restaurants in the price level.
This graph demonstrates that Middle SES areas and Low SES areas have many cheap restaurants (1 \$s), with Low SES having the most and High SES areas having the least. When it comes to restaurants that are medium-priced (2 \$s), Middle SES areas have the most, and Low SES areas have the second most. This price category also has the most number of restaurants. The moderately expensive restaurants (3 \$s) tend to be prevalent in the Middle SES areas and less prevalent in High SES areas. Low SES areas have the least. For the most expensive restaurants (4 \$s), Middle SES areas have the most, High SES areas have the second most and the Low SES areas have the least. This category also has the fewest number of restaurants. It seems based on the data that majority of the restaurants are in the \$ and \$\$ categories of price.
## Features
With our analysis, the dataset provides a multitude of features and factors relating to restaurants quality. Every Yelp restaurant in our dataset has the features below that we can analyze:
**Rating (*integer*)**: the restaurant's average rating for given by Yelp community users
**reviewCount (*integer*)**: the restaurant's total number of reviews given by Yelp community users
**recentHealthViolationType (*string*)**: the type of violation (red or blue) the restaurant had during the most recent restaurant inspection. Red refers to high risk factors that considered improper practices or procedures identified as the most prevalent contributing factors of foodborne illness or injury. Blue refers to low risk factors that are considered preventive measures to control the addition of pathogens, chemicals, and physical objects into foods.
**recentHealthInspectionScore (*integer*)**: the restaurants total points of violation given by King County food inspector during the most recent restaurant inspection. Inspection point is a cumulative violation point of red and blue violation type. Each violation within red high risk factors and blue low risk factors has an associated points. Health Inspection Score is the sum of violation points during a single restaurant inspection visit.
**totalInspectionScore (*integer*)**: the total sum of all historical health inspection score for the restaurant given by King County food inspector.
**totalInspectionCount (*integer*)**: the total count of inspection visit at the restaurant
**avgInspectionSore (*integer*)**: the overall average inspection score of the restaurant. Formula: totalInspectionScore / totalInspectionCount
**restaurantTotalMonths (*integer*)**: the estimated total months the restaurant has been opened. The number is calculated by the date difference between the most recent and earliest inspection date.
**restaurantMaxSeats (*integer*)**: the maximum number of seatings described by the King County food inspector.
**recentHealthInspectionResult (*string*)**: the inspection condition (Complete, Incomplete, Not Ready for Inspection, Unsatisfactory, or Satisfactory) given at the end of the most recent food inspection visit by a King County food inspector
**recentHealthInspectionGrade (*integer*)**: the restaurant food safety rating given by a King Country food inspector. (1 - Excellent, 2 - Good, 3 - Okay, 4 - Needs to Improve)
To further investigate Yelp restaurants rating in various socio-economic neighborhood, we trained a regression model using the features above. In our regression model, we looked to see what predictor variables (features) are important to the outcome variable (rating). Prior to training a regression model, we evaluated the above features to see which one has a strong correlation with restaurants rating.
recentHealthViolationType, recentHealthInspectionResult, and recentHealthInspectionGrade are categorical variables, which we converted into dummy variables for modeling.
## Features Ranked by Importance for Yelp Restaurants Rating in high Socio-economic area
```{r, echo=FALSE, warning=FALSE}
control <- trainControl(method="repeatedcv", number = 10, repeats = 3)
highSES_model <- train(rating ~., data=above.avg.wo.census, method = "knn", preProcess = "scale", trControl = control)
highSES_importance <- varImp(highSES_model)
ggplot(highSES_importance)
```
## Features Ranked by Importance for Yelp Restaurants Rating in low Socio-economic area
```{r, echo=FALSE, warning=FALSE}
control <- trainControl(method="repeatedcv", number = 10, repeats = 3)
lowSES_model <- train(rating ~., data=below.avg.wo.census, method = "knn", preProcess = "scale", trControl = control)
lowSES_importance <- varImp(lowSES_model)
ggplot(lowSES_importance)
```
```{r, echo=FALSE, include=FALSE, warning=FALSE}
## Create proportion for high and low SES resturants
split_proportion = 0.8
# select outcome variable for below avg HH income
below.avg.outcome <- below.avg.wo.census %>% dplyr::select(rating) # rating column only
# randomly select indices for train/validate set
below.avg.train_ind <- createDataPartition(below.avg.outcome$rating, p = split_proportion, list = FALSE)
below.avg.train <- below.avg.wo.census[below.avg.train_ind,] # get training below avg data
below.avg.test <- below.avg.wo.census[-below.avg.train_ind,] # get test below avg data
# select outcome variable for above avg HH income
above.avg.outcome <- above.avg.wo.census %>% dplyr::select(rating) # rating column only
# randomly select indices for train/validate set
above.avg.train_ind <- createDataPartition(above.avg.outcome$rating, p = split_proportion, list = FALSE)
above.avg.train <- above.avg.wo.census[above.avg.train_ind,] # get training above avg data
above.avg.test <- above.avg.wo.census[-above.avg.train_ind,] # get test above avg data
```
```{r, echo=FALSE, include=FALSE, warning=FALSE}
## Creating training control
ctrl <- trainControl(method = "repeatedcv", number=10, repeats=3) # 10 fold cross-validation, repeated 3 times. better way to do it but takes longer.
```
## SVM Regression Model for Yelp Restaurants
As a means to analyze Yelp restaurants rating, we trained two regression model: restaurants in high socio-economic and restaurants in low socio-economic neighborhood. For prediction in our model, however, we used the contrasting socio-economic data; meaning, inputting restaurants in low socio-economic restaurants for a high socio-economic model and vice versa. Next, we built a residual graph for each model. By looking at the residuals, we can evaluate how a set of restaurants would perform in various socio-economic regression model. If a majority of the residuals lie above the 0 line in a residual graph, the model suggested an underestimation for the predicted set of restaurants. If a majority of the residuals lies below the 0 line, the model suggested an overestimation for the predicted set of restaurants.
### High Socio-Economic Restaurants
```{r, echo=FALSE, warning=FALSE, message=FALSE}
# high SES SVM modeling
model_svm_high <- train(rating ~ .,
data = above.avg.train,
method = "svmRadial",
trControl=ctrl, # Radial kernel
tuneLength = 10)
predict_yelp_svm_high <- predict(model_svm_high, below.avg.test)
#actual minus predicted
delta.high.SES <- below.avg.test$rating - predict_yelp_svm_high
count.high.SES.below.line <- length(which(delta.high.SES < 0)) / length(delta.high.SES)
count.high.SES.above.line <- length(which(delta.high.SES > 0)) / length(delta.high.SES)
highSES_with_residual <- below.avg.test %>%
mutate(
residuals = rating - predict_yelp_svm_high
)
ggplot(highSES_with_residual, aes(1:nrow(highSES_with_residual))) +
geom_point(aes(y=residuals)) +
geom_abline(intercept=0, slope = 0) + ylim(-3,3) +
labs(title="High Socio-Economic Predicted Model Residuals \nwhen predicting with Low Socio-Economic Restaurants",y="Residuals", x="Row Number")
cat("Overestimated residuals proportion: ",count.high.SES.below.line, "\n")
cat("Underestimated residuals proportion: ",count.high.SES.above.line, "\n")
```
The residual graph for high socio-economic model suggested a neutral split among residuals. Approximately 43% of the predicting restaurants had a negative residuals and 57% had a positive residuals. Based on the residual analysis, the regression model for restaurants rating in high socio-economic neighborhood suggested unbiased.
### Low Socio-Economic Restaurants
```{r, echo=FALSE}
### Low Socio-Economic Restaurants
# low SES SVM modeling
model_svm_low <- train(rating ~ .,
data = below.avg.train,
method = "svmRadial",
trControl=ctrl, # Radial kernel
tuneLength = 10)
predict_yelp_svm_low <- predict(model_svm_low, above.avg.test)
#actual minus predicted
delta.low.SES <- above.avg.test$rating - predict_yelp_svm_low
count.low.SES.below.line <- length(which(delta.low.SES < 0)) / length(delta.low.SES)
count.low.SES.above.line <- length(which(delta.low.SES > 0)) / length(delta.low.SES)
lowSES_with_residual <- above.avg.test %>%
mutate(
residuals = rating - predict_yelp_svm_low
)
ggplot(lowSES_with_residual, aes(1:nrow(lowSES_with_residual))) +
geom_point(aes(y=residuals)) +
geom_abline(intercept=0, slope = 0) + ylim(-3,3) +
labs(title="Low Socio-Economic Predicted Model Residuals \nwhen predicting with High Socio-Economic Restaurants",y="Residuals", x="Row Number")
cat("Overestimated residuals proportion: ",count.low.SES.below.line, "\n")
cat("Underestimated residuals proportion: ",count.low.SES.above.line, "\n")
```
The residual graph for low socio-economic model suggested an unequal split among residuals. Approximately 37% of the predicting restaurants has a negative residuals and 63% has a positive residuals. Based on the residual analysis, the regression model for restaurants rating in low socio-economic neighborhood suggested biased. The residuals suggested 63% of the rating of restaurants in high socio-economic were underestimated in a low socio-economic restaurants regression model.
```{r, echo=FALSE, include=FALSE}
## Organize Data for Subway
# filter only Subways for high and low SES
above.avg.subway <- filter(above.avg, name %in% "Subway")
below.avg.subway <- filter(below.avg, name %in% "Subway")
# remove violationTypeRed, InspectionGrade 2 ,3 ,4 for above.avg since those columns aren't unique
above.avg.subway <- subset(above.avg.subway, select=c("rating", "reviewCount", "recentHealthInspectionScore","totalInspectionScore", "totalInspectionCount", "avgInspectionScore", "restaurantTotalMonths", "restaurantMaxSeats", "recentHealthInspectionResultSatisfactory", "recentHealthInspectionResultUnsatisfactory", "recentHealthViolationType.1", "recentHealthViolationTypeblue", "recentHealthInspectionGrade.1"))
# remove InspectionGrade 3 ,4 for above.avg since those columns aren't unique
below.avg.subway <- subset(below.avg.subway, select=c("rating", "reviewCount", "recentHealthInspectionScore","totalInspectionScore", "totalInspectionCount", "avgInspectionScore", "restaurantTotalMonths", "restaurantMaxSeats", "recentHealthInspectionResultSatisfactory", "recentHealthInspectionResultUnsatisfactory", "recentHealthViolationType.1", "recentHealthViolationTypeblue", "recentHealthInspectionGrade.1"))
subway.above.ctrl <- trainControl(method = "repeatedcv", number=nrow(above.avg.subway) - 1, repeats=3) # 10 fold cross-validation, repeated 3 times. better way to do it but takes longer.
subway.below.ctrl <- trainControl(method = "repeatedcv", number=nrow(below.avg.subway) - 1, repeats=3)
```
## Linear Regression Model for Subway Restaurants
To further investigate the results, we ran the same regression models for Subway restaurants in high socio-economic and low socio-economic neighborhood and plot the residuals. Below is the results:
```{r, echo=FALSE, warning=FALSE}
# high SES SVM modeling
subway_lm_high <- train(rating ~ .,
data = above.avg.subway,
method = "lm",
trControl=subway.above.ctrl)
predict_subway_lm_high <- predict(subway_lm_high, below.avg.subway)
#actual minus predicted
delta.subway.high.SES <- below.avg.subway$rating - predict_subway_lm_high
count.subway.high.SES.below.line <- length(which(delta.subway.high.SES < 0)) / length(delta.subway.high.SES)
count.subway.high.SES.above.line <- length(which(delta.subway.high.SES > 0)) / length(delta.subway.high.SES)
subway_high_with_residual <- below.avg.subway %>%
mutate(
residuals = rating - predict_subway_lm_high
)
ggplot(subway_high_with_residual, aes(1:nrow(subway_high_with_residual))) +
geom_point(aes(y=residuals)) +
geom_abline(intercept=0, slope = 0) + ylim(-2,2) +
labs(title="Subway High Socio-Economic Predicted Model Residuals \nwhen predicting with Low Socio-Economic Subways",y="Residuals", x="Row Number")
cat("Overestimated residuals proportion: ",count.subway.high.SES.below.line, "\n")
cat("Underestimated residuals proportion: ",count.subway.high.SES.above.line, "\n")
```
```{r, echo=FALSE, warning=FALSE}
## Linear Model for low SES Subway
# high SES SVM modeling
subway_lm_low <- train(rating ~ .,
data = below.avg.subway,
method = "lm",
trControl=subway.below.ctrl)
predict_subway_lm_low <- predict(subway_lm_low, above.avg.subway)
#actual minus predicted
delta.subway.low.SES <- above.avg.subway$rating - predict_subway_lm_low
count.subway.low.SES.below.line <- length(which(delta.subway.low.SES < 0)) / length(delta.subway.low.SES)
count.subway.low.SES.above.line <- length(which(delta.subway.low.SES > 0)) / length(delta.subway.low.SES)
subway_low_with_residual <- above.avg.subway %>%
mutate(
residuals = rating - predict_subway_lm_low
)
ggplot(subway_low_with_residual, aes(1:nrow(subway_low_with_residual))) +
geom_point(aes(y=residuals)) +
geom_abline(intercept=0, slope = 0) + ylim(-2,2) +
labs(title="Subway Low Socio-Economic Predicted Model Residuals \nwhen predicting with High Socio-Economic Subways",y="Residuals", x="Row Number")
cat("Overestimated residuals proportion: ",count.subway.low.SES.below.line, "\n")
cat("Underestimated residuals proportion: ", count.subway.low.SES.above.line, "\n")
```
## Conclusion
From our analysis of 1852 restaurants, we matched each restaurants’ location to its census tract to find out the median household income within the restaurant’s location. From there, we split the data to low, mid, and high socioeconomic status (SES). Based on the regression model training with our data, the predictive model of high SES areas doesn’t create an overestimation or underestimation of the actual data while the predictive model of low SES areas underestimated the actual data. These results indicate that in low SES areas, approximately 63% of the Yelp ratings create an underestimation of the actual rating of restaurants in those areas and an approximately 37% of overestimation. The difference in proportion is not a substantial enough to consider the result as significant. Furthermore, the additional investigated residual analysis on Subway restaurants model in high and low socioeconomic area suggested a huge proportion of underestimated residuals, however, due to its limited sample size, we could not conclude the result was significant. Therefore, we cannot reject our null hypothesis that restaurant’s ratings have biased based on socioeconomic status because our overestimation and underestimation numbers do not have a significant difference.
The analysis may not provide a conclusive evidence that there is biased in restaurants rating based on socioeconomic status, however, it provides a foundation to future research on restaurants rating. There is an array of factors that contribute to a restaurant’s quality such as location proximately, food freshness, taste, and many more that we have not included in the analysis. In addition, there are also other information that we can pull from the City of Seattle such as crime rate/index in neighborhoods, median house rental price, number of individual household members, average age of a neighbourhood, and more. In future research, we hope to include those additional factors and features to enhance our regression model and findings.