Baseball Data Analysis.Rmd

---
title: "Baseball Data Analysis - Technical Summary"
author: Scott Page & Alex Dantoft
output: pdf_document
---

```{r include = FALSE}
library(ggplot2)
library(dplyr)
library(lm.beta)
library(car)
library(knitr)
source("analyse.R")
load("data_summary.Rdata")
load("silver_slug_model.Rdata")
```

# Exploratory analysis of the data

The data set contains baseball player data from `r min(by_year$yearID)` to `r max(by_year$yearID)`. Records are maintained on each player, for each year, team, league, stint, and fielding position they played. Overall, `r player_count` unique players are recorded in the data set, covering `r man_years` man years.

```{r echo=FALSE}
ggplot(by_year, aes(x = as.factor(yearID), y = players)) +
  geom_bar(stat = "identity", fill = "Blue") +
  xlab("Year") +
  ylab("Players") +
  ggtitle("Players per Year") +
  theme(axis.text.x = element_text(angle = -60, hjust = -0))
```

## Batting Data Exploration

Batting data was recorded including:

* b_G - Games
* b_G_batting - Games as Batter
* b_AB - At Bats
* b_R - Runs
* b_H - Hits
* b_2B - Doubles
* b_3B - Triples
* b_HR - Home Runs
* b_RBI - Runs Batted In
* b_SB - Stolen Bases
* b_CS - Caught Stealing
* b_BB - Bases on Balls (Walks)
* b_SO - Strikeouts
* b_IBB - Intentional Bases on Balls (Walks)
* b_HBP - Hit By Pitch
* b_SH - Sacrifice Hits (Bunts)
* b_SF - Sacrifice Flies
* b_GIDP - Grounded into Double Plays
* b_G_old - Old version of games (depreciated) 

In addition to the baseball statistics provided, several additional ratios were computed to normalize the data:

* b_hits_per_AB - Hits per At Bat
* b_runs_per_AB - Runs per At Bat
* b_runs_per_H - Runs per Hit
* b_home_runs_per_AB - Home Runs per At Bat
* b_balls_per_AB - Balls per At Bat
* b_RBI_per_H - Runs Batted In per Hit
* b_HBP_per_AB - Hit By Pitch per At Bat
* b_games_batted_per_all_games - Games Batted per All Games Played

Finally, award data was considered:

```{r echo=FALSE}
ggplot(filter(award_winners, !is.na(awardID)), aes(x = as.factor(awardID), y = players)) +
  geom_bar(stat = "identity", fill = "Blue") +
  xlab("Award") +
  ylab("Players") +
  ggtitle("Unique Players") +
  theme(axis.text.x = element_text(angle = -60, hjust = -0))
```

As the graphs suggests, the Silver Slugger Award has the most winners contained in the data set. As a result of having the most data, it was chosen as the dependent variable.  

The data-set also contained team, league, & fielding data that was not considered.

# Model selection and Definitions
To model the likelihood that a player would received the Silver Slugger (SS) Award a Logistic Regression was performed. According to Wikipedia, The SS award is awarded annually to the best offensive player (batting) for each position in both leagues of the MLB. A single player can win the SS multiple times over a series of years. Since the SS is a batting award recognizing players of high offensive value, the batting data set was used. Voters vote for players based on several batting ratios that we were able to derive from our raw count data.

## Full Model:

```{r echo=F}
summary(model)
```

We tested both backward selection and step-wise selection routines. They both concluded with the same model - we simply used the backward selection model. We did not pursue additional selection procedures.

## Backward Selected Model

```{r echo=F}
summary(model_backwards_selection)
```

## Multicollinearity

```{r echo=F,results='asis',error=F,warning=F}
kable(vif(model_backwards_linear) %>% 
        {data.frame(Parameters = names(.), VIF = ., row.names = NULL)}, 
      format = "markdown")
```

Since we are dealing with a logistic  model, we had to create a linear "side model" for multicollinearity testing purposes based on the backward selected model. Unfortunately the VIF analysis shows that we have a plethora of multicollinearity issues. This does however make sense as much of the x variables are dependent on one another. For example, there cannot be a run without a hit (normally speaking; being walked is a possibility as well). 

To alleviate the multicollinearity issue, the following variables were removed:

* Games Played (b_G)
* Games Played Where Batted (b_G_batting)
* At Bats (b_AB)
* Runs Batted in (b_RBI)

## Revised Backward Selected Model

```{r echo=F}
summary(model_backwards_semifinal)
```

After removing the x variables that had multicollinearity issues, we then went through and removed non significant variables using a .05 p-value threshold. The variables with the highest p-values were removed one at a time until all variables were significant. Although the AIC increased, our model is now free of multicollinearity and is only fit on significant variables.

## Final Model
We now have a final model with only significant variables and have removed variables with multicolinearity:

```{r echo=F}
summary(model_backwards_final)
```

```{r include = FALSE}
print_back_selected_model <- paste(round(model_backwards_final$coefficients, digits = 2), names(model_backwards_final$coefficients), sep = " * ")
print_back_selected_model[1] <- round(model_backwards_final$coefficients[1],digits = 2)
print_back_selected_model <- paste(print_back_selected_model, collapse = " + ")
print_back_selected_model <- paste("log(odds(win_silver_slug)) = ", print_back_selected_model, sep = "")
```
The Model is: `r print_back_selected_model`

```{r echo=F,results='asis',error=F,warning=F}
kable(vif(model_backwards_final_lm) %>% 
        {data.frame(Parameters = names(.), VIF = ., row.names = NULL)}, 
      format = "markdown")
```

## Check Fit
The following box plots will be used to inspect the distribution of the independent values vs whether the player won the Silver Slugger or not.

```{r echo=FALSE}
batting_data[which(colnames(batting_data) %in% c("win_silver_slug", names(model_backwards_final$coefficients)))] %>%
plot_all_box(which(colnames(.) == "win_silver_slug"), .)
```

In all of the box plots we can see that the mean value is greater in the group of players that won the award. This seems reasonable except in the case of Strike Outs plot (b_SO). Based on our model, we know that players with fewer strike outs are more likely to win the award. Since the box plot clearly shows that players that win the award often have many strike outs, we can see that strike outs are expected from the best players, however they are penalized when deciding the most likely winner given identical Hits, Home Runs, Intentional Balls & Stolen Bases.

## Strongest Model Predictors
```{r include=F}
std_beta <- lm.beta(model_backwards_final)$standardized.coefficients
```

```{r echo=F,results='asis',error=F,warning=F}
kable(std_beta %>% 
        {data.frame(Parameters = names(.), "Standardized_Coefficients" = ., row.names = NULL)}, 
      format = "markdown")
```

The Standardized Beta show that players that wish to increase their odds of winning the award should focus on hitting the ball more (increasing b_H), followed by hitting Home Run's (b_HR), and avoiding Striking Out (b_SO).

## Leverage Points

```{r echo=FALSE}
influencePlot(model_backwards_final, 
              labels= paste(batting_data$playerID, batting_data$yearID), 
              main="Influence Plot",
              sub="Circle size is proportial to Cook's Distance")
```

We can see that Barry Bonds in 2007 (bondsba01 2007) represented a massive outlier in our data set. His outstanding hitting record is represented by large Hat values on the X axis. The rather large circle that represents his Cook Distance, shows that the model poorly predicted the fact that he did not win the award in addition to his strange hitting record. It's important to note that the Silver Slugger Award is not based purely on a player's stats. The winners are chosen based on votes from coaches in the league; in other words the votes can be influenced by things such as the media and the image of the MLB. The reason the model erroneously predicted Barry Bonds winning the award in 2007 is likely due to a steroid abuse scandal that grew in media attention in the years prior to 2007. Since our model is purely based on stats, the subjectivity of the votes is not considered in our analysis.

Also shown as an outlier is Tim Leary in 1988 (learyti01 1998). Tim won the Silver Slugger Award however his batting stats when processed through our model suggest that he should have not won. After further investigation, we determined that Tim was a pitcher in the National league. Recall that the SS Award is awarded to a single player from each field position. Often in baseball, pitchers focus all of their time and talent developing their throwing arm to yield better pitches; batting is not as big of a focus. In the American League, there are Designated Hitters - individuals who step in and bat instead of the pitcher. In the National league, however, designated hitters are not allowed. This means that the National League has a pool of players that have won the SS award who however do not have the greatest batting stats since they are pitchers.

## Computing Predictions

### Prediction 1:
```{r echo=F}
testPredict1 <- data.frame(b_H = c(150),
                           b_HR = c(30), 
                           b_SB = c(10), 
                           b_SO = c(30), 
                           b_IBB = c(10))
predict1 <- predict(model_backwards_final, newdata = testPredict1, type = "response", se.fit = T)
```

When using x variables values of:

```{r echo=F,results='asis',error=F,warning=F}
kable(testPredict1, 
      format = "markdown")
```

Our model predicts that the probability of winning the Silver Slugger Award is **`r paste(round(predict1$fit*100, digits = 1), sep='', '%')`** with a standard error of **`r paste(round(predict1$se.fit*100, digits = 1), sep='', '%')`**. Therefore our 95% prediction intervals are from **`r paste(round((predict1$fit-1.96*predict1$se.fit)*100, digits = 1), sep='', '%')`** to **`r paste(round((predict1$fit+1.96*predict1$se.fit)*100, digits = 1), sep='', '%')`**.


### Prediction 2:
```{r echo=F}
testPredict2 <- data.frame(b_H = c(50),
                           b_HR = c(30), 
                           b_SB = c(1), 
                           b_SO = c(10), 
                           b_IBB = c(5))
predict2 <- predict(model_backwards_final, newdata = testPredict2, type = "response", se.fit = T)
```

When using x variables values of:

```{r echo=F,results='asis',error=F,warning=F}
kable(testPredict2, 
      format = "markdown")
```

Our model predicts that the probability of winning the Silver Slugger Award is **`r paste(round(predict2$fit*100, digits = 1), sep='', '%')`** with a standard error of **`r paste(round(predict2$se.fit*100, digits = 1), sep='', '%')`**. Therefore our 95% prediction intervals are from **`r paste(round((predict2$fit-1.96*predict2$se.fit)*100, digits = 1), sep='', '%')`** to **`r paste(round((predict2$fit+1.96*predict2$se.fit)*100, digits = 1), sep='', '%')`**.

## Finding an Optimal Cut Off

Since few players win the Silver Slugger award each year, a naive prediction cut off of 0.5 would not provide good results. To find a more optimal cut off, cut off values were trialed, and the one that produced the maximum F Measure was chosen. The Optimal F Measure is marked with a vertical line on the following plot.

```{r message=FALSE, warning=FALSE, echo=FALSE}
plot_of_cut_offs
```

After searching for a good cut off value, `r best_cut_off["cut_offs"]` was chosen yielding the following results.

```{r echo=F,results='asis',error=F,warning=F}
kable(best_cut_off[,1:5], 
      format = "markdown")
```

```{r echo=F,results='asis',error=F,warning=F}
kable(best_cut_off[,6:ncol(best_cut_off)], 
      format = "markdown")
```

## Cross Validation
To validate the predictive strength of the model the following procedure was repeated across 5 folds, 5 times for a total of 25 trials:

1. Fit Model To Training Data
2. Determine Cut Offs Using Training Data
3. Test Model Prediction using Test Data

The following metrics were computed:

```{r echo=FALSE}
boxplot(cross_validated_results[,c("accuracy", "specificity")],
        main = "Cross Validated Metrics",
        ylab = "Value of Metric",
        xlab = "Metric")
```

We can see that the Accuracy and Specificity of the cross validated models is fairly stable and high. This is to be expected since the vast majority of players do not win awards each year.

```{r echo=FALSE}
boxplot(cross_validated_results[,c("recall", "precision", "f_measure")],
        main = "Cross Validated Metrics",
        ylab = "Value of Metric",
        xlab = "Metric")
```

We can see that the model has a mean recall of `r round(mean(cross_validated_results$recall), digits = 2)` meaning `r round(mean(cross_validated_results$recall), digits = 2) * 100`% of the players each year that won an award were correctly labeled. Since the mean precision is `r round(mean(cross_validated_results$precision), digits = 2)`, `r round(mean(cross_validated_results$precision), digits = 2) * 100`% of the players the model predicts to win, actually do so.

# Conclusions
In conclusion, our model proves that there is a statistically significant increase in a player's likelihood of winning the Silver Slugger Awards if they hit more balls, have more home runs, avoid striking out, steal more bases, and are intentionally walked by the pitcher more often. Using 5 Fold Cross validation, and an optimal cut off by F Measure:

* `r round(mean(cross_validated_results$recall), digits = 2) * 100`% of the players each year that won an award were correctly labeled
* `r round(mean(cross_validated_results$precision), digits = 2) * 100`% of the players the model predicts to win, actually did

## Ideas to Further Improve the Model
Due to time constraints, some parameters available were not tested. Of specific interest, an interaction model taking into account the players field position might improve the fit since the award is given by field position. Additionally, quantifying the media sentiment toward a player might have helped avoid the Barry Bonds outliers that were detected and made a better model.

# Appendix
## Analysis Files
### do.R
Run the main function in the do file to generate .Rdata files needed when you run `Baseball Data Analysis.Rmd`
```{r echo=F}
for (line in readLines(file("do.R", "r"))){
  cat(line, "\n")
}
```
### get.R
Getter functions to form the initial data frames.
```{r echo=F}
for (line in readLines(file("get.R", "r"))){
  cat(line, "\n")
}
```
### munge.R
Mutate the data for analysis.
```{r echo=F}
for (line in readLines(file("munge.R", "r"))){
  cat(line, "\n")
}
```
### analyse.R
model_silver_slug function will model the create the models used in the report and dump a .Rdata file for use in the markdown.
```{r echo=F}
for (line in readLines(file("analyse.R", "r"))){
  cat(line, "\n")
}
```
## Report Files
### Baseball Data Analysis.Rmd
This is the markdown code that generated this document.
```{r echo=F}
for (line in readLines(file("Baseball Data Analysis.Rmd", "r"))){
  cat(line, "\n")
}
```