-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathhw12.rmd
168 lines (108 loc) · 4.58 KB
/
hw12.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "hw12"
author: "Alvaro Bueno"
date: "4/29/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The attached dataset contains real-world data from 2008.
```{r}
countries <- read.csv("https://raw.githubusercontent.com/delagroove/datascience/master/who.csv")
```
### 1.
```{r}
plot(countries$TotExp, countries$LifeExp)
```
There is no trend that shows correlation between average life expectancy for the country and the sum of personal and government expenditures.
#### Simple linear regression and ANOVA
```{r}
lifeexp_lm <- lm(countries$LifeExp ~ countries$TotExp, data = countries)
summary(lifeexp_lm)
lifeexp_lm_anova <- anova(lifeexp_lm)
lifeexp_lm_anova
```
the R square value of 0.2577 and the P-values of 7.714e-14 indicates that there is a very weak correlation between LifeExp and TotExp.
#### Residual Analysis
```{r}
plot(fitted(lifeexp_lm),resid(lifeexp_lm))
abline(0,0)
```
```{r}
qqnorm(resid(lifeexp_lm))
qqline(resid(lifeexp_lm))
```
This plot tells us that using the sum of personal and government expenditures as the sole predictor in the regression model does not sufficiently or fully explain the data.
The Q-Q plot shows that the residuals did not tightly follow the indicated line. The two ends diverge significantly from that line, This behavior indicates that the residuals are not normally distributed.
###2.
```{r}
t_life_exp<- countries$LifeExp^4.6
t_tot_exp<- countries$TotExp^.06
plot(t_tot_exp,t_life_exp)
```
There is correlation between the transformed average life expectancy for the country(t_life_exp) and the transformed sum of personal and government expenditures(t_tot_exp).
####Simple linear regression and ANOVA
```{r}
tlifeexp_lm <- lm(t_life_exp ~ t_tot_exp, data=countries)
summary(tlifeexp_lm)
t_lifeexp_lm_anova <- anova(tlifeexp_lm)
t_lifeexp_lm_anova
```
R square, being higher than 0.7 indicates a strong correlation between LifeExp and TotExp.
#### Residual Analysis
```{r}
plot(fitted(tlifeexp_lm),resid(tlifeexp_lm))
abline(0,0)
qqnorm(resid(tlifeexp_lm))
qqline(resid(tlifeexp_lm))
```
The variances of residuals are Uniformly scattered about zero, a lot of outliers in the lower end, but this model performs bettter then the original one.
### 3.
```{r}
rows <- nrow(countries)
f <- 0.5
upper_bound <- floor(f * rows)
permuted_countries.dat <- countries[sample(rows), ]
train.dat <- permuted_countries.dat[1:upper_bound, ]
test.dat <- permuted_countries.dat[(upper_bound+1):rows, ]
t_life_exp <- train.dat$LifeEx^4.6
t_tot_exp <- train.dat$TotExp^0.06
#computing the model's coefficients as training the regression model.
tnlifeexp_lm <- lm(t_life_exp ~ t_tot_exp, data=train.dat )
#uses the model obtained from above to compute the predicted outputs
predicted.dat <- predict(tnlifeexp_lm, newdata=test.dat)
#finc the difference between the predicted and measured performance
delta <- predicted.dat - test.dat$LifeEx^4.6
#use t-test to see how well a model trained on the train.dat data set predicted the performance of the processors in the test.dat data set
t.test(delta, conf.level = 0.95)
```
we obtain a 95 percent confidence interval of [-36302146, 56231243]. This is a reasonably tight confidence interval that includes zero. Thus, we conclude that the model is reasonably good at predicting values in the test.dat data set when trained on the train.dat data set.
##### When TotExp^.06 =1.5
```{r}
LifeExp_1.5 <- -736527910 + 620060216 * 1.5
cat("life expectancy: ",LifeExp_1.5^(1/4.6))
```
##### When TotExp^.06 =2.5
```{r}
LifeExp_2.5 <- -736527910 + 620060216 * 2.5
cat("life expectancy: ",LifeExp_2.5^(1/4.6))
```
### 4.
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
#### multiple regression model and ANOVA
```{r}
ml_lifeexp_lm <- lm(LifeExp ~ PropMD+TotExp+(PropMD*TotExp), data=countries)
summary(ml_lifeexp_lm )
ml_lifeexp_lm_anova <- anova(ml_lifeexp_lm)
ml_lifeexp_lm_anova
```
R square indicates that there is not a strong correlation between LifeExp and TotExp.
### Residual Analysis
```{r}
plot(fitted(ml_lifeexp_lm),resid(ml_lifeexp_lm))
abline(0,0)
qqnorm(resid(ml_lifeexp_lm))
qqline(resid(ml_lifeexp_lm))
```
The plot tells that using the sum of personal and government expenditures, proportion of the population countries are MDs, and the products of these two variables as the predictors in the regression model does not sufficiently expalin the data. Also, The Q-Q plot shows that the residuals did not tightly follow the indicated line. The two ends diverge significantly from that line, using the variables above as predictors in the model is insufficient to explain the data.