-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathGlobalHealth_stats.Rmd
224 lines (139 loc) · 4.02 KB
/
GlobalHealth_stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "Global Health Stats"
output:
html_notebook:
toc: yes
theme: spacelab
---
<style type="text/css">
a {color: #3d0066; font-size: 11;}
p {color: #000000; font-family:"monaco" font-size:13;}
h1 {color: #3d0066 }
h2 {color: #6b00b3 }
h3 {color: #9900ff }
h4 {color: #cc0099 }
ul {list-style-type: circle; color: #6b00b3 }
ol {color: #cc0099; font-size:11; }
p1 {color: #004d4d;}
li {color: #004d4d;}
</style>
<t>
![](RStudio-Logo 1.png)</t>
# Install Gapminder library
```{r Install Gapminder Library}
install.packages("gapminder")
library(gapminder)
data(gapminder)
```
# Get the Summary Statistics
```{r Get Summary Stats}
summary(gapminder)
```
# Get the mean of a column
```{r Mean}
# the mean of the GDP per Capita
gdp.mean = mean(gapminder$gdpPercap)
gdp.mean
```
## use the attach function to get column values
```{r}
attach(gapminder)
median(pop)
```
## quick histogram
```{r}
# histogram of life expectancy log base
log.lifeExp = log(pop)
hist(log.lifeExp)
```
## quick boxplot
```{r}
# life expectancy based on continent
boxplot(lifeExp ~ continent)
```
## quick scatterplot
```{r}
# life expectancy (dep var) on the Y axis
# gdpPercap (indep var) on X axis
plot(lifeExp ~ log(gdpPercap))
```
# install the dplyr library
The pipe function %>% with the select function for the column variables of interest. The second pipe is like the "and then" with the filter function to select row variable based on a condition.
The mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package.
data %>% select(col.1, col.2, col.3) %>%
filter(col.1 =="Type 2") %>%
mutate(BMI = weight/ (height^2)) %>%
summarize(Average_BMI = mean(BMI))
```{r}
library(dplyr)
gapminder %>%
select(country, lifeExp)
```
```{r}
gapminder %>%
select(country, lifeExp) %>%
filter(country =="Norway" | country =="Canada") %>%
group_by(country) %>%
summarise(Avg_LifeExp = mean(lifeExp))
```
## Variation of sample
Differences by chance/ no difference (Null hypothesis) which t-test is used with a p-value p= 0.05. Null is p < 0.05. This is compared to real change (alternative hypothesis) p > 0.05. The 95% confidence interval.
```{r}
df.1 = gapminder %>%
select(country, lifeExp) %>%
filter(country =="Norway" | country =="South Africa")
# use the t-test to determine statistical differance
t.test(data= df.1,
lifeExp ~ country)
```
The odds of our sample difference is the p-value, so for this case we reject the null hypothesis based on the differences of means is NOT zero.
```{r}
library(ggplot2)
gapminder %>%
filter(gdpPercap < 50000) %>%
ggplot(aes(x= log(gdpPercap),
y= lifeExp,
col= continent,
size= pop)) +
geom_point(alpha= 0.3) +
geom_smooth(method = lm)
```
```{r}
# linear regression
summary(lm(lifeExp ~ gdpPercap))
```
```{r}
# multiple linear regression
summary(lm(lifeExp ~ gdpPercap + pop))
# the P-value is the Pr
```
The p-values are 2e-16 and 4.72e-05
Population => sample : measure variables
Categorical = groups, summarize
Numerical = summarize : range, IQR, mean, median
sample tests:
- 1 sample proportion test
- chi-squared test
- t-test
- ANOVA
- correlation test
1 categorical (binary) : 1 sample proportion test <br>
2 categorical (child/adult/senior) : chi squared <br>
1 numeric : t-test <br>
1 numeric AND 1 categorical : t-test or ANOVA <br>
2 numeric : correlation <br>
## Null and Alt hypothesis
- is there a difference? what are chances/ probability of difference?
- Null: no difference
- ALt: there is no difference
- alpha value = 0.05
- test: 1 sample proportion test, get p-value
- if p-value < alpha = reject null, statistical significant
"is the average height different significant between groups?"
- Null: there is no difference
- Alt: there is a difference
- alpha= 0.05
- test: t-test, compare averages, p-value
- test: ANOVA/ correlation, compare weight vs height
- if p-value < alpha = reject null
correlation -1 to +1