-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprobability_20191106_152_chang.Rmd
222 lines (168 loc) · 10.4 KB
/
probability_20191106_152_chang.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "R Notebook"
output: html_notebook
---
## 13.10 Continuous probability
When summarizing a list of numeric values, such as heights, it is not useful to construct a distribution that defines a proportion to each possible outcome. It is not useful to assign a very small probability to every single height.
Just as when using distributions to summarize numeric data, it is much more practical to define a function that operates on intervals rather than single values. The standard way of doing this is using the _cumulative distribution function_ (CDF).
As an example of empirical cumulative distribution function (eCDF), we earlier defined the height distribution for adult male students. Here, we define the vector x to contain these heights:
```{r}
library(tidyverse)
library(dslabs)
data("heights")
x<-heights %>% filter(sex=="Male") %>% pull(height)
```
We defined the empirical distribution function as:
```{r}
F<-function(a) mean(x<=a)
```
which, for any value `a`, gives the proportion of values in the list `x` that are smaller or equal than `a`.
Keep in mind that we have not yet introduced probability in the context of CDFs. Let’s do this by asking the following: if I pick one of the male students at random, what is the chance that he is taller than 70.5 inches? Because every student has the same chance of being picked, the answer to this is equivalent to the proportion of students that are taller than 70.5 inches. Using the CDF we obtain an answer by typing:
```{r}
1-F(70)
```
Once a CDF is defined, we can use this to compute the probability of any subset. For instance, the probability of a student being between height `a` and height `b` is:
```{r}
F(b)-F(a)
```
Because we can compute the probability for any possible event this way, the cumulative probability function defines the probability distribution for picking a height at random from our vector of heights `x`.
## 13.11 Theoretical continuous distributions
Normal distribution is a useful approximation to many naturally occurring distributions, including that of height. The cumulative distribution for the normal distribution is defined by a mathematical formula which in R can be obtained with the function pnorm. We say that a random quantity is normally distributed with average m and standard deviation s if its probability distribution is defined by:
```{r}
F(a)=pnorm(a,m,s)
```
This is useful because if we are willing to use the normal approximation for, say, height, we don’t need the entire dataset to answer questions such as: what is the probability that a randomly selected student is taller then 70 inches? We just need the average height and standard deviation:
```{r}
m<-mean(x)
s<-sd(x)
1-pnorm(70.5, m,s)
```
### 13.11.1 Theoretical distributions as approximations
With continuous distributions, the probability of a singular value is not even defined. For example, it does not make sense to ask what is the probability that a normally distributed value is 70. Instead, we define probabilities for intervals. We thus could ask what is the probability that someone is between 69.5 and 70.5.
In cases like height, in which the data is rounded, the normal approximation is particularly useful if we deal with intervals that include exactly one round number. For example, the normal distribution is useful for approximating the proportion of students reporting values in intervals like the following three:
```{r}
mean(x<=68.5) - mean(x<=67.5)
mean(x<=69.5) - mean(x<=68.5)
mean(x<=70.5) - mean(x<=69.5)
```
Note how close we get with the normal approximation:
```{r}
pnorm(68.5, m,s)-pnorm(67.5, m,s)
pnorm(69.5, m,s)-pnorm(68.5, m,s)
pnorm(70.5, m,s)-pnorm(69.5, m,s)
```
However, the approximation is not as useful for other intervals. For instance, notice how the approximation breaks down when we try to estimate:
```{r}
mean(x<=70.9)-mean(x<=70.1)
```
with
```{r}
pnorm(70.9,m,s)-pnorm(70.1,m,s)
```
In general, we call this situation _discretization_. Although the true height distribution is continuous, the reported heights tend to be more common at discrete values, in this case, due to rounding. As long as we are aware of how to deal with this reality, the normal approximation can still be a very useful tool.
### 13.11.2 The probability density
For categorical distributions, we can define the probability of a category. For example, a roll of a die, let’s call it X, can be 1,2,3,4,5 or 6. The probability of 4 is defined as:
$$Pr(X=4)=1/6$$
The CDF can then easily be defined:
$$F(4)=Pr(X<=4)=Pr(X=4)+Pr(X=3)+Pr(X=2)+Pr(X=1)$$
Although for continuous distributions the probability of a single value $Pr(X=x)$is not defined, there is a theoretical definition that has a similar interpretation. The probability density at $x$ is defined as the function $f(a)$ such that:
$$F(a)=Pr(X<=a)= \int^a_{\-infty}f(x)dx $$
For example, to use the normal approximation to estimate the probability of someone being taller than 76 inches, we use:
```{r}
1-pnorm(76,m,s)
```
In R, we get the probability density for normal distribution using the function `dnorm`.
## 13.12 Monte Carlo simulations for continuous variables
R provides functions to generate normally distributed outcomes. Specifically, the `rnorm` function takes three arguments: size, average (defaults to 0), and standard deviation (defaults to 1) and produces random numbers. Here is an example of how we could generate data that looks like our reported heights:
```{r}
n<-length(x)
m<-mean(x)
s<-sd(x)
simulated_heights<-rnorm(n,m,s)
```
Not surprisingly, the distribution looks normal. This is one of the most useful functions in R as it will permit us to generate data that mimics natural events and answers questions related to what could happen by chance by running Monte Carlo simulations.
If, for example, we pick 800 males at random, what is the distribution of the tallest person? How rare is a seven footer in a group of 800 males? The following Monte Carlo simulation helps us answer that question:
```{r}
B<-10000
tallest<-replicate(B, {
simulated_heights<-rnorm(800,m,s)
max(simulated_heights)
})
```
Having a seven footer is quite rare:
```{r}
mean(tallest>=7*12)
```
Note that it does not look normal.
## 13.13 Continuous distributions
The normal distribution is not the only useful theoretical distribution. Other continuous distributions that we may encounter are the student-t, Chi-square, exponential, gamma, beta, and beta-binomial. R provides functions to compute the density, the quantiles, the cumulative distribution functions and to generate Monte Carlo simulations. R uses a convention that lets us remember the names, namely using the letters `d`, `q`, `p`, and `r` in front of a shorthand for the distribution. We have already seen the functions `dnorm`, `pnorm`, and `rnorm` for the normal distribution. The functions `qnorm` gives us the quantiles. We can therefore draw a distribution like this:
```{r}
x<-seq(-4, 4, length.out = 100)
qplot(x,f, geom = "line", data=data.frame(x,f=dnorm(x)))
```
For the student-t, the shorthand `t` is used so the functions are `dt` for the density, `qt` for the quantiles, `pt` for the cumulative distribution function, and `rt` for Monte Carlo simulation.
## 13.14 Exercises
1. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 5 feet or shorter?
```{r}
m<-64
s<-3
pnorm(5*12,m,s)
```
2. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 6 feet or taller?
```{r}
m<-64
s<-3
1-pnorm(6*12,m,s)
```
3. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is between 61 and 67 inches?
```{r}
m<-64
s<-3
pnorm(67, m,s)-pnorm(61,m,s)
```
4. Repeat the exercise above, but convert everything to centimeters. That is, multiply every height, including the standard deviation, by 2.54. What is the answer now?
```{r}
m<-64*2.54
s<-3*2.54
pnorm(67*2.54, m,s)-pnorm(61*2.54,m,s)
```
5. Notice that the answer to the question does not change when you change units. This makes sense since the answer to the question should not be affected by what units we use. In fact, if you look closely, you notice that 61 and 64 are both 1 SD away from the average. Compute the probability that a randomly picked, normally distributed random variable is within 1 SD from the average.
```{r}
m<-64
s<-3
pnorm(m+s,m,s)-pnorm(m-s,m,s)
```
7. To see the math that explains why the answers to questions 3, 4, and 5 are the same, suppose we have a random variable with average $m$ and standard error $s$. Suppose we ask the probability of $X$ being smaller or equal to $a$.Remember that, by definition, $a$ is $(a-m)/s$ standard deviations $s$ away from the average $m$. The probability is:
$$Pr(X<=a)$$
Now we subtract $μ$ to both sides and then divide both sides by $σ$:
$$Pr( \frac{X-m}{s}<= \frac{a-m}{s})$$
The quantity on the left is a standard normal random variable. It has an average of 0 and a standard error of 1. We will call it $Z$:
$$Pr(Z<= \frac{a-m}{s})$$
So, no matter the units, the probability of $X≤a$ is the same as the probability of a standard normal variable being less than $(a−m)/s$. If `mu` is the average and `sigma` the standard error, which of the following R code would give us the right answer in every situation:
a. mean(X<=a)
b. pnorm((a - m)/s)
__c. pnorm((a - m)/s, m, s)__
d. pnorm(a)
7. Imagine the distribution of male adults is approximately normal with an expected value of 69 and a standard deviation of 3. How tall is the male in the 99th percentile?
```{r}
m<-69
s<-3
qnorm(0.99,m,s)
```
8. The distribution of IQ scores is approximately normally distributed. The average is 100 and the standard deviation is 15. Suppose you want to know the distribution of the highest IQ across all graduating classes if 10,000 people are born each in your school district. Run a Monte Carlo simulation with B=1000 generating 10,000 IQ scores and keeping the highest. Make a histogram.
```{r}
B<-1000
m<-100
s<-15
highest<-replicate(B, {
simulated_data<-rnorm(10000,m,s)
max(simulated_data)
})
```
```{r}
df<-data.frame(highest)
```
```{r}
library(ggplot2)
df %>% ggplot(aes(highest, count=n))+geom_histogram(bins=30)
```