-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy pathExercise_02_Distributions.Rmd
508 lines (400 loc) · 23.7 KB
/
Exercise_02_Distributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
---
title: 'Lab 2: Probability distributions'
author: "GE509"
output: html_document
---
## Objectives
* Familiarize yourself with common probability distributions and R's functions for calling them
* Understand how different probability distributions are related to one another
* Understand how discrete distributions can be approximated by continuous distributions
## Git updating:
* If you originally pulled directly from the original repository
+ In RStudio's "Git" tab, click on the down arrow
+ OR at the command line `git pull`
* If you FORKED the original repo to your Github account
+ At the command line ```git pull [email protected]:mdietze/EE509.git```
+ OR ```git pull https://github.com/mdietze/EE509.git```
## Probability distributions in R
Because it is a statistical language, there are a large number of probability distributions in R by default and an even larger number that can be loaded from packages. The table below gives a listing of the most common distributions in R, the name of the function within R, and the parameters of the distribution
Common distributions
----------------------
Distribution | R name | Parameters
------------ | ------ | ----------
beta | beta | shape1, shape2, ncp
Binomial | binom | size, prob
Cauchy | cauchy | location, scale
chi-squared | chisq | df, ncp
exponential | exp | rate
F | f | df1, df2, ncp
gamma | gamma | shape, scale
geometric | geom | prob
hypergeometric | hyper | m, n, k
log-normal | lnorm | meanlog, sdlog
logistic | logis | location, scale
Negative binomial | nbinom | size, prob
normal | norm | mean, sd
Poisson | pois | lambda
Student's t | t | df, ncp
uniform | unif | min, max
Weibull | weibull | shape, scale
Wilcoxon | wilcox | m, n
R actually provides four related functions for each probability distribution. These functions are called by adding a letter at the beginning of the function name. The variants of each probability distribution are:
* "d" = density: probability density function (PDF)
* "p" = cumulative distribution function (CDF)
* "q" = quantile: calculates the value associated with a specified tail probability, inverse of "p"
* “r” = random: simulates random numbers
The first argument to these functions is the same regardless of the distribution and is the value *x* for "d", the quantile *q* for "p", the probability *p* for "q", and the sample size *n* for "r"
All of this will make more sense once we consider a concrete example. Let's take a look at the normal probability density function first, since it's the one you're most familiar with. If you use **?dnorm** you'll see that for many of the function arguments there are default values, specifically mean=0 and sd=1. Therefore if these values are not specified explicitly in the function call R assumes you want a *standard Normal* distribution.
```{r}
x = seq(-5,5,by=0.1)
plot(x,dnorm(x),type='l') ## that’s a lowercase “L” for “line”
abline(v=0) ## add a line to indicate the mean (“v” is for “vertical”)
lines(x,dnorm(x,2),col=2) ## try changing the mean (“col” sets the color)
abline(v=2,col=2)
lines(x,dnorm(x,-1,2),col=3) ## try changing the mean and standard dev
abline(v=-1,col=3)
```
The above plot can also be produced using R's `curve` function
```{r}
curve(dnorm,-5,5)
abline(v=0)
curve(dnorm(x,2),-5,5,add=TRUE,col=2)
abline(v=2,col=2)
curve(dnorm(x,-1,2),-5,5,add=TRUE,col=3) ## try changing the mean and standard dev
abline(v=-1,col=3)
```
You are welcome to use either approach in the following examples.
You will use density functions quite frequently to estimate the **likelihood** of data. For example if we collected a data point of x = 0.5 we can calculate the likelihood that this data point came from each of these three normal distributions. Implicit in the likelihood is that we're looking at the probability of the data in a very small window, $\delta x$, and that $\delta x$ never changes so that:
$$Pr(x < X < x + \delta x) = \int_{x}^{x + \delta x} f(x)dx \propto f(x)$$
```{r}
dnorm(0.5,0,1)
dnorm(0.5,2,1)
dnorm(0.5,-1,2)
```
This shows that the first distribution has a higher likelihood than the other two, which are about the same. We interpret this as saying that the observed data point was more likely to have been generated by the first distribution than the other two. This is consistent with where the dashed blue line intersects each of the curves.
This plot of the normal distribution and the effects of varying the parameters in the normal are both probably familiar to you already – changing the mean changes where the distribution is centered while changing the standard deviation changes the spread of the distribution. Next try looking at the CDF of the normal:
```{r}
plot(x,pnorm(x,0,1),type='l')
abline(v=0)
lines(x,pnorm(x,2,1),col=2)
abline(v=2,col=2)
lines(x,pnorm(x,-1,2),col=3)
abline(v=-1,col=3)
```
Using the CDF we can calculate $Pr(X \leq x)$ for our data point
```{r}
pnorm(0.5,0,1)
pnorm(0.5,2,1)
pnorm(0.5,-1,2)
```
If the value 0.5 here corresponded to some hypothesis, the **CDF could be use to calculate a p-value associated with the one-sided test**. Would any be significant at $\alpha$=0.05 significance?
Next let's look at the function qnorm. Since the input to this function is a quantile, the x-values for the plot are restricted to the range [0,1].
```{r}
p = seq(0,1,by=0.01)
plot(p,qnorm(p,0,1),type='l',ylim=range(x)) # ylim sets the y-axis range
# range returns the min/max as a 2-element vector
abline(h=0) # “h” for “horizontal”
lines(p,qnorm(p,2,1),col=2)
abline(h=2,col=2)
lines(p,qnorm(p,-1,2),col=3)
abline(h=-1,col=3)
```
It should be readily apparent that the quantile function is the inverse of the CDF. This function can be used to find the median of the distribution (p = 0.5) or to **estimate confidence intervals** at any level desired.
```{r}
qnorm(c(0.025,0.975),0,1) # what width CI is specified by these values?
plot(p,qnorm(p,0,1),type='l',ylim=range(x))
abline(v=c(0.025,0.975),lty=2) # add vertical lines at the CI
abline(h=qnorm(c(0.025,0.975)),lty=2) #add horizontal lines at the threshold vals
plot(x,dnorm(x,0,1),type='l') # plot the corresponding pdf
abline(v=qnorm(c(0.025,0.975)),lty=2)
```
Finally, let's investigate the rnorm function for generating random numbers that have a normal distribution. Here we generate histograms that have a progressively larger sample size and compare that to the actual density of the standard normal.
```{r}
n = c(10,100,1000,10000) # sequence of sample sizes
for(i in 1:4){ # loop over these sample sizes
hist(rnorm(n[i]),main=n[i],probability=TRUE,breaks=40)
#here breaks defines number of bins in the histogram
lines(x,dnorm(x),col=2)
}
```
To make these plots we introduced the use of a ```for``` loop. A for loop iterates over all specified values of some index. Above, we used ```i``` as our index, and specified values 1, 2, 3, and 4 with the 1:4 syntax from last week. `for` loops always start this way (including parentheses): ```for(index in vector)```. The code in the { } then executes once for every value in the vector, with `i` taking on the next value in the list. The ```for``` syntax in R is very flexible and will allow you to loop over a vector of anything, not just integers...characters, factors, filenames, etc. Also note that your index variable can be anything, but to avoid unwanted results make sure it’s not the same as a variable you’re using elsewhere in the code (e.g. indexing by `x` above would ruin everything, because `x` would be overwritten on every iteration).
One other technical note: like any function in R that generates random output, this example will give different results every time you run it.
This example demonstrates that **as the number of random draws from a probability distribution increases, the histogram of those draws provides a better and better approximation of the density itself**. We will make use of this fact extensively this semester because – as odd as this may sound now – there are many distributions that are easier to randomly sample from than solve for analytically. We can also show that as the sample size increases, the sample mean and standard deviation get closer to the true population values (mean = 0, sd = 1 in this example) and the standard error gets smaller:
```{r}
y = rnorm(10) ## Sample size = 10
y
mean(y)
sd(y) ## standard deviation
sd(y)/sqrt(10) ## standard error
y = rnorm(10000) ## Sample size = 10000
mean(y)
sd(y) ## standard deviation
sd(y)/sqrt(10000) ## standard error
```
We are next going to investigate several other common probability distributions. There will be a few examples of the effects of varying parameters for each distribution, but you are strongly recommended to explore and try other values to see how they affect the shape of each PDF. You might also want to refer to Appendix F in the textbook or Chapter 3 of Hilborne and Mangel in order to learn more about each distribution. There is also a good chart at http://www.johndcook.com/distribution_chart.html that describes the relationships among these common distributions, and the Wikipedia articles for most of them are good for quick reference.
There's a good bit of code this week. We recommend running each clump of code by itself, and especially that you **make sure you understand what it did before proceeding to the next box**.
### LAB REPORT
For each of the distributions below, please include the multi-panel figure generated in your lab report and provide a short answer to the questions that follow. If you write any additional code in deriving your answers (you may not need to in this lab), include it and any R outputs or figures it generates. When doing calculations "by hand," include the formulas you used. You do not have to show all your work, but it may make it easier to assign partial credit.
**For this lab, turn everything in as a single .Rmd file by the start of next lab.**
# Part 1: Continuous distributions
*Note, if you didn’t start with the tutorial above, you’ll need to define x and p variables before running the codes below.*
### Uniform
The uniform is used when there's an equal probability of an event occurring over a range of values.
```{r}
plot(x,dunif(x),type='l')
lines(x,dunif(x,0,4),col=2)
lines(x,dunif(x,-3,-0.5),col=3)
plot(x,punif(x),type='l')
lines(x,punif(x,0,4),col=2)
lines(x,punif(x,-3,-0.5),col=3)
plot(p,qunif(p),type='l',ylim=range(x))
lines(p,qunif(p,0,4),col=2)
lines(p,qunif(p,-3,-0.5),col=3)
hist(runif(500,-3,3),breaks=30,xlim=range(x),probability=TRUE)
lines(x,dunif(x,-3,3),col=2)
```
### Lab questions:
```
1. Why does the height of the uniform PDF change as the width changes?
2. What do the second and third arguments to the uniform specify? What are their default values?
```
### Beta
The Beta is an interesting distribution because it is bound on the range 0 <= X <= 1 and thus is very often used to describe data that is a proportion. At first glance the mathematical formula for the Beta looks a lot like the Binomial:
$$Beta(x \mid a,b) \propto x^{a-1} (1-x)^{b-1}$$
$$Binom(x \mid n,p) \propto p^x (1-p)^{n-p}$$
The critical difference between the two is that in the Beta the random variable X is in the base while in the Binomial it is in the exponent. Unlike many distributions you may be have used in the past, the two shape parameters in the Beta do not define the mean and variance, but these can be calculated as simple functions of $\alpha$ and $\beta$. The Beta does have an interesting property of symmetry though, whereby Beta($\alpha$, $\beta$) is the reflection of Beta($\beta$,$\alpha$).
```{r}
plot(p,dbeta(p,5,5),type='l')
lines(p,dbeta(p,1,1),col=2)
lines(p,dbeta(p,0.2,0.2),col=3)
## vary beta
plot(p,dbeta(p,6,6),type='l',ylim=c(0,5))
lines(p,dbeta(p,6,4),col=2)
lines(p,dbeta(p,6,2),col=3)
lines(p,dbeta(p,6,1.25),col=4)
lines(p,dbeta(p,6,1),col=5)
lines(p,dbeta(p,6,0.25),col=6)
legend("topleft",legend=c(6,4,2,1.25,1,0.5),lty=1,col=1:6)
plot(p,pbeta(p,5,5),type='l')
lines(p,pbeta(p,1,1),col=2)
lines(p,pbeta(p,0.2,0.2),col=3)
hist(rbeta(500,3,3),breaks=30,xlim=range(p),probability=TRUE)
lines(p,dbeta(p,3,3),col=2)
lines(p,dbeta(p,3,3),col=2)
```
### Lab questions
```
Lab questions:
3) The Beta has a special case, Beta(1,1) that is equivalent to what other PDF?
4) In the first panel, the mean is the same for each line (0.5). What are the variances? (Hint: Calculate this analytically. Look up the distribution in one of the recommended references.)
5) In the second panel, what are the means and medians of each of these curves? (Hint: you'll need to calculate the mean analytically and use one of the variants of R's beta function to find the median.)
```
### Log Normal
The lognormal is a log transform of the normal distribution. It is defined on the range X > 0 so is commonly used for data that cannot be negative by definition. The distribution is also positively skewed so is often used for skewed data. One thing that often goes unappreciated with the log-normal is that the mean, E[X], depends on the variance:
$$E[X] = e^{\mu + {{1}\over{2}}\sigma^2}$$
This applies not just when you explicitly use the lognormal, but also **whenever you log-transform data** and then calculate a mean or standard deviation – a fact that is vastly under-appreciated in the biological and environmental sciences and frequently missed in the published literature. In fact, ANY data transformation applied to make data “more normal” will change the mean, with the functional form of the bias depending on the transformation used. You can not simply back-transform the data without correcting for this bias. This phenomena is another illustration of Jensen's Inequality.
```{r}
## changing the mean
x <- 10^seq(-2,2,by=0.01)
plot(x,dlnorm(x,0),type='l',xlim=c(0,15),main="Changing the Mean")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## on a log scale
plot(x,dlnorm(x,0),type='l',log='x',main="Log Scale")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
abline(v=exp(0),col=1)
abline(v=exp(1),col=2)
abline(v=exp(2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## changing the variance
plot(x,dlnorm(x,2,.125),type='l',xlim=c(0,20),ylim=c(0,0.6),main="Changing the Variance")
lines(x,dlnorm(x,2,0.25),col=2)
lines(x,dlnorm(x,2,0.5),col=3)
lines(x,dlnorm(x,2,1),col=4)
lines(x,dlnorm(x,2,2),col=5)
lines(x,dlnorm(x,2,4),col=6)
abline(v=exp(2),col=1)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rlnorm(250,2,1),breaks=30,probability=TRUE)
lines(x,dlnorm(x,2,1),col=4)
```
### Lab questions
```
6) What are the arithmetric and geometric means of the three curves in the first panel? (Reminder: arithmetic means are means in the linear domain, geometric means are means in the log domain)
```
## Exponential & Laplace
The exponential distribution arises naturally as the time it takes for an event to occur when the average rate of occurrence, *r*, is constant. The exponential is a special case of the Gamma (discussed next) where $Exp(X \mid r) = Gamma(X \mid 1,r)$. The exponential is also a special case of the Weibull, $Exp(X \mid r) = Weibull(X \mid r,1)$, where the Weibull is a generalization of the exponential that allows the rate parameter *r* to increase or decrease with time. The Laplace is basically a two-sided exponential and arises naturally if one is dealing with absolute deviation, $|x-m|$, rather than squared deviation, $(x-m)^2$, as is done with the normal.
```{r}
## changing the mean
x <- seq(0,10,by=0.01)
plot(x,dexp(x,0.125),type='l',ylim=c(0,1))
lines(x,dexp(x,0.25),col=2)
lines(x,dexp(x,0.5),col=3)
lines(x,dexp(x,1),col=4)
lines(x,dexp(x,2),col=5)
lines(x,dexp(x,4),col=6)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rexp(250,2),breaks=30,probability=TRUE)
lines(x,dexp(x,2),col=4)
## laplace vs Gaussian
plot(x,dexp(abs(x-5),1)/2,type='l')
lines(x,dnorm(x,5),col=2)
plot(x,dexp(abs(x-5),1)/2,type='l',log='y') ## same plot as last but on a log scale
lines(x,dnorm(x,5),col=2)
```
### Lab questions:
```
7) The last two panels compare a normal and a Laplace distribution with the same mean and variance. How do the two distributions compare? In particular, compare the difference in the probabilities of extreme events in each distribution.
```
## Gamma
The gamma and inverse-gamma distribution are flexible distributions defined for positive real numbers. These are frequently used to model the distribution of variances or precisions (precision = 1/variance), in which case the shape and rate parameters are related to the sample size and sum of squares, respectively. The gamma is frequently used in Bayesian statistics as a prior distribution, and also in mixture distributions for inflating the variance of another distribution
```{r}
x <- seq(0,10,by=0.01)
## change rate
plot(x,dgamma(x,3,3),type='l' ,ylim=c(0,1.6))
lines(x,dgamma(x,3,1),col=2)
lines(x,dgamma(x,3,6),col=3)
lines(x,dgamma(x,3,1/3),col=4)
legend("topright",legend=c(3,1,6,0.33),lty=1,col=1:4)
## change shape
plot(x,dgamma(x,3,3),type='l')
lines(x,dgamma(x,1,3),col=2)
lines(x,dgamma(x,6,3),col=3)
lines(x,dgamma(x,18,3),col=4)
legend("topright",legend=c(3,1,6,18),lty=1,col=1:4)
## change variance
a <- c(20,15,10,5,2.5,1.25)
r <- c(4,3,2,1,0.5,0.25)
plot(x,dgamma(x,20,4),type='l')
for(i in 1:6){
lines(x,dgamma(x,a[i],r[i]),col=i)}
var = a/r^2
mean = a/r
legend("topright",legend=format(var,digits=3),lty=1,col=1:6)
var
mean
## change mean
var = 4
mean = c(1,2,3,4,5)
rate = mean/var
shape = mean^2/var
plot(x,dgamma(x,shape[2],rate[2]),type='l',col=2)
for(i in 1:5){
lines(x,dgamma(x,shape[i],rate[i]),col=i)
}
legend("topright",legend=1:5,lty=1,col=1:5)
rate
shape
```
### Lab questions:
```
8) Looking at the 'change variance' figure, how does the variance change as a and r increase? Qualitatively, how does this affect the mode and skew? Quantitatively, how does this affect the median (relative to the mean)?
```
# Part 2: Discrete distributions
## Binomial
The binomial arises naturally from counts of the number of successes given a probability of success, p, and a sample size, n. You are probably already familiar with the binomial in the context of coin toss examples.
```{r}
x <- 0:11
## vary size of sample (number of draws)
size = c(1,5,10)
plot(x,dbinom(x,size[1],0.5),type='s')
for(i in 2:3){
lines(x,dbinom(x,size[i],0.5),type='s',col=i)
}
legend("topright",legend=size,lty=1,col=1:3)
## vary probability
n = 10
p = c(0.1,0.25,0.5)
plot(x,dbinom(x,n,p[1]),type='s')
for(i in 2:3){
lines(x,dbinom(x,n,p[i]),col=i,type='s')
}
abline(v = n*p,col=1:3,lty=2)
legend("topright",legend=p,lty=1,col=1:3)
## CDF
plot(x,pbinom(x,n,p[1]),type='s',ylim=c(0,1))
for(i in 2:3){
lines(x,pbinom(x,n,p[i]),col=i,type='s')
}
legend("bottomright",legend=p,lty=1,col=1:3)
## Random samples
hist(rbinom(100,10,0.5)+0.0001, ## small amount added because
probability=TRUE, ## of way R calculates breaks
breaks=0:11,main="Random Binomial")
lines(x,dbinom(x,10,0.5),type='s',col=2) #Analytical solution
legend("topright",legend=c("sample","pmf"),lty=1,col=1:2)
```
### Lab questions:
```
9) Consider a binomial distribution that has a constant mean, np. What are the differences in the shape of this distribution if it has a high n and low p vs. a low n and high p?
```
## Poisson
The Poisson is also very common for count data and arises as the number of events that occur in a fixed amount of time (e.g. number of bird sightings per hour), or the number of items found in a fixed amount of space (e.g. the number of trees in a plot). Unlike the Binomial distribution the Poisson doesn't have a fixed upper bound for the number of events that can occur.
```{r}
x <- 0:12
plot(x,dpois(x,1),type='s')
lines(x,dpois(x,2),type='s',col=2)
lines(x,dpois(x,5),type='s',col=3)
legend("topright",legend=c(1,2,5),lty=1,col=1:3)
x <- 20:80
plot(x,dpois(x,50),type='s',ylim=c(0,0.08)) #Poisson with mean 50 (variance = 50)
lines(x+0.5,dnorm(x,50,sqrt(50)),col=2) #Normal with mean and variance of 50
lines(x,dbinom(x,100,0.5),col=3,type='s') #Binomial with mean 50 (variance = 25)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
plot(x,dbinom(x,100,0.5),type='s',col=3) #Binomial with mean 50 (variance = 25)
lines(x+0.5,dnorm(x,50,sqrt(25)),col=2) #Normal with mean 50 and variance of 25
lines(x,dpois(x,50),col=1,type='s') #Poisson with mean 50 (variance = 50)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
```
The last two panels depict a comparison of the Poisson, Normal, and Binomial with the same mean and a large sample size. The Poisson and Binomial are identical in the two figures but in the first the normal has the same variance as the Poisson and the second it is the same as the binomial.
### Lab questions:
```
10) Normal distributions are often applied to count data even though count data can only take on positive integer values. Is this fair is this to do in these two examples? (i.e. how good is the normal approximation)
11) Would the normal be a fair approximation to the Poisson curves for small numbers (the first panel)? How about for the Bionomial for small numbers (earlier panel of figures on the Binomial)?
12) Is the Poisson a good approximation of the Binomial?
13) Is it possible to choose the parameters so that the Poisson and Binomial to both have the same mean and variance? If so what is this parameterization?
```
## Negative binomial
The negative binomial has two interesting interpretations that are subtlety different from either the Poisson or Binomial. In the first case, it is the number of trials needed in order to observe a fixed number of occurrences, which is the opposite from the Binomial's number of occurrences in a fixed trial size and thus where it gets its name. The Negative Binomial also arises as the distribution of number of events that occur in a fixed space or time when the rate is not constant (as in the Poisson) but varies according to a Gamma distribution. Hence the Negative Binomial is also used to describe data that logically seems to come from a Poisson process but has greater variability that is expected from the Poisson (which by definition has a variance equal to its mean). The Geometric distribution arises as a special case of the negative binomial where the number of occurrences is fixed at 1.
```{r}
x <- 0:20
## negative binomial
## vary size
plot(x,dnbinom(x,1,0.5),type="s",main="vary size")
lines(x,dnbinom(x,2,0.5),type="s",col=2)
lines(x,dnbinom(x,3,0.5),type="s",col=3)
lines(x,dnbinom(x,5,0.5),type="s",col=4)
lines(x,dnbinom(x,10,0.5),type="s",col=5)
legend("topright",legend=c(1,2,3,5,10),col=1:5,lty=1)
## vary probability
plot(x,dnbinom(x,3,0.5),type="s",main="vary probability")
lines(x,dnbinom(x,3,0.3),type="s",col=2)
lines(x,dnbinom(x,3,0.2),type="s",col=3)
lines(x,dnbinom(x,3,0.1),type="s",col=4)
legend("topright",legend=c(0.5,0.3,0.2,0.1),col=1:5,lty=1)
## vary variance , alternate parameterization
mean = 8
var = c(10,20,30)
size = mean^2/(var-mean)
plot(x,dnbinom(x,mu=mean,size=size[1]),type="s",ylim=c(0,0.14),main="vary variance")
lines(x,dnbinom(x,mu=mean,size=size[2]),type="s",col=2)
lines(x,dnbinom(x,mu=mean,size=size[3]),type="s",col=3)
legend('topright',legend=format(c(var,mean),digits=2),col=1:4,lty=1)
lines(x,dpois(x,mean),col=4,type="s")
## NB as generalization of pois with inflated variance
## geometric
plot(x,dgeom(x,0.5),type="s",main="Geometric")
lines(x,dgeom(x,0.15),type="s",col=2)
lines(x,dgeom(x,0.05),type="s",col=3)
lines(x,dnbinom(x,1,0.15),type="s",col=4,lty=2)
## geometric as special case of NB where size = 1
```
### Lab questions:
```
14) In the 'vary size' panel, what are the means of the curves?
15) In the “vary variance” panel, how does the shape of the Negative Binomial compare to a Poisson with the same mean?
```