-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathBio720_SimulatingData_Part1.Rmd
208 lines (150 loc) · 6.57 KB
/
Bio720_SimulatingData_Part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Bio720 - Simulating Data"
author: "Ian Dworkin"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
ioslides_presentation:
incremental: yes
keep_md: yes
widescreen: yes
html_document:
keep_md: yes
number_sections: yes
toc: yes
slidy_presentation:
incremental: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Simulation as a key skill in your toolbox
One of the most important skills in your bag as new computational biologists is the ability to perform simulations. In particular, to simulate data or evaluate models numerically (or both).
## Mathematical models abound
In biology mathematical models are the basis for much of the theoretical and conceptual background in disciplines like ecology, evolution, population genetics, molecular evolution & biochemistry.
## When you can't solve a model...
Many of the models that are developed are not *analytically* tractable. That is, without making very strong biological assumptions there are no *closed form solutions*.
That is the models can not be solved for the general case.
## Using computers to find numerical solutions
- For such models, "solutions" can still be found.
- For some models, stability analysis can be performed (among other techniques).
- However, most often, scientists resort to computers to identify *numerical solutions*
## Deterministic VS. Stochastic
- Within *dynamical* models, that include the majority of models in biology there are two broad categories.
- **Deterministic** - where the outcome of the model entirely depends on the model itself and the starting conditions.
- **Stochastic** - where random events influence the outcomes of the model, and only the probability of events can be predicted, not exact outcomes.
## A simple deterministic model one-locus model of natural selection.
- In population genetics we often start with a simple deterministic model of how selection on an allele influences its frequency in the population over time.
- In a simple haploid model we can envision two alternative alleles, *A* and *a*.
- *a* is initially fixed in the population, but the new mutation *A* arrives and it is beneficial. What will happen to this allele?
## Haploid selection model
- Frequency of *A* at time *t* is $p(t)$
- Fitness of *A* is $W_A$, and for *a* is $W_a$
$$p(t+1) = \frac{p(t) W_A}{p(t) W_A + W_a(1-p(t))}$$
$$p(t+1) = \frac{p(t) W_A} {\overline W }$$
- Where $\overline{W}$ is mean fitness for the population
- How would you convert this into *R* code for one generation to the next?
- Start with what variables you need.
## Converting this into R code - What variables
- We need variables for the fitness values for each allele, $W_A$, $W_a$ and for allele frequency of A $p(t+1)$ at time t and t+1.
- With this we can write the equation for allele frequency change from one generation to the next.
## Converting this into R code - What variables
```{r echo=TRUE}
p_t1 <- function(w_A, w_a, p_t0) {
w_bar <- (p_t0*w_A) + ((1-p_t0)*w_a) # mean pop fitness
p_t1 <- (w_A*p_t0)/w_bar
return(p_t1)}
p_t1(w_A = 1.1, w_a = 1.0, p_t0 = 0.5)
```
## Is this deterministic or stochastic, what would be a quick way to check?
## Is this deterministic or stochastic, what would be a quick way to check?
```{r echo=TRUE}
replicate(n = 100, p_t1(1.1, 1.0, 0.5))
```
## Allele frequency dynamics.
- Now let's extend this across many generations. We need to rewrite the function as we expect the allele frequencies to change each generation. Also, mean population fitness will change each generation (as allele frequency changes)
- write this simulation (a *for loop* would be sensible) and go for 200 generations, and look at the dynamics (starting allele frequency is p = 0.01). Wrap it all as a function called `haploid_selection`
##
```{r echo=TRUE}
haploid_selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store allele frequencies and mean pop fitness
p <- rep(NA,n) # a vector to store allele frequencies
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2) # mean population fitness
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
}
return(p)
}
```
## Test the model and plot it.
```{r echo=TRUE}
p <- haploid_selection()
generations <- 1:length(p)
plot(p ~ generations, pch = 20,
ylab = "allele frequency",
xlab = "generation")
```
## Using simple numerical simulation to gain intuition for the system.
- Try altering fitness advantage of *A* a little bit. Or reducing initial allele frequency
- Is this deterministic or stochastic?
## Making a more general function
```{r, echo = TRUE}
haploid.selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store p, delta p and mean pop fitness
p <- rep(NA,n)
delta_p <- rep(NA, n)
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
delta_p[1] <- 0 #change in allele frequency
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2)
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
delta_p[i] <- p[i] - p[i-1]
}
if (any(p > 0.9999)) {
fixation <- min(which.max(p > 0.9999))
cat("fixation for A1 occurs approximately at generation:", fixation )
} else {
maxAlleleFreq <- max(p)
cat("fixation of A1 does not occur, max. allele frequency is:", print(maxAlleleFreq, digits = 2) )
}
# Let's make some plots
par(mfrow=c(2,2))
# 1. mean population fitness over time
plot(x = 1:n, y = w_bar,
xlab = "generations",
ylab = expression(bar(w)),
pch=20, col="red", cex = 2, cex.lab = 1.5, cex.main = 1.5,
main = paste("p0 = ", p0, "and s = ", (1 - (w2/w1))))
# 2. change in allele frequency over time
plot(x = 1:n, y = p,
xlab="generations",
ylab="Allele frequency (p)",
pch = 20, col = "red", cex.lab = 1.5)
# 3. plot of p[t+1] vs p[t]
p.1 <- p[-n]
p.2 <- p[-1]
plot(p.2 ~ p.1,
xlab = expression(p[t]),
ylab = expression(p[t+1]),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
# 4. plot of allele frequency change
plot(x = 2:n, y = delta_p[-1],
xlab = "generation",
ylab= expression(paste(Delta,"p")),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
}
```
## Let's see what this does.
```{r}
haploid.selection(p0 = 0.0001, w1 = 1, w2 = 0.987, n = 1000)
```