Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Getting Started vignette #234

Merged
merged 20 commits into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ R's
README's
TransPhylo
Vukosi
WAIC
Zhian
borel
bpmodels
Expand All @@ -28,7 +29,6 @@ epicontacts
epiparameter
fn
frac
gborel
geosocial
geq
infectee
Expand All @@ -43,13 +43,11 @@ ln
mathbf
mathrm
nCoV
nbinom
nosoi
outbreaker
outbreakr
packagename
parameterised
pois
poisson
rborel
ringbp
Expand All @@ -59,4 +57,3 @@ superspreading
th
truncdist
unnormalised
var
234 changes: 166 additions & 68 deletions vignettes/epichains.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,34 @@ knitr::opts_chunk$set(
)
```

This vignette demonstrates how get started with using the _epichains_ package
for simulating transmission chains or estimating the likelihood of observing
a transmission chain.

::: {.alert .alert-info}
* To understand the theoretical background of the branching processes methods
used here, refer to the theory vignette `vignette("theoretical_background")`.
* To understand the software development design decisions and implementation details of the package,
refer to the design vignette `vignette("design-principles")`.
:::

Infectious disease epidemics spread through populations when a chain of infected individuals transmit the
infection to others.
[Branching processes](https://en.wikipedia.org/wiki/Branching_process) can be used to model this process.
A branching process is a stochastic process where each infectious individual in a
generation gives rise to a random number of individuals in the next generation, starting with the index case in generation 1.
The distribution of the number of offspring is called the offspring
distribution.

The size of the transmission chain is the total number of
individuals infected by a single case, and the length of the transmission
chain is the number of generations from the first case to the last case they
produced before the chain ended. The size calculation includes the first case and the length calculation includes the first generation when the first case starts the chain (See figure below).

```{r out.width = "75%", echo = FALSE, fig.cap = "An example of a transmission chain starting with a single case C1. Cases are represented by blue circles and arrows indicate who infected whom. The chain grows through generations Gen 1, Gen 2, and Gen 3, producing cases C2, C3, C4, C5, and C6. The chain ends at generation Gen 3 with cases C4, C5, and C6. The size of C1's chain is 6, including C1 (that is, the sum of all blue circles) and the length is 3, which includes Gen 1 (maximum number of generations reached by C1's chain).", fig.alt = "The figure shows an example of a transmission chain starting with a single case C1. The complete chain is depicted as blue circles linked by orange directed arrows showing the cases produced in each generation. Each generation is marked by a brown rectangle and labelled Gen 1, Gen 2, and Gen 3. The chain grows through generations Gen 1, Gen 2, and Gen 3. Case C1 starts in Gen 1 and produces cases C2, and C3 in Gen 2. In Gen 3, C2 produces C4 and C5, and C3 produces C6. The chain ends in Gen 3. The chain size of C1 is 6, which includes C1 (that is the sum of all the blue circles, representing cases) and the length is 3, which includes Gen 1 (maximum number of generations reached by C1's chain)."}
knitr::include_graphics(file.path("img", "transmission_chain_example.png"))
```

_epichains_ provides methods to analyse and simulate the size and length
of branching processes with an arbitrary offspring distribution. These
can be used, for example, to analyse the distribution of chain sizes
Expand All @@ -39,37 +67,68 @@ or length of infectious disease outbreaks, as discussed in @farrington2003 and
library("epichains")
```

## Chain likelihoods
## Transmission chains likelihoods

::: {.alert .alert-primary}
### Use case {-}
Suppose we have observed a number of transmission chains, each arising from a separate index case. What are the likely transmission properties
(reproduction number and/or superspreading coefficient) that
generated these chains (assuming these parameters are the same across all the chains)?

The first step in answering this question is to calculate the likelihood of
observing the observed chain summaries given the transmission properties. This
is where the `likelihood()` function comes in. The returned estimate can then
be used to infer the transmission properties using estimation frameworks such
as maximum likelihood or Bayesian inference.

_epichains_ does not provide these estimation frameworks.
:::

::: {.alert .alert-secondary}
#### What we have {-}

1. Data on observed transmission chains summaries (sizes or lengths).

#### What we assume {-}

1. An offspring distribution that governs the number of secondary cases
produced by each case.
2. Parameters of the offspring distribution (for example, mean and overdispersion if using a negative binomial; which can then be interpreted as reproduction number and superspreading coefficient, respectively)
3. (optional) An observation process/probability that generates the observed chain
summaries.
4. (optional) A threshold of chain summaries that are considered too large to be consistent with a reproduction number of $R<1$ (e.g., $N$ cases or $T$ generations in total).
:::

### [`likelihood()`](https://epiverse-trace.github.io/epichains/reference/likelihood.html)

This function calculates the likelihood/loglikelihood of observing a vector of outbreak summaries obtained from transmission chains. "Outbreak summaries" here refer to transmission chain sizes or lengths/durations.
This function calculates the likelihood/loglikelihood of observing a vector of
outbreak summaries obtained from transmission chains. "Outbreak summaries" here
refer to transmission chain sizes or lengths/durations.

`likelihood()` requires a vector of chain summaries (sizes or lengths),
`chains`, the corresponding statistic to calculate, `statistic`, the offspring
distribution, `offspring_dist` and its associated parameters. `offspring_dist`
is specified as the function that is used to generate random numbers, i.e.
`rpois` for Poisson, `rnbinom` for negative binomial, or a custom function. If no
the closed-form solution is available then simulated replicates are used to
estimate the likelihoods. In that case `likelihood()` also requires
`nsim_offspring`, which is the number of simulations to run . This argument will
be explained further in the next section.
`chains`, the corresponding statistic to use, `statistic`, the offspring
distribution, `offspring_dist` and its associated parameters.

`offspring_dist` is specified by the R function that is used to generate random
numbers, i.e. `rpois` for Poisson, `rnbinom` for negative binomial, or a custom
function.

By default, the result is a log-likelihood but if `log = FALSE`, then
likelihoods are returned.
likelihoods are returned (See `?likelihood` for more details).

::: {.alert .alert-info}
To understand how `likelihood()` works see the section on [How `likelihood()` works](#how-likelihood-works).
:::

Let's look at the following example where we estimate the log-likelihood of
observing `chain_sizes`.
```{r chain_sizes_setup}
```{r estimate_likelihoods}
set.seed(121)
# example of observed chain sizes
# randomly generate 20 chains of size between 1 to 10
chain_sizes <- sample(1:10, 20, replace = TRUE)
chain_sizes
```

```{r estimate_likelihoods}
# estimate loglikelihood of the observed chain sizes
# estimate loglikelihood of the observed chain sizes for given lambda
likelihood_eg <- likelihood(
chains = chain_sizes,
statistic = "size",
Expand All @@ -83,18 +142,16 @@ likelihood_eg

### Joint and individual log-likelihoods

`likelihood()`, by default, returns the joint log-likelihood. If instead,
the individual log-likelihoods are required, then the `individual` argument
`likelihood()`, by default, returns the joint log-likelihood, given by the sum of log-likelihoods of each observed chain. If instead
the individual log-likelihoods are desired (for example for calculating [Watanabe–Akaike information criterion](https://en.wikipedia.org/wiki/Watanabe%E2%80%93Akaike_information_criterion) values, then the `individual` argument
must be set to `TRUE`. To return likelihoods instead, set `log = FALSE`.
```{r chains_setup2}
```{r estimate_likelihoods2}
set.seed(121)
# example of observed chain sizes
# randomly generate 20 chains of size between 1 to 10
chain_sizes <- sample(1:10, 20, replace = TRUE)
chain_sizes
```

```{r estimate_likelihoods2}
# estimate loglikelihood of the observed chain sizes
likelihood_ind_eg <- likelihood(
chains = chain_sizes,
Expand All @@ -108,42 +165,6 @@ likelihood_ind_eg <- likelihood(
likelihood_ind_eg
```

### How `likelihood()` works

*epichains* ships with functions for the analytical solutions of some
transmission chain "size" and "length" distributions. For the size
distributions, we provide the poisson, negative binomial, and gamma-borel
mixture. For the length distribution, we provide the poisson and geometric
distributions. These can be used with `likelihood()` based on what is specified
for `offspring_dist` and `statistic`.

If an analytical solution does not exist, we provide `offspring_ll()`, which
uses simulations to approximate the probability distributions
([using a linear approximation to the cumulative
distribution](https://en.wikipedia.org/wiki/Empirical_distribution_function)
for unobserved sizes/lengths). In this case, an extra argument `nsim_offspring`
must be passed to `likelihood()` to specify the number of simulations to be
used for this approximation.

For example, let's look at an example where `chain_sizes` is observed and we
want to calculate the likelihood of this being drawn from a binomial
distribution with probability `prob = 0.9`.
```{r likelihoods_by_simulation}
set.seed(121)
# example of observed chain sizes; randomly generate 20 chains of size 1 to 10
chain_sizes <- sample(1:10, 20, replace = TRUE)
# get their likelihood
liks <- likelihood(
chains = chain_sizes,
offspring_dist = rbinom,
statistic = "size",
size = 1,
prob = 0.9,
nsim_offspring = 250
)
liks
```

### Observation probability

`likelihood()` uses the argument `obs_prob` to model the observation
Expand Down Expand Up @@ -180,12 +201,82 @@ liks
This returns `10` likelihood values (because `nsim_obs = 10`), which can be
averaged to come up with an overall likelihood estimate.

To find out about the usage of the `likelihood()` function, you can run
`?likelihood` to access its `R` help file.
### How `likelihood()` works {#how-likelihood-works}

`likelihood()` first checks if an analytical solution of the likelihood exists
for the given offspring distribution and chain statistic specified. If there's
none, simulations are used to estimate the likelihoods.

## Transmission chains simulation
The _epichains_ package includes closed-form (analytical) solutions for
calculating the likelihoods associated with certain summaries of transmission
chains ("size" or "length") for specific offspring distributions.

There are two simulation functions, herein referred to collectively as the `simulate_*()` functions.
For the size distributions, the package provides the Poisson, negative binomial,
and gamma-borel mixture.

::: {.alert .alert-info}
To provide the gamma-Borel size likelihood, the [Borel distribution](https://en.wikipedia.org/wiki/Borel_distribution) is needed.
However, base R does not provide this distribution natively like poisson and
gamma, so _epichains_ provides its [density and random generator](https://epiverse-trace.github.io/epichains/reference/index.html#special-distributions).
:::

For the length distribution, there's the Poisson and
geometric distributions. These can be used with `likelihood()` based on what is
specified for `offspring_dist` and `statistic`.

If an analytical solution does not exist, an internal simulation function,
`.offspring_ll()` is employed. It uses simulations to approximate the probability
distributions ([using a linear approximation to the cumulative
distribution](https://en.wikipedia.org/wiki/Empirical_distribution_function)
for unobserved sizes/lengths). If simulation is to be used, an extra argument
`nsim_offspring` must be passed to `likelihood()` to specify the number of
simulations to be used for this approximation.
jamesmbaazam marked this conversation as resolved.
Show resolved Hide resolved
Approximate values of the likelihood will vary with every call to the simulation (because the simulations used for estimation vary), and it may be worth calling `likelihood()` multiple times in this case to see the error this may introduce.

For example, let's look at an example where `chain_sizes` is observed and we
want to calculate the likelihood of this being drawn from a binomial
distribution with probability `prob = 0.9`.
```{r likelihoods_by_simulation}
set.seed(121)
# example of observed chain sizes; randomly generate 20 chains of size 1 to 10
chain_sizes <- sample(1:10, 20, replace = TRUE)
# get their likelihood
liks <- likelihood(
chains = chain_sizes,
offspring_dist = rbinom,
statistic = "size",
size = 1,
prob = 0.9,
nsim_offspring = 250
)
liks
```

## Transmission chain simulation

::: {.alert .alert-primary}
### Use case {-}
We want to simulate a scenario where a number of chains are each
produced by a separate index case. We want to simulate the chains of
transmission that result from these cases, given a specific offspring
distribution and a reproduction number.
:::

::: {.alert .alert-secondary}
#### What we have {-}

1. An initial number of cases that seed the outbreak.

#### What we assume {-}

1. An offspring distribution that governs the number of secondary cases produced by each case.
2. Parameters of the offspring distribution (for example, mean and overdispersion if using a negative binomial; which can then be interpreted as reproduction number and superspreading coefficient, respectively)
3. (optional) A threshold of chain summaries that are considered too large to be consistent with a reproduction number of $R<1$ (e.g., $N$ cases or $T$ generations in total).
4. (optional) A generation time distribution that governs the time between successive cases.
:::

There are two simulation functions, herein referred to collectively as the
`simulate_*()` functions that can help us achieve this.

### [`simulate_chains()`](https://epiverse-trace.github.io/epichains/reference/simulate_chains.html)

Expand Down Expand Up @@ -220,10 +311,12 @@ sim_chains <- simulate_chains(
head(sim_chains)
```

Beyond the defaults, `simulate_chains()` can also simulate outbreaks based on
a specified population size and pre-existing immunity until the susceptible
pool runs out.

::: {.alert .alert-info}
By default, `simulate_chains()` assumes an infinite population but can account for susceptible depletion when a finite `pop` is specified.

Pre-existing immunity levels can also be specified, which will be applied to `pop` before the simulation is initialised.
:::

Here is a quick example where we simulate an outbreak in a population of
size $1000$ with $10$ index cases and $10/%$ pre-existing immunity. We assume
individuals have a poisson offspring distribution with mean,
Expand Down Expand Up @@ -252,13 +345,15 @@ head(sim_chains_with_pop)

### [`simulate_summary()`](https://epiverse-trace.github.io/epichains/reference/simulate_summary.html)

::: {.alert .alert-primary}
`simulate_summary()` is a performant version of `simulate_chains()` that
does not track information on each infector and infectee. It returns the
eventual size or length/duration of each transmission chain. This function is
especially useful for calculating numerical likelihoods in `likelihood()`.
:::

Here is an example to simulate the previous examples without intervention,
returning the size of each of the $10$ chains. It assumes a poisson offspring
returning the size of each of the $10$ chains. It assumes a Poisson offspring distribution
distribution with mean of $0.9$.
```{r sim_summary_pois}
set.seed(123)
Expand All @@ -275,7 +370,7 @@ simulate_summary_eg <- simulate_summary(
simulate_summary_eg
```

## Other functionalities
## S3 Methods

### Summarising

Expand Down Expand Up @@ -315,13 +410,16 @@ simulate_summary_eg <- simulate_summary(
summary(simulate_summary_eg)
```

::: {.alert .alert-danger}
This summary is the same as the output of `simulate_summary()` if the same
inputs are used. `simulate_summary()` is a more performant version of
`simulate_chains()`, hence, you can use it instead, if you are only interested
in the summary of the simulated chains without details about the infection
tree.
:::

We can confirm if the two outputs are the same using `base::setequal()`
We can confirm if the two outputs are the same using `base::setequal()`, which
checks if two objects are equal and returns `TRUE/FALSE`.

```{r compare_sim_funcs, include=TRUE,echo=TRUE}
setequal(
Expand Down
Binary file added vignettes/img/transmission_chain_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.