-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathresources.qmd
139 lines (93 loc) · 6.34 KB
/
resources.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "Computational Lebesgue Integration"
format:
html:
toc: true
bibliography: references.bib
---
## Introduction
The Lebesgue integral, $\int_{\mathcal{X}} f(x) \mu(\text{d}x)$ makes little structural assumptions on the integrand domain $\mathcal{X}$. This high level of generality is useful for example when performing Bayesian inference over combinatorial spaces. As a specific use case, consider Bayesian phylogenetic inference, where $\mathcal{X}$ is the space of phylogenetic trees.
Are there *generic* tools to approach the problem of computing Lebesgue integrals? By generic, we mean that they should apply to all situations where the Lebesgue integral is defined.
Surprisingly, the answer is yes. The generic setup is too abstract to permit concrete algorithms, but it does permit the elaboration and analysis of "meta-algorithms."
The general story of these meta-algorithms starts in the same way as Lebesgue's original idea: exploiting the fact that while the domain of $f$ is unstructured, its image is $\mathbb{R}$.
::: column-margin
![](lebesgue.png){width="200"}
:::
But in contrast to the theory of Lebesgue integration, which is a settled field, Computational Lebesgue Integration is in active development.
This page is a growing bibliography of Computational Lebesgue Integration methods. Create a pull request if you would like to add missing entries. To keep the length of this page manageable, we only include methods that (1) apply to any type of target distributions (e.g., we are not including methods that assume real-valued targets), (2) have reasonable scalability to large problems.
## Fundamentals
### Annealing
::: column-margin
![An example of annealed distributions.](annealing.png){width="200"}
:::
Many methods described here *anneal* the distribution of interest $\pi(\text{d}x) \propto \exp(\ell(x)) \pi_0(\text{d}x)$, i.e., consider the path $\pi_\beta(\text{d}x) \propto \exp(\beta \ell(x)) \pi_0(\text{d}x)$. Given the direct connection between $\beta$ and the inverse temperature in statistical mechanics, the notion of annealing appears right from the very beginning of the Monte Carlo literature, in Equation (2) of @metropolis1953. It is used as a tool detached from its statistical mechanics interpretation as early as @pincus1970.
### Meta-algorithms
::: column-margin
![Swap structure of a PT algorithm.](PT.png){width="200"}
:::
Two large families of meta-algorithms are able to take a basic, inefficient MCMC algorithm and transform it into a potentially much faster one while preserving consistency guarantees.
The first family is designed via MCMC on augmented distributions:
::: column-margin
![SMC particle genealogy.](SMC.png){width="150"}
:::
- Parallel tempering (PT): @geyer1991
- Simulated tempering (ST): @geyer1995
The second family uses importance sampling on augmented distributions combined with resampling:
- Annealed importance sampling (AIS): @neal2001
- Annealed Sequential Monte Carlo (ASMC): @delmoral2006
### Normalization constants
To obtain estimates of normalization constant from the output of the above algorithms, several algorithms are available, for example:
- The stepping stone, a product of importance sampling estimators, see @bennett1976 and @xie2011.
- Bridge sampling [@gelman1998].
## Non-reversible variants
The variable $\beta$ is on $[0, 1]$, and hence facilitate the design of non-reversible algorithms even when $\mathcal{X}$ is a black box state space.
- In parallel tempering:
- First occurence of odd-even swaps @okabe2001
- Non-Reversible Parallel Tempering (NRPT) @syed2022
- Non-reversible simulated tempering (NRST)
- @sakai2016
- @biron-lattes2024
::: column-margin
![](non-reversible.png){height="150"}
:::
## Methodological extensions
### Adaptive schedules
In challenging problems, using an adaptive discretization of $\beta \in [0, 1]$ is critical for good performance.
::: column-margin
![](adaptive.png){width="150"}
:::
- For stochastic gradient methods, see @atchadé2011 and @miasojedow2013.
- Round based: @syed2022.
- See subsection "Round-based method" of @Bouchard2024MCMCdriven for an explanation of how this implicitly pre-condition the adaptation optimization.
### Generalized distribution paths
Most of the methods described so far also apply when the annealing distributions are generalized into a path of distributions. Paths that are geodesic can lead to substantial savings.
::: column-margin
![](geodesic.png){width="150"}
:::
- For AIS/ASMC, see:
- @gelman1998
- @grosse2013
- @brekelmans2020
- For PT, see @Syed2021Parallel
### Variational blends
Instead of a fixed prior $\pi_0$ as a tractable reference, a fitted variational distribution $q_\phi$ can also be used, either by itself or in conjunction.
::: column-margin
![](variational.png){width="120"}
:::
- In the context of path sampling:
- @lefebvre2009
- @wang2022
- For PT:
- @paquet2009
- @cameron_recursive_2014
- @Surjanovic2022Parallel
### Convergence of adaptive methods
Theoretical frameworks to analyze the convergence of the above adaptive methods (schedule, end points, paths):
- For stochastic gradient methods, see, e.g., @andrieu2008 for the general toolbox and @miasojedow2013 for analysis in the case of PT specifically.
- For round based schemes, see @chimisov2018 for general tools, and @Surjanovic2022Parallel for analysis in the case of variational PT specifically.
## Software
Software implementations of the above meta-algorithms (PT, ST, AIS, ASMC) that support arbitrary data types. We do not include packages specialized to real-valued targets, nor software that use algorithms that do not scale to large problems (e.g. universal PPLs that only provide algorithms such as non-Markovian SMC or lightweight Metropolis-Hastings).
- [Turing.jl](https://github.com/TuringLang/Turing.jl), a universal Probabilistic Programming Language (PPL) [@ge2018] with some PT support.
- [Blang](https://github.com/UBC-Stat-ML/blangSDK), a PPL over arbitrary data types with NRPT, AIS and ASMC [@Bouchard2022Blang].
- [NIMBLE](https://cran.r-project.org/web/packages/nimble/nimble.pdf), a PPL with PT support [@pleydell2021].
- [Pigeons](https://pigeons.run/), an NRPT implementation interfacing multiple PPLs (Turing, Stan, Blang), black box MCMC code [@surjanovic_pigeonsjl_2023]. Also supports distributed computation over MPI.