Bayesian Learning and Probabilistic Programming

Edwin T. Jaynes was one of the first people to realize that probability theory, as originated by Laplace, is a generalization of Aristotelian logic that reduces to deductive logic in the special case that our hypotheses are either true or false.

https://cpg.doc.ic.ac.uk/
https://www.zhihu.com/question/59442141
https://research.cs.aalto.fi/bayes/
Deep Learning and Bayesian Modeling
Probabilistic AI
A Java Toolbox for Scalable Probabilistic Machine Learning
Probabilistic Models in the Study of Language
https://arxiv.org/abs/1902.10708

And the joint probability identity is the first for us to understand in the world probabilistic modelling: $$P(A\cup B)=P(A)P(B\mid A)=P(A\mid B)P(B)$$ where $A$ and $B$ are two events; and $P(\cdot\mid \cdot)$ is the conditional probability.

Bayesian Learning and Probabilistic Programming

Bayesian learning can be regarded as the extension of Bayesian statistics. The core topic of Bayesian learning is thought as prior information to explain the uncertainty of parameters. It is related with Bayesian statistics, computational statistics, probabilistic programming and machine learning.

$\color{red}{Note}$: We can suppose that every variable in Bayesian Learning is at random so that each parameters is associated with a probability distribution.

Traditional Bayesianism is defined by the following attributes:

willingness to accept subjective belief as an expedient substitute for raw data,
reliance on complete (i.e., coherent) probabilistic models of beliefs,
adherence to Bayes’ conventionalization as the primary mechanism for updating belief in light of new information.

There are many varieties of Bayesian analysis. The fullest version of the Bayesian paradigm casts statistical problems in the framework of decision making. It entails formulating subjective prior probabilities to express pre-existing information, careful modelling of the data structure, checking and allowing for uncertainty in model assumptions, formulating a set of possible decisions and a utility function to express how the value of each alternative decision is affected by the unknown model parameters. But each of these components can be omitted. Many users of Bayesian methods do not employ genuine prior information, either because it is insubstantial or because they are uncomfortable with subjectivity. The decision-theoretic framework is also widely omitted, with many feeling that statistical inference should not really be formulated as a decision. So there are varieties of Bayesian analysis and varieties of Bayesian analysts. But the common strand that underlies this variation is the basic principle of using Bayes’ theorem and expressing uncertainty about unknown parameters probabilistically.

The following formula is worthy of keeping in mind.

$$\underbrace{P(\theta\mid y)}{posterior}=\frac{\overbrace{P(y\mid \theta)}^{likelihood} \overbrace{P(\theta)}^{prior}}{\underbrace{P(y)}{evidence}}$$

updating one's beliefs

Bayes Formulae	Inverse Bayes Formulae
$f_X(x)=\frac{f_{X, Y}(X, Y)}{f_{Y\mid X}(y\mid x)}=\frac{f_{X\mid Y}(x\mid y)f_Y(y)}{f_{Y\mid X}(y\mid x)}$	$f_X(x) = (\int_{S_y} \frac{ f_{Y\mid X}(y\mid x)}{f_{X\mid Y}(x\mid y)}\mathrm{d}y)^{-1}$
$f_X(x)\propto f_{X\mid Y}(x\mid y)f_Y(y)$	$f_X(x) \propto \frac{f_{X\mid Y}(x\mid y_0)}{f_{Y\mid X}(y_0\mid x)}$

$P(X=x)=\frac{P(X=x, Y=y)}{P(Y=y\mid X=x)}=\sum_{y}P(X=x, Y=y)$

Bayesian learning is the application of Bayesian methods in machine learning such as Bayesian net, random condition field. In computational Bayesian statistics, every numerical features is generated by a probability. For example, we restrict the parameters of linear regression into some prior distribution in the Bayesian framework Bayesian linear regression. An increasingly-popular prior is the double exponential or Bayesian LASSO prior.

What is Bayesian Analysis?
https://www2.stat.duke.edu/~berger/papers.html#review
Bayesian AI
Inversion of Bayes formula and measures of Bayesian information gain and pairwise dependence
Stat 260/CS 294 Bayesian Modeling and Inference

Probabilistic programming lies at the intersection of machine learning, statistics, programming languages, and deep learning. Historically probabilistic programming has been about automating Bayesian statistical inference; more recently it has emerged as a candidate for the next toolchain for AI, particularly unsupervised and reinforcement learning.

Probabilistic Programming
https://probmods.org/
https://github.com/hongseok-yang/probprog17
https://www.probabilisticworld.com/
http://www.stats.org.uk/uncertainty/

The probabilistic theory is to describe the uncertainty; in contrast, statistics theory is to compute the uncertainty personally.

http://www.stat.columbia.edu/~gelman/research/published/badbayesmain.pdf

Cox’s famous theorem says that if your way of reasoning is not in some sense isomorphic to Bayesianism, then you are violating one of the following ten basic principles.

http://www.stats.org.uk/cox-theorems/
https://risingentropy.com/coxs-theorem/
http://ksvanhorn.com/bayes/Papers/rcox.pdf
https://arxiv.org/abs/1507.06597
http://ksvanhorn.com/bayes/Papers/rcox.pdf

Resources on Bayesian Learning

MaxEnt 2006 Twenty sixth International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering CNRS, Paris, France, July 8-13, 2006
Cynthia Rudin
https://bayesgroup.github.io/
Probabilistic Artificial Intelligence (2018)
INTELLIGENT PROBABILISTIC APPROACHES TO NATURAL LANGUAGE
Probabilistic Numerics Group
Applied Bayesian Analysis
http://wilkeraziz.github.io/pages/landscape
https://probcomp.github.io/Gen/
https://m-clark.github.io/workshops.html
https://dsteinberg.github.io/pages/research-projects.html
https://tminka.github.io/papers/index.html
http://am207.info/resources.html
http://mlg.eng.cam.ac.uk/tutorials/07/
Machine Learning 2007 Advanced Tutorial Lecture Series
COS597C: Advanced Methods in Probabilistic Modeling
O'Bayes 2019: Objective Bayes Methodology Conference
Philosophy of Bayesian Inference, Radford M. Neal, January 1998
Understanding computational Bayesian statistics
Variational Methods for Bayesian Independent Component Analysis by Rizwan A. Choudrey
PROBABILISTIC-NUMERICS.ORG
Variational Bayes for Implicit Probabilistic Models

Blogs on Bayesian Learning

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process
Understanding emprical Bayesian hierarchical learning
While My MCMC Gently Samples - Bayesian modeling, Data Science, and Python
Probabilistic Models in the Study of Language

Radford Neal's Research: Bayesian Inference
http://www2.stat.duke.edu/~rcs46/bayes17.html
http://ryanrossi.com/search.php
https://bayesianwatch.wordpress.com/
http://worrydream.com/
https://xianblog.wordpress.com/
https://statmodeling.stat.columbia.edu/

Bayesian Statistics

Bayesian statistics has a fundamentally different view to statistical inference from the classic (frequentist) inference. Knowledge of the concerned problem prior to data collection is represented by a probability distribution (prior distribution), and after the data are collected, this distribution is updated using Bayes' theorem, and then called posterior distribution. All Bayesian inference is then based on this posterior distribution. Philosophically, this process is very similar to the scientific discovery process.

Advantages of Bayesian statistics include, the inference is conditional on the given data; prior knowledge can be integrated into the analysis using prior distributions; and modeling complex systems can be done easily using hierarchical models. Disadvantages include, different priors lead to different inference; and computation is often intensive.

After this class, students will master the basic principle of Bayesian inference and its applications in different models, such as prior specification, Markov chain Monte Carlo, Gibbs Sampling, Bayes factor, empirical Bayes, analysis of hierarchical models, etc.

We will use R and WinBUGS for data analysis and have some computer lab sessions in which students will gain hand-on experiences on real data analysis.

Doing Bayesian Data Analysis
What is Bayesian statistics and why everything else is wrong
Bayesian methods for combining multiple Individual and Aggregate data Sources in observational studies.
https://www.stat.berkeley.edu/~census/goldbug.pdf
http://www.biostat.umn.edu/~ph7440/pubh7440/Lecture5.pdf
CS&SS/STAT 564: Bayesian Statistics for the Social Sciences, University of Washington, Spring 2018
CSSS 564: Bayesian Statistics for the Social Sciences (University of Washington, Spring 2018)
Statistical Rethinking A Bayesian Course with Examples in R and Stan (& PyMC3 & brms too)

Learn Bayesian statistics
STA 360/602: Bayesian Methods and Modern Statistics

https://jrnold.github.io/bayesian_notes/
http://doingbayesiandataanalysis.blogspot.com/
https://www.uv.es/bernardo/BayesStat.pdf

Bayesian data analysis

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.
Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.
Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

https://avehtari.github.io/BDA_course_Aalto/
http://www.stat.columbia.edu/~gelman/book/
https://github.com/avehtari/BDA_course_Aalto
https://www.ics.uci.edu/~mguindan/teaching/introBDA/
http://doingbayesiandataanalysis.blogspot.com/
https://bookdown.org/ajkurz/DBDA_recoded/
https://sites.google.com/site/doingbayesiandataanalysis/
https://bookdown.org/content/4857/

Inverse Bayes' Formula

The joint pdf of the random variables $X$ and $Y$ can be expressed as $$ f_{(x,y)}(x,y)=f_{X|Y}(x|y)f_{Y}(y)=f_{Y|X}(y|x)f_{X}(x). $$ in some proper conditions. Thus we get by division $$f_{Y}(y)=\frac{f_{Y|X}(y|x)f_{X}(x)}{f_{X|Y}(x|y)} \tag 1.$$

Integrating this identity with respect to $y$ on support of $f_{Y}(y)$, we immediately have the point-wise formula as shown below $$f_{X}(x)={\int_{f_{X|Y}(x|y)\neq {0}}\frac{f_{Y|X}(y|x)}{f_{X|Y}(x|y)}\mathbb{d}y}^{-1} \tag 2.$$ Now substitute (2) into (1), we obtain the dual form of IBF for $f_Y(y)$ and hence by symmetry we obtain the function-wise formula of $f_Y(y)$ at $y_0$ as shown in (3), or the sampling formula in (4) when the normalizing constant is omitted. $$f_{X}(x) ={\int_{f_{Y|X}(y|X)\neq {0}}\frac{f_{X|Y}(x|y_0)}{f_{Y|X}(y_0|x)}\mathbb{d}x}^{-1}\frac{f_{X|Y}(x|y_0)}{f_{Y|X}(y_0|x)} \ \propto \frac{f_{X|Y}(x|y_0)}{f_{Y|X}(y_0|x)}$$

The Inventor of IBF

http://web.hku.hk/~kaing/Background.pdf
http://101.96.10.64/web.hku.hk/~kaing/Section1_3.pdf
http://web.hku.hk/~kaing/HKSSinterview.pdf
http://web.hku.hk/~kaing/HKSStalk.pdf
http://reliawiki.org/index.php/Life_Data_Analysis_Reference_Book
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

Sparse Bayesian Models

"Sparse Bayesian Modelling" describes the application of Bayesian "automatic relevance determination" (ARD) methodology to predictive models that are linear in their parameters. The motivation behind the approach is that one can infer a flexible, nonlinear, predictive model which is accurate and at the same time makes its predictions using only a small number of relevant basis functions which are automatically selected from a potentially large initial set.

ftp://ftp.sanger.ac.uk/pub/resources/theses/down/chapter2.pdf
http://dsp.ucsd.edu/~zhilin/papers/SBL_introduction.pdf
https://papers.nips.cc/paper/2121-analysis-of-sparse-bayesian-learning.pdf
http://wangshusen.github.io/papers/JMLR2012.pdf
https://ieeexplore.ieee.org/document/1315936
http://www.miketipping.com/sparsebayes.htm
http://staff.ustc.edu.cn/~hchen/signal/SBL.pdf
https://rdrr.io/cran/sbl/man/sblgwas.html
http://www.lx.it.pt/~mtf/Figueiredo_Jain_CVPR_2001.pdf
BAYESIAN SPARSE SIGNAL RECOVERY
https://www.prowler.io/blog/sparse-gps-approximate-the-posterior-not-the-model

Bayesian Lasso

Tibshirani (1996) suggested that Lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace (i.e., double-exponential) priors.

The Bayesian Lasso
http://www2.stat.duke.edu/~rcs46/lectures_2015/14-bayes1/14-bayes3.pdf
http://www.math.chalmers.se/Stat/Grundutb/GU/MSA220/S16/bayeslasso.pdf
http://pages.stat.wisc.edu/~shao/stat992/
Bayesian regularization of neural networks.
Bayesian Regularization
Bayes Sparse Regression

Empirical Bayes

In practice, empirical Bayes analysis employs both frequentist and Bayesian inferential methods.

The fundamental question of Empirical Bayes is to estimate the density of latent variables.

We will work in the following simplified framework: unobserved parameters $\theta_i$ have each independently generated an observation $x_i$ according to a known probability kernel $p(x\mid \theta)$, $$x_i\overset{ind}{\sim} p(x_i\mid \theta_i), \quad i=1,2,\cdots, N.$$ The entire data set $x = (x_1, x_2, \cdots ,x_N)$ can profitably be employed in the estimation of each $\theta_i$. Empirical Bayes procedures typically add the assumption that the parameters $\theta_i$ above have been independently drawn from some hidden prior density $$\theta_i\overset{ind}{\sim}g(\theta), \quad i=1,2,\cdots, N.$$

Bayes, Oracle Bayes, and Empirical Bayes
Borrowing Strength from Experience: Empirical Bayes Methods and Convex Optimization
Compound decision theory and empirical Bayes methods: invited paper
NONPARAMETRIC EMPIRICAL BAYES AND COMPOUND DECISION APPROACHES TO ESTIMATION OF A HIGH-DIMENSIONAL VECTOR OF NORMAL MEANS
https://stephens999.github.io/fiveMinuteStats/ebnm_normal.html

Objective Bayes Methodology

In subjective Bayesian analysis, prior distributions for $\theta$, $\pi(\theta)$, represent beliefs, which change with data via Bayes theorem. In objective Bayesian analysis, prior distributions represent 'neutral' knowledge and the posterior distribution is claimed to give the probability of unknowns arising from just the data.

O'Bayes 2019: Objective Bayes Methodology Conference
objective Bayes
http://www.robotics.stanford.edu/~stong/papers/tong_thesis.pdf
https://www.bayes-pharma.org/objectives/
Objective Bayesian Analysis
https://www2.stat.duke.edu/~berger/papers/obayes-debate.pdf
http://www.stat.cmu.edu/bayesworkshop/2005/goldstein.pdf
http://www.stat.columbia.edu/~gelman/research/unpublished/objectivityr3.pdf
https://www.jstor.org/stable/2096686

Variational Bayes Methods

Variational inference is an umbrella term for algorithms which cast posterior inference as optimization.

Variational Bayesian (VB) methods, also called "ensemble learning", are a family of techniques for approximating intractable integrals arising in Bayesian statistics and machine learning. They are an alternative to other approaches for approximate Bayesian inference such as Markov chain Monte Carlo, the Laplace approximation, etc. They can be used to lower bound the marginal likelihood (i.e. "evidence") of several models with a view to performing model selection, and often provide an analytical approximation to the parameter posterior which is useful for prediction.

Variational-Bayes .org
A tutorial on variational Bayesian inference
The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures
Variational Inference: A Review for Statisticians
Another Walkthrough of Variational Bayes
The Variational Approximation for Bayesian Inference
High-Level Explanation of Variational Inference by Jason Eisner (2011)
VBA (Variational Bayesian Analysis): Interpreting experimental data through computational models

Approximate Bayesian Computation

Approximate Bayesian Computation (abc for short) is a family of computational techniques which offer an almost automated solution in situations where evaluation of the posterior likelihood is computationally prohibitive, or whenever suitable likelihoods are not available.

Approximate Bayesian Computation (ABC) in practice
Approximate Bayesian Computation: a simulation based approach to inferenc

Approximate Rejection Algorithm:

Draw $\theta$ from $\pi(\theta)$;

Simulate $D′ \sim P(\cdot\mid \theta)$;

Accept $\theta$ if $\rho(D, D')\leq \epsilon$.

Symposium on Advances in Approximate Bayesian Inference, December 2, 2018
RIKEN Center for Advanced Intelligence Project: Approximate Bayesian Inference Team
Approximation Methods for Inference, Learning and Decision-Making
Algorithms and Theoretical Foundations for Approximate Bayesian Inference in Machine Learning
APPROXIMATE BAYESIAN INFERENCE
Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations
Approximate Bayesian inference via synthetic likelihood for a process-based forest model
A family of algorithms for approximate Bayesian inference by Thomas Minka, MIT PhD thesis, 2001
EnLLVM – Fast Approximate Bayesian Inference
approximate Bayesian inference under informative sampling
Approximate Bayesian Inference Reveals Evidence for a Recent, Severe Bottleneck in a Netherlands Population of Drosophila melanogaster
Approximate Bayesian inference methods for stochastic state space models
Variational Inference: A Review for Statisticians

Expectation propagation

Expectation Propagation (EP) is a family of algorithms for approximate inference in Bayesian models.

Multi-stage inference

Draw samples from the model
Using samples as training data, locally approximate each component of the model
Combine the local approximations to form a surrogate model
Perform exact inference in the surrogate model

A roadmap to research on EP
https://en.wikipedia.org/wiki/Expectation_propagation
https://tminka.github.io/papers/MinkaMaterialsTutorial.pdf
http://www.mbmlbook.com/TrueSkill_A_solution__expectation_propagation.html
https://github.com/tminka/bpm/

Quasi-Bayesian theory

http://maisondelices.it/ablz/cvpr-2023.html
https://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/QuasiBayesian/
https://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/QuasiBayesian/FiniteConvex/node4.html
https://qeconomics.org/ojs/forth/587/587-3.pdf
http://matematyka.uek.krakow.pl/AMMS/Osiewalski-slides.pdf
http://www.fas.nus.edu.sg/ecs/events/seminar/seminar-papers/18-03-07_2.pdf
https://pillowlab.princeton.edu/pubs/Archer13_MIestim_Entropy.pdf
https://projecteuclid.org/euclid.aos/1383661267
https://www.lri.fr/TAU_seminars/videos/Benjamin_Guedj_24oct2017/bguedj.pdf

Stein Method

Stein's method, due to Charles M. Stein, is a set of remarkably powerful theoretical techniques for proving approximation and limit theorems in probability theory. It has been mostly known to theoretical statisticians. Recently, however, it has been shown that some of the key ideas from Stein's method can be naturally adopted to solve computational and statistical challenges in practical machine learning.

https://statweb.stanford.edu/~souravc/beam-icm-trans.pdf
http://www.math.nus.edu.sg/~lhychen/files/IMS4-pp-1-59.pdf
https://projecteuclid.org/euclid.ps/1319806862
https://sites.google.com/site/steinsmethod/applications
https://www.cs.dartmouth.edu/~qliu/stein.html
https://roma.stat.wisc.edu/courses/st992-newton/smmb/
http://www.cs.cmu.edu/~aarti/SMLRG/
Workshop on Stein's Method in Machine Learning and Statistics
Stein's method and applications in high-dimensional statistics
Math 605 - Introduction to Stein's Method
Progress in Stein’s Method (5 Jan – 6 Feb 2009)
Stein’s Method and The Bootstrap in Low and High Dimensions: A Tutorial Larry Wasserman February 8 2014
Stein's Method and Related Topics 8-11 December 2018, University of Macau
Workshop on New Directions in Stein's Method (18 - 29 May 2015)
Workshop on Stein’s Method and it’s Applications
http://junezhe.com/posts/2017/07/blog-post-1/

Maximum Entropy Methods

Maximum Entropy Methods, similar to maximum likelihood estimations, are based oon the probability distribution.

The maximum entropy method is usually stated in a deceptively simple way: from among all the probability distributions compatible with empirical data, pick the one with the highest information-theoretic entropy. To really understand where this comes from, and appreciate it at its proper worth, we need to look at its origins in equilibrium statistical mechanics.

Entropy is defined as the expectation of self-information for discrete multiply states system as following $$H(p)=\mathbb{E}{p(x)}(\ln\frac{1}{p(x)})=\sum{x\in\mathcal{S}}-p(x)\log p(x)\mathrm{d}x$$ where $x$ is in the support $\mathcal{S}$ of probability mass function $p(x)$ and we call $\ln\frac{1}{p(x)}$ the self-information of the state $x$. Note that $\lim_{p\to 0}H(p)=\lim_{p\to 0}\frac{\ln \frac{1}{p}}{\frac{1}{p}}=0$. As convention, $H(0)$ is defined to $0$.

The self-information $I(\cdot)$ is a function $I:[0, 1]\to [0,\infty)$ which satisfies the following axioms:

Monotonicity: if $p(x)> p(y)$, $I(p(x))< I(p(y))$ for any $x, y\in[0, 1]$.
Non-negativeness: $I(p(x))\geq 0$ for all $p(x)$ and specially $I(p(x))=0$ if $p(x)=1$.
$I(p)$ is a continuous function of $p\in [0, 1]$.
$I(p(x\cup y)) = I(p(x)) + I(p(y))$ if the events $x, y$ are independent.

Thus the entropy is $0$ for the single state or determined system. And the entropy function $p\mapsto [0,\infty)]$ is concave in the term of $p(x)\in[0, 1]$, i.e., $H(\lambda p +(1-\lambda)p)\geq \lambda H(p)+(1-\lambda)H(p)$. Additionally, $\max_{p}H(p)=\ln |\mathcal{S}|$ when $p(x)=\frac{1}{|\mathcal{S}|}$ where $|X|$ is the cardinality of $X$.

If the random variable $X$ is continuos or the states of the system is continuous, we extend the entropy to the following formula $$H(X)=\int_{x\in\mathcal{S}}-f(x)\log f(x)\mathrm{d}x$$ where $f(x)$ is the probability density function of the random variable $X$ at the point $x$. We call it differentiable entropy.

The joint entropy, $H(X, Y )$ of two random variables $(X, Y )$ with pmf $p(x, y)$ is defined as $$H(X, Y)=\sum_{(x,y) \in\mathcal{S}}-p(x, y)\log p(x, y)=-\mathbb{E}_{p(x, y)}(\log p(x, y))$$

The conditional entropy measures how much entropy a random variable $X$ has remaining if we have already learned the value of a second random variable $Y$. It is referred to as the entropy of $X$ conditional on $Y$, and is written $H(X\mid Y)$. If the probability that $X = x$ is denoted by $p(x)$, then we donote by $p(x\mid y)$ the probability that $X = x$, given that we already know that $Y = y$. $p(x\mid y)$ is a conditional probability. In Baysian language, $Y$ represents our prior information information about $X$. The conditional entropy is just the Shannon entropy with $p(x\mid y)$ replacing $p(x)$, and then we average it over all possible "Y". $$H(X|Y):=\sum_{y}p(y)H(x\mid Y=y)=-\sum_{x, y} {p(x|y)\log p(x|y)} p(y)\=-\sum_{x, y}p(x, y)\log p(x|y)=-\sum_{x, y} {p(x, y)\log p(x, y)}+\sum_{x, y} {p(x, y)} \log p(y).$$

Using the Baysian sum rule $p(x, y) = p(x∣y)p(y)$, one finds that the conditional entropy is equal to $H(X|Y) = H(X,Y) - H(Y)$ with "H(X, Y)" the joint entropy of "X" and "Y".

The maximum entropy principle (EMaxEnt) states that the most appropriate distribution to model a given set of data is the one with highest entropy among all those that satisfy the constrains of our prior knowledge.

https://bayes.wustl.edu/
Maximum Entropy and the Foundation of Direct Methods
A comparison of algorithms for maximum entropy parameter estimation
Space-Time Power Spectral Analysis Using the Maximum Entropy Methods
On the Rationale of Maximum-Entropy Methods
Self-information
CSE 599S: Entropy optimality
MAXIMUM ENTROPY: THE UNIVERSAL METHOD FOR INFERENCE
Differential Entropy
Conditional entropy
A relation between conditional entropy and conditional expectation, to evaluate secrecy systems
Entropy, information, and computation
Note: Information Theory, Axiomatic Foundations, Connections to Statistics
On the Axioms of Information Theory
Maximum Entropy
MaxEnt and Exponential Models
Maximum Entropy Modeling
Maximum Entropy Modelling, sgp@clg.ox.ac.uk
Entropy by Patrick Xia
Lecture 3: KL-divergence and connections

Naive Bayes

Naive Bayes is to reconstruct the joint distribution of features and labels $Pr(\vec{x}, y)$ given some training dataset/samples $T={(\vec{X}i, y_i)}{i=1}^{n}$. However, the features are usually in high dimensional space in practice so the dimension curse occurs which makes it impossible to compute the joint distribution $Pr(\vec{X}, y)$ via the (empirical) conditional probability $Pr(\vec{X}\mid y)$ and the prior $Pr(y)$. A naive idea is to simplify the computation process by assumption that the features are conditional independence so that $$Pr(\vec{X}\mid y) =\prod_{i=1}^{p} Pr(\vec{X}^{(i)}\mid y).\tag{1}$$

And the predicted labels will be computed via $$ y=\arg\max_{y\in\mathcal{Y}}Pr(y\mid \vec{X}) =\arg\max_{y\in\mathcal{Y}} \frac{Pr(y) Pr(\vec{x}\mid y)}{\sum Pr(y)Pr(\vec{x}\mid y)}. \tag{2} $$

where the conditional probability $Pr(\vec{X}\mid y)$ is simplified by the conditional independence assumption in formula (1). Thus the naive Bayes classifier is represented as maximum a posterior (MAP)

The prior probability $Pr(y)$ can be empirical or estimated.

Naive Bayesian
Ritchie Ng on Machine Learning
MLE/MAP + Naïve Bayes
http://inverseprobability.com/talks/notes/guest-lecture.html
https://pathmind.com/wiki/bayes-theorem-naive-bayes

For the continuous features, we assume that they are distributed in Gaussian.

Gaussian Naive Bayes Classifier
Gaussian Naïve Bayes
Naive Bayes and Gaussian Bayes Classifier
https://www.cs.cmu.edu/~epxing/Class/10701-10s/Lecture/lecture5.pdf
https://debategraph.org/Details.aspx?nid=291938

Hierarchical Naïve Bayes models

In a probabilistic framework, classification is the calculation of $P(C|A)$. A new instance is classified as $c^{\ast}$, where: $$c^{\ast}=\arg\min_{c\in sp(C)}\sum_{c^{\prime}\in sp(C)}\ell(c, c^{\prime})P(C=c^{\prime}\mid \bar a)$$ and $\ell(c, c^{\prime})$ is the loss function.

A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays
A Literature Review on Hierarchical Naive Bayes Classifier
https://folk.idi.ntnu.no/helgel/slides/HNB.pdf
Classification using Hierarchical Na¨ıve Bayes models
Efficient Learning of Hierarchical Naive Bayes Model
https://arxiv.org/abs/1607.01690

Average One-Dependence Estimator (AODE)

Not so naive Bayes: Aggregating one-dependence estimators
https://link.springer.com/article/10.1007%2Fs10994-005-4258-6

Hierarchical Models

Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parameters that vary at more than one level.

We consider a multilevel model to be a regression (a linear or generalized linear model) in which the parameters—the regression coefficients—are given a probability model. This second-level model has parameters of its own—the hyperparameters of the model—which are also estimated from data. The two key parts of a multilevel model are varying coefficients, and a model for those varying coefficients (which can itself include group-level predictors). Classical regression can sometimes accommodate varying coefficients by using indicator variables. The feature that distinguishes multilevel models from classical regression is in the modeling of the variation between groups.

Multilevel models are also called hierarchical, for two different reasons: first, from the structure of the data (for example, students clustered within schools); and second, from the model itself, which has its own hierarchy, with the parameters of the within-school regressions at the bottom, controlled by the hyperparameters of the upper-level model.

An Introduction to Hierarchical Modeling
https://en.wikipedia.org/wiki/Multilevel_model
Hierarchical models

Hierarchical Bayesian Regression

Hierarchical Bayes：

explicitly represent category hierarchies for sharing abstract knowledge.
explicitly idenAfy only a small number of parameters that are relevant to the new concept being learned. Hierarchical Bayesian Regression extends the Bayesian models by setting the uncertainty of the uncertainty such as $$ y\sim Pr(\phi(x)\mid \theta)\ Pr(\phi(x)\mid \theta) = \frac{Pr(\theta\mid\phi(x))Pr(\phi(x))}{P(\theta)}\ Pr(\phi(x))= Pr(\phi(x)\mid \eta) Pr(\eta)\ \vdots $$

We can take any factor into consideration in this hierarchical Bayesian model. And it is a graphical probability model, which consists of the connections and probability. The probability of the observed sample $y$ is $$y\sim \frac{Pr(\theta\mid\phi(x)) Pr(\phi(x)\mid \eta) Pr(\eta)}{P(\theta)}.$$

https://twiecki.io/
The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3
Hierarchical Bayesian Neural Networks with Informative Priors
HDDM: Hierarchical Bayesian estimation of the Drift-Diffusion Model in Python
https://www.cnblogs.com/huangxiao2015/p/5667941.html
https://www.cnblogs.com/huangxiao2015/p/5668140.html
BAYESIAN HIERARCHICAL MODELS
Chapter 4: Regression and Hierarchical Models
GLM: Hierarchical Linear Regression
Understanding empirical Bayesian hierarchical modeling (using baseball statistics) Previously in this series:
Probabilistic Model in the Study of Language
https://www.wikiwand.com/en/Bayesian_hierarchical_modeling
SAS/STAT Examples Bayesian Hierarchical Poisson Regression Model for Overdispersed Count Data
Hierarchical Regression and Spatial models
Short Course for ENAR 2009 - Sunday, March 15, 2009: Hierarchical Modeling and Analysis of Spatial-Temporal Data: Emphasis in Forestry, Ecology, and Environmental Sciences

Beta-logistic

Assume that the instantaneous event probability at time step t is characterized by a geometric distribution: $$P(T=t\mid \theta)=\theta(1-\theta).$$

Instead of a point estimate for $\theta$, use a Beta prior parameterized as follows: $$\alpha(x)=\exp(a(x)), \beta(x)=\exp(b(x)).$$ The likelihood function is $$L(\alpha, \beta)=\prod_{i,,,, observed} P(T=t_i\mid \alpha(x_i), \beta(x_i))\prod_{i,,, censored} P(T> t_i\mid \alpha(x_i), \beta(x_i) ).$$

Nice properties:

$P(T=1\mid \alpha, \beta)=\frac{\alpha}{\alpha + \beta}$;
$P(T=t\mid \alpha, \beta)=(\frac{\beta + t - 2}{\alpha + \beta+t-1})P(T=t-1\mid \alpha, \beta)$.

If $a$ and $b$ are linear: $a(x)=\gamma_a\cdot x$ and $b(x)=\gamma_b \cdot x$, then $$P(T=1\mid \alpha, \beta)=\frac{\alpha(x)}{\alpha(x) + \beta(x)}=\frac{1}{1+\exp(\left<\gamma_a-\gamma_b, x\right>)}.$$

For $T=1$, it reduces to overparameterized logistic regression.

A Beta-logistic Model for the Analysis of Sequential Labor Force Participation by Married Women
https://arxiv.org/abs/1905.03818

Latent Dirichlet Allocation

Latent Dirichlet Allocation(LDA) is a topic model in natural language processing.

Dirichlet distribution

Generative Process: $w \sim LDA$:

Draw each topic $\theta_k \sim Dir(\eta)$ for $k=1,\cdots, K$.
For each document:
- Draw topic proportions $\pi_d \sim Dir(\alpha)$
- For each word:
  - Draw topic indicator $z_{d, n} \sim Mult(\pi_d)$
  - Draw word $w_{d, n} \sim Mult(\theta_{z_d, n})$

Latent Dirichlet Allocation
Introduction to Latent Dirichlet Allocation
https://staff.fnwi.uva.nl/k.simaan/ESSLLI03.html
https://stephens999.github.io/fiveMinuteStats/dirichlet.html
https://mast.queensu.ca/~communications/Papers/msc-jiayu-lin.pdf
http://www.cs.columbia.edu/~blei/topicmodeling.html

Hierarchical Generative Model

https://ermongroup.github.io/blog/hierarchy/

Bayesian Optimization

Bayesian optimization has been successful at global optimization of expensive-to-evaluate multimodal objective functions. However, unlike most optimization methods, Bayesian optimization typically does not use derivative information.

As response surface methods, they date back to Box and Wilson in 1951. Bayesian optimization usually uses Gaussian process regression.

At a high level, Bayesian Optimization offers a principled method of hyperparameter searching that takes advantage of information one learns during the optimization process. The idea is as follows: you pick some prior belief (it is Bayesian, after all) about how your parameters behave, and search the parameter space of interest by enforcing and updating that prior belief based on your ongoing measurements. The trade-off between exploration (making sure that you’ve visited most relevant corners of your space) and exploitation (once you’ve found a promising region of your space, finding a good setting of parameters within it) is handled in a structured and intelligent way.

http://www.sigopt.com/
https://jmhessel.github.io/Bayesian-Optimization/
https://mlrmbo.mlr-org.com/index.html
Let’s Talk Bayesian Optimization
Bayesian optimization tutorial slides and article (from INFORMS 2018)
Practical Bayesian optimization in the presence of outliers
Bayesian Optimization with Gradients
https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
https://haikufactory.com/
bayesian optimization track
Bayesian Optimization at Imperial London College
Bayesian optimization@http://krasserm.github.io/
Introduction to Bayesian Optimization by Javier Gonz´alez
A Python implementation of global optimization with gaussian processes
Bayesian Optimization using Pyro
Taking the Human Out of the Loop:A Review of Bayesian Optimization
Bayesian Optimization @modAL
RoBO – a Robust Bayesian Optimization framework written in python.
Bayesian Optimization@Ployaxon
BOAT: Building auto-tuners with structured Bayesian optimization
The Intuitions Behind Bayesian Optimization
High-dimensional Bayesian optimization using low-dimensional feature spaces
A Tutorial on Bayesian Optimization

Bayesian Learning

We can think of machine learning as learning models of data. The Bayesian framework for machine learning states that you start out by enumerating all reasonable models of the data and assigning your prior belief $P(M)$ to each of these models. Then, upon observing the data D, you evaluate how probable the data was under each of these models to compute P(D|M). Multiplying this likelihood by the prior and renormalizing results in the posterior probability over models P(M|D) which encapsulates everything that you have learned from the data regarding the possible models under consideration. Thus, to compare two models M and M', we need to compute their relative probability given the data: P(M)P(D|M) / P(M')P(D|M').

http://mlg.eng.cam.ac.uk/zoubin/bayesian.html
[Bayesian] “我是bayesian我怕谁”系列 - Gaussian Process
Bayesian Learning
Lecture 9: Bayesian Learning
Bayesian Learning
Bayesian machine learning
Bayesian Learning for Machine Learning: Part I - Introduction to Bayesian Learning
Bayesian Learning for Machine Learning: Part II - Linear Regression
Bayesian Learning by Burr H. Settles, CS-540, UW-Madison
Bayesian Learning Lecture
Bayesian machine learning @fastML

Bayes Optimal Classifier

The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new example, given the training dataset $$f(x)=\hat y=\arg\max_{y}P(y\mid x)=\arg\max_{y\in\mathcal{Y}}\frac{P(x,y)}{P(x)}=\arg\max_{y\in\mathcal{Y}}\frac{P(x\mid y)P(y)}{P(x)}.$$

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

https://machinelearningmastery.com/bayes-optimal-classifier/

PAC-Bayesian Learning

Industry-wide successes of machine learning at the dawn of the (so-called) big data era has led to an increasing gap between practitioners and theoreticians. The former are using off-the-shelf statistical and machine learning methods, while the latter are designing and studying the mathematical properties of such algorithms. The tradeoff between those two movements is somewhat addressed by Bayesian researchers, where sound mathematical guarantees often meet efficient implementation and provide model selection criteria. In the late 90s, a new paradigm has emerged in the statistical learning community, used to derive probably approximately correct (PAC) bounds on Bayesian-flavored estimators. This PAC-Bayesian theory has been pioneered by Shawe-Taylor and Willamson (1997), and McAllester (1998, 1999). It has been extensively formalized by Catoni (2004, 2007) and has triggered, slowly but surely, increasing research efforts during last decades.

We believe it is time to pinpoint the current PAC-Bayesian trends relatively to other modern approaches in the (statistical) machine learning community. Indeed, we observe that, while the field grows by its own, it took some undesirable distance from some related areas. Firstly, it seems to us that the relation to Bayesian methods has been forsaken in numerous works, despite the potential of PAC-Bayesian theory to bring new insights to the Bayesian community and to go beyond the classical Bayesian/frequentist divide. Secondly, the PAC-Bayesian methods share similarities with other quasi-Bayesian (or pseudo-Bayesian) methods studying Bayesian practices from a frequentist standpoint, such as the Minimum Description Length (MDL) principle (Grünwald, 2007). Last but not least, even if some practical and theory grounded learning algorithm has emerged from PAC-Bayesian works, these are almost unused for real-world problems.

http://www.pascalgermain.info/
https://github.com/bguedj/icml2019
https://bguedj.github.io/6pac/#intro
https://bguedj.github.io/files/bguedj-talk-2020-faicdt.pdf
https://sites.google.com/site/yevgenyseldin/tutorials
https://safestatistics.com/
https://bguedj.github.io/nips2017/
https://icml.cc/Conferences/2019/ScheduleMultitrack?event=4338

Bayesian Ensemble Methods

We consider the fundamental problem of making inference about an unknown function $f$ that predicts an output $Y$ using a $p$ dimensional vector of inputs $x$ when $$Y = f(x) + \epsilon, \epsilon\sim N(0, \sigma^2). $$

To do this, we consider modelling or at least approximating $f(x) = E(Y \mid x)$, the mean of $Y$ given $x$, by a sum of $m$ regression trees: $$f(x) ≈ w_1g_1(x) + w_2g_2(x) + \cdots + w_mg_m(x)$$ where each $g_i$ denotes a binary regression tree.

Bayesian Ensemble Learning
A flexible probabilistic framework for large-margin mixture of experts
On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled
Bayesian Neural Network Ensembles
Ensemble Learning in Bayesian Neural Networks

In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. Bayesian stacking is based on the generalization stacking or weight learning methods: $$\arg\min_{w, \Theta}L(\mathcal{E}(g(x)) , y)$$ where $\mathcal{E}(g(x))\approx w_1g_1(x) + w_2g_2(x) + \cdots + w_mg_m(x)$.

http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf
https://www.yulingyao.com/
https://mc-stan.org/loo/articles/loo2-weights.html
https://cran.r-project.org/web/packages/loo/vignettes/loo2-weights.html
http://machine-learning.martinsewell.com/ensembles/stacking/
https://www.jianshu.com/p/46ccf40222d6

Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored

Bayesian model averaging

Bayesian parameter averaging (BPA) is an ensemble technique that seeks to approximate the Bayes optimal classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.

As the size of the data set increases, this uncertainty reduces, and the posterior probabilities $p(h\mid X)$ become increasingly focussed on just one of the models. This highlights the key difference between Bayesian model averaging and model combination, because in Bayesian model averaging the whole data set is generated by a single model.

Bayesian parameter averaging: a tutorial
https://www.stat.colostate.edu/~jah/papers/statsci.pdf
Does Bayesian model averaging ”overfit”?

Bayesian model averaging is best thought of as a method for ‘soft model selection.’ It answers the question: “Given that all of the data so far was generated by exactly one of the hypotheses, what is the probability of observing the new pair $(c, x)$?” The soft weights in BMA only reflect a statistical inability to distinguish the hypothesis based on limited data. As more data arrives, the hypotheses become more distinguishable and BMA will always focus its weight on the most probable hypothesis, just as the posterior for the mean of a Gaussian focuses ever more narrowly on the sample mean. Mathematically, we can write the BMA rule as $$P((c, x)\mid D)\propto\sum_{h} P((c, x), D\mid h)P(h)$$ which emphasizes the assumption that exactly one hypothesis is responsible for all of the data.

And the only flaw with BMA is the belief that it is an algorithm for model combination, when it is not.

Bayesian model averaging is not model combination
https://tminka.github.io/papers/bma.html
https://www.microsoft.com/en-us/research/publication/bayesian-model-averaging-not-model-combination/

Bayesian model combination

Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model.

Bayesian classifier combination does not assume that any of the classifiers is the true one. Moreover, it does not require that the classifiers be probabilistic; they can even be human experts. Finally, the classifiers can embody widely different prior assumptions about the data, and have observed different data sets.

https://www.wikiwand.com/en/Ensemble_learning
Bayesian Classifier Combination
http://proceedings.mlr.press/v22/kim12/kim12.pdf
https://www.sas.upenn.edu/~fdiebold/papers2/Diebold%20(1991).pdf
Bayesian Model Weighting: The Many Faces of Model Averaging
Turning Bayesian Model Averaging Into Bayesian Model Combination
Bayesian model averaging is not model combination
http://www.orchid.ac.uk/wp-content/uploads/2013/10/DecIBCC_reece_et_al.pdf
https://github.com/jlshen2011/iFusion
https://github.com/yanyachen/rBMC

Bayesian model selection

Bayesian model selection is the hard version of model averaging.

In Bayesian Model Selection (BMS), a model is selected which maximises this probability $$m=\arg\max_{m}P(m\mid y)$$ If the prior is uniform, $P(m) = 1/M$ then this is equivalent to picking the model with the highest evidence.

http://alumni.media.mit.edu/~tpminka/statlearn/demo/
https://statswithr.github.io/book/bayesian-model-choice.html
Bayesian inference and model selection
http://labs.kbs.keio.ac.jp/andotomohiro/Bayesianbook.htm

Bayesian committee machine

The Bayesian committee machine (BCM) is a novel approach to combining estimators which were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators the main foci are Gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly, we find that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator. The BCM also provides a new solution for online learning with potential applications to data mining. We apply the BCM to systems with fixed basis functions and discuss its relationship to Gaussian process regression. Finally, we also show how the ideas behind the BCM can be applied in a non-Bayesian setting to extend the input dependent combination of estimators.

A Bayesian Committee Machine
http://www.dbs.ifi.lmu.de/~tresp/papers/bcm6.pdf
http://www.dbs.ifi.lmu.de/~tresp/papers/kddpaper2.pdf
http://www.gaussianprocess.org/
https://www.fil.ion.ucl.ac.uk/~wpenny/

---	---
Model comparison	define criteria to rank models for which is best.
Model selection	choose the best model
Model averaging	combine models into a single meta-model.

Bayesian model comparison

In the context of model comparison it is appropriate to think of a model as a specification of a set of parameters $\theta$ and of their prior distribution, $p(\theta | M)$. As shown below, it is the number of free parameters and their prior range that control the strength of the Occam's razor effect in Bayesian model comparison: models that have many parameters that can take on a wide range of values but that are not needed in the light of the data are penalized for their unwarranted complexity. Therefore, the prior choice ought to reflect the available parameter space under the model M, independently of experimental constraints we might already be aware of. This is because we are trying to assess the economy (or simplicity) of the model itself, and hence the prior should be based on theoretical or physical constraints on the model under consideration. Often these will take the form of a range of values that are deemed "intuitively" plausible, or "natural". Thus the prior specification is inherent in the model comparison approach.

https://www.fil.ion.ucl.ac.uk/~wpenny/bdb/bmc.pdf
https://cedar.buffalo.edu/~srihari/CSE574/Chap3/3.5-BayesianModelComparison.pdf
https://www.stat.auckland.ac.nz/~iase/publications/13/Alston-Kuhnert-Low_Choy-McVinish-Mengersen.pdf
https://ned.ipac.caltech.edu/level5/Sept13/Trotta/Trotta4.html
https://papers.nips.cc/paper/488-bayesian-model-comparison-and-backprop-nets.pdf
https://ned.ipac.caltech.edu/level5/Sept13/Trotta/paper.pdf

Bayesian Decision Theory

Bayesian decision theory refers to a decision theory which is informed by Bayesian probability. It is a statistical system that tries to quantify the tradeoff between various decisions, making use of probabilities and costs. An agent operating under such a decision theory uses the concepts of Bayesian statistics to estimate the expected value of its actions, and update its expectations based on new information.

https://www.byclb.com/TR/Tutorials/neural_networks/ch4_1.htm
https://www.projectrhea.org/rhea/index.php/Bayesian_Decision_Theory

Bayesian choice model

https://www.zhihu.com/column/logit
http://faculty.washington.edu/bajari/iosp07/lecture4.pdf
https://acteon.webs.upv.es/MATERIAL%20DOCENTE/Bayesian%20inference%20with%20MCMC.%20Chapter%202.pdf
https://economics.mit.edu/files/2652

Probabilistic Machine Learning

Probabilistic modeling is a practical approach for designing machines that:

Learn from noisy and unlabeled data
Define confidence levels in predictions
Allow decision making in the absence of complete information
Infer missing data and latent correlations in data**

COS597C: Advanced Methods in Probabilistic Modeling
COS513: Foundations of Probabilistic Modeling

https://www.cse.iitk.ac.in/users/piyush/courses/pml_fall17/pml_fall17.html
STA561 COMPSCI571 ECE682: Probabilistic Machine Learning: Spring 2019
Probabilistic Machine Learning 4f13 Michaelmas 2019
Probabilistic Machine Learning Group @aalto
Probabilistic Topic Model
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models
Representation Learning: A Probabilistic Perspective
Probabilistic Machine Learning: Models, Algorithms and a Programming Library
http://inverseprobability.com/talks/
https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
https://statml.io/
http://www.cs.columbia.edu/~blei/seminar/2020-representation/readings/Mohamed2015a.pdf
http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2010.html

Probabilistic Programming

Probabilistic graphical models provide a formal lingua franca for modeling and a common target for efficient inference algorithms. Their introduction gave rise to an extensive body of work in machine learning, statistics, robotics, vision, biology, neuroscience, artificial intelligence (AI) and cognitive science. However, many of the most innovative and useful probabilistic models published by the AI, machine learning, and statistics community far outstrip the representational capacity of graphical models and associated inference techniques. Models are communicated using a mix of natural language, pseudo code, and mathematical formulae and solved using special purpose, one-off inference methods. Rather than precise specifications suitable for automatic inference, graphical models typically serve as coarse, high-level descriptions, eliding critical aspects such as fine-grained independence, abstraction and recursion.

PROBABILISTIC PROGRAMMING LANGUAGES aim to close this representational gap, unifying general purpose programming with probabilistic modeling; literally, users specify a probabilistic model in its entirety (e.g., by writing code that generates a sample from the joint distribution) and inference follows automatically given the specification. These languages provide the full power of modern programming languages for describing complex distributions, and can enable reuse of libraries of models, support interactive modeling and formal verification, and provide a much-needed abstraction barrier to foster generic, efficient inference in universal model classes.

We believe that the probabilistic programming language approach within AI has the potential to fundamentally change the way we understand, design, build, test and deploy probabilistic systems. This approach has seen growing interest within AI over the last 10 years, yet the endeavor builds on over 40 years of work in range of diverse fields including mathematical logic, theoretical computer science, formal methods, programming languages, as well as machine learning, computational statistics, systems biology, probabilistic AI.

Graph Nets library

There is a list of existing probabilistic programming systems at http://www.probabilistic-programming.org/wiki/Home and a list of research articles on probabilistic programming until 2015.

The Algorithms Behind Probabilistic Programming includes: Bayesian Inference, Hamiltonian Monte Carlo and the No U-Turn Sampler, Variational Inference and Automatic Differentation and Probabilistic Programming Languages.

Broadly speaking, probabilistic programming languages are to express computations with degrees of uncertainty, which comes from the imprecision in input data, lack of the complete knowledge or is inherent in the domain. More precisely, the goal of probabilistic programming languages is to represent and automate reasoning about probabilistic models, which describe uncertain quantities -- random variables -- and relationships among them.

https://www.cs.ubc.ca/~fwood/CS532W-539W/
https://learnit.itu.dk/local/coursebase/view.php?ciid=445
https://uol.de/en/lcs/probabilistic-programming
https://moves.rwth-aachen.de/teaching/ws-1617/probabilistic-programming/
http://okmij.org/ftp/

https://uncertainties-python-package.readthedocs.io/en/latest/
Automating Inference, Learning, and Design using Probabilistic Programming
Embedded domain-specific languages for probabilistic programming
Program Transformations for Machine Learning
LAFI 2019 : Languages for Inference (formerly PPS)
http://webppl.org/
PPS 2018 Probabilistic Programming Languages, Semantics, and Systems (PPS 2018)
PROBPROG 2018 -- The International Conference on Probabilistic Programming
PyMC: Probabilistic Programming in Python
Stan: a platform for statistical modeling and high-performance statistical computation.
Welcome to Pyro Examples and Tutorials!
Edward, and some motivations
Monadic probabilistic programming in Scala with Rainier
Factorie: a toolkit for deployable probabilistic modeling, implemented as a software library in Scala
InferPy: Probabilistic Modeling with Tensorflow Made Easy
GluonTS - Probabilistic Time Series Modeling
BTYDplus: Probabilistic Models for Assessing and Predicting your Customer Base
A Simple Embedded Probabilistic Programming Language
A modern model checker for probabilistic systems
emcee: Seriously Kick-Ass MCMC
Gen: MIT probabilistic programming language
Graph Nets library
Hakaru a simply-typed probabilistic programming language
PLDI 2019 PLDI Tutorialsh
https://github.com/markdregan/Bayesian-Modelling-in-Python

Probabilistic Programming Primer
PROBABILISTIC-PROGRAMMING.org
Programming Languages and Logics Fall 2018
CAREER: Blending Deep Reinforcement Learning and Probabilistic Programming
Inference for Probabilistic Programs: A Symbolic Approach
CSCI E-82A Probabilistic Programming and Artificial Intelligence Stephen Elston, PhD, Principle Consultant, Quantia Analytics LLC
MIT Probabilistic Computing Project
Beast2: Bayesian evolutionary analysis by sampling teee

36-402, Undergraduate Advanced Data Analysis, Section A
Machine Learning
HiPEDS Seminar: Probabilistic models and principled decision making @ PROWLER.io
Probabilistic Model-Based Reinforcement Learning Using The Differentiable Neural Computer
Probabilistic Modeling and Inference
MIT Probabilistic Computing Project
https://eng.uber.com/pyro/
https://github.com/pyprob/pyprob

Optimal Learning

The Bayesian perspective casts a different interpretation on the statistics we compute, which is particularly useful in the context of optimal learning. In the frequentist perspective, we do not start with any knowledge about the system before we have collected any data. By contrast, in the Bayesian perspective we assume that we begin with a prior distribution of belief about the unknown parameters.

Everyday decisions are made without the benefit of accurate information. Optimal Learning develops the needed principles for gathering information to make decisions, especially when collecting information is time-consuming and expensive. Optimal learning addresses the problem of efficiently collecting information with which to make decisions. Optimal learning is an issue primarily in applications where observations or measurements are expensive.

It is possible to approach the learning problem using classical and familiar ideas from optimization. The operations research community is very familiar with the use of gradients to minimize or maximize functions. Dual variables in linear programs are a form of gradient, and these are what guide the simplex algorithm. Gradients capture the value of an incremental change in some input such as a price, fleet size or the size of buffers in a manufacturing system. We can apply this same idea to learning.

There is a list of optimal learning problems.

http://yelp.github.io/MOE/
Peter I. Frazier@cornell
Optimal Learning book
Optimal Learning Course
https://onlinelibrary.wiley.com/doi/book/10.1002/9781118309858
有没有依靠『小数据』学习的机器学习分支? - 覃含章的回答 - 知乎

Files

Bayesian Learning and Probabilistic Programming.md

Latest commit

History