Distributions implements low-level primitives for Bayesian MCMC inference in Python and C++ including:
- special numerical functions
distributions.<flavor>.special
, - samplers and density functions from a variety of distributions,
distributions.<flavor>.random
, - conjugate component models (e.g., gamma-Poisson,
normal-inverse-chi-squared)
distributions.<flavor>.models
, and - clustering models (e.g., CRP, Pitman-Yor)
distributions.<flavor>.clustering
.
Python implementations are provided in up to three flavors:
- Debug
distributions.dbg
are pure-python implementations for correctness auditing and error checking, and allowing debugging via pdb. - High-Precision
distributions.hp
are cython implementations for fast inference in python and numerical reference. - Low-Precision
distributions.lp
are inefficent wrappers of blazingly fast C++ implementations, intended mostly as wrappers to check that C++ implementations are correct.
Our typical workflow is to first prototype models in python, then prototype faster inference applications using cython models, and finally implement optimized scalable inference products in C++, while testing all implementations for correctness.
Feature models are contained in modules in python and structs in C++.
Below write Model.thing
to denote
module.thing
in python and
Model::thing
in C++.
Most functions consume explicit entropy sources in C++
or global_rng
implicitly in python
Below json
denotes a python dict/list/number/string
suitable for serialization with the json
package.
Each feature model API consist of:
Datatypes.
Shared
- shared global model state including fixed parameters, hyperparameters, and, for datatypes with dynamic support, shared sufficient statistics.Value
- observation state, i.e., datumGroup
- local component state including sufficient statistics and possibly group parametersSampler
- partially evaluated per-group sampling function (optional in python)Scorer
- cached per-group scoring function (optional in python)Mixture
- vectorized scoring functions for mixture models (optional in python)
Shared operations. These should be simple and fast:
shared = Model.Shared() shared.protobuf_load(message) shared.protobuf_dump(message) shared.load(json) # python only shared.dump() -> json # python only Shared.from_dict(json) -> shared # python only Shared.from_protobuf(json, message) # python only Shared.to_protobuf(message) -> json # python only shared.add_value(value) shared.add_repeated_value(value) shared.remove_value(value) shared.realize() shared.plus_group(group) -> shared # optional
Group operations. These should be simple and fast. These may consume entropy:
group = Model.Group() group.protobuf_load(message) group.protobuf_dump(message) group.load(json) # python only group.dump() -> json # python only Group.from_values(shared, values) -> group # python only Group.from_dict(json) -> group # python only Group.from_protobuf(json, message) # python only Group.to_protobuf(message) -> json # python only group.init(shared) group.add_value(shared, value) group.add_repeated_value(shared, value, count) group.remove_value(shared, value) group.merge(shared, other_group) group.sample_value(shared) group.score_value(shared) group.vaidate() # C++ only
Sampling. These may consume entropy:
sampler = Model.Sampler() sampler.init(shared, group) sampler.eval(sampler) -> value group.sample_value(shared) -> value Model.sample_group(shared, group_size) -> group # python only
Scoring. These may also consume entropy, e.g. when implemented using monte carlo integration):
scorer = Model.Scorer() scorer.init(shared, group) scorer.eval(shared, value) -> float group.score_value(shared, value) -> float
Mixture Slaves (optional in python). These provide batch operations on a collection of groups.:
mixture = Model.Mixture() mixture.groups().push_back(group) # C++ only mixture.append(group) # python only mixture.init(shared) mixture.add_group(shared) mixture.remove_group(shared, groupid) mixture.add_value(shared, groupid, value) mixture.remove_value(shared, groupid, value) mixture.score_value(shared, value, scores_accum) mixture.score_data(shared) -> float mixture.score_data_grid(shareds, scores_out) # C++ only
Testing metadata. Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:
# in python for example in Model.EXAMPLES: shared = Model.shared_load(example['shared']) values = example['values'] ... // in C++ Model::Shared shared = Model::Shared::EXAMPLE(); ...
Sampling and scoring:
model = Model() model.sample_assignments(sample_size) model.score_counts(counts) model.score_add_value(...) model.score_remove_value(...)
Mixture driver (optional in python). These provide batch operations on a collection of groups. Clustering mixture drivers, referencing a
clustering
model:mixture = model.Mixture() mixture.counts().push_back(count) # C++ only mixture.init(model) # C++ only mixture.init(model, counts) # python only mixture.remove_group(shared, groupid) mixture.add_value(shared, groupid, value) -> bool mixture.remove_value(shared, groupid, value) -> bool mixture.score_value(shared, value, scores_out) mixture.score_data(shared) -> float
Mixture drivers and slaves coordinate using the pattern:
# driver is a single clustering model # slaves is a list of feature models def add_value(driver, slaves, groupid, value): added = driver.mixture.add_value(driver.shared, groupid, value) for slave in slaves: slave.mixture.add_value(slave.shared, groupid, value) if added: slave.mixture.add_group(slave.shared) def remove_value(driver, slaves, groupid, value): removed = driver.mixture.remove_value(driver.shared, groupid, value) for slave in slaves: slave.mixture.add_value(slave.shared, groupid, value) if removed: slave.mixture.remove_group(slave.shared, groupid)
See
examples/mixture/main.py
for a working example.Testing metadata (python only). Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:
ExampleModel.EXAMPLES = [ ...model specific... ]
The C++ methods explicity require a random number generator rng
everywhere entropy may be consumed. The python models try to maintain
compatibility with numpy.random
by hiding this source either as
the global numpy.random
generator, or as single global_rng
in
wrapped C++.