Skip to content

Commit

Permalink
Added minor adjustments to SimpleStatic
Browse files Browse the repository at this point in the history
  • Loading branch information
thevolatilebit committed Dec 1, 2023
1 parent 6bbb021 commit 7cd8673
Show file tree
Hide file tree
Showing 6 changed files with 41 additions and 35 deletions.
6 changes: 3 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ A decision-making framework for the cost-efficient design of experiments, balanc
```

## Static experimental designs
Here we assume that the same experimental design will be used for a population of examined entities, hence the word 'static'.
Here we assume that the same experimental design will be used for a population of examined entities, hence the word "static".

For each subset of experiments, we consider an estimate of the value of acquired information. To give an example, if a set of experiments is used to predict the value of a specific target variable, our framework can leverage a built-in integration with [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) to estimate predictive accuracies of machine learning models fitted over subset of experimental features.

In the cost-sensitive setting of CEEDesigns, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.
In the cost-sensitive setting of `CEEDesigns.jl``, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.

Assuming the information values and optimized experimental costs for each subset of experiments, we then generate a set of cost-efficient experimental designs.

Expand All @@ -23,7 +23,7 @@ Assuming the information values and optimized experimental costs for each subset

We consider 'personalized' experimental designs that dynamically adjust based on the evidence gathered from the experiments. This approach is motivated by the fact that the value of information collected from an experiment generally differs across subpopulations of the entities involved in the triage process.

At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word 'generative'), we adjust the continuation based on this evidence.
At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word "generative"), we adjust the continuation based on this evidence.

```@raw html
<a><img src="assets/search_tree.png" align="left" alt="code" width="400"></a>
Expand Down
16 changes: 9 additions & 7 deletions docs/src/tutorials/SimpleStatic.jl
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# # Static Experimental Designs

# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
# arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
# We also show an example with synthetic data.

# ## Setting

# Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
# measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
# has some "information value", which is intentionally vaguely defined for generality, but for example, may be
# has some "information value," which is intentionally vaguely defined for generality, but for example, may be
# a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
# Finally, each experiment has some monetary cost and execution time to perform the experiment, and
# the user has some known tradeoff between overall execution time and cost.
Expand Down Expand Up @@ -60,7 +60,7 @@
# $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
# The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
# arrangement is done in serial, experiments within partitions are done in parallel.
# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$$
# Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
# cost of an arrangement is:
# $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
Expand All @@ -82,7 +82,7 @@

# ## Synthetic Data Example

# We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
# We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.
#
# First we load necessary packages.

Expand Down Expand Up @@ -139,13 +139,15 @@ plot_evals(

# We print the data frame showing each subset of experiments and its overall information value.

DataFrame(;
df_values = DataFrame(;
S = collect.(collect(keys(experiments_evals))),
value = collect(values(experiments_evals)),
)

sort(df_values, :value)

# Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
# value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
# value and combined cost. CEED exports a function `efficient_designs`
# which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
# optimal arrangements for each subset on the Pareto frontier.
#
Expand All @@ -165,7 +167,7 @@ designs = efficient_designs(
# Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
# is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
# Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
# of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
# of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
# prefixed with a number).

plot_front(designs; labels = make_labels(designs), ylabel = "loss")
16 changes: 9 additions & 7 deletions docs/src/tutorials/SimpleStatic.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ EditURL = "SimpleStatic.jl"

# Static Experimental Designs

In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
We also show an example with synthetic data.

## Setting

Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
has some "information value", which is intentionally vaguely defined for generality, but for example, may be
has some "information value," which is intentionally vaguely defined for generality, but for example, may be
a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
Finally, each experiment has some monetary cost and execution time to perform the experiment, and
the user has some known tradeoff between overall execution time and cost.
Expand Down Expand Up @@ -64,7 +64,7 @@ the sum of the costs of each experiment:
$$m_{O_{S}}=\sum_{e\in S} m_{e}$$
The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
arrangement is done in serial, experiments within partitions are done in parallel.
$$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
$$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$$
Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
cost of an arrangement is:
$$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
Expand All @@ -86,7 +86,7 @@ if the maximum number of parallel experiments does not divide $S$ evenly.

## Synthetic Data Example

We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.

First we load necessary packages.

Expand Down Expand Up @@ -153,14 +153,16 @@ plot_evals(
We print the data frame showing each subset of experiments and its overall information value.

````@example SimpleStatic
DataFrame(;
df_values = DataFrame(;
S = collect.(collect(keys(experiments_evals))),
value = collect(values(experiments_evals)),
)
sort(df_values, :value)
````

Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
value and combined cost. CEED exports a function `efficient_designs`
which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
optimal arrangements for each subset on the Pareto frontier.

Expand All @@ -183,7 +185,7 @@ nothing #hide
Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
prefixed with a number).

````@example SimpleStatic
Expand Down
4 changes: 2 additions & 2 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A decision-making framework for the cost-efficient design of experiments, balanc
<a><img src="docs/src/assets/front_static.png" align="right" alt="code" width="400"></a>

### Static experimental designs
Here we assume that the same experimental design will be used for a population of examined entities, hence the word 'static'.
Here we assume that the same experimental design will be used for a population of examined entities, hence the word "static".

For each subset of experiments, we consider an estimate of the value of acquired information. To give an example, if a set of experiments is used to predict the value of a specific target variable, our framework can leverage a built-in integration with [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) to estimate predictive accuracies of machine learning models fitted over subset of experimental features.

Expand All @@ -26,7 +26,7 @@ Assuming the information values and optimized experimental costs for each subset

We consider 'personalized' experimental designs that dynamically adjust based on the evidence gathered from the experiments. This approach is motivated by the fact that the value of information collected from an experiment generally differs across subpopulations of the entities involved in the triage process.

At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word 'generative'), we adjust the continuation based on this evidence.
At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word "generative"), we adjust the continuation based on this evidence.

<a><img src="docs/src/assets/search_tree.png" align="left" alt="code" width="400"></a>

Expand Down
2 changes: 1 addition & 1 deletion src/fronts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ end
Create a stick plot that visualizes the performance measures evaluated for subsets of experiments.
Argument `evals` should be the output of [`evaluate_experiments`](@ref) and the kwarg `f` (if provided) is a function that
Argument `evals` should be the output of [`evaluate_experiments`](@ref CEEDesigns.StaticDesigns.evaluate_experiments) and the kwarg `f` (if provided) is a function that
should take as input `evals` and return a list of its keys in the order to be plotted on the x-axis.
By default they are sorted by length.
"""
Expand Down
32 changes: 17 additions & 15 deletions tutorials/SimpleStatic.jl
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# # Static Experimental Designs

# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
# arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
# We also show an example with synthetic data.

# ## Setting

# Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
# measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
# has some "information value", which is intentionally vaguely defined for generality, but for example, may be
# a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
# has some "information value," which is intentionally vaguely defined for generality, but for example, may be
# a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information.
# Finally, each experiment has some monetary cost and execution time to perform the experiment, and
# the user has some known tradeoff between overall execution time and cost.
#
Expand All @@ -19,8 +19,11 @@
# resources to a set of experiments that attain some acceptable level of information (or, conversely, reduce
# uncertainty below some level).
#
# The arrangements produced by the tools introduced in this tutorial are called "static" because they implicitly
# assume that future data will have exactly the information gain of each experiment as the "historical" input.
# The arrangements produced by the tools introduced in this tutorial are called "static" because they inherently
# assume that for each experiment, future data will deterministically yield the same information gain as the "historical" data did.
# This information gain from the "historical" data is quantified based on certain aggregate statistics.
#
# We can also consider "generative experimental designs," where the information gain is modeled as a random variable. This concept is detailed in another [tutorial](./SimpleGenerative.jl).
#
# This tutorial introduces the theoretical framework behind static experimental designs with synthetic data.
# For examples using real data, please see our other tutorials.
Expand Down Expand Up @@ -56,14 +59,11 @@
# ### Optimal Arrangements

# To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
# the sum of the costs of each experiment:
# $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
# the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
# The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
# arrangement is done in serial, experiments within partitions are done in parallel.
# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
# Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
# cost of an arrangement is:
# $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
# cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.
#
# For instance, consider the experiments $S = \{e_{1},e_{2},e_{3},e_{4}\}$, with associated costs $(1, 1)$, $(1, 3)$, $(1, 2)$, and $(1, 4)$.
# If we conduct experiments $e_1$ through $e_4$ in sequence, this would correspond to an arrangement
Expand All @@ -82,7 +82,7 @@

# ## Synthetic Data Example

# We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
# We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.
#
# First we load necessary packages.

Expand Down Expand Up @@ -139,13 +139,15 @@ plot_evals(

# We print the data frame showing each subset of experiments and its overall information value.

DataFrame(;
df_values = DataFrame(;
S = collect.(collect(keys(experiments_evals))),
value = collect(values(experiments_evals)),
)

sort(df_values, order(:value, rev=true))

# Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
# value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
# value and combined cost. CEED exports a function `efficient_designs`
# which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
# optimal arrangements for each subset on the Pareto frontier.
#
Expand All @@ -165,7 +167,7 @@ designs = efficient_designs(
# Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
# is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
# Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
# of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
# of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
# prefixed with a number).

plot_front(designs; labels = make_labels(designs), ylabel = "loss")

0 comments on commit 7cd8673

Please sign in to comment.