Added minor adjustments to SimpleStatic

Merck · Dec 1, 2023 · 7cd8673 · 7cd8673
1 parent 6bbb021
commit 7cd8673
Show file tree

Hide file tree

Showing 6 changed files with 41 additions and 35 deletions.
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -7,11 +7,11 @@ A decision-making framework for the cost-efficient design of experiments, balanc
 ```
 
 ## Static experimental designs
-Here we assume that the same experimental design will be used for a population of examined entities, hence the word 'static'.
+Here we assume that the same experimental design will be used for a population of examined entities, hence the word "static".
 
 For each subset of experiments, we consider an estimate of the value of acquired information. To give an example, if a set of experiments is used to predict the value of a specific target variable, our framework can leverage a built-in integration with [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) to estimate predictive accuracies of machine learning models fitted over subset of experimental features.
 
-In the cost-sensitive setting of CEEDesigns, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.
+In the cost-sensitive setting of `CEEDesigns.jl``, a user provides the monetary cost and execution time of each experiment. Given the constraint on the maximum number of parallel experiments along with a fixed tradeoff between monetary cost and execution time, we devise an arrangement of each subset of experiments such that the expected combined cost is minimized.
 
 Assuming the information values and optimized experimental costs for each subset of experiments, we then generate a set of cost-efficient experimental designs.
 
@@ -23,7 +23,7 @@ Assuming the information values and optimized experimental costs for each subset
 
 We consider 'personalized' experimental designs that dynamically adjust based on the evidence gathered from the experiments. This approach is motivated by the fact that the value of information collected from an experiment generally differs across subpopulations of the entities involved in the triage process.
 
-At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word 'generative'), we adjust the continuation based on this evidence.
+At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word "generative"), we adjust the continuation based on this evidence.
 
 ```@raw html
 <a><img src="assets/search_tree.png" align="left" alt="code" width="400"></a>

diff --git a/docs/src/tutorials/SimpleStatic.jl b/docs/src/tutorials/SimpleStatic.jl
@@ -1,14 +1,14 @@
 # # Static Experimental Designs
 
-# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
+# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
 # arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
 # We also show an example with synthetic data.
 
 # ## Setting
 
 # Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
 # measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
-# has some "information value", which is intentionally vaguely defined for generality, but for example, may be 
+# has some "information value," which is intentionally vaguely defined for generality, but for example, may be 
 # a loss function if that subset is used  to train some machine learning model. It is generally the value of acquiring that information. 
 # Finally, each experiment has some monetary cost and execution time to perform the experiment, and
 # the user has some known tradeoff between overall execution time and cost.
@@ -60,7 +60,7 @@
 # $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
 # The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the 
 # arrangement is done in serial, experiments within partitions are done in parallel.
-# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
+# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$$
 # Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
 # cost of an arrangement is:
 # $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
@@ -82,7 +82,7 @@
 
 # ## Synthetic Data Example
 
-# We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
+# We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.
 # 
 # First we load necessary packages.
 
@@ -139,13 +139,15 @@ plot_evals(
 
 # We print the data frame showing each subset of experiments and its overall information value.
 
-DataFrame(;
+df_values = DataFrame(;
     S = collect.(collect(keys(experiments_evals))),
     value = collect(values(experiments_evals)),
 )
 
+sort(df_values, :value)
+
 # Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
-# value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
+# value and combined cost. CEED exports a function `efficient_designs`
 # which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
 # optimal arrangements for each subset on the Pareto frontier.
 # 
@@ -165,7 +167,7 @@ designs = efficient_designs(
 # Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
 # is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
 # Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
-# of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
+# of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
 # prefixed with a number).
 
 plot_front(designs; labels = make_labels(designs), ylabel = "loss")
diff --git a/docs/src/tutorials/SimpleStatic.md b/docs/src/tutorials/SimpleStatic.md
@@ -4,15 +4,15 @@ EditURL = "SimpleStatic.jl"
 
 # Static Experimental Designs
 
-In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
+In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
 arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
 We also show an example with synthetic data.
 
 ## Setting
 
 Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
 measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
-has some "information value", which is intentionally vaguely defined for generality, but for example, may be
+has some "information value," which is intentionally vaguely defined for generality, but for example, may be
 a loss function if that subset is used  to train some machine learning model. It is generally the value of acquiring that information.
 Finally, each experiment has some monetary cost and execution time to perform the experiment, and
 the user has some known tradeoff between overall execution time and cost.
@@ -64,7 +64,7 @@ the sum of the costs of each experiment:
 $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
 The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the
 arrangement is done in serial, experiments within partitions are done in parallel.
-$$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
+$$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$$
 Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
 cost of an arrangement is:
 $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
@@ -86,7 +86,7 @@ if the maximum number of parallel experiments does not divide $S$ evenly.
 
 ## Synthetic Data Example
 
-We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
+We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.
 
 First we load necessary packages.
 
@@ -153,14 +153,16 @@ plot_evals(
 We print the data frame showing each subset of experiments and its overall information value.
 
 ````@example SimpleStatic
-DataFrame(;
+df_values = DataFrame(;
     S = collect.(collect(keys(experiments_evals))),
     value = collect(values(experiments_evals)),
 )
+
+sort(df_values, :value)
 ````
 
 Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
-value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
+value and combined cost. CEED exports a function `efficient_designs`
 which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
 optimal arrangements for each subset on the Pareto frontier.
 
@@ -183,7 +185,7 @@ nothing #hide
 Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
 is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
 Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
-of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
+of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
 prefixed with a number).
 
 ````@example SimpleStatic

diff --git a/readme.md b/readme.md
@@ -12,7 +12,7 @@ A decision-making framework for the cost-efficient design of experiments, balanc
 <a><img src="docs/src/assets/front_static.png" align="right" alt="code" width="400"></a>
 
 ### Static experimental designs
-Here we assume that the same experimental design will be used for a population of examined entities, hence the word 'static'.
+Here we assume that the same experimental design will be used for a population of examined entities, hence the word "static".
 
 For each subset of experiments, we consider an estimate of the value of acquired information. To give an example, if a set of experiments is used to predict the value of a specific target variable, our framework can leverage a built-in integration with [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) to estimate predictive accuracies of machine learning models fitted over subset of experimental features.
 
@@ -26,7 +26,7 @@ Assuming the information values and optimized experimental costs for each subset
 
 We consider 'personalized' experimental designs that dynamically adjust based on the evidence gathered from the experiments. This approach is motivated by the fact that the value of information collected from an experiment generally differs across subpopulations of the entities involved in the triage process.
 
-At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word 'generative'), we adjust the continuation based on this evidence.
+At the beginning of the triage process, an entity's prior data is used to project a range of cost-efficient experimental designs. Internally, while constructing these designs, we incorporate multiple-step-ahead lookups to model likely experimental outcomes and consider the subsequent decisions for each outcome. Then after choosing a specific decision policy from this set and acquiring additional experimental readouts (sampled from a generative model, hence the word "generative"), we adjust the continuation based on this evidence.
 
 <a><img src="docs/src/assets/search_tree.png" align="left" alt="code" width="400"></a>
 

diff --git a/src/fronts.jl b/src/fronts.jl
@@ -134,7 +134,7 @@ end
 
 Create a stick plot that visualizes the performance measures evaluated for subsets of experiments.
 
-Argument `evals` should be the output of [`evaluate_experiments`](@ref) and the kwarg `f` (if provided) is a function that
+Argument `evals` should be the output of [`evaluate_experiments`](@ref CEEDesigns.StaticDesigns.evaluate_experiments) and the kwarg `f` (if provided) is a function that
 should take as input `evals` and return a list of its keys in the order to be plotted on the x-axis.
 By default they are sorted by length.
 """

diff --git a/tutorials/SimpleStatic.jl b/tutorials/SimpleStatic.jl
@@ -1,15 +1,15 @@
 # # Static Experimental Designs
 
-# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs",
+# In this document we describe the theoretical background behind the tools in `CEEDesigns.jl` for producing optimal "static experimental designs," i.e.,
 # arrangements of experiments that exist along a Pareto-optimal tradeoff between cost and information gain.
 # We also show an example with synthetic data.
 
 # ## Setting
 
 # Consider the following scenario. There exists a set of experiments, each of which, when performed, yields
 # measurements on one or more observables (features). Each subset of observables (and therefore each subset of experiments)
-# has some "information value", which is intentionally vaguely defined for generality, but for example, may be 
-# a loss function if that subset is used  to train some machine learning model. It is generally the value of acquiring that information. 
+# has some "information value," which is intentionally vaguely defined for generality, but for example, may be 
+# a loss function if that subset is used to train some machine learning model. It is generally the value of acquiring that information. 
 # Finally, each experiment has some monetary cost and execution time to perform the experiment, and
 # the user has some known tradeoff between overall execution time and cost.
 # 
@@ -19,8 +19,11 @@
 # resources to a set of experiments that attain some acceptable level of information (or, conversely, reduce
 # uncertainty below some level).
 # 
-# The arrangements produced by the tools introduced in this tutorial are called "static" because they implicitly
-# assume that future data will have exactly the information gain of each experiment as the "historical" input.
+# The arrangements produced by the tools introduced in this tutorial are called "static" because they inherently
+# assume that for each experiment, future data will deterministically yield the same information gain as the "historical" data did.
+# This information gain from the "historical" data is quantified based on certain aggregate statistics.
+#
+# We can also consider "generative experimental designs," where the information gain is modeled as a random variable. This concept is detailed in another [tutorial](./SimpleGenerative.jl).
 # 
 # This tutorial introduces the theoretical framework behind static experimental designs with synthetic data.
 # For examples using real data, please see our other tutorials.
@@ -56,14 +59,11 @@
 # ### Optimal Arrangements
 
 # To find the optimal arrangement for each $S$ we need to know the cost of $O_{S}$. The monetary cost of $O_{S}$ is simply
-# the sum of the costs of each experiment:
-# $$m_{O_{S}}=\sum_{e\in S} m_{e}$$
+# the sum of the costs of each experiment:$m_{O_{S}}=\sum_{e\in S} m_{e}$.
 # The total time required is the sum of the maximum time *of each partition*. This is because while each partition in the 
-# arrangement is done in serial, experiments within partitions are done in parallel.
-# $$t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e} e \in o_{i}\}$$
+# arrangement is done in serial, experiments within partitions are done in parallel. That is, $t_{O_{S}}=\sum_{i=1}^{l} \text{max} \{ t_{e}: e \in o_{i}\}$
 # Given these costs and a parameter $\lambda$ which controls the tradeoff between monetary cost and time, the combined
-# cost of an arrangement is:
-# $$\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$$
+# cost of an arrangement is: $\lambda m_{O_{S}} + (1-\lambda) t_{O_{S}}$.
 # 
 # For instance, consider the experiments $S = \{e_{1},e_{2},e_{3},e_{4}\}$, with associated costs $(1, 1)$, $(1, 3)$, $(1, 2)$, and $(1, 4)$.
 # If we conduct experiments $e_1$ through $e_4$ in sequence, this would correspond to an arrangement 
@@ -82,7 +82,7 @@
 
 # ## Synthetic Data Example
 
-# We now present an example of finding cost-efficient designs using synthetic data using the `CEEDesigns.jl` package.
+# We now present an example of finding cost-efficient designs for synthetic data using the `CEEDesigns.jl` package.
 # 
 # First we load necessary packages.
 
@@ -139,13 +139,15 @@ plot_evals(
 
 # We print the data frame showing each subset of experiments and its overall information value.
 
-DataFrame(;
+df_values = DataFrame(;
     S = collect.(collect(keys(experiments_evals))),
     value = collect(values(experiments_evals)),
 )
 
+sort(df_values, order(:value, rev=true))
+
 # Now we are ready to find the subsets of experiments giving an optimal tradeoff between information
-# value and combined cost (where we use $\lambda=0.5$). CEED exports a function `efficient_designs`
+# value and combined cost. CEED exports a function `efficient_designs`
 # which formulates the problem of finding optimal arrangements as a Markov Decision Process and solves
 # optimal arrangements for each subset on the Pareto frontier.
 # 
@@ -165,7 +167,7 @@ designs = efficient_designs(
 # Finally we may produce a plot of the set of cost-efficient experimental designs. The set of designs
 # is plotted along a Pareto frontier giving tradeoff between informatio value (y-axis) and combined cost (x-axis).
 # Note that because we set the maximum number of parallel experiments equal to 2, the efficient design for the complete set
-# of experiments groups the experiments with long execution times together (see plot legend; each group/partition is
+# of experiments groups the experiments with long execution times together (see plot legend; each group within a partition is
 # prefixed with a number).
 
 plot_front(designs; labels = make_labels(designs), ylabel = "loss")