edits based on comments

tlverse · May 20, 2021 · 3df7d2d · 3df7d2d
1 parent 7de47e1
commit 3df7d2d
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 27 deletions.
diff --git a/02-roadmap.Rmd b/02-roadmap.Rmd
@@ -24,7 +24,7 @@ estimator.
 
 ## The Roadmap {#roadmap}
 
-Following the roadmap is a process of five stages.
+The roadmap is a five-stage process of defining the following.
 
 1. Data as a random variable with a probability distribution, $O \sim P_0$.
 2. The statistical model $\M$ such that $P_0 \in \M$.
@@ -34,42 +34,55 @@ Following the roadmap is a process of five stages.
 
 ### (1) Data: A random variable with a probability distribution, $O \sim P_0$ {-}
 
-The data set we're confronted with is the result of an experiment and we can
-view the data as a random variable, $O$, because if we repeat the experiment
-we would have a different realization of this experiment. In particular, if we
-repeat the experiment many times we could learn the probability distribution,
-$P_0$, of our data. So, the observed data $O$ with probability distribution
-$P_0$ are $n$ independent identically distributed (i.i.d.) observations of the
-random variable $O; O_1, \ldots, O_n$. Note that while not all data are i.i.d.,
-there are ways to handle non-i.i.d. data, such as establishing conditional
-independence, stratifying data to create sets of identically distributed data,
-etc. It is crucial that researchers be absolutely clear about what they actually
-know about the data-generating distribution for a given problem of interest.
-Unfortunately, communication between statisticians and researchers is often
-fraught with misinterpretation. The roadmap provides a mechanism by which to
-ensure clear communication between research and statistician -- it truly helps
-with this communication!
+The data set we are confronted with is the collection of the results of an
+experiment, and we can view the data as a _random variable_ -- that is, if we
+were to repeat the experiment, we would have a different realization of the data
+generated by the experiment in question. In particular, if the experiment were
+repeated many times, the probability distribution generating the data, $P_0$,
+could be learned. So, the observed data on a single unit, $O$, may be thought of
+as being drawn from a probability distribution $P_0$. Most often, we observe $n$
+independent identically distributed (i.i.d.) observations of the random variable
+$O$, so the observed data is the collection $O_1, \ldots, O_n$, where the
+subscripts denote the individual observational units.  While not all data are
+i.i.d., this is certainly the most common case in applied data analysis;
+moreover, there are a number of techniques for handling non-i.i.d. data, such as
+establishing conditional independence, stratifying data to create distinct sets
+of identically distributed data, and inferential corrections for repeated or
+clustered observations, to name but a few.
+
+It is crucial that the domain scientist (i.e., researcher) have absolute clarity
+about what is actually known about the data-generating distribution for a given
+problem of interest. Just as critical is that this scientific information be
+communicated to the statistician, whose job it is to use such knowledge to guide
+any assumptions encoded in the choice of statistical model. Unfortunately,
+communication between statisticians and researchers is often fraught with
+misinterpretation. The roadmap provides a mechanism by which to ensure clear
+communication between the researcher and the statistician -- it is an invaluable
+tool for such communication!
 
 #### The empirical probability measure, $P_n$ {-}
 
-Once we have $n$ of such i.i.d. observations we have an empirical probability
+With $n$ i.i.d. observations in hand, we can define an empirical probability
 measure, $P_n$. The empirical probability measure is an approximation of the
-true probability measure $P_0$, allowing us to learn from our data. For
-example, we can define the empirical probability measure of a set, $A$, to be
-the proportion of observations which end up in $A$. That is,
+true probability measure, $P_0$, allowing us to learn from the observed data.
+For example, we can define the empirical probability measure of a set $X$ to be
+the proportion of observations that belong in $X$. That is,
 \begin{equation*}
-  P_n(A) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in A)
+  P_n(X) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in X)
 \end{equation*}
 
-In order to start learning something, we need to ask *"What do we know about the
-probability distribution of the data?"* This brings us to Step 2.
+In order to start learning from the data, we next need to ask *"What do we know
+about the probability distribution of the data?"* This brings us on to Step 2.
 
-### (2) The statistical model $\M$ such that $P_0 \in \M$ {-}
+### (2) Defining the statistical model $\M$ such that $P_0 \in \M$ {-}
 
 The statistical model $\M$ is defined by the question we asked at the end of
-Step 1. It is defined as the set of possible probability distributions for our
-observed data. Often $\M$ is very large (possibly infinite-dimensional), to
-reflect the fact that statistical knowledge is limited. In the case that $\M$ is
+Step 1. It is defined as the set of possible probability distributions that
+could describe our observed data, appropriately constrained by background
+scientific knowledge. Often $\M$ is very large (e.g., nonparametric
+or infinite-dimensional), reflecting the fact that statistical knowledge about
+the data-generating process is limited.
+In the case that $\M$ is
 infinite-dimensional, we deem this a nonparametric statistical model.
 
 Alternatively, if the probability distribution of the data at hand is described

diff --git a/img/.DS_Store b/img/.DS_Store