diff --git a/02-roadmap.Rmd b/02-roadmap.Rmd index 3800dfa..8432076 100644 --- a/02-roadmap.Rmd +++ b/02-roadmap.Rmd @@ -24,7 +24,7 @@ estimator. ## The Roadmap {#roadmap} -Following the roadmap is a process of five stages. +The roadmap is a five-stage process of defining the following. 1. Data as a random variable with a probability distribution, $O \sim P_0$. 2. The statistical model $\M$ such that $P_0 \in \M$. @@ -34,42 +34,55 @@ Following the roadmap is a process of five stages. ### (1) Data: A random variable with a probability distribution, $O \sim P_0$ {-} -The data set we're confronted with is the result of an experiment and we can -view the data as a random variable, $O$, because if we repeat the experiment -we would have a different realization of this experiment. In particular, if we -repeat the experiment many times we could learn the probability distribution, -$P_0$, of our data. So, the observed data $O$ with probability distribution -$P_0$ are $n$ independent identically distributed (i.i.d.) observations of the -random variable $O; O_1, \ldots, O_n$. Note that while not all data are i.i.d., -there are ways to handle non-i.i.d. data, such as establishing conditional -independence, stratifying data to create sets of identically distributed data, -etc. It is crucial that researchers be absolutely clear about what they actually -know about the data-generating distribution for a given problem of interest. -Unfortunately, communication between statisticians and researchers is often -fraught with misinterpretation. The roadmap provides a mechanism by which to -ensure clear communication between research and statistician -- it truly helps -with this communication! +The data set we are confronted with is the collection of the results of an +experiment, and we can view the data as a _random variable_ -- that is, if we +were to repeat the experiment, we would have a different realization of the data +generated by the experiment in question. In particular, if the experiment were +repeated many times, the probability distribution generating the data, $P_0$, +could be learned. So, the observed data on a single unit, $O$, may be thought of +as being drawn from a probability distribution $P_0$. Most often, we observe $n$ +independent identically distributed (i.i.d.) observations of the random variable +$O$, so the observed data is the collection $O_1, \ldots, O_n$, where the +subscripts denote the individual observational units. While not all data are +i.i.d., this is certainly the most common case in applied data analysis; +moreover, there are a number of techniques for handling non-i.i.d. data, such as +establishing conditional independence, stratifying data to create distinct sets +of identically distributed data, and inferential corrections for repeated or +clustered observations, to name but a few. + +It is crucial that the domain scientist (i.e., researcher) have absolute clarity +about what is actually known about the data-generating distribution for a given +problem of interest. Just as critical is that this scientific information be +communicated to the statistician, whose job it is to use such knowledge to guide +any assumptions encoded in the choice of statistical model. Unfortunately, +communication between statisticians and researchers is often fraught with +misinterpretation. The roadmap provides a mechanism by which to ensure clear +communication between the researcher and the statistician -- it is an invaluable +tool for such communication! #### The empirical probability measure, $P_n$ {-} -Once we have $n$ of such i.i.d. observations we have an empirical probability +With $n$ i.i.d. observations in hand, we can define an empirical probability measure, $P_n$. The empirical probability measure is an approximation of the -true probability measure $P_0$, allowing us to learn from our data. For -example, we can define the empirical probability measure of a set, $A$, to be -the proportion of observations which end up in $A$. That is, +true probability measure, $P_0$, allowing us to learn from the observed data. +For example, we can define the empirical probability measure of a set $X$ to be +the proportion of observations that belong in $X$. That is, \begin{equation*} - P_n(A) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in A) + P_n(X) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in X) \end{equation*} -In order to start learning something, we need to ask *"What do we know about the -probability distribution of the data?"* This brings us to Step 2. +In order to start learning from the data, we next need to ask *"What do we know +about the probability distribution of the data?"* This brings us on to Step 2. -### (2) The statistical model $\M$ such that $P_0 \in \M$ {-} +### (2) Defining the statistical model $\M$ such that $P_0 \in \M$ {-} The statistical model $\M$ is defined by the question we asked at the end of -Step 1. It is defined as the set of possible probability distributions for our -observed data. Often $\M$ is very large (possibly infinite-dimensional), to -reflect the fact that statistical knowledge is limited. In the case that $\M$ is +Step 1. It is defined as the set of possible probability distributions that +could describe our observed data, appropriately constrained by background +scientific knowledge. Often $\M$ is very large (e.g., nonparametric +or infinite-dimensional), reflecting the fact that statistical knowledge about +the data-generating process is limited. +In the case that $\M$ is infinite-dimensional, we deem this a nonparametric statistical model. Alternatively, if the probability distribution of the data at hand is described diff --git a/img/.DS_Store b/img/.DS_Store deleted file mode 100644 index a7ff58d..0000000 Binary files a/img/.DS_Store and /dev/null differ