Skip to content

Commit

Permalink
edits based on comments
Browse files Browse the repository at this point in the history
  • Loading branch information
nhejazi committed May 20, 2021
1 parent 7de47e1 commit 3df7d2d
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 27 deletions.
67 changes: 40 additions & 27 deletions 02-roadmap.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ estimator.

## The Roadmap {#roadmap}

Following the roadmap is a process of five stages.
The roadmap is a five-stage process of defining the following.

1. Data as a random variable with a probability distribution, $O \sim P_0$.
2. The statistical model $\M$ such that $P_0 \in \M$.
Expand All @@ -34,42 +34,55 @@ Following the roadmap is a process of five stages.

### (1) Data: A random variable with a probability distribution, $O \sim P_0$ {-}

The data set we're confronted with is the result of an experiment and we can
view the data as a random variable, $O$, because if we repeat the experiment
we would have a different realization of this experiment. In particular, if we
repeat the experiment many times we could learn the probability distribution,
$P_0$, of our data. So, the observed data $O$ with probability distribution
$P_0$ are $n$ independent identically distributed (i.i.d.) observations of the
random variable $O; O_1, \ldots, O_n$. Note that while not all data are i.i.d.,
there are ways to handle non-i.i.d. data, such as establishing conditional
independence, stratifying data to create sets of identically distributed data,
etc. It is crucial that researchers be absolutely clear about what they actually
know about the data-generating distribution for a given problem of interest.
Unfortunately, communication between statisticians and researchers is often
fraught with misinterpretation. The roadmap provides a mechanism by which to
ensure clear communication between research and statistician -- it truly helps
with this communication!
The data set we are confronted with is the collection of the results of an
experiment, and we can view the data as a _random variable_ -- that is, if we
were to repeat the experiment, we would have a different realization of the data
generated by the experiment in question. In particular, if the experiment were
repeated many times, the probability distribution generating the data, $P_0$,
could be learned. So, the observed data on a single unit, $O$, may be thought of
as being drawn from a probability distribution $P_0$. Most often, we observe $n$
independent identically distributed (i.i.d.) observations of the random variable
$O$, so the observed data is the collection $O_1, \ldots, O_n$, where the
subscripts denote the individual observational units. While not all data are
i.i.d., this is certainly the most common case in applied data analysis;
moreover, there are a number of techniques for handling non-i.i.d. data, such as
establishing conditional independence, stratifying data to create distinct sets
of identically distributed data, and inferential corrections for repeated or
clustered observations, to name but a few.

It is crucial that the domain scientist (i.e., researcher) have absolute clarity
about what is actually known about the data-generating distribution for a given
problem of interest. Just as critical is that this scientific information be
communicated to the statistician, whose job it is to use such knowledge to guide
any assumptions encoded in the choice of statistical model. Unfortunately,
communication between statisticians and researchers is often fraught with
misinterpretation. The roadmap provides a mechanism by which to ensure clear
communication between the researcher and the statistician -- it is an invaluable
tool for such communication!

#### The empirical probability measure, $P_n$ {-}

Once we have $n$ of such i.i.d. observations we have an empirical probability
With $n$ i.i.d. observations in hand, we can define an empirical probability
measure, $P_n$. The empirical probability measure is an approximation of the
true probability measure $P_0$, allowing us to learn from our data. For
example, we can define the empirical probability measure of a set, $A$, to be
the proportion of observations which end up in $A$. That is,
true probability measure, $P_0$, allowing us to learn from the observed data.
For example, we can define the empirical probability measure of a set $X$ to be
the proportion of observations that belong in $X$. That is,
\begin{equation*}
P_n(A) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in A)
P_n(X) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in X)
\end{equation*}

In order to start learning something, we need to ask *"What do we know about the
probability distribution of the data?"* This brings us to Step 2.
In order to start learning from the data, we next need to ask *"What do we know
about the probability distribution of the data?"* This brings us on to Step 2.

### (2) The statistical model $\M$ such that $P_0 \in \M$ {-}
### (2) Defining the statistical model $\M$ such that $P_0 \in \M$ {-}

The statistical model $\M$ is defined by the question we asked at the end of
Step 1. It is defined as the set of possible probability distributions for our
observed data. Often $\M$ is very large (possibly infinite-dimensional), to
reflect the fact that statistical knowledge is limited. In the case that $\M$ is
Step 1. It is defined as the set of possible probability distributions that
could describe our observed data, appropriately constrained by background
scientific knowledge. Often $\M$ is very large (e.g., nonparametric
or infinite-dimensional), reflecting the fact that statistical knowledge about
the data-generating process is limited.
In the case that $\M$ is
infinite-dimensional, we deem this a nonparametric statistical model.

Alternatively, if the probability distribution of the data at hand is described
Expand Down
Binary file removed img/.DS_Store
Binary file not shown.

0 comments on commit 3df7d2d

Please sign in to comment.