Skip to content

Commit

Permalink
begin intro revisions; all images to PNG
Browse files Browse the repository at this point in the history
  • Loading branch information
nhejazi committed May 11, 2021
1 parent d171b42 commit 0112fa6
Show file tree
Hide file tree
Showing 52 changed files with 85 additions and 58 deletions.
93 changes: 55 additions & 38 deletions 01-preface.Rmd
Original file line number Diff line number Diff line change
@@ -1,56 +1,70 @@
# Robust Statistics and Reproducible Science {#robust}

<!--
`r if (knitr::is_latex_output()) '\\begin{shortbox}\n\\Boxhead{Test}'`
test shortbox
`r if (knitr::is_latex_output()) '\\end{shortbox}'`
`r if (knitr::is_latex_output()) '\\begin{VT1}\n\\VH{Test}'`
test VT1
`r if (knitr::is_latex_output()) '\\end{VT1}'`
-->

> "One enemy of robust science is our humanity our appetite for
> "One enemy of robust science is our humanity -- our appetite for
> being right, and our tendency to find patterns in noise, to see supporting
> evidence for what we already believe is true, and to ignore the facts that do
> not fit."
>
> --- @naturenews_2015
Scientific research is at a unique point in history. The need to improve rigor
and reproducibility in our field is greater than ever; corroboration moves
science forward, yet there is a growing alarm about results that cannot be
reproduced and that report false discoveries [@baker2016there]. Consequences of
not meeting this need will result in further decline in the rate of scientific
progression, the reputation of the sciences, and the public’s trust in its
findings [@munafo2017manifesto; @naturenews2_2015].


Scientific research is at a unique point in its history. The need to improve
rigor and reproducibility in our field is greater than ever; corroboration moves
science forward, yet there is growing alarm that results cannot be reproduced or
validated, suggesting the possibility that many discoveries may be false
[@baker2016there]. Consequences of not meeting this need will result in further
decline in the rate of scientific progress, the reputation of the sciences, and
the public's trust in scientific findings [@munafo2017manifesto;
@naturenews2_2015].

> "The key question we want to answer when seeing the results of any scientific
> study is whether we can trust the data analysis."
>
> --- @peng2015reproducibility
Unfortunately, at its current state the culture of data analysis and statistics
actually enables human bias through improper model selection. All hypothesis
tests and estimators are derived from statistical models, so to obtain valid
estimates and inference it is critical that the statistical model contains the
process that generated the data. Perhaps treatment was randomized or only
depended on a small number of baseline covariates; this knowledge should and
can be incorporated in the model. Alternatively, maybe the data is
observational, and there is no knowledge about the data-generating process (DGP).
If this is the case, then the statistical model should contain *all* data
distributions. In practice; however, models are not selected based on knowledge
of the DGP, instead models are often selected based on (1) the p-values they
yield, (2) their convenience of implementation, and/or (3) an analysts loyalty
to a particular model. This practice of "cargo-cult statistics --- the
ritualistic miming of statistics rather than conscientious practice,"
[@stark2018cargo] is characterized by arbitrary modeling choices, even though
these choices often result in different answers to the same research question.
That is, "increasingly often, [statistics] is used instead to aid and
abet weak science, a role it can perform well when used mechanically or
ritually," as opposed to its original purpose of safeguarding against weak
science [@stark2018cargo]. This presents a fundamental drive behind the epidemic
of false findings that scientific research is suffering from [@vdl2014entering].
Unfortunately, in its current state, the culture of statistical data analysis
enables, rather than precludes, the manner in which human bias may affect the
results of (ideally objective) data analytic efforts. A significant degree of
human bias enters statistical analysis efforts in the form improper model
selection. All procedures for estimation and hypothesis testing are derived
based on a choice of statistical model; thus, obtaining valid estimates and
statistical inference relies critically on the chosen statistical model
containing an accurate representation of the process that generated the data.
Consider, for example, a hypothetical study in which a treatment was assigned to
a group of patients: Was the treatment assigned randomly or were characteristics
of the individuals (i.e., baseline covariates) used in making the treatment
decision? Such knowledge can should be incorporated in the statistical model.
Alternatively, the data could be from an observational study, in which there is
no control over the treatment assignment mechanism. In such cases, available
knowledge about the data-generating process (DGP) is more limited still. If
this is the case, then the statistical model should contain *all* possible
distributions of the data. In practice, however, models are not selected based
on scientific knowledge available
about the DGP; instead, models are often selected based on (1) the philosophical
leanings of the analyst, (2)
the relative convenience of implementation of statistical methods admissible
within the choice of model, and (3) the results of significance testing (i.e.,
p-values) applied within the choice of model.

This practice of "cargo-cult statistics --- the ritualistic miming of statistics
rather than conscientious practice," [@stark2018cargo] is characterized by
arbitrary modeling choices, even though these choices often result in different
answers to the same research question. That is, "increasingly often,
[statistics] is used instead to aid and abet weak science, a role it can perform
well when used mechanically or ritually," as opposed to its original purpose of
safeguarding against weak science by providing formal techniques for evaluating
the veracity of a claim using properly collected data [@stark2018cargo]. This
presents a fundamental drive behind the epidemic of false findings from which
scientific research is suffering [@vdl2014entering].

> "We suggest that the weak statistical understanding is probably due to
> inadequate "statistics lite" education. This approach does not build up
Expand All @@ -65,15 +79,18 @@ of false findings that scientific research is suffering from [@vdl2014entering].
>
> --- @szucs2017null

Our team at The University of California, Berkeley, is uniquely positioned to
Our team at the University of California, Berkeley is uniquely positioned to
provide such an education. Spearheaded by Professor Mark van der Laan, and
spreading rapidly by many of his students and colleagues who have greatly
enriched the field, the aptly named "Targeted Learning" methodology targets the
scientific question at hand and is counter to the current culture of
"convenience statistics" which opens the door to biased estimation, misleading
results, and false discoveries. Targeted Learning restores the fundamentals that
formalized the field of statistics, such as the that facts that a statistical
enriched the field, the aptly named "Targeted Learning" methodology emphasizes a
focus of (i.e., "targeting of") the scientific question at hand, running counter
to the current culture problem of "convenience statistics," which opens the door
to biased estimation, misleading analytic results, and erroneous discoveries.
Targeted Learning embraces the fundamentals that formalized the field of
statistics,


such as the that facts that a statistical
model represents real knowledge about the experiment that generated the data,
and a target parameter represents what we are seeking to learn from the data as
a feature of the distribution that generated it [@vdl2014entering]. In this way,
Expand Down
14 changes: 7 additions & 7 deletions 05-origami.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ By the end of this chapter you will be able to:

3. Select a loss function that is appropriate for the functional parameter to be
estimated.

4. Understand and contrast different cross-validation schemes for i.i.d. data.

5. Understand and contrast different cross-validation schemes for time dependent
data.

6. Setup the proper fold structure, build custom fold-based function, and
cross-validate the proposed function using the `origami` `R` package.

7. Setup the proper cross-validation structure for the use by the Super Learner
using the the `origami` `R` package.

Expand Down Expand Up @@ -491,7 +491,7 @@ time points, including the original 15 we started with. We then evaluate its
performance on 10 time points in the future.

```{r, fig.cap="Rolling origin CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/image/rolling_origin.png")
knitr::include_graphics(path = "img/png/rolling_origin.png")
```

We illustrate the usage of the rolling origin cross-validation with `origami`
Expand Down Expand Up @@ -541,7 +541,7 @@ to the rolling origin CV. We then evaluate the performance of the proposed
algorithm on 10 time points in the future.

```{r, fig.cap="Rolling window CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/image/rolling_window.png")
knitr::include_graphics(path = "img/png/rolling_window.png")
```

We illustrate the usage of the rolling window cross-validation with `origami`
Expand Down Expand Up @@ -581,7 +581,7 @@ first_window, validation_size, gap, batch)`. In the figure below, we show $V=2$
$V$-folds, and 2 time-series CV folds.

```{r, fig.cap="Rolling origin V-fold CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/image/rolling_origin_v_fold.png")
knitr::include_graphics(path = "img/png/rolling_origin_v_fold.png")
```

#### Rolling window with v-fold
Expand All @@ -594,7 +594,7 @@ validation_size, gap, batch)`. In the figure below, we show $V=2$ $V$-folds, and
2 time-series CV folds.

```{r, fig.cap="Rolling window V-fold CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/image/rolling_window_v_fold.png")
knitr::include_graphics(path = "img/png/rolling_window_v_fold.png")
```

## General workflow of `origami`
Expand Down
14 changes: 7 additions & 7 deletions 06-sl3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ Below is a figure from [ADD REF] describing the same step-by-step procedure.
This figure considers $k=16$ learners, and in the figure $p=k$; and the squared
error loss function, thus mean squared error (MSE) is the risk.
```{r cv_fig, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/misc/SLKaiserNew.pdf")
knitr::include_graphics("img/png/SLKaiserNew.png")
```
<!-- ADD REFERENCE + CV-SL FIGURE AND REFERENCE -->

Expand All @@ -227,9 +227,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf")
- Cross-validation is proven to be optimal for selection among estimators. This
result was established through the oracle inequality for the cross-validation
selector among a collection of candidate estimators [@vdl2003unified;
@vaart2006oracle]. The only conditions are that loss function uniformly
@vaart2006oracle]. The only conditions are that loss function uniformly
bounded, which is guaranteed in `sl3`, and that the loss function is *valid*
(defined below). (Do we also need for the proportion of validation observation
(defined below). (Do we also need for the proportion of validation observation
times the number of observations to go to infinity?)
- We use a *loss function* $L$ to assign a measure of performance to each
learner $\psi$ when applied to the data $O$, and subsequently compare
Expand All @@ -239,9 +239,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf")

+ It is important to recall that $\psi$ is an estimator of $\psi_0$, the
unknown and true parameter value under $P_0$.
+ A *valid loss function* will have mean/expectation (i.e., risk) that is
minimized at the true value of the parameter $\psi_0$. Thus, minimizing
the expected loss will bring an estimator $\psi$ closer to the true
+ A *valid loss function* will have mean/expectation (i.e., risk) that is
minimized at the true value of the parameter $\psi_0$. Thus, minimizing
the expected loss will bring an estimator $\psi$ closer to the true
$\psi_0$.
+ For example, say we observe a learning data set $O_i=(Y_i,X_i)$, of
$i=1, \ldots, n$ independent and identically distributed observations,
Expand Down Expand Up @@ -293,7 +293,7 @@ such a study, comparing the fits of several different learners, including the
SL algorithms.
r cv_fig3, results="asis", echo = FALSE
knitr::include_graphics("img/misc/ericSL.pdf")
knitr::include_graphics("img/png/ericSL.png")
For more detail on Super Learner we refer the reader to @vdl2007super and
@polley2010super. The optimality results for the cross-validation selector
Expand Down
6 changes: 3 additions & 3 deletions 07-tmle3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ estimation; targeted minimum loss-based estimation) framework, using the
following example data:

```{r tmle_fig1, results="asis", echo = FALSE}
knitr::include_graphics("img/misc/tmle_sim/schematic_1_truedgd.png")
knitr::include_graphics("img/png/schematic_1_truedgd.png")
```

The small ticks on the right indicate the mean outcomes (averaging over $W$)
Expand Down Expand Up @@ -62,7 +62,7 @@ Applying `sl3` to estimate the outcome regression in our example, we can see
that the ensemble machine learning predictions fit the data quite well:

```{r tmle_fig2, results="asis", echo = FALSE}
knitr::include_graphics("img/misc/tmle_sim/schematic_2b_sllik.png")
knitr::include_graphics("img/png/schematic_2b_sllik.png")
```

The solid lines indicate the `sl3` estimate of the regression function, with the
Expand All @@ -81,7 +81,7 @@ We can see these limitations illustrated in the estimates generated for the
example data:

```{r tmle_fig3, results="asis", echo = FALSE}
knitr::include_graphics("img/misc/tmle_sim/schematic_3_effects.png")
knitr::include_graphics("img/png/schematic_3_effects.png")
```

We see that Super Learner, estimates the true parameter value (indicated by the
Expand Down
2 changes: 1 addition & 1 deletion 08-tmle3mopttx.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ improve efficiency by not allocating resources to individuals that do not need
them, or would not benefit from it.

```{r, fig.cap="Dynamic Treatment Regime in a Clinical Setting", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/image/DynamicA_Illustration.png")
knitr::include_graphics(path = "img/png/DynamicA_Illustration.png")
```

One opts to administer the intervention to individuals who will profit from it instead,
Expand Down
2 changes: 1 addition & 1 deletion book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ @article{wolpert1992stacked
}

@article{naturenews_2015,
author = {Anonymous},
author = {Anonymous Editorial in \textit{Nature}}},
title={Let’s think about cognitive bias},
journal={Nature},
publisher={Springer Nature},
Expand Down
Binary file removed img/misc/.DS_Store
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_1_truedgd.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_2a_glmlik.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_2b_sllik.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_3_effects.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_4_opttx_truedgd.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/schematic_5_opttx_estimates.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/sim_distribution.png
Binary file not shown.
Binary file removed img/misc/tmle_sim/sim_performance.png
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
10 changes: 10 additions & 0 deletions img/pdf2png.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
# requires imageMagick

for f in pdf/*.pdf
do
echo "Converting PDF file: $f to ${f%.pdf}.png"
convert -density 300 $f -quality 100 ${f%.pdf}.png
done

mv pdf/*.png png/
File renamed without changes
Binary file added img/png/NatureSlides-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/NatureSlides-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/NatureSlides-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/NatureSlides-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/NatureSlides-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/SLKaiserNew.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/TMLEimage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/ericSL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Binary file added img/png/schematic_1_truedgd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/schematic_2a_glmlik.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/schematic_2b_sllik.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/schematic_3_effects.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/png/schematic_4_opttx_truedgd.png
Binary file added img/png/schematic_5_opttx_estimates.png
Binary file added img/png/sim_distribution.png
Binary file added img/png/sim_performance.png
Binary file added img/png/vs.png
2 changes: 1 addition & 1 deletion index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ graphics: yes
description: "An open source handbook for causal machine learning and data
science with the Targeted Learning framework using the [`tlverse` software
ecosystem](https://github.com/tlverse)."
favicon: "img/logos/favicons/favicon.png"
favicon: "img/favicons/favicon.png"
---

# About this book {-}
Expand Down

0 comments on commit 0112fa6

Please sign in to comment.