begin intro revisions; all images to PNG

tlverse · May 11, 2021 · 0112fa6 · 0112fa6
1 parent d171b42
commit 0112fa6
Show file tree

Hide file tree

Showing 52 changed files with 85 additions and 58 deletions.
diff --git a/01-preface.Rmd b/01-preface.Rmd
@@ -1,56 +1,70 @@
 # Robust Statistics and Reproducible Science {#robust}
 
+<!--
 `r if (knitr::is_latex_output()) '\\begin{shortbox}\n\\Boxhead{Test}'`
 test shortbox
 `r if (knitr::is_latex_output()) '\\end{shortbox}'`
 
 `r if (knitr::is_latex_output()) '\\begin{VT1}\n\\VH{Test}'`
 test VT1
 `r if (knitr::is_latex_output()) '\\end{VT1}'`
+-->
 
-> "One enemy of robust science is our humanity — our appetite for
+> "One enemy of robust science is our humanity -- our appetite for
 > being right, and our tendency to find patterns in noise, to see supporting
 > evidence for what we already believe is true, and to ignore the facts that do
 > not fit."
 >
 > --- @naturenews_2015
 
-Scientific research is at a unique point in history. The need to improve rigor
-and reproducibility in our field is greater than ever; corroboration moves
-science forward, yet there is a growing alarm about results that cannot be
-reproduced and that report false discoveries [@baker2016there]. Consequences of
-not meeting this need will result in further decline in the rate of scientific
-progression, the reputation of the sciences, and the public’s trust in its
-findings [@munafo2017manifesto; @naturenews2_2015].
-
-
+Scientific research is at a unique point in its history. The need to improve
+rigor and reproducibility in our field is greater than ever; corroboration moves
+science forward, yet there is growing alarm that results cannot be reproduced or
+validated, suggesting the possibility that many discoveries may be false
+[@baker2016there]. Consequences of not meeting this need will result in further
+decline in the rate of scientific progress, the reputation of the sciences, and
+the public's trust in scientific findings [@munafo2017manifesto;
+@naturenews2_2015].
 
 > "The key question we want to answer when seeing the results of any scientific
 > study is whether we can trust the data analysis."
 >
 > --- @peng2015reproducibility
 
-Unfortunately, at its current state the culture of data analysis and statistics
-actually enables human bias through improper model selection. All hypothesis
-tests and estimators are derived from statistical models, so to obtain valid
-estimates and inference it is critical that the statistical model contains the
-process that generated the data. Perhaps treatment was randomized or only
-depended on a small number of baseline covariates; this knowledge should and
-can be incorporated in the model. Alternatively, maybe the data is
-observational, and there is no knowledge about the data-generating process (DGP).
-If this is the case, then the statistical model should contain *all* data
-distributions. In practice; however, models are not selected based on knowledge
-of the DGP, instead models are often selected based on (1) the p-values they
-yield, (2) their convenience of implementation, and/or (3) an analysts loyalty
-to a particular model. This practice of "cargo-cult statistics --- the
-ritualistic miming of statistics rather than conscientious practice,"
-[@stark2018cargo] is characterized by arbitrary modeling choices, even though
-these choices often result in different answers to the same research question.
-That is, "increasingly often, [statistics] is used instead to aid and
-abet weak science, a role it can perform well when used mechanically or
-ritually," as opposed to its original purpose of safeguarding against weak
-science [@stark2018cargo]. This presents a fundamental drive behind the epidemic
-of false findings that scientific research is suffering from [@vdl2014entering].
+Unfortunately, in its current state, the culture of statistical data analysis
+enables, rather than precludes, the manner in which human bias may affect the
+results of (ideally objective) data analytic efforts. A significant degree of
+human bias enters statistical analysis efforts in the form improper model
+selection. All procedures for estimation and hypothesis testing are derived
+based on a choice of statistical model; thus, obtaining valid estimates and
+statistical inference relies critically on the chosen statistical model
+containing an accurate representation of the process that generated the data.
+Consider, for example, a hypothetical study in which a treatment was assigned to
+a group of patients: Was the treatment assigned randomly or were characteristics
+of the individuals (i.e., baseline covariates) used in making the treatment
+decision? Such knowledge can should be incorporated in the statistical model.
+Alternatively, the data could be from an observational study, in which there is
+no control over the treatment assignment mechanism. In such cases, available
+knowledge about the data-generating process (DGP) is more limited still.  If
+this is the case, then the statistical model should contain *all* possible
+distributions of the data. In practice, however, models are not selected based
+on scientific knowledge available
+about the DGP; instead, models are often selected based on (1) the philosophical
+leanings of the analyst, (2)
+the relative convenience of implementation of statistical methods admissible
+within the choice of model, and (3) the results of significance testing (i.e.,
+p-values) applied within the choice of model.
+
+This practice of "cargo-cult statistics --- the ritualistic miming of statistics
+rather than conscientious practice," [@stark2018cargo] is characterized by
+arbitrary modeling choices, even though these choices often result in different
+answers to the same research question.  That is, "increasingly often,
+[statistics] is used instead to aid and abet weak science, a role it can perform
+well when used mechanically or ritually," as opposed to its original purpose of
+safeguarding against weak science by providing formal techniques for evaluating
+the veracity of a claim using properly collected data [@stark2018cargo]. This
+presents a fundamental drive behind the epidemic of false findings from which
+scientific research is suffering [@vdl2014entering].
 
 > "We suggest that the weak statistical understanding is probably due to
 > inadequate "statistics lite" education. This approach does not build up
@@ -65,15 +79,18 @@ of false findings that scientific research is suffering from [@vdl2014entering].
 >
 > --- @szucs2017null
 
-
-Our team at The University of California, Berkeley, is uniquely positioned to
+Our team at the University of California, Berkeley is uniquely positioned to
 provide such an education. Spearheaded by Professor Mark van der Laan, and
 spreading rapidly by many of his students and colleagues who have greatly
-enriched the field, the aptly named "Targeted Learning" methodology targets the
-scientific question at hand and is counter to the current culture of
-"convenience statistics" which opens the door to biased estimation, misleading
-results, and false discoveries. Targeted Learning restores the fundamentals that
-formalized the field of statistics, such as the that facts that a statistical
+enriched the field, the aptly named "Targeted Learning" methodology emphasizes a
+focus of (i.e., "targeting of") the scientific question at hand, running counter
+to the current culture problem of "convenience statistics," which opens the door
+to biased estimation, misleading analytic results, and erroneous discoveries.
+Targeted Learning embraces the fundamentals that formalized the field of
+statistics,
+
+
+such as the that facts that a statistical
 model represents real knowledge about the experiment that generated the data,
 and a target parameter represents what we are seeking to learn from the data as
 a feature of the distribution that generated it [@vdl2014entering]. In this way,

diff --git a/05-origami.Rmd b/05-origami.Rmd
@@ -19,15 +19,15 @@ By the end of this chapter you will be able to:
 
 3. Select a loss function that is appropriate for the functional parameter to be
    estimated.
-   
+
 4. Understand and contrast different cross-validation schemes for i.i.d. data.
 
 5. Understand and contrast different cross-validation schemes for time dependent
    data.
-   
+
 6. Setup the proper fold structure, build custom fold-based function, and
    cross-validate the proposed function using the `origami` `R` package.
-   
+
 7. Setup the proper cross-validation structure for the use by the Super Learner
    using the the `origami` `R` package.
 
@@ -491,7 +491,7 @@ time points, including the original 15 we started with. We then evaluate its
 performance on 10 time points in the future.
 
 ```{r, fig.cap="Rolling origin CV", results="asis", echo=FALSE}
-knitr::include_graphics(path = "img/image/rolling_origin.png")
+knitr::include_graphics(path = "img/png/rolling_origin.png")
 ```
 
 We illustrate the usage of the rolling origin cross-validation with `origami`
@@ -541,7 +541,7 @@ to the rolling origin CV. We then evaluate the performance of the proposed
 algorithm on 10 time points in the future.
 
 ```{r, fig.cap="Rolling window CV", results="asis", echo=FALSE}
-knitr::include_graphics(path = "img/image/rolling_window.png")
+knitr::include_graphics(path = "img/png/rolling_window.png")
 ```
 
 We illustrate the usage of the rolling window cross-validation with `origami`
@@ -581,7 +581,7 @@ first_window, validation_size, gap, batch)`. In the figure below, we show $V=2$
 $V$-folds, and 2 time-series CV folds.
 
 ```{r, fig.cap="Rolling origin V-fold CV", results="asis", echo=FALSE}
-knitr::include_graphics(path = "img/image/rolling_origin_v_fold.png")
+knitr::include_graphics(path = "img/png/rolling_origin_v_fold.png")
 ```
 
 #### Rolling window with v-fold
@@ -594,7 +594,7 @@ validation_size, gap, batch)`. In the figure below, we show $V=2$ $V$-folds, and
 2 time-series CV folds.
 
 ```{r, fig.cap="Rolling window V-fold CV", results="asis", echo=FALSE}
-knitr::include_graphics(path = "img/image/rolling_window_v_fold.png")
+knitr::include_graphics(path = "img/png/rolling_window_v_fold.png")
 ```
 
 ## General workflow of `origami`

diff --git a/06-sl3.Rmd b/06-sl3.Rmd
@@ -218,7 +218,7 @@ Below is a figure from [ADD REF] describing the same step-by-step procedure.
 This figure considers $k=16$ learners, and in the figure $p=k$; and the squared
 error loss function, thus mean squared error (MSE) is the risk.
 ```{r cv_fig, fig.show="hold", echo = FALSE}
-knitr::include_graphics("img/misc/SLKaiserNew.pdf")
+knitr::include_graphics("img/png/SLKaiserNew.png")
 ```
 <!-- ADD REFERENCE + CV-SL FIGURE AND REFERENCE -->
 
@@ -227,9 +227,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf")
 - Cross-validation is proven to be optimal for selection among estimators. This
   result was established through the oracle inequality for the cross-validation
   selector among a collection of candidate estimators [@vdl2003unified;
-  @vaart2006oracle]. The only conditions are that loss function uniformly 
+  @vaart2006oracle]. The only conditions are that loss function uniformly
   bounded, which is guaranteed in `sl3`, and that the loss function is *valid*
-  (defined below). (Do we also need for the proportion of validation observation 
+  (defined below). (Do we also need for the proportion of validation observation
   times the number of observations to go to infinity?)
 - We use a *loss function* $L$ to assign a measure of performance to each
   learner $\psi$ when applied to the data $O$, and subsequently compare
@@ -239,9 +239,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf")
 
     + It is important to recall that $\psi$ is an estimator of $\psi_0$, the
       unknown and true parameter value under $P_0$.
-    + A *valid loss function* will have mean/expectation (i.e., risk) that is 
-      minimized at the true value of the parameter $\psi_0$. Thus, minimizing 
-      the expected loss will bring an estimator $\psi$ closer to the true 
+    + A *valid loss function* will have mean/expectation (i.e., risk) that is
+      minimized at the true value of the parameter $\psi_0$. Thus, minimizing
+      the expected loss will bring an estimator $\psi$ closer to the true
       $\psi_0$.
     + For example, say we observe a learning data set $O_i=(Y_i,X_i)$, of
       $i=1, \ldots, n$ independent and identically distributed observations,
@@ -293,7 +293,7 @@ such a study, comparing the fits of several different learners, including the
 SL algorithms.
 
 r cv_fig3, results="asis", echo = FALSE
-knitr::include_graphics("img/misc/ericSL.pdf")
+knitr::include_graphics("img/png/ericSL.png")
 
 For more detail on Super Learner we refer the reader to @vdl2007super and
 @polley2010super. The optimality results for the cross-validation selector

diff --git a/07-tmle3.Rmd b/07-tmle3.Rmd
@@ -33,7 +33,7 @@ estimation; targeted minimum loss-based estimation) framework, using the
 following example data:
 
 ```{r tmle_fig1, results="asis", echo = FALSE}
-knitr::include_graphics("img/misc/tmle_sim/schematic_1_truedgd.png")
+knitr::include_graphics("img/png/schematic_1_truedgd.png")
 ```
 
 The small ticks on the right indicate the mean outcomes (averaging over $W$)
@@ -62,7 +62,7 @@ Applying `sl3` to estimate the outcome regression in our example, we can see
 that the ensemble machine learning predictions fit the data quite well:
 
 ```{r tmle_fig2, results="asis", echo = FALSE}
-knitr::include_graphics("img/misc/tmle_sim/schematic_2b_sllik.png")
+knitr::include_graphics("img/png/schematic_2b_sllik.png")
 ```
 
 The solid lines indicate the `sl3` estimate of the regression function, with the
@@ -81,7 +81,7 @@ We can see these limitations illustrated in the estimates generated for the
 example data:
 
 ```{r tmle_fig3, results="asis", echo = FALSE}
-knitr::include_graphics("img/misc/tmle_sim/schematic_3_effects.png")
+knitr::include_graphics("img/png/schematic_3_effects.png")
 ```
 
 We see that Super Learner, estimates the true parameter value (indicated by the

diff --git a/08-tmle3mopttx.Rmd b/08-tmle3mopttx.Rmd
@@ -49,7 +49,7 @@ improve efficiency by not allocating resources to individuals that do not need
 them, or would not benefit from it.
 
 ```{r, fig.cap="Dynamic Treatment Regime in a Clinical Setting", results="asis", echo=FALSE}
-knitr::include_graphics(path = "img/image/DynamicA_Illustration.png")
+knitr::include_graphics(path = "img/png/DynamicA_Illustration.png")
 ```
 
 One opts to administer the intervention to individuals who will profit from it instead,

diff --git a/book.bib b/book.bib
@@ -306,7 +306,7 @@ @article{wolpert1992stacked
 }
 
 @article{naturenews_2015,
-  author = {Anonymous},
+  author = {Anonymous Editorial in \textit{Nature}}},
   title={Let’s think about cognitive bias},
   journal={Nature},
   publisher={Springer Nature},

diff --git a/img/misc/.DS_Store b/img/misc/.DS_Store
diff --git a/img/misc/tmle_sim/schematic_1_truedgd.png b/img/misc/tmle_sim/schematic_1_truedgd.png
diff --git a/img/misc/tmle_sim/schematic_2a_glmlik.png b/img/misc/tmle_sim/schematic_2a_glmlik.png
diff --git a/img/misc/tmle_sim/schematic_2b_sllik.png b/img/misc/tmle_sim/schematic_2b_sllik.png
diff --git a/img/misc/tmle_sim/schematic_3_effects.png b/img/misc/tmle_sim/schematic_3_effects.png
diff --git a/img/misc/tmle_sim/schematic_4_opttx_truedgd.png b/img/misc/tmle_sim/schematic_4_opttx_truedgd.png
diff --git a/img/misc/tmle_sim/schematic_5_opttx_estimates.png b/img/misc/tmle_sim/schematic_5_opttx_estimates.png
diff --git a/img/misc/tmle_sim/sim_distribution.png b/img/misc/tmle_sim/sim_distribution.png
diff --git a/img/misc/tmle_sim/sim_performance.png b/img/misc/tmle_sim/sim_performance.png
diff --git a/img/misc/NatureSlides.pdf → img/pdf/NatureSlides.pdf b/img/misc/NatureSlides.pdf → img/pdf/NatureSlides.pdf
diff --git a/img/misc/SLKaiserNew.pdf → img/pdf/SLKaiserNew.pdf b/img/misc/SLKaiserNew.pdf → img/pdf/SLKaiserNew.pdf
diff --git a/img/misc/TMLEimage.pdf → img/pdf/TMLEimage.pdf b/img/misc/TMLEimage.pdf → img/pdf/TMLEimage.pdf
diff --git a/img/misc/ericSL.pdf → img/pdf/ericSL.pdf b/img/misc/ericSL.pdf → img/pdf/ericSL.pdf
diff --git a/img/misc/tmle_sim/schematic_1_truedgd.pdf → img/pdf/schematic_1_truedgd.pdf b/img/misc/tmle_sim/schematic_1_truedgd.pdf → img/pdf/schematic_1_truedgd.pdf
diff --git a/img/misc/tmle_sim/schematic_2a_glmlik.pdf → img/pdf/schematic_2a_glmlik.pdf b/img/misc/tmle_sim/schematic_2a_glmlik.pdf → img/pdf/schematic_2a_glmlik.pdf
diff --git a/img/misc/tmle_sim/schematic_2b_sllik.pdf → img/pdf/schematic_2b_sllik.pdf b/img/misc/tmle_sim/schematic_2b_sllik.pdf → img/pdf/schematic_2b_sllik.pdf
diff --git a/img/misc/tmle_sim/schematic_3_effects.pdf → img/pdf/schematic_3_effects.pdf b/img/misc/tmle_sim/schematic_3_effects.pdf → img/pdf/schematic_3_effects.pdf
diff --git a/...sc/tmle_sim/schematic_4_opttx_truedgd.pdf → img/pdf/schematic_4_opttx_truedgd.pdf b/...sc/tmle_sim/schematic_4_opttx_truedgd.pdf → img/pdf/schematic_4_opttx_truedgd.pdf
diff --git a/.../tmle_sim/schematic_5_opttx_estimates.pdf → img/pdf/schematic_5_opttx_estimates.pdf b/.../tmle_sim/schematic_5_opttx_estimates.pdf → img/pdf/schematic_5_opttx_estimates.pdf
diff --git a/img/misc/tmle_sim/sim_distribution.pdf → img/pdf/sim_distribution.pdf b/img/misc/tmle_sim/sim_distribution.pdf → img/pdf/sim_distribution.pdf
diff --git a/img/misc/tmle_sim/sim_performance.pdf → img/pdf/sim_performance.pdf b/img/misc/tmle_sim/sim_performance.pdf → img/pdf/sim_performance.pdf
diff --git a/img/misc/vs.pdf → img/pdf/vs.pdf b/img/misc/vs.pdf → img/pdf/vs.pdf
diff --git a/img/pdf2png.sh b/img/pdf2png.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# requires imageMagick
+
+for f in pdf/*.pdf
+do
+	echo "Converting PDF file: $f to ${f%.pdf}.png"
+	convert -density 300 $f -quality 100 ${f%.pdf}.png
+done
+
+mv pdf/*.png png/
diff --git a/img/image/DynamicA_Illustration.png → img/png/DynamicA_Illustration.png b/img/image/DynamicA_Illustration.png → img/png/DynamicA_Illustration.png
diff --git a/img/png/NatureSlides-0.png b/img/png/NatureSlides-0.png
diff --git a/img/png/NatureSlides-1.png b/img/png/NatureSlides-1.png
diff --git a/img/png/NatureSlides-2.png b/img/png/NatureSlides-2.png
diff --git a/img/png/NatureSlides-3.png b/img/png/NatureSlides-3.png
diff --git a/img/png/NatureSlides-4.png b/img/png/NatureSlides-4.png
diff --git a/img/png/SLKaiserNew.png b/img/png/SLKaiserNew.png
diff --git a/img/png/TMLEimage.png b/img/png/TMLEimage.png
diff --git a/img/png/ericSL.png b/img/png/ericSL.png
diff --git a/img/image/rolling_origin.png → img/png/rolling_origin.png b/img/image/rolling_origin.png → img/png/rolling_origin.png
diff --git a/img/image/rolling_origin_v_fold.png → img/png/rolling_origin_v_fold.png b/img/image/rolling_origin_v_fold.png → img/png/rolling_origin_v_fold.png
diff --git a/img/image/rolling_window.png → img/png/rolling_window.png b/img/image/rolling_window.png → img/png/rolling_window.png
diff --git a/img/image/rolling_window_v_fold.png → img/png/rolling_window_v_fold.png b/img/image/rolling_window_v_fold.png → img/png/rolling_window_v_fold.png
diff --git a/img/png/schematic_1_truedgd.png b/img/png/schematic_1_truedgd.png
diff --git a/img/png/schematic_2a_glmlik.png b/img/png/schematic_2a_glmlik.png
diff --git a/img/png/schematic_2b_sllik.png b/img/png/schematic_2b_sllik.png
diff --git a/img/png/schematic_3_effects.png b/img/png/schematic_3_effects.png
diff --git a/img/png/schematic_4_opttx_truedgd.png b/img/png/schematic_4_opttx_truedgd.png
diff --git a/img/png/schematic_5_opttx_estimates.png b/img/png/schematic_5_opttx_estimates.png
diff --git a/img/png/sim_distribution.png b/img/png/sim_distribution.png
diff --git a/img/png/sim_performance.png b/img/png/sim_performance.png
diff --git a/img/png/vs.png b/img/png/vs.png
diff --git a/index.Rmd b/index.Rmd
@@ -23,7 +23,7 @@ graphics: yes
 description: "An open source handbook for causal machine learning and data
   science with the Targeted Learning framework using the [`tlverse` software
   ecosystem](https://github.com/tlverse)."
-favicon: "img/logos/favicons/favicon.png"
+favicon: "img/favicons/favicon.png"
 ---
 
 # About this book {-}