diff --git a/01-preface.Rmd b/01-preface.Rmd index 132bcbc..4f0ad3c 100644 --- a/01-preface.Rmd +++ b/01-preface.Rmd @@ -1,5 +1,6 @@ # Robust Statistics and Reproducible Science {#robust} + -> "One enemy of robust science is our humanity — our appetite for +> "One enemy of robust science is our humanity -- our appetite for > being right, and our tendency to find patterns in noise, to see supporting > evidence for what we already believe is true, and to ignore the facts that do > not fit." > > --- @naturenews_2015 -Scientific research is at a unique point in history. The need to improve rigor -and reproducibility in our field is greater than ever; corroboration moves -science forward, yet there is a growing alarm about results that cannot be -reproduced and that report false discoveries [@baker2016there]. Consequences of -not meeting this need will result in further decline in the rate of scientific -progression, the reputation of the sciences, and the public’s trust in its -findings [@munafo2017manifesto; @naturenews2_2015]. - - +Scientific research is at a unique point in its history. The need to improve +rigor and reproducibility in our field is greater than ever; corroboration moves +science forward, yet there is growing alarm that results cannot be reproduced or +validated, suggesting the possibility that many discoveries may be false +[@baker2016there]. Consequences of not meeting this need will result in further +decline in the rate of scientific progress, the reputation of the sciences, and +the public's trust in scientific findings [@munafo2017manifesto; +@naturenews2_2015]. > "The key question we want to answer when seeing the results of any scientific > study is whether we can trust the data analysis." > > --- @peng2015reproducibility -Unfortunately, at its current state the culture of data analysis and statistics -actually enables human bias through improper model selection. All hypothesis -tests and estimators are derived from statistical models, so to obtain valid -estimates and inference it is critical that the statistical model contains the -process that generated the data. Perhaps treatment was randomized or only -depended on a small number of baseline covariates; this knowledge should and -can be incorporated in the model. Alternatively, maybe the data is -observational, and there is no knowledge about the data-generating process (DGP). -If this is the case, then the statistical model should contain *all* data -distributions. In practice; however, models are not selected based on knowledge -of the DGP, instead models are often selected based on (1) the p-values they -yield, (2) their convenience of implementation, and/or (3) an analysts loyalty -to a particular model. This practice of "cargo-cult statistics --- the -ritualistic miming of statistics rather than conscientious practice," -[@stark2018cargo] is characterized by arbitrary modeling choices, even though -these choices often result in different answers to the same research question. -That is, "increasingly often, [statistics] is used instead to aid and -abet weak science, a role it can perform well when used mechanically or -ritually," as opposed to its original purpose of safeguarding against weak -science [@stark2018cargo]. This presents a fundamental drive behind the epidemic -of false findings that scientific research is suffering from [@vdl2014entering]. +Unfortunately, in its current state, the culture of statistical data analysis +enables, rather than precludes, the manner in which human bias may affect the +results of (ideally objective) data analytic efforts. A significant degree of +human bias enters statistical analysis efforts in the form improper model +selection. All procedures for estimation and hypothesis testing are derived +based on a choice of statistical model; thus, obtaining valid estimates and +statistical inference relies critically on the chosen statistical model +containing an accurate representation of the process that generated the data. +Consider, for example, a hypothetical study in which a treatment was assigned to +a group of patients: Was the treatment assigned randomly or were characteristics +of the individuals (i.e., baseline covariates) used in making the treatment +decision? Such knowledge can should be incorporated in the statistical model. +Alternatively, the data could be from an observational study, in which there is +no control over the treatment assignment mechanism. In such cases, available +knowledge about the data-generating process (DGP) is more limited still. If +this is the case, then the statistical model should contain *all* possible +distributions of the data. In practice, however, models are not selected based +on scientific knowledge available +about the DGP; instead, models are often selected based on (1) the philosophical +leanings of the analyst, (2) +the relative convenience of implementation of statistical methods admissible +within the choice of model, and (3) the results of significance testing (i.e., +p-values) applied within the choice of model. + +This practice of "cargo-cult statistics --- the ritualistic miming of statistics +rather than conscientious practice," [@stark2018cargo] is characterized by +arbitrary modeling choices, even though these choices often result in different +answers to the same research question. That is, "increasingly often, +[statistics] is used instead to aid and abet weak science, a role it can perform +well when used mechanically or ritually," as opposed to its original purpose of +safeguarding against weak science by providing formal techniques for evaluating +the veracity of a claim using properly collected data [@stark2018cargo]. This +presents a fundamental drive behind the epidemic of false findings from which +scientific research is suffering [@vdl2014entering]. > "We suggest that the weak statistical understanding is probably due to > inadequate "statistics lite" education. This approach does not build up @@ -65,15 +79,18 @@ of false findings that scientific research is suffering from [@vdl2014entering]. > > --- @szucs2017null - -Our team at The University of California, Berkeley, is uniquely positioned to +Our team at the University of California, Berkeley is uniquely positioned to provide such an education. Spearheaded by Professor Mark van der Laan, and spreading rapidly by many of his students and colleagues who have greatly -enriched the field, the aptly named "Targeted Learning" methodology targets the -scientific question at hand and is counter to the current culture of -"convenience statistics" which opens the door to biased estimation, misleading -results, and false discoveries. Targeted Learning restores the fundamentals that -formalized the field of statistics, such as the that facts that a statistical +enriched the field, the aptly named "Targeted Learning" methodology emphasizes a +focus of (i.e., "targeting of") the scientific question at hand, running counter +to the current culture problem of "convenience statistics," which opens the door +to biased estimation, misleading analytic results, and erroneous discoveries. +Targeted Learning embraces the fundamentals that formalized the field of +statistics, + + +such as the that facts that a statistical model represents real knowledge about the experiment that generated the data, and a target parameter represents what we are seeking to learn from the data as a feature of the distribution that generated it [@vdl2014entering]. In this way, diff --git a/05-origami.Rmd b/05-origami.Rmd index 6b5d3cc..f9036c7 100644 --- a/05-origami.Rmd +++ b/05-origami.Rmd @@ -19,15 +19,15 @@ By the end of this chapter you will be able to: 3. Select a loss function that is appropriate for the functional parameter to be estimated. - + 4. Understand and contrast different cross-validation schemes for i.i.d. data. 5. Understand and contrast different cross-validation schemes for time dependent data. - + 6. Setup the proper fold structure, build custom fold-based function, and cross-validate the proposed function using the `origami` `R` package. - + 7. Setup the proper cross-validation structure for the use by the Super Learner using the the `origami` `R` package. @@ -491,7 +491,7 @@ time points, including the original 15 we started with. We then evaluate its performance on 10 time points in the future. ```{r, fig.cap="Rolling origin CV", results="asis", echo=FALSE} -knitr::include_graphics(path = "img/image/rolling_origin.png") +knitr::include_graphics(path = "img/png/rolling_origin.png") ``` We illustrate the usage of the rolling origin cross-validation with `origami` @@ -541,7 +541,7 @@ to the rolling origin CV. We then evaluate the performance of the proposed algorithm on 10 time points in the future. ```{r, fig.cap="Rolling window CV", results="asis", echo=FALSE} -knitr::include_graphics(path = "img/image/rolling_window.png") +knitr::include_graphics(path = "img/png/rolling_window.png") ``` We illustrate the usage of the rolling window cross-validation with `origami` @@ -581,7 +581,7 @@ first_window, validation_size, gap, batch)`. In the figure below, we show $V=2$ $V$-folds, and 2 time-series CV folds. ```{r, fig.cap="Rolling origin V-fold CV", results="asis", echo=FALSE} -knitr::include_graphics(path = "img/image/rolling_origin_v_fold.png") +knitr::include_graphics(path = "img/png/rolling_origin_v_fold.png") ``` #### Rolling window with v-fold @@ -594,7 +594,7 @@ validation_size, gap, batch)`. In the figure below, we show $V=2$ $V$-folds, and 2 time-series CV folds. ```{r, fig.cap="Rolling window V-fold CV", results="asis", echo=FALSE} -knitr::include_graphics(path = "img/image/rolling_window_v_fold.png") +knitr::include_graphics(path = "img/png/rolling_window_v_fold.png") ``` ## General workflow of `origami` diff --git a/06-sl3.Rmd b/06-sl3.Rmd index 9a5960b..04a7355 100644 --- a/06-sl3.Rmd +++ b/06-sl3.Rmd @@ -218,7 +218,7 @@ Below is a figure from [ADD REF] describing the same step-by-step procedure. This figure considers $k=16$ learners, and in the figure $p=k$; and the squared error loss function, thus mean squared error (MSE) is the risk. ```{r cv_fig, fig.show="hold", echo = FALSE} -knitr::include_graphics("img/misc/SLKaiserNew.pdf") +knitr::include_graphics("img/png/SLKaiserNew.png") ``` @@ -227,9 +227,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf") - Cross-validation is proven to be optimal for selection among estimators. This result was established through the oracle inequality for the cross-validation selector among a collection of candidate estimators [@vdl2003unified; - @vaart2006oracle]. The only conditions are that loss function uniformly + @vaart2006oracle]. The only conditions are that loss function uniformly bounded, which is guaranteed in `sl3`, and that the loss function is *valid* - (defined below). (Do we also need for the proportion of validation observation + (defined below). (Do we also need for the proportion of validation observation times the number of observations to go to infinity?) - We use a *loss function* $L$ to assign a measure of performance to each learner $\psi$ when applied to the data $O$, and subsequently compare @@ -239,9 +239,9 @@ knitr::include_graphics("img/misc/SLKaiserNew.pdf") + It is important to recall that $\psi$ is an estimator of $\psi_0$, the unknown and true parameter value under $P_0$. - + A *valid loss function* will have mean/expectation (i.e., risk) that is - minimized at the true value of the parameter $\psi_0$. Thus, minimizing - the expected loss will bring an estimator $\psi$ closer to the true + + A *valid loss function* will have mean/expectation (i.e., risk) that is + minimized at the true value of the parameter $\psi_0$. Thus, minimizing + the expected loss will bring an estimator $\psi$ closer to the true $\psi_0$. + For example, say we observe a learning data set $O_i=(Y_i,X_i)$, of $i=1, \ldots, n$ independent and identically distributed observations, @@ -293,7 +293,7 @@ such a study, comparing the fits of several different learners, including the SL algorithms. r cv_fig3, results="asis", echo = FALSE -knitr::include_graphics("img/misc/ericSL.pdf") +knitr::include_graphics("img/png/ericSL.png") For more detail on Super Learner we refer the reader to @vdl2007super and @polley2010super. The optimality results for the cross-validation selector diff --git a/07-tmle3.Rmd b/07-tmle3.Rmd index 778333f..ca8b04f 100644 --- a/07-tmle3.Rmd +++ b/07-tmle3.Rmd @@ -33,7 +33,7 @@ estimation; targeted minimum loss-based estimation) framework, using the following example data: ```{r tmle_fig1, results="asis", echo = FALSE} -knitr::include_graphics("img/misc/tmle_sim/schematic_1_truedgd.png") +knitr::include_graphics("img/png/schematic_1_truedgd.png") ``` The small ticks on the right indicate the mean outcomes (averaging over $W$) @@ -62,7 +62,7 @@ Applying `sl3` to estimate the outcome regression in our example, we can see that the ensemble machine learning predictions fit the data quite well: ```{r tmle_fig2, results="asis", echo = FALSE} -knitr::include_graphics("img/misc/tmle_sim/schematic_2b_sllik.png") +knitr::include_graphics("img/png/schematic_2b_sllik.png") ``` The solid lines indicate the `sl3` estimate of the regression function, with the @@ -81,7 +81,7 @@ We can see these limitations illustrated in the estimates generated for the example data: ```{r tmle_fig3, results="asis", echo = FALSE} -knitr::include_graphics("img/misc/tmle_sim/schematic_3_effects.png") +knitr::include_graphics("img/png/schematic_3_effects.png") ``` We see that Super Learner, estimates the true parameter value (indicated by the diff --git a/08-tmle3mopttx.Rmd b/08-tmle3mopttx.Rmd index 5f3b448..105dd04 100644 --- a/08-tmle3mopttx.Rmd +++ b/08-tmle3mopttx.Rmd @@ -49,7 +49,7 @@ improve efficiency by not allocating resources to individuals that do not need them, or would not benefit from it. ```{r, fig.cap="Dynamic Treatment Regime in a Clinical Setting", results="asis", echo=FALSE} -knitr::include_graphics(path = "img/image/DynamicA_Illustration.png") +knitr::include_graphics(path = "img/png/DynamicA_Illustration.png") ``` One opts to administer the intervention to individuals who will profit from it instead, diff --git a/book.bib b/book.bib index a6b99cd..3c17d23 100644 --- a/book.bib +++ b/book.bib @@ -306,7 +306,7 @@ @article{wolpert1992stacked } @article{naturenews_2015, - author = {Anonymous}, + author = {Anonymous Editorial in \textit{Nature}}}, title={Let’s think about cognitive bias}, journal={Nature}, publisher={Springer Nature}, diff --git a/img/misc/.DS_Store b/img/misc/.DS_Store deleted file mode 100644 index 4a61f78..0000000 Binary files a/img/misc/.DS_Store and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_1_truedgd.png b/img/misc/tmle_sim/schematic_1_truedgd.png deleted file mode 100644 index 9e1b4d8..0000000 Binary files a/img/misc/tmle_sim/schematic_1_truedgd.png and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_2a_glmlik.png b/img/misc/tmle_sim/schematic_2a_glmlik.png deleted file mode 100644 index a369dfa..0000000 Binary files a/img/misc/tmle_sim/schematic_2a_glmlik.png and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_2b_sllik.png b/img/misc/tmle_sim/schematic_2b_sllik.png deleted file mode 100644 index fcc74a4..0000000 Binary files a/img/misc/tmle_sim/schematic_2b_sllik.png and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_3_effects.png b/img/misc/tmle_sim/schematic_3_effects.png deleted file mode 100644 index b973a9c..0000000 Binary files a/img/misc/tmle_sim/schematic_3_effects.png and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_4_opttx_truedgd.png b/img/misc/tmle_sim/schematic_4_opttx_truedgd.png deleted file mode 100644 index 2001cc1..0000000 Binary files a/img/misc/tmle_sim/schematic_4_opttx_truedgd.png and /dev/null differ diff --git a/img/misc/tmle_sim/schematic_5_opttx_estimates.png b/img/misc/tmle_sim/schematic_5_opttx_estimates.png deleted file mode 100644 index ad997b9..0000000 Binary files a/img/misc/tmle_sim/schematic_5_opttx_estimates.png and /dev/null differ diff --git a/img/misc/tmle_sim/sim_distribution.png b/img/misc/tmle_sim/sim_distribution.png deleted file mode 100644 index 0fb5afd..0000000 Binary files a/img/misc/tmle_sim/sim_distribution.png and /dev/null differ diff --git a/img/misc/tmle_sim/sim_performance.png b/img/misc/tmle_sim/sim_performance.png deleted file mode 100644 index 9d3e0df..0000000 Binary files a/img/misc/tmle_sim/sim_performance.png and /dev/null differ diff --git a/img/misc/NatureSlides.pdf b/img/pdf/NatureSlides.pdf similarity index 100% rename from img/misc/NatureSlides.pdf rename to img/pdf/NatureSlides.pdf diff --git a/img/misc/SLKaiserNew.pdf b/img/pdf/SLKaiserNew.pdf similarity index 100% rename from img/misc/SLKaiserNew.pdf rename to img/pdf/SLKaiserNew.pdf diff --git a/img/misc/TMLEimage.pdf b/img/pdf/TMLEimage.pdf similarity index 100% rename from img/misc/TMLEimage.pdf rename to img/pdf/TMLEimage.pdf diff --git a/img/misc/ericSL.pdf b/img/pdf/ericSL.pdf similarity index 100% rename from img/misc/ericSL.pdf rename to img/pdf/ericSL.pdf diff --git a/img/misc/tmle_sim/schematic_1_truedgd.pdf b/img/pdf/schematic_1_truedgd.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_1_truedgd.pdf rename to img/pdf/schematic_1_truedgd.pdf diff --git a/img/misc/tmle_sim/schematic_2a_glmlik.pdf b/img/pdf/schematic_2a_glmlik.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_2a_glmlik.pdf rename to img/pdf/schematic_2a_glmlik.pdf diff --git a/img/misc/tmle_sim/schematic_2b_sllik.pdf b/img/pdf/schematic_2b_sllik.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_2b_sllik.pdf rename to img/pdf/schematic_2b_sllik.pdf diff --git a/img/misc/tmle_sim/schematic_3_effects.pdf b/img/pdf/schematic_3_effects.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_3_effects.pdf rename to img/pdf/schematic_3_effects.pdf diff --git a/img/misc/tmle_sim/schematic_4_opttx_truedgd.pdf b/img/pdf/schematic_4_opttx_truedgd.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_4_opttx_truedgd.pdf rename to img/pdf/schematic_4_opttx_truedgd.pdf diff --git a/img/misc/tmle_sim/schematic_5_opttx_estimates.pdf b/img/pdf/schematic_5_opttx_estimates.pdf similarity index 100% rename from img/misc/tmle_sim/schematic_5_opttx_estimates.pdf rename to img/pdf/schematic_5_opttx_estimates.pdf diff --git a/img/misc/tmle_sim/sim_distribution.pdf b/img/pdf/sim_distribution.pdf similarity index 100% rename from img/misc/tmle_sim/sim_distribution.pdf rename to img/pdf/sim_distribution.pdf diff --git a/img/misc/tmle_sim/sim_performance.pdf b/img/pdf/sim_performance.pdf similarity index 100% rename from img/misc/tmle_sim/sim_performance.pdf rename to img/pdf/sim_performance.pdf diff --git a/img/misc/vs.pdf b/img/pdf/vs.pdf similarity index 100% rename from img/misc/vs.pdf rename to img/pdf/vs.pdf diff --git a/img/pdf2png.sh b/img/pdf2png.sh new file mode 100755 index 0000000..831f24d --- /dev/null +++ b/img/pdf2png.sh @@ -0,0 +1,10 @@ +#!/bin/bash +# requires imageMagick + +for f in pdf/*.pdf +do + echo "Converting PDF file: $f to ${f%.pdf}.png" + convert -density 300 $f -quality 100 ${f%.pdf}.png +done + +mv pdf/*.png png/ diff --git a/img/image/DynamicA_Illustration.png b/img/png/DynamicA_Illustration.png similarity index 100% rename from img/image/DynamicA_Illustration.png rename to img/png/DynamicA_Illustration.png diff --git a/img/png/NatureSlides-0.png b/img/png/NatureSlides-0.png new file mode 100644 index 0000000..1f0868f Binary files /dev/null and b/img/png/NatureSlides-0.png differ diff --git a/img/png/NatureSlides-1.png b/img/png/NatureSlides-1.png new file mode 100644 index 0000000..acde7bf Binary files /dev/null and b/img/png/NatureSlides-1.png differ diff --git a/img/png/NatureSlides-2.png b/img/png/NatureSlides-2.png new file mode 100644 index 0000000..e4162a1 Binary files /dev/null and b/img/png/NatureSlides-2.png differ diff --git a/img/png/NatureSlides-3.png b/img/png/NatureSlides-3.png new file mode 100644 index 0000000..502e085 Binary files /dev/null and b/img/png/NatureSlides-3.png differ diff --git a/img/png/NatureSlides-4.png b/img/png/NatureSlides-4.png new file mode 100644 index 0000000..01c05e8 Binary files /dev/null and b/img/png/NatureSlides-4.png differ diff --git a/img/png/SLKaiserNew.png b/img/png/SLKaiserNew.png new file mode 100644 index 0000000..dfaa755 Binary files /dev/null and b/img/png/SLKaiserNew.png differ diff --git a/img/png/TMLEimage.png b/img/png/TMLEimage.png new file mode 100644 index 0000000..850951b Binary files /dev/null and b/img/png/TMLEimage.png differ diff --git a/img/png/ericSL.png b/img/png/ericSL.png new file mode 100644 index 0000000..9380015 Binary files /dev/null and b/img/png/ericSL.png differ diff --git a/img/image/rolling_origin.png b/img/png/rolling_origin.png similarity index 100% rename from img/image/rolling_origin.png rename to img/png/rolling_origin.png diff --git a/img/image/rolling_origin_v_fold.png b/img/png/rolling_origin_v_fold.png similarity index 100% rename from img/image/rolling_origin_v_fold.png rename to img/png/rolling_origin_v_fold.png diff --git a/img/image/rolling_window.png b/img/png/rolling_window.png similarity index 100% rename from img/image/rolling_window.png rename to img/png/rolling_window.png diff --git a/img/image/rolling_window_v_fold.png b/img/png/rolling_window_v_fold.png similarity index 100% rename from img/image/rolling_window_v_fold.png rename to img/png/rolling_window_v_fold.png diff --git a/img/png/schematic_1_truedgd.png b/img/png/schematic_1_truedgd.png new file mode 100644 index 0000000..6b053e5 Binary files /dev/null and b/img/png/schematic_1_truedgd.png differ diff --git a/img/png/schematic_2a_glmlik.png b/img/png/schematic_2a_glmlik.png new file mode 100644 index 0000000..eaac365 Binary files /dev/null and b/img/png/schematic_2a_glmlik.png differ diff --git a/img/png/schematic_2b_sllik.png b/img/png/schematic_2b_sllik.png new file mode 100644 index 0000000..e530429 Binary files /dev/null and b/img/png/schematic_2b_sllik.png differ diff --git a/img/png/schematic_3_effects.png b/img/png/schematic_3_effects.png new file mode 100644 index 0000000..e0cc3de Binary files /dev/null and b/img/png/schematic_3_effects.png differ diff --git a/img/png/schematic_4_opttx_truedgd.png b/img/png/schematic_4_opttx_truedgd.png new file mode 100644 index 0000000..7c0ad73 Binary files /dev/null and b/img/png/schematic_4_opttx_truedgd.png differ diff --git a/img/png/schematic_5_opttx_estimates.png b/img/png/schematic_5_opttx_estimates.png new file mode 100644 index 0000000..50aed65 Binary files /dev/null and b/img/png/schematic_5_opttx_estimates.png differ diff --git a/img/png/sim_distribution.png b/img/png/sim_distribution.png new file mode 100644 index 0000000..544a467 Binary files /dev/null and b/img/png/sim_distribution.png differ diff --git a/img/png/sim_performance.png b/img/png/sim_performance.png new file mode 100644 index 0000000..d40e270 Binary files /dev/null and b/img/png/sim_performance.png differ diff --git a/img/png/vs.png b/img/png/vs.png new file mode 100644 index 0000000..f9fc4c7 Binary files /dev/null and b/img/png/vs.png differ diff --git a/index.Rmd b/index.Rmd index 250e0ce..fa8a3e4 100644 --- a/index.Rmd +++ b/index.Rmd @@ -23,7 +23,7 @@ graphics: yes description: "An open source handbook for causal machine learning and data science with the Targeted Learning framework using the [`tlverse` software ecosystem](https://github.com/tlverse)." -favicon: "img/logos/favicons/favicon.png" +favicon: "img/favicons/favicon.png" --- # About this book {-}