-M1 +obs | -M2 +Y | -M3 +X1 | -M4 +X2 | -W1 +X3 | -W2 +X4 | -W3 +X5 | -Y +X6 + | ++X7 + | ++Z + | ++W |
---|---|---|---|---|---|---|---|---|---|---|
-23.01099 +1 | -3.864569 +7.534686 | -4.461534 +0.4157066 | -4.747169 +0.5308077 | -7.873068 +0.2223965 | -7.362137 +1.1592634 | -1 +2.4577556 + | ++0.9438601 | -72.76739 +1.8714406 + | ++0 + | ++0.1335790 |
-23.70848 +2 | -4.169044 +19.611934 | -4.990670 +0.5293572 | -5.568766 +0.9339570 | -6.677126 +1.1210595 | -7.302253 +1.3350074 | -1 +0.3096883 + | ++0.5190970 + | ++0.2418065 + | ++0 | -87.81536 +0.0585291 |
-22.44274 +3 + | ++12.664050 + | ++0.4849759 + | ++0.7210988 | -2.922274 +0.4629027 | -4.885933 +1.0334138 | -3.347086 +0.9492810 | -7.598614 +0.3664090 | -7.647076 +0.3502445 | 0 | -42.33824 +0.1342057 |
-22.99681 +4 | -2.507461 +15.600288 | -4.227508 +0.8275456 | -1.719390 +1.0457137 | -7.321127 +0.9699040 | -6.907626 +0.9045099 | -1 +0.9107914 + | ++0.4299847 | -38.72763 +1.0007901 + | ++0 + | ++0.0734320 |
-22.88541 +5 | -3.349974 +18.606498 | -4.369185 +0.5190363 | -6.618387 +0.7802400 | -6.203951 +0.6142188 | -6.547683 +0.3729743 | -1 +0.5038126 + | ++0.3575472 | -90.86418 +0.5906156 + | ++0 + | ++-0.0148427 |
-22.88781 +6 + | ++18.525890 + | ++0.4009491 + | ++0.8639886 | -3.890133 +0.5501847 | -5.370329 +0.9011016 | -5.736624 +1.2907615 | -7.677073 +0.7990418 | -8.250861 +1.5097039 | 0 | -68.59041 +0.1749775 |
+
-
-Díaz, Iván, and Nima S. Hejazi. 2020. “Causal
-mediation analysis for stochastic interventions.” *Journal of the
-Royal Statistical Society. Series B: Statistical Methodology* 82 (3):
-661–83.
-
Dı́az, Iván, and Nima S Hejazi. 2020. “Causal Mediation Analysis for
diff --git a/doc/SuperNOVA-vignette.R b/doc/SuperNOVA-vignette.R
deleted file mode 100644
index 802b667..0000000
--- a/doc/SuperNOVA-vignette.R
+++ /dev/null
@@ -1,77 +0,0 @@
-## ---- include = FALSE---------------------------------------------------------
-knitr::opts_chunk$set(
- collapse = TRUE,
- comment = "#>"
-)
-
-## ----figure, echo=FALSE, out.width='100%', fig.align='center'-----------------
-library(knitr)
-include_graphics("Biometrics_Flow_Chart.png")
-
-## ----deltas, message=FALSE, warning=FALSE-------------------------------------
-deltas <- c("M1" = 1, "M2" = 2.3, "M3" = 1.4)
-
-## ----setup, message=FALSE, warning=FALSE--------------------------------------
-library(data.table)
-library(dplyr)
-library(kableExtra)
-library(SuperNOVA)
-
-seed <- 325911
-
-## ----NIEHS example------------------------------------------------------------
-data("NIEHS_data_1", package = "SuperNOVA")
-
-## ----NIEH Nodes---------------------------------------------------------------
-NIEHS_data_1$W <- rnorm(nrow(NIEHS_data_1), mean = 0, sd = 0.1)
-w <- NIEHS_data_1[, c("W", "Z")]
-a <- NIEHS_data_1[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7")]
-y <- NIEHS_data_1$Y
-
-## ----run SuperNOVA NIEHS data, eval = TRUE------------------------------------
-deltas <- list(
- "X1" = 1, "X2" = 1, "X3" = 1,
- "X4" = 1, "X5" = 1, "X6" = 1, "X7" = 1
-)
-
-ptm <- proc.time()
-
-NIEH_results <- SuperNOVA(
- w = w,
- a = a,
- y = y,
- deltas = deltas,
- estimator = "tmle",
- fluctuation = "standard",
- n_folds = 2,
- outcome_type = "continuous",
- quantile_thresh = 0,
- verbose = TRUE,
- parallel = FALSE,
- parallel_type = "sequential",
- num_cores = 2,
- seed = seed,
- adaptive_delta = TRUE
-)
-
-proc.time() - ptm
-
-indiv_shift_results <- NIEH_results$`Indiv Shift Results`
-em_results <- NIEH_results$`Effect Mod Results`
-joint_shift_results <- NIEH_results$`Joint Shift Results`
-
-## ----individual results-------------------------------------------------------
-indiv_shift_results$X7 %>%
- kableExtra::kbl(caption = "Effect Modification Results") %>%
- kable_classic(full_width = F, html_font = "Cambria")
-
-## ----joint results------------------------------------------------------------
-em_results$X7Z %>%
- kableExtra::kbl(caption = "Effect Modification Results") %>%
- kableExtra::kable_classic(full_width = F, html_font = "Cambria")
-
-## ----interaction results------------------------------------------------------
-joint_shift_results$X2X7 %>%
- kableExtra::kbl(caption = "Interaction Results") %>%
- kableExtra::kable_classic(full_width = T, html_font = "Cambria")
-
diff --git a/doc/SuperNOVA-vignette.Rmd b/doc/SuperNOVA-vignette.Rmd
deleted file mode 100644
index 6046ee3..0000000
--- a/doc/SuperNOVA-vignette.Rmd
+++ /dev/null
@@ -1,280 +0,0 @@
----
-title: "Analysis of Variance using Super Learner with Data-Adaptive Stochastic Interventions"
-author: "David McCoy"
-date: "`r Sys.Date()`"
-output: rmarkdown::html_vignette
-bibliography: ../inst/references.bib
-vignette: >
- %\VignetteIndexEntry{Analysis of Variance using Super Learner with Data-Adaptive Stochastic Interventions}
- %\VignetteEngine{knitr::rmarkdown}
- %\VignetteEncoding{UTF-8}
----
-
-```{r, include = FALSE}
-knitr::opts_chunk$set(
- collapse = TRUE,
- comment = "#>"
-)
-```
-
-## Motivation
-
-The motivation behind the package SuperNOVA is to address the limitations of traditional statistical methods in environmental epidemiology studies. Analysts are often interested in understanding the joint impact of a mixed exposure, i.e. a vector of exposures, but the most important variables and variable sets remain unknown. Traditional methods make overly simplistic assumptions, such as linear and additive relationships, and the resulting statistical quantities may not be directly applicable to public policy decisions.
-
-To overcome these limitations, SuperNOVA uses data-adaptive machine learning methods to identify the variables and variable sets that have the most explanatory power on an outcome of interest. The package builds a discrete Super Learner, which is then analyzed using ANOVA style analysis to determine the variables that contribute most to the model fit through basis functions. The target parameter, which may be a single shift, effect modification, interaction or mediation, is then applied to the data-adaptively determined variable sets.
-
-In this way, SuperNOVA allows analysts to explore modified treatment policies and ask causal questions, such as "If exposure to collections of PFAS chemicals increases, what is the change in cholesterol, immune function, or cancer?" By providing more nuanced and data-driven insights, SuperNOVA aims to inform public policy decisions and improve our understanding of the impact of mixed exposures on health outcomes.
-
-For more detailed information, please read [@mccoy2023semiparametric @McCoy2023mediation] - here we focus on providing a general overview and enough information to understand processes and output.
-
-### Data-Adaptive Machine Learning
-
-The package SuperNOVA uses a data-adaptive approach to identify important variables and variable sets in mixed exposure studies. To avoid over-fitting and incorrect model assumptions, the package employs cross-validation procedures. In each fold, the full data is split into a training and estimation sample, the training data is used to:
-
-1. Identify the variable sets of interest using flexible machine learning algorithms.
-2. Fit estimators for the relevant nuisance parameters and the final target parameter of interest using the identified variable sets.
-
-Then the estimation sample is used to:
-
-3. Obtain estimates of the nuisance parameters and the final target parameter of interest using the held-out validation data.
-4. The process is repeated in each fold to provide an estimate specific to each validation data. To optimize the balance between bias and variance, SuperNOVA uses targeted minimum loss based estimation (TMLE).
-5. Estimates across the folds are then averaged and a pooled fluctuation step is performed to estimate the efficient influence function and derive pooled variance estimates.
-
-
-# Data and Parameter of Interest
-
-## Overview of our Data-Adaptive Target Parameter
-
-In contemporary epidemiological analyses, high-dimensional data sets have revealed the constraints of conventional statistical methods. In many cases, the interactions that exist
-in multiple exposures, the baseline covariates that modify the impacts of exposures or the
-mediators that may mediate the impact of exposures are all unknown a priori. Thus, we need to
-identify these relationships and build efficient estimators for when interventions are made
-on these variable sets in the respective relationships.
-
-Hubbard and van der Laan [@Hubbard2016] championed data-adaptive target parameters which aim to answer the question of how to both identify and estimate target parameter in data which requires the combination of data-splitting and efficient estimators. Because of the number of steps required for robust estimation is complex, we first provide an overview of the method here:
-
-```{r figure, echo=FALSE, out.width='100%', fig.align='center'}
-library(knitr)
-include_graphics("Biometrics_Flow_Chart.png")
-```
-
-## Notation and Framework
-
-Building upon the foundational concepts presented in [@diaz2012population].
-
-Let $O=(W, \mathbf{A}, Y)$ represent a random variable with distribution $P_0$, and $O_1,...,O_n$ represent $n$ i.i.d. observations of $O$. These are our observations
-
-The true distribution $P_0$ of $O$ is:
-
-\[ P_0(O) = P_0(Y|\mathbf{A}, W)P_0(\mathbf{A}|W)P_0(W), \]
-
-Here, we are not considering mediation.
-
-We are interested in the counterfactual outcome if the exposure distribution was shifted by some amount $\delta$. For example, we may want to observe how childhood asthma changes if we reduce exposure to PM2.5 by a small amount and also if we reduce exposure to NO by a small amount.
-
-We denote these counterfactual outcomes are denoted by $Y_{\mathbf{P}\delta}$.
-
-We denote a shift in the conditional density distribution of exposures as:
-
-\[ g_0(\mathbf{A}-\boldsymbol{\delta}(W)|W), \]
-
-Of course, we cannot change the entire population's exposure to a chemical by some amount and follow up to observe outcomes and then go back in time adjust the exposures again and then follow up over time - this would be our full causal model which we can't observe. Thus we need to see how close we can get to this parameter given our observed data.
-
-## Identification
-
-The statistical parameter for a stochastic shift intervention of 1 or many exposures is expressed as:
-
-\begin{align*}
-\Phi(P_n) &= \int_{\mathbf{A}} \int_{W} P(Y=y|\mathbf{A}=a, W=w) \\
-&\times P_\delta(g)(\mathbf{A}=a|W=w)P(W=w) \, da \, dw.
-\end{align*}
-
-The causal parameter:
-
-\[ \theta(P_0) = \mathbb{E}[Y(A-\delta)] \]
-
-Under the assumptions of Conditional Ignorability and Positivity - that is, no unmeasured confounding (we have adjusted for all necessary covariates) and we have enough exposure values across our covariates:
-
-\[ \Phi(P_n) = \theta(P_0) \]
-
-\[ \Phi(P_n) = \theta(P_0) = \int_{\mathbf{A}} \int_{W} \bar{Q}(\mathbf{A}, W)P_\delta(g)(\mathbf{A}|W)Q_W(W) \, da \, dw. \]
-
-Thus to construct our target parameter we need estimators for $\bar{Q}$ and $\bar{g}$
-
-## Interaction Target Parameter
-
-For $\mathbf{A} = A_1, A_2$, a joint shift in both exposures is noted as $E[Y_{g^*_{\boldsymbol{\delta}}}]$.
-
-\[ \mathbb{E}[Y_{g^*_{\boldsymbol{\delta}}}] - \left( \mathbb{E}[Y_{g^*_{\delta_1}}] + \mathbb{E}[Y_{g^*_{\delta_2}}] \right) \]
-
-In words, our target parameter for interaction compares the counterfactual outcome mean under a joint exposure shift to the sum of expected outcomes under individual shifts of the same exposures.
-
-## Interpretation: Analogy to Interaction Coefficient in a Linear Model
-
-In traditional epidemiological studies often, interactions are defined and estimated as a product term in a linear regression model. Given a simplified model:
-
-\[ Y = \beta_0 + \beta_1A_1 + \beta_2A_2 + \beta_{12}A_1A_2 + \epsilon \]
-
-where \( \beta_{12} \) is the interaction term of interest, we can understand this coefficient from a stochastic shift perspective. Given this structural model we can ask about average outcomes under unidimensional unit shifts in $A_1, A_2$ as well as unit shifts in both and with some algebra we arrive at a representation of \( \beta_{12} \) in terms of expected outcomes under these shifts:
-
-\[ \beta_{12} = \mathbb{E}[Y_{g^*_{\boldsymbol{\delta}}}] - \left( \mathbb{E}[Y_{g^*_{\delta_1}}] + \mathbb{E}[Y_{g^*_{\delta_2}}] \right) + \mathbb{E}[Y] \]
-
-This not only links classical regression approaches with modern causal inference techniques for stochastic shift interventions but also simplifies interpretation in epidemiological studies. Therefore, the estimates provided by our method have a similar interpretation to that of an interaction $\beta$ in a linear model (up to a constant $E[Y]$), which epidemiologists are more accustomed to; however, our estimate is not contingent on assumptions of linearity.
-
-# Estimation and Interpretation
-
-In Diaz and Van der Laan's work on stochastic interventions [@diaz2012population], the efficient function (EIF) is given for an individual shift and this EIF does not change for multiple shifts. We provide more details for the EIF in our paper [@mccoy2023semiparametric]. Briefly, we update our counterfactuals from plug-in machine learning using targeted learning such that the bias for our target parameter is 0, we then plug these new counterfactuals into the EIF which we can then use to get variance estimates for confidence intervals and hypothesis testing.
-
-# Cross-Validation Results
-
-Because seperate folds are used to find the variable relationships and do estimation on these relationships, we offer to types of results. We provide fold specific results, that are tables for each variable relationship and estimation found in each fold. This is done so users can see how consistent estimates are across the folds or if there are inconsistencies in identifying varialbe relationships.
-
-We also offer pooled results. In this case, estimates are averages across all the folds, leveraging the full data and thus providing tighter confidence intervals. This is done by pooling our nuisance estimates across all the validation folds and doing a pooled TMLE update and calculating the EIF for the full data. For the sake of this vignette, users just need to know that both pooled and v-fold results are given in the SuperNOVA output.
-
-Specifically, users should be aware that, it is possible to get inconsistent results. For example, if an interaction is only found in 1 of 10 folds, this is an inconsistent result and reporting this as a robust result should be done with caution. Likewise for any other variable relationship that is found. Thus users should evaluate how consistent 1. the relationships are found across the folds and 2. how consistent the empirical results are for the variable relationships.
-
-# Exposure Set Identification using Ensemble Approaches
-
-In the realm of high-dimensional data, the challenge isn't solely the estimation but rather the identification of influential variable sets, especially when exposure dimensions are vast. To address this, we resort to an ensemble method: the discrete Super Learner \cite{SL_2008}. This algorithm employs a cross-validation mechanism, optimizing over a library of candidate algorithms to select the estimator with the best empirical fit.
-
-More specifically, the Super Learner in our context uses basis splines and their tensor products, permitting the model to capture intricate non-linear relationships as linear combinations of basis functions while remaining interpretable. Such formulations are represented in prominent packages like \texttt{earth} \cite{earth}, \texttt{polySpline} \cite{polspline}, and \texttt{hal9001} \cite{hal9001}. We employ this estimator for $E[Y| \boldsymbol{A}, W]$ used in the data-adaptive discovery procedure using the parameter-generating data.
-
-# Non-Parametric Analysis of Variance (NP-ANOVA)
-
-Upon establishing the best fitting estimator, discerning the hierarchy of variable set importance becomes paramount. Traditional ANOVA techniques fall short given the inherent non-linearity and extensive dimensionality intrinsic to our models. In response, we introduce a non-parametric extension: NP-ANOVA.
-
-This method decomposes the variance of the response variable, attributing it to contributions from individual basis functions. Irrespective of the model's underpinnings (be it MARS, HAL, or others), the variance partitioning furnishes insights into the relative importance of each basis function.
-
-The F-statistic serves as our metric of choice, but instead of applied to variables in a linear model, which is done in standard ANOVA analyses, we apply an ANOVA over basis functions used in the best fitting estimator. At its core, the F-statistic juxtaposes the variance elucidated by the model against the residual variance, normalized by their respective degrees of freedom. With the model construed as a blend of basis functions, we resort to the conventional sums of squares (RSS, TSS, ESS) to compute the F-statistic.
-
-Subsequent to this, variable sets earn their rankings via aggregated F-statistics. That is, each basis function, attached to a particular exposure, or exposure set, has a particular F-statistic, we then aggregate these F-statisics to the exposure set level. Employing quantile-based thresholding, we extract influential subsets, thus streamlining the set for subsequent parameter estimations which is useful in very high-dimensional exposure setting but can also be bypassed as part of the data-adaptive parameter.
-
-For a more details into NP-ANOVA method for exposure set discovery, we direct readers to the supplementary material.
-
-Overall, our method for exposure set discovery includes the following:
-
- 1. Fit $$E[Y| \boldsymbol{A}, W]$$ using the aforementiond Super Leaner of basis function estimators.
- 2. Extract the model matrix from the best fitting estimator
- 3. Either use all sets of exposures of length 2 used in the basis functions (basis functions with two exposures indicate this basis is capturing interaction) or apply our F-statistic method to extract "important" basis functions with interaction.
-
-
-## Targeted Learning
-
-We use targeted minimum loss based estimation (TMLE) to debias our initial outcome estimates given a shift in exposure such that the resulting estimator is asymptotically unbiased and has the smallest variance. This procedure involves constructing a least favorable submodel which uses a ratio of conditional exposure densities under shift and now shift as a covariate to debias the initial estimates. More details of targeted learning for stochastic shifts are here @diaz2012population
-
-
-## Review
-
-Overall, because we need to both identify variable relationships and create efficient estimators given an intervention on these relationships we need to use sample splitting. As such, we use one part of the data to find these relationships
-
-## Data Adaptive Delta
-
-For each exposure, the user inputs a respective delta, for example for exposures $M1, M2, M3$ the analyst puts in the `deltas` vector
-
-```{r deltas, message=FALSE, warning=FALSE}
-deltas <- c("M1" = 1, "M2" = 2.3, "M3" = 1.4)
-```
-
-Which assigns a delta shift amount to each exposure. However, because the user doesn't know the underlying experimentation in the data, a delta that is too big may result in positivity violations which leads to bias and high variance of the estimator. That is, if the user asks for a shift in exposure density that is very unlikely given the data. To solve this, the analyst can also choose to set `adaptive_delta = TRUE` which then makes the shift amount a data adaptive parameter as well. The delta will be reduced until the max of the ratio of densities for each exposure is less than or equal to `hn_trunc_thresh`.
-
-## Application:
-
-First, let's load the packages we'll use and set a seed for simulation:
-
-```{r setup, message=FALSE, warning=FALSE}
-library(data.table)
-library(dplyr)
-library(kableExtra)
-library(SuperNOVA)
-
-seed <- 325911
-```
-
-Below we show implementation of `SuperNOVA` using simulated data. Because estimates are given for each data-adaptively identified parameter for each fold, after the demonstration we also show our pooled estimate approach for parameters that are found across all the folds.
-
-## NIEHS Mixture Workshop Data
-
-The `SupernOVA` package comes with 2 datasets from the 2015-NIEHS-Mixtures-Workshop simulation data (https://github.com/niehs-prime/2015-NIEHS-MIxtures-Workshop). Let's load this data and run `SuperNOVA` to see if we identify 1. any interactions in the mixture variables or any marginally important mixture variables and/or 2. any effect modifying variables.
-
-```{r NIEHS example}
-data("NIEHS_data_1", package = "SuperNOVA")
-```
-
-```{r NIEH Nodes}
-NIEHS_data_1$W <- rnorm(nrow(NIEHS_data_1), mean = 0, sd = 0.1)
-w <- NIEHS_data_1[, c("W", "Z")]
-a <- NIEHS_data_1[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7")]
-y <- NIEHS_data_1$Y
-```
-
-
-```{r run SuperNOVA NIEHS data, eval = TRUE}
-deltas <- list(
- "X1" = 1, "X2" = 1, "X3" = 1,
- "X4" = 1, "X5" = 1, "X6" = 1, "X7" = 1
-)
-
-ptm <- proc.time()
-
-NIEH_results <- SuperNOVA(
- w = w,
- a = a,
- y = y,
- deltas = deltas,
- estimator = "tmle",
- fluctuation = "standard",
- n_folds = 2,
- outcome_type = "continuous",
- quantile_thresh = 0,
- verbose = TRUE,
- parallel = FALSE,
- parallel_type = "sequential",
- num_cores = 2,
- seed = seed,
- adaptive_delta = TRUE
-)
-
-proc.time() - ptm
-
-indiv_shift_results <- NIEH_results$`Indiv Shift Results`
-em_results <- NIEH_results$`Effect Mod Results`
-joint_shift_results <- NIEH_results$`Joint Shift Results`
-```
-
-
-Let's look at the results for the variable $X7$ which should have a positive
-effect given the data dictionary provided:
-
-```{r individual results}
-indiv_shift_results$X7 %>%
- kableExtra::kbl(caption = "Effect Modification Results") %>%
- kable_classic(full_width = F, html_font = "Cambria")
-```
-
-We ran `SuperNOVA` with two folds and above we see this exposure was identified in both folds. We also set `adaptive_delta` = TRUE,
-with an initial shift value of 1. As we can see - in both folds this exposure was found, the delta was reduced to 0.59 to ensure there are no positivity violations (incurring a shift that is unfeasible). The pooled estimate is an average across the folds and variance estimates for the pooled result are done via a pooled targeted estimation fluctuation step using nuisance parameters across the folds. Here, our interpretation of this finding is "If all individuals were exposed to a 0.59 increase in X7 the expected outcome increases by 2.37". This result is significant for both the pooled estimate and the fold specific estimates.
-
-Now let's look for effect modification:
-```{r joint results}
-em_results$X7Z %>%
- kableExtra::kbl(caption = "Effect Modification Results") %>%
- kableExtra::kable_classic(full_width = F, html_font = "Cambria")
-```
-
-This shows there are differential effects on the impact of shifting $X7$ between
-strata of the baseline covariate Z. This result was found in 1 fold and so the fold specific and pooled results are the same. This is an example of a less consistent finding as it wasn't found in all the folds. The user should incorporate this information into their interpretation of findings. Findings are more consistent with higher CV fold values.
-
-Here we see that when Z = 0 a 0.47 shift in X7 leads to an average outcome of 17.43 and when Z = 1 the average outcome for the same shift is 33.7.
-
-```{r interaction results}
-joint_shift_results$X2X7 %>%
- kableExtra::kbl(caption = "Interaction Results") %>%
- kableExtra::kable_classic(full_width = T, html_font = "Cambria")
-```
-
-Here we see that in one of the folds an interaction between X2 and X7 was found. The adaptive delta was set to 0.22 for X2 and 0.35 for X7. An increase of 0.22 in X2 leads to a 0.64 increase in Y and a 0.35 increase in X7 leads to a 1.23 increase in Y. An increase in both by these amounts changes Y by 1.87 compared to the additive sum 1.87 if we just add both these individual estimates together. Thus, there is no evidence of interaction given our definition in this case that is more than additive effects.
-
-
-
-
-
diff --git a/doc/SuperNOVA-vignette.html b/doc/SuperNOVA-vignette.html
deleted file mode 100644
index a669260..0000000
--- a/doc/SuperNOVA-vignette.html
+++ /dev/null
@@ -1,1057 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Analysis of Variance using Super Learner -with Data-Adaptive Stochastic Interventions-David McCoy-2023-10-09- - - -
-
-Motivation-The motivation behind the package SuperNOVA is to address the -limitations of traditional statistical methods in environmental -epidemiology studies. Analysts are often interested in understanding the -joint impact of a mixed exposure, i.e. a vector of exposures, but the -most important variables and variable sets remain unknown. Traditional -methods make overly simplistic assumptions, such as linear and additive -relationships, and the resulting statistical quantities may not be -directly applicable to public policy decisions. -To overcome these limitations, SuperNOVA uses data-adaptive machine -learning methods to identify the variables and variable sets that have -the most explanatory power on an outcome of interest. The package builds -a discrete Super Learner, which is then analyzed using ANOVA style -analysis to determine the variables that contribute most to the model -fit through basis functions. The target parameter, which may be a single -shift, effect modification, interaction or mediation, is then applied to -the data-adaptively determined variable sets. -In this way, SuperNOVA allows analysts to explore modified treatment -policies and ask causal questions, such as “If exposure to collections -of PFAS chemicals increases, what is the change in cholesterol, immune -function, or cancer?” By providing more nuanced and data-driven -insights, SuperNOVA aims to inform public policy decisions and improve -our understanding of the impact of mixed exposures on health -outcomes. -For more detailed information, please read McCoy, Hubbard, Laan, et al. (2023) - here we -focus on providing a general overview and enough information to -understand processes and output. -
-
-Data-Adaptive Machine Learning-The package SuperNOVA uses a data-adaptive approach to identify -important variables and variable sets in mixed exposure studies. To -avoid over-fitting and incorrect model assumptions, the package employs -cross-validation procedures. In each fold, the full data is split into a -training and estimation sample, the training data is used to: -
Then the estimation sample is used to: -
-
-Data and Parameter of Interest-
-
-Overview of our Data-Adaptive Target Parameter-In contemporary epidemiological analyses, high-dimensional data sets -have revealed the constraints of conventional statistical methods. In -many cases, the interactions that exist in multiple exposures, the -baseline covariates that modify the impacts of exposures or the -mediators that may mediate the impact of exposures are all unknown a -priori. Thus, we need to identify these relationships and build -efficient estimators for when interventions are made on these variable -sets in the respective relationships. -Hubbard and van der Laan (Hubbard, -Kherad-Pajouh, and Van Der Laan 2016) championed data-adaptive -target parameters which aim to answer the question of how to both -identify and estimate target parameter in data which requires the -combination of data-splitting and efficient estimators. Because of the -number of steps required for robust estimation is complex, we first -provide an overview of the method here: - -
-
-Notation and Framework-Building upon the foundational concepts presented in (Dı́az and van der Laan 2012). -Let \(O=(W, \mathbf{A}, Y)\) -represent a random variable with distribution \(P_0\), and \(O_1,...,O_n\) represent \(n\) i.i.d. observations of \(O\). These are our observations -The true distribution \(P_0\) of -\(O\) is: -\[ P_0(O) = P_0(Y|\mathbf{A}, -W)P_0(\mathbf{A}|W)P_0(W), \] -Here, we are not considering mediation. -We are interested in the counterfactual outcome if the exposure -distribution was shifted by some amount \(\delta\). For example, we may want to -observe how childhood asthma changes if we reduce exposure to PM2.5 by a -small amount and also if we reduce exposure to NO by a small amount. -We denote these counterfactual outcomes are denoted by \(Y_{\mathbf{P}\delta}\). -We denote a shift in the conditional density distribution of -exposures as: -\[ -g_0(\mathbf{A}-\boldsymbol{\delta}(W)|W), \] -Of course, we cannot change the entire population’s exposure to a -chemical by some amount and follow up to observe outcomes and then go -back in time adjust the exposures again and then follow up over time - -this would be our full causal model which we can’t observe. Thus we need -to see how close we can get to this parameter given our observed -data. -
-
-Identification-The statistical parameter for a stochastic shift intervention of 1 or -many exposures is expressed as: -\[\begin{align*} -\Phi(P_n) &= \int_{\mathbf{A}} \int_{W} P(Y=y|\mathbf{A}=a, W=w) \\ -&\times P_\delta(g)(\mathbf{A}=a|W=w)P(W=w) \, da \, dw. -\end{align*}\] -The causal parameter: -\[ \theta(P_0) = \mathbb{E}[Y(A-\delta)] -\] -Under the assumptions of Conditional Ignorability and Positivity - -that is, no unmeasured confounding (we have adjusted for all necessary -covariates) and we have enough exposure values across our -covariates: -\[ \Phi(P_n) = \theta(P_0) \] -\[ \Phi(P_n) = \theta(P_0) = -\int_{\mathbf{A}} \int_{W} \bar{Q}(\mathbf{A}, -W)P_\delta(g)(\mathbf{A}|W)Q_W(W) \, da \, dw. \] -Thus to construct our target parameter we need estimators for \(\bar{Q}\) and \(\bar{g}\) -
-
-Interaction Target Parameter-For \(\mathbf{A} = A_1, A_2\), a -joint shift in both exposures is noted as \(E[Y_{g^*_{\boldsymbol{\delta}}}]\). -\[ -\mathbb{E}[Y_{g^*_{\boldsymbol{\delta}}}] - \left( -\mathbb{E}[Y_{g^*_{\delta_1}}] + \mathbb{E}[Y_{g^*_{\delta_2}}] \right) -\] -In words, our target parameter for interaction compares the -counterfactual outcome mean under a joint exposure shift to the sum of -expected outcomes under individual shifts of the same exposures. -
-
-Interpretation: Analogy to Interaction Coefficient in a Linear -Model-In traditional epidemiological studies often, interactions are -defined and estimated as a product term in a linear regression model. -Given a simplified model: -\[ Y = \beta_0 + \beta_1A_1 + \beta_2A_2 + -\beta_{12}A_1A_2 + \epsilon \] -where \(\beta_{12}\) is the -interaction term of interest, we can understand this coefficient from a -stochastic shift perspective. Given this structural model we can ask -about average outcomes under unidimensional unit shifts in \(A_1, A_2\) as well as unit shifts in both -and with some algebra we arrive at a representation of \(\beta_{12}\) in terms of expected outcomes -under these shifts: -\[ \beta_{12} = -\mathbb{E}[Y_{g^*_{\boldsymbol{\delta}}}] - \left( -\mathbb{E}[Y_{g^*_{\delta_1}}] + \mathbb{E}[Y_{g^*_{\delta_2}}] \right) -+ \mathbb{E}[Y] \] -This not only links classical regression approaches with modern -causal inference techniques for stochastic shift interventions but also -simplifies interpretation in epidemiological studies. Therefore, the -estimates provided by our method have a similar interpretation to that -of an interaction \(\beta\) in a linear -model (up to a constant \(E[Y]\)), -which epidemiologists are more accustomed to; however, our estimate is -not contingent on assumptions of linearity. -
-
-Estimation and Interpretation-In Diaz and Van der Laan’s work on stochastic interventions (Dı́az and van der Laan 2012), the efficient -function (EIF) is given for an individual shift and this EIF does not -change for multiple shifts. We provide more details for the EIF in our -paper (McCoy, Hubbard, Schuler, et al. -2023). Briefly, we update our counterfactuals from plug-in -machine learning using targeted learning such that the bias for our -target parameter is 0, we then plug these new counterfactuals into the -EIF which we can then use to get variance estimates for confidence -intervals and hypothesis testing. -
-
-Cross-Validation Results-Because seperate folds are used to find the variable relationships -and do estimation on these relationships, we offer to types of results. -We provide fold specific results, that are tables for each variable -relationship and estimation found in each fold. This is done so users -can see how consistent estimates are across the folds or if there are -inconsistencies in identifying varialbe relationships. -We also offer pooled results. In this case, estimates are averages -across all the folds, leveraging the full data and thus providing -tighter confidence intervals. This is done by pooling our nuisance -estimates across all the validation folds and doing a pooled TMLE update -and calculating the EIF for the full data. For the sake of this -vignette, users just need to know that both pooled and v-fold results -are given in the SuperNOVA output. -Specifically, users should be aware that, it is possible to get -inconsistent results. For example, if an interaction is only found in 1 -of 10 folds, this is an inconsistent result and reporting this as a -robust result should be done with caution. Likewise for any other -variable relationship that is found. Thus users should evaluate how -consistent 1. the relationships are found across the folds and 2. how -consistent the empirical results are for the variable relationships. -
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/man/calc_final_joint_shift_param.Rd b/man/calc_final_joint_shift_param.Rd
index b8ce4fc..5625a89 100644
--- a/man/calc_final_joint_shift_param.Rd
+++ b/man/calc_final_joint_shift_param.Rd
@@ -8,12 +8,15 @@ calc_final_joint_shift_param(
joint_shift_fold_results,
rank,
fold_k,
- deltas_updated
+ deltas_updated,
+ exposures
)
}
\arguments{
\item{joint_shift_fold_results}{Results of the joint shift}
+\item{rank}{ranking of the interaction found}
+
\item{fold_k}{Fold the joint shift is identified}
\item{deltas_updated}{The new delta, could be updated if Hn has positivity
Exposure Set Identification using Ensemble Approaches-In the realm of high-dimensional data, the challenge isn’t solely the -estimation but rather the identification of influential variable sets, -especially when exposure dimensions are vast. To address this, we resort -to an ensemble method: the discrete Super Learner . This algorithm -employs a cross-validation mechanism, optimizing over a library of -candidate algorithms to select the estimator with the best empirical -fit. -More specifically, the Super Learner in our context uses basis -splines and their tensor products, permitting the model to capture -intricate non-linear relationships as linear combinations of basis -functions while remaining interpretable. Such formulations are -represented in prominent packages like , , and . We employ this -estimator for \(E[Y| \boldsymbol{A}, -W]\) used in the data-adaptive discovery procedure using the -parameter-generating data. -Upon establishing the best fitting estimator, discerning the -hierarchy of variable set importance becomes paramount. Traditional -ANOVA techniques fall short given the inherent non-linearity and -extensive dimensionality intrinsic to our models. In response, we -introduce a non-parametric extension: NP-ANOVA. -This method decomposes the variance of the response variable, -attributing it to contributions from individual basis functions. -Irrespective of the model’s underpinnings (be it MARS, HAL, or others), -the variance partitioning furnishes insights into the relative -importance of each basis function. -The F-statistic serves as our metric of choice, but instead of -applied to variables in a linear model, which is done in standard ANOVA -analyses, we apply an ANOVA over basis functions used in the best -fitting estimator. At its core, the F-statistic juxtaposes the variance -elucidated by the model against the residual variance, normalized by -their respective degrees of freedom. With the model construed as a blend -of basis functions, we resort to the conventional sums of squares (RSS, -TSS, ESS) to compute the F-statistic. -Subsequent to this, variable sets earn their rankings via aggregated -F-statistics. That is, each basis function, attached to a particular -exposure, or exposure set, has a particular F-statistic, we then -aggregate these F-statisics to the exposure set level. Employing -quantile-based thresholding, we extract influential subsets, thus -streamlining the set for subsequent parameter estimations which is -useful in very high-dimensional exposure setting but can also be -bypassed as part of the data-adaptive parameter. - -Overall, our method for exposure set discovery includes the following: -
-
-Targeted Learning-We use targeted minimum loss based estimation (TMLE) to debias our -initial outcome estimates given a shift in exposure such that the -resulting estimator is asymptotically unbiased and has the smallest -variance. This procedure involves constructing a least favorable -submodel which uses a ratio of conditional exposure densities under -shift and now shift as a covariate to debias the initial estimates. More -details of targeted learning for stochastic shifts are here Dı́az and van der Laan (2012) -
-
-Pooling Estimates Found Across the Folds-Because
-
-Data Adaptive Delta-For each exposure, the user inputs a respective delta, for example
-for exposures \(M1, M2, M3\) the
-analyst puts in the Which assigns a delta shift amount to each exposure. However, because
-the user doesn’t know the underlying experimentation in the data, a
-delta that is too big may result in positivity violations which leads to
-bias and high variance of the estimator. That is, if the user asks for a
-shift in exposure density that is very unlikely given the data. To solve
-this, the analyst can also choose to set
-
-
-Application:-First, let’s load the packages we’ll use and set a seed for -simulation: - -Below we show implementation of
-
-NIEHS Mixture Workshop Data-The
Let’s look at the results for the variable \(X7\) which should have a positive effect -given the data dictionary provided: -
We ran Now let’s look for effect modification: -
This shows there are differential effects on the impact of shifting -\(X7\) between strata of the baseline -covariate Z. This result was found in 1 fold and so the fold specific -and pooled results are the same. This is an example of a less consistent -finding as it wasn’t found in all the folds. The user should incorporate -this information into their interpretation of findings. Findings are -more consistent with higher CV fold values. -Here we see that when Z = 0 a 0.47 shift in X7 leads to an average -outcome of 17.43 and when Z = 1 the average outcome for the same shift -is 33.7. -
Here we see that in one of the folds an interaction between X2 and X7 -was found. The adaptive delta was set to 0.22 for X2 and 0.35 for X7. An -increase of 0.22 in X2 leads to a 0.64 increase in Y and a 0.35 increase -in X7 leads to a 1.23 increase in Y. An increase in both by these -amounts changes Y by 1.87 compared to the additive sum 1.87 if we just -add both these individual estimates together. Thus, there is no evidence -of interaction given our definition in this case that is more than -additive effects. -
-
-
-Dı́az, Iván, and Mark J van der Laan. 2012. “Population
-Intervention Causal Effects Based on Stochastic Interventions.”
-Biometrics 68 (2): 541–49.
-
-
-Hubbard, Alan E., Sara Kherad-Pajouh, and Mark J. Van Der Laan. 2016.
-“Statistical Inference for Data Adaptive
-Target Parameters.” International Journal of
-Biostatistics 12 (1): 3–19. https://doi.org/10.1515/ijb-2015-0013.
-
-
-McCoy, David B., Alan E. Hubbard, Mark van der Laan, and Alejandro
-Schuler. 2023. “Unveiling Causal Mediation
-Pathways in High-Dimensional Mixed Exposures: A Data-Adaptive Target
-Parameter Strategy,” 1–32. http://arxiv.org/abs/2307.02667.
-
-
-McCoy, David B., Alan E. Hubbard, Alejandro Schuler, and Mark J. van der
-Laan. 2023. “Semi-Parametric Identification and Estimation of
-Interaction and Effect Modification in Mixed Exposures Using Stochastic
-Interventions.” https://arxiv.org/abs/2305.01849.
-
- |