Skip to content

Latest commit

 

History

History
1714 lines (1588 loc) · 38.2 KB

README.md

File metadata and controls

1714 lines (1588 loc) · 38.2 KB

R/SuperNOVA

R-CMD-check Coverage Status CRAN CRAN downloads CRAN total downloads Project Status: Active – The project has reached a stable, usable state and is being actively developed. MIT license

Efficient Estimation of the Causal Effects of Non-Parametric Interactions, Effect Modifications and Mediation using Stochastic Interventions Authors: David McCoy


What’s SuperNOVA?

The SuperNOVA R package offers a comprehensive toolset for identifying predictive variable sets, be it subsets of exposures, exposure-covariates, or exposure-mediators, for a specified outcome. It further assists in creating efficient estimators for the counterfactual mean of the outcome under stochastic interventions on these variable sets. This means making exposure changes that depend on naturally observed values, as described in past literature (Dı́az and van der Laan 2012; Haneuse and Rotnitzky 2013).

SuperNOVA introduces several estimators, constructed based on the patterns found in the data. At present, semi-parametric estimators are available for various contexts: interaction, effect modification, marginal impacts, and mediation. The target parameters that this package calculates are grounded in previously published works: one regarding interaction and effect modification (McCoy et al. 2023) and another dedicated to mediation [McCoy2023mediation].

The SuperNOVA package builds upon the capabilities of the txshift package, which implements the TML estimator for a stochastic shift causal parameter (Dı́az and van der Laan 2012). A notable extension in SuperNOVA is its support for joint stochastic interventions on two exposures. This allows for the creation of a non-parametric interaction parameter, shedding light on the combined effect of concurrently shifting two variables relative to the cumulative effect of individual shifts. At this stage, the focus is on two-way shifts.

Various parameters are calculated based on identified patterns:

  • Interaction Parameter: When two exposures show signs of interaction, this parameter is calculated.

  • Individual Stochastic Interventions: When marginal effects are identified, this estimates the difference in outcomes upon making a specific shift compared to no intervention.

  • Effect Modification: In scenarios where effect modification is evident, the target parameter for it is the mean outcome under intervention within specific regions of the covariate space that the data suggests. This contrasts with the marginal effect by diving deeper into strata of covariates, for example, gauging the effects of exposure shifts among distinct genders.

  • Mediation: In scenarios where mediation is found, the target parameter are the natural direct and indirect effects as defined by stochastic intervention. That is, for the natural indirect effect, SuperNOVA estimates the impact of shifting an exposure through a mediator and the direct effect which is not through the mediator.

Mediation in SuperNOVA is a new feature. Here, SuperNOVA brings to the table estimates initially conceived in another work (Díaz and Hejazi 2020), while also supporting mediation estimates for continuous exposures.

The package ensures robustness by employing a k-fold cross-validation framework. This framework helps in estimating a data-adaptive parameter, which is the stochastic shift target parameters for the variable sets identified as influential for the outcome. The process begins by partitioning the data into parameter-generating and estimation samples. The former sample assists in fitting a collection of basis function estimators to the data. The one with the lowest cross-validated mean squared error is selected. Key variable sets are then extracted using an ANOVA-like variance decomposition methodology. For the estimation sample, targeted learning is harnessed to gauge causal target parameters across different contexts: interaction, mediation, effect modification, and individual variable shifts.

By using SuperNOVA, users get access to a tool that offers both k-fold specific and aggregated results for each target parameter, ensuring that researchers can glean the most information from their data. For a more in-depth exploration, there’s an accompanying vignette.

To utilize the package, users need to provide vectors for exposures, covariates, mediators, and outcomes. They also specify the respective $\delta$ for each exposure (indicating the degree of shift) and if this delta should be adaptive in response to positivity violations. A detailed guide is provided in the vignette. With these inputs, SuperNOVA processes the data and delivers tables showcasing fold-specific results and aggregated outcomes, allowing users to glean insights effectively.

SuperNOVA also incorporates features from the sl3 package (Coyle, Hejazi, Malenica, et al. 2022), facilitating ensemble machine learning in the estimation process. If the user does not specify any stack parameters, SuperNOVA will automatically create an ensemble of machine learning algorithms that strike a balance between flexibility and computational efficiency.


Installation

Note: Because the SuperNOVA package (currently) depends on sl3 that allows ensemble machine learning to be used for nuisance parameter estimation and sl3 is not on CRAN the SuperNOVA package is not available on CRAN and must be downloaded here.

There are many depedencies for SuperNOVA so it’s easier to break up installation of the various packages to ensure proper installation.

First install the basis estimators used in the data-adaptive variable discovery of the exposure and covariate space:

install.packages("earth")
install.packages("hal9001")

SuperNOVA uses the sl3 package to build ensemble machine learners for each nuisance parameter. We have to install off the development branch, first download these two packages for sl3

install.packages(c("ranger", "arm", "xgboost", "nnls"))

Now install sl3 on devel:

remotes::install_github("tlverse/sl3@devel")

Make sure sl3 installs correctly then install SuperNOVA

remotes::install_github("blind-contours/SuperNOVA@main")

SuperNOVA has some other miscellaneous dependencies that are used in the examples as well as in the plotting functions.

install.packages(c("kableExtra", "hrbrthemes", "viridis"))

Example

To illustrate how SuperNOVA may be used to ascertain the effect of a mixed exposure, consider the following example:

library(SuperNOVA)
library(devtools)
#> Loading required package: usethis
library(kableExtra)
library(sl3)

set.seed(429153)
# simulate simple data
n_obs <- 100000

The simulate_data is a function for generating synthetic data with a complex structure to study the causal effects of shifting values in the mixtures of exposures. The primary purpose of this simulation is to provide a controlled environment for testing and validating estimates given by the SuperNOVA package.

The simulate_data function generates synthetic data for n_obs observations with a pre-specified covariance structure (sigma_mod) and a shift parameter (delta). It simulates four mixture components (M1, M2, M3, M4) and three covariates (W1, W2, W3) with specific relationships between them. The outcome variable Y is generated as a function of these mixtures and covariates.

After generating the data, the function applies a shift (delta) to each mixture component separately and calculates the average treatment effect for each component. Additionally, it calculates the interaction effect of shifting two mixture components simultaneously (m14_intxn). These ground truth effects can be used for validating and comparing the performance of various causal inference methods on this synthetic dataset.

The function returns a list containing the generated data and the ground truth effects of the shifts applied to each mixture component, the interaction effect, and the modified effect results based on a specific level in the W3 covariate.

sim_out <- simulate_data(n_obs = n_obs)
data <- sim_out$data
head(data) %>%
  kbl(caption = "Simulated Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Simulated Data
M1 M2 M3 M4 W1 W2 W3 Y
23.01099 3.864569 4.461534 4.747169 7.873068 7.362137 1 72.76739
23.70848 4.169044 4.990670 5.568766 6.677126 7.302253 1 87.81536
22.44274 2.922274 4.885933 3.347086 7.598614 7.647076 0 42.33824
22.99681 2.507461 4.227508 1.719390 7.321127 6.907626 1 38.72763
22.88541 3.349974 4.369185 6.618387 6.203951 6.547683 1 90.86418
22.88781 3.890133 5.370329 5.736624 7.677073 8.250861 0 68.59041

And therefore, in SuperNOVA we would expect most of the fold CIs to cover this number and the pooled estimate to also cover this true effect. Let’s run SuperNOVA to see if it correctly identifies the exposures that drive the outcome and any interaction/effect modification that exists in the DGP.

Of note, there are three exposures M1, M2, M3 - M1 and M3 have individual effects and interactions that drive the outcome. There is also effect modification between M3 and W1.

data_sample <- data[sample(nrow(data), 4000), ]

w <- data_sample[, c("W1", "W2", "W3")]
a <- data_sample[, c("M1", "M2", "M3", "M4")]
y <- data_sample$Y

deltas <- list("M1" = 1, "M2" = 1, "M3" = 1, "M4" = 1)

ptm <- proc.time()
sim_results <- SuperNOVA(
  w = w,
  a = a,
  y = y,
  delta = deltas,
  n_folds = 3,
  num_cores = 6,
  outcome_type = "continuous",
  quantile_thresh = 0,
  seed = 294580
)
#> 
#> Iter: 1 fn: 5616.2821     Pars:  0.83768 0.16232
#> Iter: 2 fn: 5615.9460     Pars:  0.99996649 0.00003351
#> Iter: 3 fn: 5615.9460     Pars:  0.99997849 0.00002151
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 3815.7486     Pars:  0.99998993 0.00001007
#> Iter: 2 fn: 3815.7486     Pars:  0.999998834 0.000001166
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3737.1413     Pars:  0.999996084 0.000003915
#> Iter: 2 fn: 3737.1412     Pars:  0.9999991151 0.0000008849
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3731.8788     Pars:  0.88038 0.11962
#> Iter: 2 fn: 3731.8383     Pars:  0.99995226 0.00004774
#> Iter: 3 fn: 3731.8383     Pars:  0.99997338 0.00002662
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5615.8427     Pars:  0.37837 0.62163
#> Iter: 2 fn: 5615.7080     Pars:  0.61125 0.38875
#> Iter: 3 fn: 5615.7080     Pars:  0.61254 0.38746
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5621.0789     Pars:  0.84929 0.15071
#> Iter: 2 fn: 5620.8684     Pars:  0.00007712 0.99992288
#> Iter: 3 fn: 5620.8684     Pars:  0.000002771 0.999997229
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5612.7158     Pars:  0.82116 0.17884
#> Iter: 2 fn: 5612.5096     Pars:  0.9999838 0.0000162
#> Iter: 3 fn: 5612.5096     Pars:  0.99998959 0.00001041
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 3842.7726     Pars:  0.999856 0.000144
#> Iter: 2 fn: 3842.7726     Pars:  0.99995859 0.00004141
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3811.0844     Pars:  0.67596 0.32404
#> Iter: 2 fn: 3811.0844     Pars:  0.67601 0.32399
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3841.7125     Pars:  0.997439 0.002561
#> Iter: 2 fn: 3841.7125     Pars:  0.9993128 0.0006872
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 5612.7076     Pars:  0.43156 0.56844
#> Iter: 2 fn: 5612.3977     Pars:  0.999891 0.000109
#> Iter: 3 fn: 5612.3977     Pars:  0.99992987 0.00007013
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5621.1314     Pars:  0.51060 0.48940
#> Iter: 2 fn: 5621.1314     Pars:  0.51078 0.48922
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 5600.7646     Pars:  0.99997834 0.00002166
#> Iter: 2 fn: 5600.7645     Pars:  0.999995592 0.000004408
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3831.8447     Pars:  0.45887 0.54113
#> Iter: 2 fn: 3831.8447     Pars:  0.45886 0.54114
#> solnp--> Completed in 2 iterations
#> 
#> Iter: 1 fn: 3799.4818     Pars:  0.02808 0.97192
#> Iter: 2 fn: 3797.3950     Pars:  0.9998439 0.0001561
#> Iter: 3 fn: 3797.3948     Pars:  0.99998699 0.00001301
#> Iter: 4 fn: 3797.3948     Pars:  0.999993236 0.000006764
#> solnp--> Completed in 4 iterations
#> 
#> Iter: 1 fn: 3830.8913     Pars:  0.84699 0.15301
#> Iter: 2 fn: 3830.8719     Pars:  0.75597 0.24403
#> Iter: 3 fn: 3830.8719     Pars:  0.75597 0.24403
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5600.6732     Pars:  0.13089 0.86911
#> Iter: 2 fn: 5600.6602     Pars:  0.007388 0.992612
#> Iter: 3 fn: 5600.6602     Pars:  0.007384 0.992616
#> solnp--> Completed in 3 iterations
#> 
#> Iter: 1 fn: 5600.7959     Pars:  0.9998956 0.0001044
#> Iter: 2 fn: 5600.7958     Pars:  0.99998238 0.00001762
#> solnp--> Completed in 2 iterations
proc.time() - ptm
#>    user  system elapsed 
#>  51.213   4.756 824.470

basis_in_folds <- sim_results$`Basis Fold Proportions`
indiv_shift_results <- sim_results$`Indiv Shift Results`
em_results <- sim_results$`Effect Mod Results`
joint_shift_results <- sim_results$`Joint Shift Results`

Let’s first look at the variable relationships used in the folds:

basis_in_folds
#> 
#>   M1 M1M4 M3M4 M3W3   M4   W3 
#> 1.00 0.33 0.67 1.00 1.00 1.00

The above list shows that marginal effects for exposures M1 and M4 were found, an interaction for M1 and M4, and effect modification for M3 and W3 - all of which are correct as the outcome is generated from these relationships. There is no effect for M2 which we correctly reject.

Let’s first look at the results for individual stochastic shifts by delta compared to no shift:

indiv_shift_results$M1 %>%
  kbl(caption = "Individual Stochastic Intervention Results for M1") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Individual Stochastic Intervention Results for M1
Condition Psi Variance SE Lower CI Upper CI P-value Fold Type Variables N Delta
M1 2.664497 0.7389329 0.8596120 0.9797 4.3493 0.0019375 1 Indiv Shift M1 1334 1
M1 1.752571 0.5247758 0.7244141 0.3327 3.1724 0.0155506 2 Indiv Shift M1 1333 1
M1 2.214215 0.5609105 0.7489396 0.7463 3.6821 0.0031119 3 Indiv Shift M1 1333 1
M1 1.496780 0.1667218 0.4083158 0.6965 2.2971 0.0002466 Pooled TMLE Indiv Shift M1 4000 1

The true effect for a shifted M1 vs observed M1 is:

sim_out$m1_effect
#> [1] 1.603984

And so we see that we have proper coverage.

Next we can look at effect modifications:

em_results$M3W3 %>%
  kbl(caption = "Effect Modification Stochastic Intervention Results for M3 and W3") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Effect Modification Stochastic Intervention Results for M3 and W3
Condition Psi Variance SE Lower_CI Upper_CI P_value Fold Type Variables N Delta
Level 1 Shift Diff in W3 \<= 0 -3.335721 3.3306066 1.8249950 -6.9126 0.2412 0.0675799 1 Effect Mod M3W3 1334 1
Level 0 Shift Diff in W3 \<= 0 4.242965 8.7593547 2.9596207 -1.5578 10.0437 0.1516814 1 Effect Mod M3W3 1334 1
Level 1 Shift Diff in W3 \<= 0 3.485335 4.2329425 2.0574116 -0.5471 7.5178 0.0902579 2 Effect Mod M3W3 1333 1
Level 0 Shift Diff in W3 \<= 0 11.881071 3.4080300 1.8460850 8.2628 15.4993 0.0000000 2 Effect Mod M3W3 1333 1
Level 1 Shift Diff in W3 \<= 0 4.945589 41.1813216 6.4172675 -7.6320 17.5232 0.4409031 3 Effect Mod M3W3 1333 1
Level 0 Shift Diff in W3 \<= 0 11.089110 6.3564564 2.5212014 6.1476 16.0306 0.0000109 3 Effect Mod M3W3 1333 1
Level 1 Shift Diff in W3 \<= 0 1.689869 0.9935718 0.9967807 -0.2638 3.6435 0.0900134 Pooled TMLE Effect Mod M3W3 4000 1
Level 0 Shift Diff in W3 \<= 0 10.114451 1.4270177 1.1945785 7.7731 12.4558 0.0000000 Pooled TMLE Effect Mod M3W3 4000 1

Let’s first look at the truth:

sim_out$effect_mod
#> $`Level 0 Shift Diff in W3 <= 0`
#> [1] 10.99749
#> 
#> $`Level 1 Shift Diff in W3 <= 0`
#> [1] 1

When W3 is 1 the truth effect is 11, our estimates are 11 with CI coverage. When W3 is 0 the truth is 1, our estimate is 1.9 with CIs that cover the truth as well.

And finally results for the joint shift which is a joint shift compared to additive individual shifts.

joint_shift_results$M1M4 %>%
  kbl(caption = "Interactions Stochastic Intervention Results for M1 and M4") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Interactions Stochastic Intervention Results for M1 and M4
Condition Psi Variance SE Lower CI Upper CI P-value Fold Type Variables N Delta M1 Delta M4
M1 2.701885 0.7471848 0.8643985 1.0077 4.3961 0.0036597 1 Interaction M1&M4 1334 1 1
M4 10.236543 0.3365067 0.5800920 9.0996 11.3735 0.0000000 1 Interaction M1&M4 1334 1 1
M1&M4 11.832085 0.3484322 0.5902815 10.6752 12.9890 0.0000000 1 Interaction M1&M4 1334 1 1
Psi -1.106343 0.7334625 0.8564242 -2.7849 0.5722 0.2318963 1 Interaction M1&M4 1334 1 1
M1 2.701885 0.7471848 0.8643985 1.0077 4.3961 0.0036597 Pooled TMLE Interaction M1&M4 1334 1 1
M4 10.236543 0.3365067 0.5800920 9.0996 11.3735 0.0000000 Pooled TMLE Interaction M1&M4 1334 1 1
M1&M4 11.832085 0.3484322 0.5902815 10.6752 12.9890 0.0000000 Pooled TMLE Interaction M1&M4 1334 1 1
Psi -1.106343 0.7334625 0.8564242 -2.7849 0.5722 0.2318963 Pooled TMLE Interaction M1&M4 1334 1 1

Let’s look at the truth again:

sim_out$m1_effect
#> [1] 1.603984
sim_out$m4_effect
#> [1] 10.41265
sim_out$m14_effect
#> [1] 12.41664
sim_out$m14_intxn
#> [1] 0.4

So comparing the results to the above table in the pooled section we can see all our estimates for the marginal shifts, dual shift, and difference between dual and sum of marginals have CIs that cover the truth.


Issues

If you encounter any bugs or have any specific feature requests, please file an issue. Further details on filing issues are provided in our contribution guidelines.


Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.


Citation

After using the SuperNOVA R package, please cite the following:


Related

  • R/tmle3shift - An R package providing an independent implementation of the same core routines for the TML estimation procedure and statistical methodology as is made available here, through reliance on a unified interface for Targeted Learning provided by the tmle3 engine of the tlverse ecosystem.

  • R/medshift - An R package providing facilities to estimate the causal effect of stochastic treatment regimes in the mediation setting, including classical (IPW) and augmented double robust (one-step) estimators. This is an implementation of the methodology explored by Dı́az and Hejazi (2020).

  • R/haldensify - A minimal package for estimating the conditional density treatment mechanism component of this parameter based on using the highly adaptive lasso (Coyle, Hejazi, Phillips, et al. 2022; Hejazi, Coyle, and van der Laan 2020) in combination with a pooled hazard regression. This package implements a variant of the approach advocated by Dı́az and van der Laan (2011).


Funding

The development of this software was supported in part through grants from the


License

© 2020-2022 David B. McCoy

The contents of this repository are distributed under the MIT license. See below for details:

MIT License
Copyright (c) 2020-2022 David B. McCoy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

References

Coyle, Jeremy R, Nima S Hejazi, Ivana Malenica, Rachael V Phillips, and Oleg Sofrygin. 2022. “sl3: Modern Machine Learning Pipelines for Super Learning.” https://doi.org/10.5281/zenodo.1342293.

Coyle, Jeremy R, Nima S Hejazi, Rachael V Phillips, Lars W van der Laan, and Mark J van der Laan. 2022. “hal9001: The Scalable Highly Adaptive Lasso.” https://doi.org/10.5281/zenodo.3558313.

Díaz, Iván, and Nima S. Hejazi. 2020. “Causal mediation analysis for stochastic interventions.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 82 (3): 661–83. https://doi.org/10.1111/rssb.12362.

Dı́az, Iván, and Nima S Hejazi. 2020. “Causal Mediation Analysis for Stochastic Interventions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3): 661–83. https://doi.org/10.1111/rssb.12362.

Dı́az, Iván, and Mark J van der Laan. 2011. “Super Learner Based Conditional Density Estimation with Application to Marginal Structural Models.” The International Journal of Biostatistics 7 (1): 1–20.

———. 2012. “Population Intervention Causal Effects Based on Stochastic Interventions.” Biometrics 68 (2): 541–49.

Haneuse, Sebastian, and Andrea Rotnitzky. 2013. “Estimation of the Effect of Interventions That Modify the Received Treatment.” Statistics in Medicine 32 (30): 5260–77.

Hejazi, Nima S, Jeremy R Coyle, and Mark J van der Laan. 2020. “hal9001: Scalable Highly Adaptive Lasso Regression in R.” Journal of Open Source Software 5 (53): 2526. https://doi.org/10.21105/joss.02526.

McCoy, David B., Alan E. Hubbard, Alejandro Schuler, and Mark J. van der Laan. 2023. “Semi-Parametric Identification and Estimation of Interaction and Effect Modification in Mixed Exposures Using Stochastic Interventions.” https://arxiv.org/abs/2305.01849.