PeDecURe provides feature extraction with built-in adjustment for nuisance variables. Our method identifies sources of variation in that data that are shared between features (e.g., measurements derived from neuroimaging scans) and an outcome of interest (e.g., diagnosis), while substantially reducing overlap with information about nuisance variables (e.g., age or sex).
In our paper (now published in Biostatistics), we introduce the intuition behind our method and illustrate that features extracted using PeDecURe are predictive of an outcome of interest, have low correlations with nuisance variables, and show promise for out-of-sample generalizability.
# install.packages("devtools")
devtools::install_github("smweinst/PeDecURe")
Notation:
X
: feature matrixA
: matrix of nuisance variablesY
: vector (length ) with outcome labels (e.g., disease group)
Implementation using PeDecURe
R package:
library(PeDecURe)
# get residuals:
resid.dat = get.resid(X,Y,A)
X.star = resid.dat$X.star
X.tilde = resid.dat$X.tilde
# tune lambda:
lambda.tune = pedecure.tune(X.orig = X,
X.max = X.star,
X.penalize = X.tilde,
lambdas = seq(0,10,by=0.1),
A = A,
Y = Y,
nPC = 3)
best.lambda = lambda.tune$lambda_tune
# run pedecure:
pedecure.out = pedecure(X = X.star,
X.penalize = X.tilde,
A = A,
Y = Y,
lambda = best.lambda,
nPC = 3)
# PC scores - these are our new features that can be used for an association study, predictive model, etc.
## note: X should be centered by column
PC.scores = X%*%pedecure.out$vectors
# Look at correlations between the first few PC scores and the nuisance variables (A1, A2, Y)
cor.scores = partial.cor(PC.scores, A, Y)
scores.partial.cor = cor.scores$partial$estimates
scores.marginal.cor = cor.scores$marginal$estimates
PC scores in new sample: multiply new feature matrix by PC loadings from above.
PC.scores.test = X.test%*%pedecure.out$vectors
# note: pedecure.out$vectors was the output from applying PeDecURe in training sample above
If and are observed in the new sample, we can also look at their correlations with the PC scores in the test sample:
A.test
: matrix of nuisance variables in the new sample (if observed)Y.test
: vector (length ) with outcome labels (if observed)
cor.scores.test = partial.cor(PC.scores.test, A.test, Y.test)
scores.partial.cor.test = cor.scores.test$partial$estimates
scores.marginal.cor.test = cor.scores.test$marginal$estimates