Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knockoffs(1/4): add comments and docstring of the functions #128

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
e202b84
Remove unecesarry argument
lionelkusch Jan 10, 2025
82b829c
Change the a bit the behavior
lionelkusch Jan 10, 2025
d4f80c2
Update the variables
lionelkusch Jan 10, 2025
47dbe24
Put all the knockoff test together
lionelkusch Jan 10, 2025
81e64c2
Fix a bug
lionelkusch Jan 10, 2025
d6ded88
remove the function for estimating covariance
lionelkusch Jan 10, 2025
801c5f9
Remove unnecessary file
lionelkusch Jan 10, 2025
4c61dd7
Remove a function
lionelkusch Jan 10, 2025
b890cba
comparison with original code
lionelkusch Jan 10, 2025
b6fe948
improve docstring and function
lionelkusch Jan 10, 2025
43a3e9a
Merge file for knockoff together
lionelkusch Jan 10, 2025
a116b53
Add function for repeat the gaussian knockoff
lionelkusch Jan 10, 2025
1699675
Put all the test for knockoff in one file
lionelkusch Jan 10, 2025
db6098f
Include the new function in the init
lionelkusch Jan 10, 2025
e38f376
Fix bug
lionelkusch Jan 10, 2025
8861480
Fix bugs
lionelkusch Jan 10, 2025
a0d53b0
Fix test for new signature of the function
lionelkusch Jan 10, 2025
4bb9eb4
Improve the docstring
lionelkusch Jan 10, 2025
d21f46a
Improve docstring knockoff
lionelkusch Jan 10, 2025
f752f9e
/bin/bash: line 1: :wq: command not found
lionelkusch Jan 10, 2025
071ea6d
Remove the begining of the file
lionelkusch Jan 10, 2025
68dd7e6
Add equations
lionelkusch Jan 13, 2025
727ee7d
Change reference for paper
lionelkusch Jan 13, 2025
efbe49a
add a reference
lionelkusch Jan 15, 2025
bfe111a
a a new tests
lionelkusch Jan 15, 2025
211599e
rename function adn remove warning for test
lionelkusch Jan 15, 2025
beeda0a
Add parameters of knockoff
lionelkusch Jan 15, 2025
3736661
Merge branch 'main' into PR_knockoffs
lionelkusch Jan 15, 2025
d066c5a
Format files
lionelkusch Jan 15, 2025
f6c699f
format file
lionelkusch Jan 15, 2025
bf621d2
Fix bugs
lionelkusch Jan 15, 2025
779288b
Fix bug in utils
lionelkusch Jan 15, 2025
20d0567
Update example knockoff
lionelkusch Jan 15, 2025
c71cc22
Formating
lionelkusch Jan 15, 2025
bbd3252
Update function
lionelkusch Jan 15, 2025
9af787f
Apply suggestions from code review
lionelkusch Jan 16, 2025
548e283
Fix name variables
lionelkusch Jan 16, 2025
0ad893e
Fix name variables
lionelkusch Jan 16, 2025
252a012
Fix test and name variables
lionelkusch Jan 16, 2025
b34cf1f
Add tests and fix bugs
lionelkusch Jan 16, 2025
2ddfd9d
Add a tests and formating file
lionelkusch Jan 16, 2025
df45c82
Undo delete of the tests
lionelkusch Jan 16, 2025
e3eeb72
Improve coverage and test
lionelkusch Jan 17, 2025
4df5d90
Format
lionelkusch Jan 17, 2025
8aee168
Group the function agregation and not aggregation together
lionelkusch Jan 17, 2025
bfe6346
formating
lionelkusch Jan 17, 2025
4e3f4d2
Replace lambda by alpha
lionelkusch Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion doc_conf/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,11 @@ Functions
ensemble_clustered_inference
group_reid
hd_inference
knockoff_aggregation
model_x_knockoff
model_x_knockoff_filter
model_x_knockoff_pvalue
model_x_knockoff_bootstrap_quantile
model_x_knockoff_bootstrap_e_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these functions meant to be public ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be public.

multivariate_1D_simulation
permutation_test_cv
reid
Expand Down
60 changes: 28 additions & 32 deletions doc_conf/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -77,22 +77,6 @@ @article{Ren_2023
eprint = {https://academic.oup.com/jrsssb/article-pdf/86/1/122/56629998/qkad085.pdf},
}

@article{Candes_2018,
author = {Candès, Emmanuel and Fan, Yingying and Janson, Lucas and Lv, Jinchi},
title = "{Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection}",
journal = {Journal of the Royal Statistical Society Series B: Statistical Methodology},
volume = {80},
number = {3},
pages = {551-577},
year = {2018},
month = {01},
abstract = "{ Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n⩾p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn's disease in the UK, making twice as many discoveries as the original analysis of the same data.}",
issn = {1369-7412},
doi = {10.1111/rssb.12265},
url = {https://doi.org/10.1111/rssb.12265},
eprint = {https://academic.oup.com/jrsssb/article-pdf/80/3/551/49274696/jrsssb\_80\_3\_551.pdf},
}

@article{breimanRandomForests2001,
title = {Random {{Forests}}},
author = {Breiman, Leo},
Expand Down Expand Up @@ -148,20 +132,24 @@ @article{miPermutationbasedIdentificationImportant2021
keywords = {Cancer,Data mining,Machine learning,Statistical methods},
}

@article{candesPanningGoldModelX2017,
title = {Panning for {{Gold}}: {{Model-X Knockoffs}} for {{High-dimensional Controlled Variable Selection}}},
shorttitle = {Panning for {{Gold}}},
author = {Candes, Emmanuel and Fan, Yingying and Janson, Lucas and Lv, Jinchi},
year = {2017},
month = dec,
journal = {arXiv:1610.02351 [math, stat]},
eprint = {1610.02351},
primaryclass = {math, stat},
urldate = {2022-01-12},
abstract = {Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively control the fraction of false discoveries even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, we propose a new framework of \$model\$-\$X\$ knockoffs, which reads from a different perspective the knockoff procedure (Barber and Cand{\textbackslash}`es, 2015) originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with \$n{\textbackslash}ge p\$, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires the covariates be random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown/estimated distributions. To our knowledge, no other procedure solves the \$controlled\$ variable selection problem in such generality, but in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case-control study of Crohn's disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.},
archiveprefix = {arxiv},
keywords = {Mathematics - Statistics Theory,Statistics - Applications,Statistics - Methodology},
file = {/home/ahmad/Zotero/storage/YZ23F3Q5/Candes et al. - 2017 - Panning for Gold Model-X Knockoffs for High-dimen.pdf;/home/ahmad/Zotero/storage/ZSN64F6N/1610.html}
@article{candes2018panning,
title={Panning for gold:‘model-X’knockoffs for high dimensional controlled variable selection},
author={Candes, Emmanuel and Fan, Yingying and Janson, Lucas and Lv, Jinchi},
journal={Journal of the Royal Statistical Society Series B: Statistical Methodology},
volume={80},
number={3},
pages={551--577},
year={2018},
publisher={Oxford University Press}
}

@article{barber2015controlling,
title={Controlling the false discovery rate via knockoffs},
author={Barber, Rina Foygel and Cand{\`e}s, Emmanuel J},
journal={The Annals of statistics},
pages={2055--2085},
year={2015},
publisher={JSTOR}
}

@article{liuFastPowerfulConditional2021,
Expand All @@ -176,5 +164,13 @@ @article{liuFastPowerfulConditional2021
abstract = {We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Y is independent of X given Z. The conditional randomization test (CRT) was recently proposed as a way to use distributional information about X{\textbar}Z to exactly (non-asymptotically) control Type-I error using any test statistic in any dimensionality without assuming anything about Y{\textbar}(X,Z). This flexibility in principle allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the CRT is prohibitively computationally expensive, especially with multiple testing, due to the CRT's requirement to recompute the test statistic many times on resampled data. We propose the distilled CRT, a novel approach to using state-of-the-art machine learning algorithms in the CRT while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the CRT's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks like screening and recycling computations to further speed up the CRT without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing CRT implementations but requires orders of magnitude less computation, making it a practical tool even for large data sets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.},
archiveprefix = {arxiv},
keywords = {Statistics - Methodology},
file = {/home/ahmad/Zotero/storage/8HRQZX3H/Liu et al. - 2021 - Fast and Powerful Conditional Randomization Testin.pdf;/home/ahmad/Zotero/storage/YFNDKN2B/2006.html}
}
}

@article{reid2016study,
title={A study of error variance estimation in lasso regression},
author={Reid, Stephen and Tibshirani, Robert and Friedman, Jerome},
journal={Statistica Sinica},
pages={35--67},
year={2016},
publisher={JSTOR}
}
52 changes: 36 additions & 16 deletions examples/plot_knockoff_aggregation.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,16 @@

import numpy as np
from hidimstat.data_simulation import simu_data
from hidimstat.knockoffs import model_x_knockoff
from hidimstat.knockoff_aggregation import knockoff_aggregation
from hidimstat.knockoffs import (
model_x_knockoff,
model_x_knockoff_filter,
model_x_knockoff_bootstrap_quantile,
model_x_knockoff_bootstrap_e_value,
)
from hidimstat.utils import cal_fdp_power
from sklearn.utils import check_random_state
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

plt.rcParams.update({"font.size": 26})
Expand Down Expand Up @@ -61,32 +67,46 @@ def single_run(
)

# Use model-X Knockoffs [1]
mx_selection = model_x_knockoff(X, y, fdr=fdr, n_jobs=n_jobs, seed=seed)

test_scores = model_x_knockoff(
X,
y,
estimator=LassoCV(
n_jobs=n_jobs,
verbose=0,
max_iter=1000,
cv=KFold(n_splits=5, shuffle=True, random_state=0),
tol=1e-6,
),
n_bootstraps=1,
random_state=seed,
)
mx_selection = model_x_knockoff_filter(test_scores, fdr=fdr)
fdp_mx, power_mx = cal_fdp_power(mx_selection, non_zero_index)

# Use p-values aggregation [2]
aggregated_ko_selection = knockoff_aggregation(
test_scores = model_x_knockoff(
X,
y,
fdr=fdr,
estimator=LassoCV(
n_jobs=n_jobs,
verbose=0,
max_iter=1000,
cv=KFold(n_splits=5, shuffle=True, random_state=0),
tol=1e-6,
),
n_bootstraps=n_bootstraps,
n_jobs=n_jobs,
gamma=0.3,
random_state=seed,
)
aggregated_ko_selection = model_x_knockoff_bootstrap_quantile(
test_scores, fdr=fdr, gamma=0.3, selection_only=True
)

fdp_pval, power_pval = cal_fdp_power(aggregated_ko_selection, non_zero_index)

# Use e-values aggregation [1]
eval_selection = knockoff_aggregation(
X,
y,
fdr=fdr,
method="e-values",
n_bootstraps=n_bootstraps,
n_jobs=n_jobs,
gamma=0.3,
random_state=seed,
eval_selection = model_x_knockoff_bootstrap_e_value(
test_scores, fdr=fdr, selection_only=True
)

fdp_eval, power_eval = cal_fdp_power(eval_selection, non_zero_index)
Expand Down
14 changes: 11 additions & 3 deletions src/hidimstat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,13 @@
from .desparsified_lasso import desparsified_group_lasso, desparsified_lasso
from .Dnn_learner_single import DnnLearnerSingle
from .ensemble_clustered_inference import ensemble_clustered_inference
from .knockoff_aggregation import knockoff_aggregation
from .knockoffs import model_x_knockoff
from .knockoffs import (
model_x_knockoff,
model_x_knockoff_filter,
model_x_knockoff_pvalue,
model_x_knockoff_bootstrap_quantile,
model_x_knockoff_bootstrap_e_value,
)
from .multi_sample_split import aggregate_quantiles
from .noise_std import group_reid, reid
from .permutation_test import permutation_test_cv
Expand All @@ -31,8 +36,11 @@
"ensemble_clustered_inference",
"group_reid",
"hd_inference",
"knockoff_aggregation",
"model_x_knockoff",
"model_x_knockoff_filter",
"model_x_knockoff_pvalue",
"model_x_knockoff_bootstrap_quantile",
"model_x_knockoff_bootstrap_e_value",
"multivariate_1D_simulation",
"permutation_test_cv",
"reid",
Expand Down
Loading
Loading