-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3 from molgenis/feature/model_creation
Feature/model creation
- Loading branch information
Showing
34 changed files
with
2,225 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# capice-resources | ||
|
||
Repository for resource files for CAPICE and updating CAPICE model. It contains serveral modules listed below. Each | ||
module contains their own README (if applicable) describing how to use them. | ||
|
||
## Installation | ||
|
||
Install should be as easy as `pip install -e '.[test]'`. | ||
To test the individual modules, change directory to the specific module and run `pytest`. | ||
|
||
## Modules: | ||
|
||
### validation | ||
|
||
Validation is not really a module, but more a folder to contain plots of supporting evidence of a model's performance | ||
per GitHub release. | ||
|
||
### train_data_creator | ||
|
||
The module train_data_creator is here to create new CAPICE train-test and validation datasets. For more information, | ||
refer to the readme within the module. | ||
|
||
### utility_scripts | ||
|
||
utility_scripts contains a couple of bash scripts for ease of use, which include but are not limited to lifting over | ||
variants to GRCh38, running VEP104 and converting a VEP VCF output to TSV. | ||
It is advised to use these scripts as they maintain a level of consistency throughout the lifespan of CAPICE. | ||
|
||
- liftover_variants.sh: | ||
|
||
The script liftover_variants.sh converts a VCF containing GRCh37 variants to GRCh38 variants. | ||
|
||
- vep_to_tsv.sh: | ||
|
||
The vep_to_tsv.sh converts the VEP output VCF back to TSV with all required columns. | ||
If features are added/removed, be sure to adjust the `pre_header` variable within this bash script accordingly. | ||
For instance, if feature `CADD_score` is added, within pre_header `\tCADD_score` has to be added. | ||
|
||
## Usage | ||
|
||
### TLDR | ||
|
||
_(For a more detailed explanation on creating the train-test and validation datasets, please see the README in | ||
[train_data_creator](./train_data_creator/README.md))_ | ||
|
||
1. Make new [CAPICE](https://github.com/molgenis/capice) release, containing added or removed processors and/or code changes supporting a new model. | ||
1. Steps: | ||
2. Newly added features have been added to the [impute json](https://github.com/molgenis/capice/blob/master/resources/train_impute_values.json) and/or deprecated features have been removed. | ||
3. Apply changes in features to PRE_HEADER in the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh). | ||
4. Annotate new training VCF using VEP and convert the VCF using the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (raw file: [train_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/train_input_raw.vcf.gz)) (note: use `-t` when using the conversion tool) | ||
5. Make training TSV train ready using `utility_scripts/vep_to_train.py`. | ||
6. Use newly generated training TSV to create new [PoC](https://github.com/molgenis/capice/blob/master/tests/resources/xgb_booster_poc.pickle.dat) model. | ||
7. Update [predict_input.tsv.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input.tsv.gz) (raw file: [predict_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input_raw.vcf.gz)) with altered features. | ||
8. Update [breakends_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends_vep.tsv.gz) (raw file: [breakends.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends.vcf.gz)) with altered features. | ||
9. Update [edge_cases_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases_vep.tsv.gz) (raw file: [edge_cases.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases.vcf.gz)) with altered features. | ||
10. Update [symbolic_alleles_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles_vep.tsv.gz) (raw file: [symbolic_alleles.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles.vcf.gz)) with altered features. | ||
11. Run [CAPICE](https://github.com/molgenis/capice) tests. | ||
2. Download latest non-public GRCh37 VKGL (`/apps/data/VKGL/GRCh37`) | ||
and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets. | ||
3. Use [train_data_creator](./train_data_creator/README.md) to create a train-test and validation VCFs. | ||
1. `python3 ./train_data_creator/main.py --input_input_vkgl </path/to/vkgl_nonpublic.tsv> --input_clinvar </path/to/clinvar.vcf.gz> -o </path/to/output>` | ||
4. Make [capice-resources](https://github.com/molgenis/capice-resources) GitHub release, matching the CAPICE release in step 1. | ||
5. Attach both train-test and validation VCFs to [capice-resources](https://github.com/molgenis/capice-resources) release. | ||
6. Download the latest VEP release from the [Molgenis Cloud](https://download.molgeniscloud.org/downloads/vip/images/) | ||
7. Use the VEP singularity image, combined with the latest VEP cache to annotate the VCF files. | ||
1. `singularity exec --bind /apps,/groups,/tmp vep <command>` (See [VEP](#VEP) for the command) | ||
8. (Optional for XGBoost 0.72.1 models) Upload the validation.vcf to [CADD1.4-GRCh37](https://cadd.gs.washington.edu/score) for annotation. | ||
9. Lift over not-annotated VCFs to GRCh38 using `lifover_variants.sh`. | ||
1. `sbatch lifover_variants.sh -i </path/to/step3.vcf.gz> -o </path/to/output/directory/file>` (please note: do not supply an extension as it doesn't produce a single file) | ||
10. Use the VEP singularity image to annotate the GRCh38 VCF files. | ||
11. Convert VEP annotated VCFs back to TSV using [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (using `-t`) | ||
1. `capice/scripts/convert_vep_vcf_to_tsv_capice.sh -i </path/to/vep.vcf.gz> -o </path/to/vep.tsv.gz> -t` | ||
12. Process GRCH37 TSV | ||
1. `python3 ./utility_scripts/vep_to_train.py -i /path/to/vep.tsv.gz -o /path/to/vep_processed.tsv.gz` | ||
13. Repeat for GRCh37 validation (see step 12) | ||
14. Repeat steps 11 and 12 for train-test and validation for GRCh38 (add the `-a` flag to the `vep_to_train.py`). | ||
15. Update imputing JSON accordingly to the newly features added and/or removed. | ||
16. Make sure the latest release of CAPICE made in step 1 is available on the GCC cluster | ||
1. `pip install capice` (be sure to install within a virtual environment or use the singularity image) | ||
17. Use the JSON made in step 16 and the train-test TSV made in step 12 to start the train protocol of CAPICE | ||
1. You may want to use a custom script that loads the latest Python module, activates the virtual environment and activates the and then activates the train protocol. It can take 5 hours for a new model to train. | ||
2. `module load <python>` | ||
3. `source ./capice/venv/bin/activate` | ||
4. `capice train -i </path/to/train-test.tsv.gz> -m </path/to/impute.json> -o </path/to/output>` | ||
18. Attach new models to [CAPICE](https://github.com/molgenis/capice) and [capice-resources](https://github.com/molgenis/capice-resources) releases. | ||
19. Use new model generated in step 20 to generate CAPICE results file of the validation TSV. | ||
20. Use latest non `Release Candidate` model to generate CAPICE results file of the same validation TSV. | ||
21. Use `compare_old_model.py` in `utility_scripts` to compare performance of an old model to a new model. | ||
1. `compare_old_model.py --old_model_results </path/to/old_capice_results.tsv> --vep_processed_capice_input </path/to/validation_vep.tsv.gz> --new_model_results </path/to/validation_vep_capice.tsv.gz> --output </path/to/rcX>` | ||
|
||
## Making train-test and validation VCF files | ||
|
||
Download the latest non-public GRCh37 VKGL (GCC `/apps/data/VKGL/GRCh37/vkgl_consensus_*.tsv`) | ||
and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets and simply supply them to main.py within | ||
train_data_creator. For further details, use `python3 main.py -h` or refer to the README in the train_data_creator | ||
module. | ||
|
||
## VEP | ||
|
||
VEP is used to add annotations to the train-test and validation files. The following command is a good representation of what VEP command should be called: | ||
```commandline | ||
vep --input_file <path to your input file> --format vcf --output_file <path to your output file> --vcf --compress_output gzip --sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --no_stats --offline --cache --dir_cache */path/to/latest/vep/cache* --species "homo_sapiens" --assembly *GRCh37 or GRCh38* --refseq --use_given_ref --exclude_predicted --use_given_ref --flag_pick_allele --force_overwrite --fork 4 --dont_skip --allow_non_variant --per_gene | ||
``` | ||
_(`--input_file`, `--output_file` and `--assembly` have to be manually altered)_ | ||
|
||
_Command can vary based on the training features._ | ||
|
||
**Comparing an XGBoost 0.72.1 (CAPICE 2.0.0 or lower) to a new XGBoost 1.4.2 model, upload the validation.vcf.gz to [CADD](https://cadd.gs.washington.edu/score) for annotation.** (set "INCLUDE ANNOTATIONS" to True) | ||
|
||
## Post-VEP processing | ||
|
||
First convert the VEP output VCF back to TSV using the `vep_to_tsv.sh` _(Note: pre_header must be altered according to the new features or removed features)_. | ||
**WARNING: Drop all columns that are not wanted within training, such as the ID column.** This step can be performed for all train-test and validation VCFs for both GRCh37 and 38. | ||
|
||
Making the converted VEP TSV train-ready requires use of the `vep_to_train.py` in `utility_scripts`. This works for both GRCh37 and GRCh38 (GRCh38 only when the `-a` flag is supplied.) | ||
|
||
This script makes sure that variants with the correct labels are preserved, so that the labels remains accurate for the variant. | ||
|
||
## Training | ||
|
||
Before the training module of CAPICE can be called, make sure your impute JSON used within training is up to date on your newly added and/or removed features. | ||
|
||
Training a new model is rather easy, but takes quite a bit of time. It is recommended to start a slurm job on the GCC cluster and to let it run overnight. | ||
|
||
The training module can be activated with the following command (assuming you have installed CAPICE the way it is supposed to be: `pip install .` or `pip install capice`): | ||
|
||
`capice -v train -i /path/to/processed/train_test.tsv.gz -m /path/to/up/to/date/impute.json -o /path/to/output.pickle.dat` | ||
|
||
## Generating results | ||
|
||
### XGBoost 0.90 vs XGBoost 1.4.2 models | ||
|
||
By now, you should already have a CADD 1.4 annotated output file for "old" CAPICE to score. | ||
Get the CAPICE result file for both old generation and new generation. | ||
I prefer to use CAPICE 1.3.1 for "old" CAPICE. For "new" CAPICE, use your latest release, as the model file will try and match the | ||
version to the CAPICE version. | ||
|
||
### XGBoost 1.4.2 vs XGBoost 1.4.2 models | ||
|
||
Work in Progress | ||
|
||
## Validate | ||
|
||
Merging the old and new validation datafiles happens within `utility_scripts/compare_old_model.py`. | ||
This script merges the old results file with a file containing the original labels (the direct output of a TSV converted VEP output using the `-t` flag of `conver_vep_vcf_to_tsv_capice.sh` of [CAPICE](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) is perfect). | ||
After that merge and some cleanup of columns that are not going to be used anymore, | ||
the new results file is merged on top of the old results merged with their original true labels. | ||
|
||
After this merge, AUC is calculated globally, as well as an absolute score difference between the predicted score and the true label. | ||
These global results are then plotted, and AUC calculations are performed over all consequences present within the **old capice** datafile, as well as the absolute score difference. | ||
|
||
This absolute score difference tells us something of how close the scores are to the true label, where 1 means the score is far off from the true label and 0 means the score matches the label. We want 0. | ||
|
||
Furthermore, a table of the feature importances of the new model is plotted aswell, so you can see what is and what is not important. | ||
|
||
Compare the results of Old vs New and discus within the team, decide on next steps. | ||
|
||
|
||
## Common issues | ||
|
||
- Training with my generated dataset causes warning ValueError: multiclass format is not supported, resulting in average | ||
CV performance of NaN. | ||
|
||
This one is a bit tricky, but it comes down with a bug somewhere within creating the ID column. One of the binarized | ||
labels is set to 2, origin is unknown. Is fixed as per CAPICE 3.0.0-beta2. | ||
|
||
## FAQ | ||
|
||
- I'm getting an `ValueError: invalid literal for int() with base 10: 'chr1'`, what's wrong? | ||
|
||
You've likely downloaded a genome build 38, which has a different annotation for the chromosome. | ||
`train_data_creator` only supports build 37. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
#!/usr/bin/env python3 | ||
|
||
from setuptools import setup | ||
with open('README.md', 'r', encoding='utf-8') as fh: | ||
long_description = fh.read() | ||
|
||
setup( | ||
name='capice-resources', | ||
version='1.0.0-dev', | ||
url='https://capice.molgeniscloud.org/', | ||
license='LGPL-3.0', | ||
author='Molgenis', | ||
author_email='[email protected]', | ||
description='Resource files for CAPICE to train new CAPICE models', | ||
long_description=long_description, | ||
classifiers=[ | ||
'Development Status :: 4 - Beta', | ||
'License :: LGPL-3.0', | ||
'Programming Language :: Python :: 3.9', | ||
'Programming Language :: Python :: 3.10' | ||
], | ||
python_requires='>=3.9.*', | ||
install_requires=[ | ||
'pandas==1.3.4', | ||
'numpy==1.21.4', | ||
'matplotlib==3.5.1', | ||
'xgboost==1.4.2', | ||
'scikit-learn==1.0.2' | ||
], | ||
extras_require={ | ||
'test': [ | ||
'pytest', | ||
] | ||
} | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Train data creator | ||
|
||
###### Resource module for CAPICE to create new train-test and validation datasets | ||
|
||
## Usage | ||
|
||
`python3 main.py -iv path/to/vkgl_consensus.tsv.gz -ic path/to/clinvar.vcf.gz -o path/to/existing/output/directory` | ||
|
||
_Note: -iv needs to be the non-public VKGL, public will throw an KeyNotFoundError._ | ||
|
||
_Note 2: the parent of the output directory has to exist. train_data_creator will, at most, create only 1 new directory._ | ||
|
||
## Theory | ||
|
||
(data_preprocessor.py) | ||
|
||
Step 1 is to load in the different type datasets, which is not all that easy. | ||
|
||
VCF files have headers that can't be directly parsed into VCF. The amount of lines are obtained and provided to the | ||
pandas parser. | ||
|
||
The VKGL TSV is easier to parse since that doesn't require extra skips. | ||
|
||
After that the files are validated. | ||
|
||
For VKGL the file follows the following path: | ||
|
||
- Correcting the TSV to VCF-like column names. | ||
- Correcting the consensus-classification of "Classified by one lab" to the actual classification of the lab (also | ||
setting the amount of labs for that variant to 1 for review status). | ||
- Ordering according to VCF standards. | ||
- Applying the ClinVar review status to their Gold Star Description. | ||
- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else). | ||
- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic). | ||
|
||
For ClinVar the file follows the following (similar) path: | ||
|
||
- (Before the file gets even loaded) obtaining the amount of header lines | ||
- Obtaining the classification from the INFO field. | ||
- Obtaining the gene from the INFO field. | ||
- Obtaining the review status from the INFO field. | ||
- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else). | ||
- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic). | ||
|
||
After that, both files are equal in terms of columns, column types and variables within the columns. | ||
|
||
(main.py) | ||
|
||
Both datasets are merged, with the column "source" indicating their true origin and the originally loaded datasets are | ||
deleted to preserve RAM usage. | ||
|
||
(duplicate_processor.py) | ||
|
||
Once merged, duplicates have to be removed since it is possible that VKGL samples are present within ClinVar. Duplicates | ||
are dropped if all of the following columns are duplicates: | ||
|
||
- CHROM | ||
- POS | ||
- REF | ||
- ALT | ||
- gene | ||
- class | ||
|
||
Since the keep in pandas.DataFrame.drop_duplicates() is not defined, the first duplicate will remain present while the | ||
rest are dropped. Since VKGL is appended to the ClinVar dataset, the ClinVar variant will be kept. | ||
_Note: it might be possible that duplicate variants are still present with conflicting classes, since those do not count | ||
as duplicates. We don't think this will be an issue due to the amount of samples._ | ||
|
||
Now that we have an assembled large dataset, we can start building the train-test and validation datasets. __But before | ||
we can do that__, it is useful to apply the sample weights already. Sample weights is what tells XGBoost (and other | ||
machine learning algorithms) how well it should trust the particular sample. As of current, the following ClinVar Gold | ||
Star ratings are used to apply the following sample weights: | ||
|
||
| Gold star rating | Sample weight | | ||
|:----------------:|:-------------:| | ||
| 0 | Removed | | ||
| 1 | 0.8 | | ||
| 2 | 0.9 | | ||
| 3 | 1.0 | | ||
| 4 | 1.0 | | ||
|
||
These sample weights follow the idea that [Li et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00775-w) originally had for the initial model. | ||
Applying sample weights that are lower than 0.8 have the chance to influence the end score too much in such a way that scores will not reach 0.0 for benign and 1.0 for pathogenic (beta1 iteration 1). | ||
Higher quality variants (2 labs or more confirming the consensus) should be rewarded with a higher sample weight. | ||
|
||
Now that sample weights have been mapped and applied, the large dataset can be split into the train-test and validation | ||
datasets. | ||
|
||
(split_datasets.py) | ||
|
||
The most important part about the splitting of the large dataset into the train-test and validation datasets is that the | ||
validation dataset is of high quality. This is done by randomly sampling 50% of high quality pathogenic variants (review | ||
score 2 or higher, which means at least 2 labs have confirmed the consensus). | ||
Once 50% of high quality pathogenic variants have been sampled, an equal amount of benign samples of the same quality is sampled. | ||
These sampled variants now form the validation dataset. | ||
These validation samples are then merged back to the training dataset, after which an `pd.drop_duplicates(keep=False)` is performed, | ||
so all validation samples that were still in the train-test dataset are removed (since `sample` does not remove the dataframe entries). | ||
|
||
This results in the train-test and validation datasets, | ||
that are exported to the output argument after adding a pseudo-vcf header and pseudo-vcf columns (and merging critical information to the ID column). |
Empty file.
Oops, something went wrong.