Merge pull request #3 from molgenis/feature/model_creation

Feature/model creation
molgenis · Feb 17, 2022 · f69e8a8 · f69e8a8
2 parents f3309f3 + 8914e67
commit f69e8a8
Show file tree

Hide file tree

Showing 34 changed files with 2,225 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,172 @@
+# capice-resources
+
+Repository for resource files for CAPICE and updating CAPICE model. It contains serveral modules listed below. Each
+module contains their own README (if applicable) describing how to use them.
+
+## Installation
+
+Install should be as easy as `pip install -e '.[test]'`. 
+To test the individual modules, change directory to the specific module and run `pytest`.
+
+## Modules:
+
+### validation
+
+Validation is not really a module, but more a folder to contain plots of supporting evidence of a model's performance
+per GitHub release.
+
+### train_data_creator
+
+The module train_data_creator is here to create new CAPICE train-test and validation datasets. For more information,
+refer to the readme within the module.
+
+### utility_scripts
+
+utility_scripts contains a couple of bash scripts for ease of use, which include but are not limited to lifting over
+variants to GRCh38, running VEP104 and converting a VEP VCF output to TSV. 
+It is advised to use these scripts as they maintain a level of consistency throughout the lifespan of CAPICE.
+
+- liftover_variants.sh:
+
+The script liftover_variants.sh converts a VCF containing GRCh37 variants to GRCh38 variants.
+
+- vep_to_tsv.sh:
+
+The vep_to_tsv.sh converts the VEP output VCF back to TSV with all required columns.
+If features are added/removed, be sure to adjust the `pre_header` variable within this bash script accordingly.
+For instance, if feature `CADD_score` is added, within pre_header `\tCADD_score` has to be added.
+
+## Usage
+
+### TLDR
+
+_(For a more detailed explanation on creating the train-test and validation datasets, please see the README in
+[train_data_creator](./train_data_creator/README.md))_
+
+1. Make new [CAPICE](https://github.com/molgenis/capice) release, containing added or removed processors and/or code changes supporting a new model.
+   1. Steps:
+   2. Newly added features have been added to the [impute json](https://github.com/molgenis/capice/blob/master/resources/train_impute_values.json) and/or deprecated features have been removed.
+   3. Apply changes in features to PRE_HEADER in the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh).
+   4. Annotate new training VCF using VEP and convert the VCF using the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (raw file: [train_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/train_input_raw.vcf.gz)) (note: use `-t` when using the conversion tool)
+   5. Make training TSV train ready using `utility_scripts/vep_to_train.py`.
+   6. Use newly generated training TSV to create new [PoC](https://github.com/molgenis/capice/blob/master/tests/resources/xgb_booster_poc.pickle.dat) model.
+   7. Update [predict_input.tsv.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input.tsv.gz) (raw file: [predict_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input_raw.vcf.gz)) with altered features.
+   8. Update [breakends_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends_vep.tsv.gz) (raw file: [breakends.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends.vcf.gz)) with altered features.
+   9. Update [edge_cases_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases_vep.tsv.gz) (raw file: [edge_cases.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases.vcf.gz)) with altered features.
+   10. Update [symbolic_alleles_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles_vep.tsv.gz) (raw file: [symbolic_alleles.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles.vcf.gz)) with altered features.
+   11. Run [CAPICE](https://github.com/molgenis/capice) tests.
+2. Download latest non-public GRCh37 VKGL (`/apps/data/VKGL/GRCh37`)
+   and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets.
+3. Use [train_data_creator](./train_data_creator/README.md) to create a train-test and validation VCFs.
+   1. `python3 ./train_data_creator/main.py --input_input_vkgl </path/to/vkgl_nonpublic.tsv> --input_clinvar </path/to/clinvar.vcf.gz> -o </path/to/output>`
+4. Make [capice-resources](https://github.com/molgenis/capice-resources) GitHub release, matching the CAPICE release in step 1. 
+5. Attach both train-test and validation VCFs to [capice-resources](https://github.com/molgenis/capice-resources) release.
+6. Download the latest VEP release from the [Molgenis Cloud](https://download.molgeniscloud.org/downloads/vip/images/)
+7. Use the VEP singularity image, combined with the latest VEP cache to annotate the VCF files.
+   1. `singularity exec --bind /apps,/groups,/tmp vep <command>` (See [VEP](#VEP) for the command)
+8. (Optional for XGBoost 0.72.1 models) Upload the validation.vcf to [CADD1.4-GRCh37](https://cadd.gs.washington.edu/score) for annotation.
+9. Lift over not-annotated VCFs to GRCh38 using `lifover_variants.sh`. 
+   1. `sbatch lifover_variants.sh -i </path/to/step3.vcf.gz> -o </path/to/output/directory/file>` (please note: do not supply an extension as it doesn't produce a single file)
+10. Use the VEP singularity image to annotate the GRCh38 VCF files.
+11. Convert VEP annotated VCFs back to TSV using [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (using `-t`)
+    1. `capice/scripts/convert_vep_vcf_to_tsv_capice.sh -i </path/to/vep.vcf.gz> -o </path/to/vep.tsv.gz> -t`
+12. Process GRCH37 TSV 
+    1. `python3 ./utility_scripts/vep_to_train.py -i /path/to/vep.tsv.gz -o /path/to/vep_processed.tsv.gz`
+13. Repeat for GRCh37 validation (see step 12)
+14. Repeat steps 11 and 12 for train-test and validation for GRCh38 (add the `-a` flag to the `vep_to_train.py`).
+15. Update imputing JSON accordingly to the newly features added and/or removed.
+16. Make sure the latest release of CAPICE made in step 1 is available on the GCC cluster
+    1. `pip install capice` (be sure to install within a virtual environment or use the singularity image)
+17. Use the JSON made in step 16 and the train-test TSV made in step 12 to start the train protocol of CAPICE
+    1. You may want to use a custom script that loads the latest Python module, activates the virtual environment and activates the and then activates the train protocol. It can take 5 hours for a new model to train.
+    2. `module load <python>`
+    3. `source ./capice/venv/bin/activate`
+    4. `capice train -i </path/to/train-test.tsv.gz> -m </path/to/impute.json> -o </path/to/output>` 
+18. Attach new models to [CAPICE](https://github.com/molgenis/capice) and [capice-resources](https://github.com/molgenis/capice-resources) releases.
+19. Use new model generated in step 20 to generate CAPICE results file of the validation TSV.
+20. Use latest non `Release Candidate` model to generate CAPICE results file of the same validation TSV.
+21. Use `compare_old_model.py` in `utility_scripts` to compare performance of an old model to a new model.
+    1. `compare_old_model.py --old_model_results </path/to/old_capice_results.tsv> --vep_processed_capice_input </path/to/validation_vep.tsv.gz> --new_model_results </path/to/validation_vep_capice.tsv.gz> --output </path/to/rcX>`
+
+## Making train-test and validation VCF files
+
+Download the latest non-public GRCh37 VKGL (GCC `/apps/data/VKGL/GRCh37/vkgl_consensus_*.tsv`)
+and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets and simply supply them to main.py within
+train_data_creator. For further details, use `python3 main.py -h` or refer to the README in the train_data_creator
+module.
+
+## VEP
+
+VEP is used to add annotations to the train-test and validation files. The following command is a good representation of what VEP command should be called:
+```commandline
+vep --input_file <path to your input file> --format vcf --output_file <path to your output file> --vcf --compress_output gzip --sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --no_stats --offline --cache --dir_cache */path/to/latest/vep/cache* --species "homo_sapiens" --assembly *GRCh37 or GRCh38* --refseq --use_given_ref --exclude_predicted --use_given_ref --flag_pick_allele --force_overwrite --fork 4 --dont_skip --allow_non_variant --per_gene
+```
+_(`--input_file`, `--output_file` and `--assembly` have to be manually altered)_
+
+_Command can vary based on the training features._
+
+**Comparing an XGBoost 0.72.1 (CAPICE 2.0.0 or lower) to a new XGBoost 1.4.2 model, upload the validation.vcf.gz to [CADD](https://cadd.gs.washington.edu/score) for annotation.** (set "INCLUDE ANNOTATIONS" to True)
+
+## Post-VEP processing
+
+First convert the VEP output VCF back to TSV using the `vep_to_tsv.sh` _(Note: pre_header must be altered according to the new features or removed features)_.
+**WARNING: Drop all columns that are not wanted within training, such as the ID column.** This step can be performed for all train-test and validation VCFs for both GRCh37 and 38.
+
+Making the converted VEP TSV train-ready requires use of the `vep_to_train.py` in `utility_scripts`. This works for both GRCh37 and GRCh38 (GRCh38 only when the `-a` flag is supplied.)
+
+This script makes sure that variants with the correct labels are preserved, so that the labels remains accurate for the variant.
+
+## Training
+
+Before the training module of CAPICE can be called, make sure your impute JSON used within training is up to date on your newly added and/or removed features.
+
+Training a new model is rather easy, but takes quite a bit of time. It is recommended to start a slurm job on the GCC cluster and to let it run overnight.
+
+The training module can be activated with the following command (assuming you have installed CAPICE the way it is supposed to be: `pip install .` or `pip install capice`):
+
+`capice -v train -i /path/to/processed/train_test.tsv.gz -m /path/to/up/to/date/impute.json -o /path/to/output.pickle.dat`
+
+## Generating results
+
+### XGBoost 0.90 vs XGBoost 1.4.2 models
+
+By now, you should already have a CADD 1.4 annotated output file for "old" CAPICE to score. 
+Get the CAPICE result file for both old generation and new generation. 
+I prefer to use CAPICE 1.3.1 for "old" CAPICE. For "new" CAPICE, use your latest release, as the model file will try and match the
+version to the CAPICE version. 
+
+### XGBoost 1.4.2 vs XGBoost 1.4.2 models
+
+Work in Progress
+
+## Validate
+
+Merging the old and new validation datafiles happens within `utility_scripts/compare_old_model.py`. 
+This script merges the old results file with a file containing the original labels (the direct output of a TSV converted VEP output using the `-t` flag of `conver_vep_vcf_to_tsv_capice.sh` of [CAPICE](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) is perfect). 
+After that merge and some cleanup of columns that are not going to be used anymore, 
+the new results file is merged on top of the old results merged with their original true labels.
+
+After this merge, AUC is calculated globally, as well as an absolute score difference between the predicted score and the true label.
+These global results are then plotted, and AUC calculations are performed over all consequences present within the **old capice** datafile, as well as the absolute score difference.
+
+This absolute score difference tells us something of how close the scores are to the true label, where 1 means the score is far off from the true label and 0 means the score matches the label. We want 0.
+
+Furthermore, a table of the feature importances of the new model is plotted aswell, so you can see what is and what is not important.
+
+Compare the results of Old vs New and discus within the team, decide on next steps.
+
+
+## Common issues
+
+- Training with my generated dataset causes warning ValueError: multiclass format is not supported, resulting in average
+  CV performance of NaN.
+
+This one is a bit tricky, but it comes down with a bug somewhere within creating the ID column. One of the binarized
+labels is set to 2, origin is unknown. Is fixed as per CAPICE 3.0.0-beta2.
+
+## FAQ
+
+- I'm getting an `ValueError: invalid literal for int() with base 10: 'chr1'`, what's wrong?
+
+You've likely downloaded a genome build 38, which has a different annotation for the chromosome. 
+`train_data_creator` only supports build 37.
diff --git a/setup.py b/setup.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+
+from setuptools import setup
+with open('README.md', 'r', encoding='utf-8') as fh:
+    long_description = fh.read()
+
+setup(
+    name='capice-resources',
+    version='1.0.0-dev',
+    url='https://capice.molgeniscloud.org/',
+    license='LGPL-3.0',
+    author='Molgenis',
+    author_email='[email protected]',
+    description='Resource files for CAPICE to train new CAPICE models',
+    long_description=long_description,
+    classifiers=[
+        'Development Status :: 4 - Beta',
+        'License :: LGPL-3.0',
+        'Programming Language :: Python :: 3.9',
+        'Programming Language :: Python :: 3.10'
+    ],
+    python_requires='>=3.9.*',
+    install_requires=[
+        'pandas==1.3.4',
+        'numpy==1.21.4',
+        'matplotlib==3.5.1',
+        'xgboost==1.4.2',
+        'scikit-learn==1.0.2'
+    ],
+    extras_require={
+        'test': [
+            'pytest',
+        ]
+    }
+)
diff --git a/train_data_creator/README.md b/train_data_creator/README.md
@@ -0,0 +1,100 @@
+# Train data creator
+
+###### Resource module for CAPICE to create new train-test and validation datasets
+
+## Usage
+
+`python3 main.py -iv path/to/vkgl_consensus.tsv.gz -ic path/to/clinvar.vcf.gz -o path/to/existing/output/directory`
+
+_Note: -iv needs to be the non-public VKGL, public will throw an KeyNotFoundError._
+
+_Note 2: the parent of the output directory has to exist. train_data_creator will, at most, create only 1 new directory._
+
+## Theory
+
+(data_preprocessor.py)
+
+Step 1 is to load in the different type datasets, which is not all that easy.
+
+VCF files have headers that can't be directly parsed into VCF. The amount of lines are obtained and provided to the
+pandas parser.
+
+The VKGL TSV is easier to parse since that doesn't require extra skips.
+
+After that the files are validated.
+
+For VKGL the file follows the following path:
+
+- Correcting the TSV to VCF-like column names.
+- Correcting the consensus-classification of "Classified by one lab" to the actual classification of the lab (also
+  setting the amount of labs for that variant to 1 for review status).
+- Ordering according to VCF standards.
+- Applying the ClinVar review status to their Gold Star Description.
+- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else).
+- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic).
+
+For ClinVar the file follows the following (similar) path:
+
+- (Before the file gets even loaded) obtaining the amount of header lines
+- Obtaining the classification from the INFO field.
+- Obtaining the gene from the INFO field.
+- Obtaining the review status from the INFO field.
+- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else).
+- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic).
+
+After that, both files are equal in terms of columns, column types and variables within the columns.
+
+(main.py)
+
+Both datasets are merged, with the column "source" indicating their true origin and the originally loaded datasets are
+deleted to preserve RAM usage.
+
+(duplicate_processor.py)
+
+Once merged, duplicates have to be removed since it is possible that VKGL samples are present within ClinVar. Duplicates
+are dropped if all of the following columns are duplicates:
+
+- CHROM
+- POS
+- REF
+- ALT
+- gene
+- class
+
+Since the keep in pandas.DataFrame.drop_duplicates() is not defined, the first duplicate will remain present while the
+rest are dropped. Since VKGL is appended to the ClinVar dataset, the ClinVar variant will be kept.
+_Note: it might be possible that duplicate variants are still present with conflicting classes, since those do not count
+as duplicates. We don't think this will be an issue due to the amount of samples._
+
+Now that we have an assembled large dataset, we can start building the train-test and validation datasets. __But before
+we can do that__, it is useful to apply the sample weights already. Sample weights is what tells XGBoost (and other
+machine learning algorithms) how well it should trust the particular sample. As of current, the following ClinVar Gold
+Star ratings are used to apply the following sample weights:
+
+| Gold star rating | Sample weight |
+|:----------------:|:-------------:|
+|         0        |      Removed      |
+|         1        |      0.8      |
+|         2        |      0.9      |
+|         3        |      1.0      |
+|         4        |      1.0      |
+
+These sample weights follow the idea that [Li et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00775-w) originally had for the initial model.
+Applying sample weights that are lower than 0.8 have the chance to influence the end score too much in such a way that scores will not reach 0.0 for benign and 1.0 for pathogenic (beta1 iteration 1). 
+Higher quality variants (2 labs or more confirming the consensus) should be rewarded with a higher sample weight.
+
+Now that sample weights have been mapped and applied, the large dataset can be split into the train-test and validation
+datasets.
+
+(split_datasets.py)
+
+The most important part about the splitting of the large dataset into the train-test and validation datasets is that the
+validation dataset is of high quality. This is done by randomly sampling 50% of high quality pathogenic variants (review
+score 2 or higher, which means at least 2 labs have confirmed the consensus). 
+Once 50% of high quality pathogenic variants have been sampled, an equal amount of benign samples of the same quality is sampled.
+These sampled variants now form the validation dataset. 
+These validation samples are then merged back to the training dataset, after which an `pd.drop_duplicates(keep=False)` is performed, 
+so all validation samples that were still in the train-test dataset are removed (since `sample` does not remove the dataframe entries).
+
+This results in the train-test and validation datasets, 
+that are exported to the output argument after adding a pseudo-vcf header and pseudo-vcf columns (and merging critical information to the ID column).
diff --git a/train_data_creator/__init__.py b/train_data_creator/__init__.py