diff --git a/README.md b/README.md new file mode 100644 index 0000000..3607bc1 --- /dev/null +++ b/README.md @@ -0,0 +1,172 @@ +# capice-resources + +Repository for resource files for CAPICE and updating CAPICE model. It contains serveral modules listed below. Each +module contains their own README (if applicable) describing how to use them. + +## Installation + +Install should be as easy as `pip install -e '.[test]'`. +To test the individual modules, change directory to the specific module and run `pytest`. + +## Modules: + +### validation + +Validation is not really a module, but more a folder to contain plots of supporting evidence of a model's performance +per GitHub release. + +### train_data_creator + +The module train_data_creator is here to create new CAPICE train-test and validation datasets. For more information, +refer to the readme within the module. + +### utility_scripts + +utility_scripts contains a couple of bash scripts for ease of use, which include but are not limited to lifting over +variants to GRCh38, running VEP104 and converting a VEP VCF output to TSV. +It is advised to use these scripts as they maintain a level of consistency throughout the lifespan of CAPICE. + +- liftover_variants.sh: + +The script liftover_variants.sh converts a VCF containing GRCh37 variants to GRCh38 variants. + +- vep_to_tsv.sh: + +The vep_to_tsv.sh converts the VEP output VCF back to TSV with all required columns. +If features are added/removed, be sure to adjust the `pre_header` variable within this bash script accordingly. +For instance, if feature `CADD_score` is added, within pre_header `\tCADD_score` has to be added. + +## Usage + +### TLDR + +_(For a more detailed explanation on creating the train-test and validation datasets, please see the README in +[train_data_creator](./train_data_creator/README.md))_ + +1. Make new [CAPICE](https://github.com/molgenis/capice) release, containing added or removed processors and/or code changes supporting a new model. + 1. Steps: + 2. Newly added features have been added to the [impute json](https://github.com/molgenis/capice/blob/master/resources/train_impute_values.json) and/or deprecated features have been removed. + 3. Apply changes in features to PRE_HEADER in the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh). + 4. Annotate new training VCF using VEP and convert the VCF using the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (raw file: [train_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/train_input_raw.vcf.gz)) (note: use `-t` when using the conversion tool) + 5. Make training TSV train ready using `utility_scripts/vep_to_train.py`. + 6. Use newly generated training TSV to create new [PoC](https://github.com/molgenis/capice/blob/master/tests/resources/xgb_booster_poc.pickle.dat) model. + 7. Update [predict_input.tsv.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input.tsv.gz) (raw file: [predict_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input_raw.vcf.gz)) with altered features. + 8. Update [breakends_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends_vep.tsv.gz) (raw file: [breakends.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends.vcf.gz)) with altered features. + 9. Update [edge_cases_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases_vep.tsv.gz) (raw file: [edge_cases.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases.vcf.gz)) with altered features. + 10. Update [symbolic_alleles_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles_vep.tsv.gz) (raw file: [symbolic_alleles.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles.vcf.gz)) with altered features. + 11. Run [CAPICE](https://github.com/molgenis/capice) tests. +2. Download latest non-public GRCh37 VKGL (`/apps/data/VKGL/GRCh37`) + and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets. +3. Use [train_data_creator](./train_data_creator/README.md) to create a train-test and validation VCFs. + 1. `python3 ./train_data_creator/main.py --input_input_vkgl --input_clinvar -o ` +4. Make [capice-resources](https://github.com/molgenis/capice-resources) GitHub release, matching the CAPICE release in step 1. +5. Attach both train-test and validation VCFs to [capice-resources](https://github.com/molgenis/capice-resources) release. +6. Download the latest VEP release from the [Molgenis Cloud](https://download.molgeniscloud.org/downloads/vip/images/) +7. Use the VEP singularity image, combined with the latest VEP cache to annotate the VCF files. + 1. `singularity exec --bind /apps,/groups,/tmp vep ` (See [VEP](#VEP) for the command) +8. (Optional for XGBoost 0.72.1 models) Upload the validation.vcf to [CADD1.4-GRCh37](https://cadd.gs.washington.edu/score) for annotation. +9. Lift over not-annotated VCFs to GRCh38 using `lifover_variants.sh`. + 1. `sbatch lifover_variants.sh -i -o ` (please note: do not supply an extension as it doesn't produce a single file) +10. Use the VEP singularity image to annotate the GRCh38 VCF files. +11. Convert VEP annotated VCFs back to TSV using [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (using `-t`) + 1. `capice/scripts/convert_vep_vcf_to_tsv_capice.sh -i -o -t` +12. Process GRCH37 TSV + 1. `python3 ./utility_scripts/vep_to_train.py -i /path/to/vep.tsv.gz -o /path/to/vep_processed.tsv.gz` +13. Repeat for GRCh37 validation (see step 12) +14. Repeat steps 11 and 12 for train-test and validation for GRCh38 (add the `-a` flag to the `vep_to_train.py`). +15. Update imputing JSON accordingly to the newly features added and/or removed. +16. Make sure the latest release of CAPICE made in step 1 is available on the GCC cluster + 1. `pip install capice` (be sure to install within a virtual environment or use the singularity image) +17. Use the JSON made in step 16 and the train-test TSV made in step 12 to start the train protocol of CAPICE + 1. You may want to use a custom script that loads the latest Python module, activates the virtual environment and activates the and then activates the train protocol. It can take 5 hours for a new model to train. + 2. `module load ` + 3. `source ./capice/venv/bin/activate` + 4. `capice train -i -m -o ` +18. Attach new models to [CAPICE](https://github.com/molgenis/capice) and [capice-resources](https://github.com/molgenis/capice-resources) releases. +19. Use new model generated in step 20 to generate CAPICE results file of the validation TSV. +20. Use latest non `Release Candidate` model to generate CAPICE results file of the same validation TSV. +21. Use `compare_old_model.py` in `utility_scripts` to compare performance of an old model to a new model. + 1. `compare_old_model.py --old_model_results --vep_processed_capice_input --new_model_results --output ` + +## Making train-test and validation VCF files + +Download the latest non-public GRCh37 VKGL (GCC `/apps/data/VKGL/GRCh37/vkgl_consensus_*.tsv`) +and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets and simply supply them to main.py within +train_data_creator. For further details, use `python3 main.py -h` or refer to the README in the train_data_creator +module. + +## VEP + +VEP is used to add annotations to the train-test and validation files. The following command is a good representation of what VEP command should be called: +```commandline +vep --input_file --format vcf --output_file --vcf --compress_output gzip --sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --no_stats --offline --cache --dir_cache */path/to/latest/vep/cache* --species "homo_sapiens" --assembly *GRCh37 or GRCh38* --refseq --use_given_ref --exclude_predicted --use_given_ref --flag_pick_allele --force_overwrite --fork 4 --dont_skip --allow_non_variant --per_gene +``` +_(`--input_file`, `--output_file` and `--assembly` have to be manually altered)_ + +_Command can vary based on the training features._ + +**Comparing an XGBoost 0.72.1 (CAPICE 2.0.0 or lower) to a new XGBoost 1.4.2 model, upload the validation.vcf.gz to [CADD](https://cadd.gs.washington.edu/score) for annotation.** (set "INCLUDE ANNOTATIONS" to True) + +## Post-VEP processing + +First convert the VEP output VCF back to TSV using the `vep_to_tsv.sh` _(Note: pre_header must be altered according to the new features or removed features)_. +**WARNING: Drop all columns that are not wanted within training, such as the ID column.** This step can be performed for all train-test and validation VCFs for both GRCh37 and 38. + +Making the converted VEP TSV train-ready requires use of the `vep_to_train.py` in `utility_scripts`. This works for both GRCh37 and GRCh38 (GRCh38 only when the `-a` flag is supplied.) + +This script makes sure that variants with the correct labels are preserved, so that the labels remains accurate for the variant. + +## Training + +Before the training module of CAPICE can be called, make sure your impute JSON used within training is up to date on your newly added and/or removed features. + +Training a new model is rather easy, but takes quite a bit of time. It is recommended to start a slurm job on the GCC cluster and to let it run overnight. + +The training module can be activated with the following command (assuming you have installed CAPICE the way it is supposed to be: `pip install .` or `pip install capice`): + +`capice -v train -i /path/to/processed/train_test.tsv.gz -m /path/to/up/to/date/impute.json -o /path/to/output.pickle.dat` + +## Generating results + +### XGBoost 0.90 vs XGBoost 1.4.2 models + +By now, you should already have a CADD 1.4 annotated output file for "old" CAPICE to score. +Get the CAPICE result file for both old generation and new generation. +I prefer to use CAPICE 1.3.1 for "old" CAPICE. For "new" CAPICE, use your latest release, as the model file will try and match the +version to the CAPICE version. + +### XGBoost 1.4.2 vs XGBoost 1.4.2 models + +Work in Progress + +## Validate + +Merging the old and new validation datafiles happens within `utility_scripts/compare_old_model.py`. +This script merges the old results file with a file containing the original labels (the direct output of a TSV converted VEP output using the `-t` flag of `conver_vep_vcf_to_tsv_capice.sh` of [CAPICE](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) is perfect). +After that merge and some cleanup of columns that are not going to be used anymore, +the new results file is merged on top of the old results merged with their original true labels. + +After this merge, AUC is calculated globally, as well as an absolute score difference between the predicted score and the true label. +These global results are then plotted, and AUC calculations are performed over all consequences present within the **old capice** datafile, as well as the absolute score difference. + +This absolute score difference tells us something of how close the scores are to the true label, where 1 means the score is far off from the true label and 0 means the score matches the label. We want 0. + +Furthermore, a table of the feature importances of the new model is plotted aswell, so you can see what is and what is not important. + +Compare the results of Old vs New and discus within the team, decide on next steps. + + +## Common issues + +- Training with my generated dataset causes warning ValueError: multiclass format is not supported, resulting in average + CV performance of NaN. + +This one is a bit tricky, but it comes down with a bug somewhere within creating the ID column. One of the binarized +labels is set to 2, origin is unknown. Is fixed as per CAPICE 3.0.0-beta2. + +## FAQ + +- I'm getting an `ValueError: invalid literal for int() with base 10: 'chr1'`, what's wrong? + +You've likely downloaded a genome build 38, which has a different annotation for the chromosome. +`train_data_creator` only supports build 37. diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..4dc669e --- /dev/null +++ b/setup.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python3 + +from setuptools import setup +with open('README.md', 'r', encoding='utf-8') as fh: + long_description = fh.read() + +setup( + name='capice-resources', + version='1.0.0-dev', + url='https://capice.molgeniscloud.org/', + license='LGPL-3.0', + author='Molgenis', + author_email='molgenis-support@umcg.nl', + description='Resource files for CAPICE to train new CAPICE models', + long_description=long_description, + classifiers=[ + 'Development Status :: 4 - Beta', + 'License :: LGPL-3.0', + 'Programming Language :: Python :: 3.9', + 'Programming Language :: Python :: 3.10' + ], + python_requires='>=3.9.*', + install_requires=[ + 'pandas==1.3.4', + 'numpy==1.21.4', + 'matplotlib==3.5.1', + 'xgboost==1.4.2', + 'scikit-learn==1.0.2' + ], + extras_require={ + 'test': [ + 'pytest', + ] + } +) diff --git a/train_data_creator/README.md b/train_data_creator/README.md new file mode 100644 index 0000000..1c9d42a --- /dev/null +++ b/train_data_creator/README.md @@ -0,0 +1,100 @@ +# Train data creator + +###### Resource module for CAPICE to create new train-test and validation datasets + +## Usage + +`python3 main.py -iv path/to/vkgl_consensus.tsv.gz -ic path/to/clinvar.vcf.gz -o path/to/existing/output/directory` + +_Note: -iv needs to be the non-public VKGL, public will throw an KeyNotFoundError._ + +_Note 2: the parent of the output directory has to exist. train_data_creator will, at most, create only 1 new directory._ + +## Theory + +(data_preprocessor.py) + +Step 1 is to load in the different type datasets, which is not all that easy. + +VCF files have headers that can't be directly parsed into VCF. The amount of lines are obtained and provided to the +pandas parser. + +The VKGL TSV is easier to parse since that doesn't require extra skips. + +After that the files are validated. + +For VKGL the file follows the following path: + +- Correcting the TSV to VCF-like column names. +- Correcting the consensus-classification of "Classified by one lab" to the actual classification of the lab (also + setting the amount of labs for that variant to 1 for review status). +- Ordering according to VCF standards. +- Applying the ClinVar review status to their Gold Star Description. +- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else). +- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic). + +For ClinVar the file follows the following (similar) path: + +- (Before the file gets even loaded) obtaining the amount of header lines +- Obtaining the classification from the INFO field. +- Obtaining the gene from the INFO field. +- Obtaining the review status from the INFO field. +- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else). +- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic). + +After that, both files are equal in terms of columns, column types and variables within the columns. + +(main.py) + +Both datasets are merged, with the column "source" indicating their true origin and the originally loaded datasets are +deleted to preserve RAM usage. + +(duplicate_processor.py) + +Once merged, duplicates have to be removed since it is possible that VKGL samples are present within ClinVar. Duplicates +are dropped if all of the following columns are duplicates: + +- CHROM +- POS +- REF +- ALT +- gene +- class + +Since the keep in pandas.DataFrame.drop_duplicates() is not defined, the first duplicate will remain present while the +rest are dropped. Since VKGL is appended to the ClinVar dataset, the ClinVar variant will be kept. +_Note: it might be possible that duplicate variants are still present with conflicting classes, since those do not count +as duplicates. We don't think this will be an issue due to the amount of samples._ + +Now that we have an assembled large dataset, we can start building the train-test and validation datasets. __But before +we can do that__, it is useful to apply the sample weights already. Sample weights is what tells XGBoost (and other +machine learning algorithms) how well it should trust the particular sample. As of current, the following ClinVar Gold +Star ratings are used to apply the following sample weights: + +| Gold star rating | Sample weight | +|:----------------:|:-------------:| +| 0 | Removed | +| 1 | 0.8 | +| 2 | 0.9 | +| 3 | 1.0 | +| 4 | 1.0 | + +These sample weights follow the idea that [Li et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00775-w) originally had for the initial model. +Applying sample weights that are lower than 0.8 have the chance to influence the end score too much in such a way that scores will not reach 0.0 for benign and 1.0 for pathogenic (beta1 iteration 1). +Higher quality variants (2 labs or more confirming the consensus) should be rewarded with a higher sample weight. + +Now that sample weights have been mapped and applied, the large dataset can be split into the train-test and validation +datasets. + +(split_datasets.py) + +The most important part about the splitting of the large dataset into the train-test and validation datasets is that the +validation dataset is of high quality. This is done by randomly sampling 50% of high quality pathogenic variants (review +score 2 or higher, which means at least 2 labs have confirmed the consensus). +Once 50% of high quality pathogenic variants have been sampled, an equal amount of benign samples of the same quality is sampled. +These sampled variants now form the validation dataset. +These validation samples are then merged back to the training dataset, after which an `pd.drop_duplicates(keep=False)` is performed, +so all validation samples that were still in the train-test dataset are removed (since `sample` does not remove the dataframe entries). + +This results in the train-test and validation datasets, +that are exported to the output argument after adding a pseudo-vcf header and pseudo-vcf columns (and merging critical information to the ID column). diff --git a/train_data_creator/__init__.py b/train_data_creator/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/train_data_creator/main.py b/train_data_creator/main.py new file mode 100755 index 0000000..eb131ea --- /dev/null +++ b/train_data_creator/main.py @@ -0,0 +1,67 @@ +#!/usr/bin/env python3 + +import gc + +from train_data_creator.src.main.exporter import Exporter +from train_data_creator.src.main.split_datasets import SplitDatasets +from train_data_creator.src.main.sample_weighter import SampleWeighter +from train_data_creator.src.main.data_preprocessor import VKGL, ClinVar +from train_data_creator.src.main.utilities import correct_order_vcf_notation +from train_data_creator.src.main.command_line_parser import CommandLineParser +from train_data_creator.src.main.duplicate_processor import DuplicateProcessor +from train_data_creator.src.main.validators.input_validator import InputValidator + + +def main(): + # Parse command line + print('Parsing command line arguments.') + clp = CommandLineParser() + input_vkgl = clp.get_argument('input_vkgl') + input_clinvar = clp.get_argument('input_clinvar') + output = clp.get_argument('output') + + # Validate + print('Validating input arguments.') + validator = InputValidator() + validator.validate_vkgl(input_vkgl) + validator.validate_clinvar(input_clinvar) + output = validator.validate_output(output) + print('Input arguments valid.\n') + + # Parse + print('Parsing VKGL.') + vkgl = VKGL().parse(input_vkgl) + print('Parsing ClinVar.') + clinvar = ClinVar().parse(input_clinvar) + print('Datafiles parsed.\n') + + # Combine + print('Combining ClinVar and VKGL') + data = vkgl.append(clinvar, ignore_index=True) + print('ClinVar and VKGL combined.\n') + + # Freeing up memory + del vkgl, clinvar + gc.collect() + + # Dropping duplicates + DuplicateProcessor().process(data) + + # Applying sample weight + SampleWeighter().apply_sample_weight(data) + + # Deviding into train-test and validation + train_test, validation = SplitDatasets().split(data) + + # Freeing up memory + del data + gc.collect() + + # Exporting + exporter = Exporter(output) + exporter.export_validation_dataset(correct_order_vcf_notation(validation)) + exporter.export_train_test_dataset(train_test) + + +if __name__ == '__main__': + main() diff --git a/train_data_creator/src/__init__.py b/train_data_creator/src/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/train_data_creator/src/main/__init__.py b/train_data_creator/src/main/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/train_data_creator/src/main/command_line_parser.py b/train_data_creator/src/main/command_line_parser.py new file mode 100644 index 0000000..d424f72 --- /dev/null +++ b/train_data_creator/src/main/command_line_parser.py @@ -0,0 +1,41 @@ +import argparse + + +class CommandLineParser: + def __init__(self): + parser = self._create_argument_parser() + self.arguments = parser.parse_args() + + def _create_argument_parser(self): + parser = argparse.ArgumentParser( + prog='Train-data-creator', + description='Creator of CAPICE train/test and validation datasets.' + ) + required = parser.add_argument_group('Required arguments') + + required.add_argument( + '--input_vkgl', + type=str, + required=True, + help='Input location of the VKGL dataset (consensus, not public)' + ) + required.add_argument( + '--input_clinvar', + type=str, + required=True, + help='Input location of the ClinVar dataset.' + ) + required.add_argument( + '-o', + '--output', + type=str, + required=True, + help='Output directory where the files should be placed.' + ) + return parser + + def get_argument(self, argument_key): + value = None + if argument_key in self.arguments: + value = getattr(self.arguments, argument_key) + return value diff --git a/train_data_creator/src/main/data_preprocessor.py b/train_data_creator/src/main/data_preprocessor.py new file mode 100644 index 0000000..98283da --- /dev/null +++ b/train_data_creator/src/main/data_preprocessor.py @@ -0,0 +1,188 @@ +import gzip +import warnings +import pandas as pd + +from train_data_creator.src.main.utilities import correct_order_vcf_notation, equalize_class, \ + apply_binarized_label +from train_data_creator.src.main.validators.dataset_validator import DatasetValidator + +_COLUMNS_OF_INTEREST = ['#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class', 'review', 'source'] + + +class VKGL: + def __init__(self): + self.validator = DatasetValidator() + + def parse(self, file_location): + data = pd.read_csv(file_location, sep='\t') + self.validator.validate_vkgl(data) + self._correct_column_names(data) + self._correct_single_consensus(data) + correct_order_vcf_notation(data) + self._apply_review_status(data) + data['source'] = 'VKGL' + data = data[_COLUMNS_OF_INTEREST] + equalize_class( + data, + equalize_dict={ + 'Likely benign': 'LB', + 'Benign': 'B', + '(Likely) benign': 'LB', + 'Pathogenic': 'P', + 'Likely pathogenic': 'LP', + '(Likely) pathogenic': 'LP' + } + ) + apply_binarized_label(data) + data.reset_index(drop=True, inplace=True) + return data + + @staticmethod + def _apply_review_status(data: pd.DataFrame): + print('Applying review status.') + data['review'] = 2 + data.loc[data[data['labs'] == 1].index, 'review'] = 1 + return data + + @staticmethod + def _correct_single_consensus(data: pd.DataFrame): + print('Correcting single consensus.') + lab_concensi = [] + for col in data.columns: + if col.endswith('_link'): + lab_concensi.append(col.split('_link')[0]) + for lab in lab_concensi: + data[lab].fillna('', inplace=True) + # The true amount of labs doesn't matter, as long as it is at least 2 + data['labs'] = 2 + # Grabbing the variants that only have 1 lab supporting the consensus + data.loc[ + data[ + data[ + 'class' + ] == 'Classified by one lab' + ].index, + 'labs' + ] = 1 + # Correcting consensus + data.loc[ + data[ + data[ + 'class' + ] == 'Classified by one lab' + ].index, + 'class' + ] = data[lab_concensi].astype(str).agg(''.join, axis=1) + return data + + @staticmethod + def _correct_column_names(data: pd.DataFrame): + print('Correcting column names.') + data.rename( + columns={ + 'chromosome': '#CHROM', + 'start': 'POS', + 'ref': 'REF', + 'alt': 'ALT', + 'consensus_classification': 'class' + }, inplace=True + ) + return data + + +class ClinVar: + def __init__(self): + self.validator = DatasetValidator() + + def parse(self, file_location): + skiprows = self._get_n_header(file_location) + data = pd.read_csv(file_location, sep='\t', skiprows=skiprows) + self.validator.validate_clinvar(data) + self._obtain_class(data) + self._obtain_gene(data) + self._obtain_review(data) + data['source'] = 'ClinVar' + data = data[_COLUMNS_OF_INTEREST] + equalize_class( + data, + equalize_dict={ + 'Uncertain_significance': 'VUS', + 'Likely_benign': 'LB', + 'Benign': 'B', + 'Pathogenic': 'P', + 'Likely_pathogenic': 'LP', + 'Benign/Likely_benign': 'LB', + 'Pathogenic/Likely_pathogenic': 'LP' + } + ) + apply_binarized_label(data) + data.reset_index(drop=True, inplace=True) + return data + + @staticmethod + def _get_n_header(file_location): + print('Obtaining amount of skip rows.') + n_skip = 0 + if file_location.endswith('.gz'): + file_conn = gzip.open(file_location, 'rt') + else: + file_conn = open(file_location, 'rt') + for line in file_conn: + if line.strip().startswith('##'): + n_skip += 1 + else: + break + file_conn.close() + print(f'Total skip rows: {n_skip}') + return n_skip + + @staticmethod + def _obtain_gene(data): + print('Obtaining gene.') + data['gene'] = data['INFO'].str.split( + 'GENEINFO=', expand=True + )[1].str.split(':', expand=True)[0] + return data + + @staticmethod + def _obtain_class(data): + print('Obtaining class.') + data['class'] = data['INFO'].str.split( + 'CLNSIG=', expand=True + )[1].str.split(';', expand=True)[0] + return data + + @staticmethod + def _obtain_review(data): + print('Obtaining review.') + # Dict with Gold stars matching the ClinVar review status + # Conflicting -1 because we dont want those + stars = { + 'criteria_provided,_conflicting_interpretations': -1, + 'no_assertion_provided': 0, + 'no_assertion_criteria_provided': 0, + 'no_assertion_for_individual_variant': 0, + 'criteria_provided,_single_submitter': 1, + 'criteria_provided,_multiple_submitters,_no_conflicts': 2, + 'reviewed_by_expert_panel': 3, + 'practice_guideline': 4 + } + + # Isolating column + data['review'] = data['INFO'].str.split( + 'CLNREVSTAT=', expand=True + )[1].str.split(';', expand=True)[0] + + # Dropping uninteresting review status + for status in data['review'].unique(): + if status not in stars.keys(): + warnings.warn(f'Found unknown review status: {status}') + + # Mapping to Gold Stars values + for key, value in stars.items(): + data.loc[data[data['review'] == key].index, 'review'] = value + data['review'] = data['review'].astype(int) + + # Droppin + data.drop(data[data['review'] < 1].index, inplace=True) + return data diff --git a/train_data_creator/src/main/duplicate_processor.py b/train_data_creator/src/main/duplicate_processor.py new file mode 100644 index 0000000..0402ec2 --- /dev/null +++ b/train_data_creator/src/main/duplicate_processor.py @@ -0,0 +1,11 @@ +import pandas as pd + + +class DuplicateProcessor: + @staticmethod + def process(data: pd.DataFrame): + print('Dropping duplicates.') + data.drop_duplicates( + subset=['#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'], inplace=True + ) + return data diff --git a/train_data_creator/src/main/exporter.py b/train_data_creator/src/main/exporter.py new file mode 100644 index 0000000..1a6a105 --- /dev/null +++ b/train_data_creator/src/main/exporter.py @@ -0,0 +1,71 @@ +import os +import gzip +import pandas as pd + + +class Exporter: + def __init__(self, output): + self.output = output + self.vcf_header = [ + '##fileformat=VCFv4.2', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##contig=', + '##fileDate=20200320' + ] + + def export_validation_dataset(self, data: pd.DataFrame): + self._export(data, dataset_type='validation') + + def export_train_test_dataset(self, data: pd.DataFrame): + self._export(data, dataset_type='train_test') + + def _export(self, data: pd.DataFrame, dataset_type: str): + export_loc = os.path.join(self.output, dataset_type + '.vcf.gz') + with gzip.open(export_loc, 'wt') as pseudo_vcf: + for line in self.vcf_header: + pseudo_vcf.write(f'{line}\n') + + # Adding required columns + data['QUAL'] = '.' + data['FILTER'] = 'PASS' + data['INFO'] = '.' + + # Making sure that the variant can be mapped back after VEP + data['ID'] = data[ + [ + '#CHROM', + 'POS', + 'REF', + 'ALT', + 'gene', + 'binarized_label', + 'sample_weight' + ] + ].astype(str).agg('_'.join, axis=1) + data[ + ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'] + ].to_csv(export_loc, mode='a', sep='\t', index=False, compression='gzip', na_rep='.') + print(f'Export to {export_loc} successful.') diff --git a/train_data_creator/src/main/sample_weighter.py b/train_data_creator/src/main/sample_weighter.py new file mode 100644 index 0000000..9d7a264 --- /dev/null +++ b/train_data_creator/src/main/sample_weighter.py @@ -0,0 +1,15 @@ +import pandas as pd + + +class SampleWeighter: + @staticmethod + def apply_sample_weight(dataset: pd.DataFrame): + print('Applying sample weights') + sample_weights = { + 4: 1.0, + 3: 1.0, + 2: 0.9, + 1: 0.8 + } + dataset['sample_weight'] = dataset['review'].map(sample_weights) + return dataset diff --git a/train_data_creator/src/main/split_datasets.py b/train_data_creator/src/main/split_datasets.py new file mode 100644 index 0000000..d180746 --- /dev/null +++ b/train_data_creator/src/main/split_datasets.py @@ -0,0 +1,37 @@ +import gc + +import pandas as pd + + +class SplitDatasets: + def __init__(self): + self.frac = 0.5 + + def split(self, data: pd.DataFrame): + print('Splitting into validation and training.') + pathogenic_set = data[data['binarized_label'] == 1] + print(f'Amount of pathogenic variants:{pathogenic_set.shape[0]}') + benign_set = data[data['binarized_label'] == 0] + print(f'Amount of benign variants:{benign_set.shape[0]}') + validation = pathogenic_set[pathogenic_set['sample_weight'] >= 0.9].sample(frac=self.frac) + print(f'Sampled: {validation.shape[0]} high confidence pathogenic variants.') + if benign_set[benign_set['sample_weight'] >= 0.9].shape[0] < validation.shape[0]: + raise ValueError( + f'Not enough benign variants to match pathogenic variants, unable to create ' + f'validation set.' + ) + validation = validation.append( + benign_set[ + benign_set['sample_weight'] >= 0.9].sample(n=validation.shape[0]), ignore_index=True + ) + print(f'Validation dataset made, number of samples: {validation.shape[0]}') + del pathogenic_set, benign_set + gc.collect() + + # Creating train_test dataset + train_test = data.copy(deep=True) + train_test = train_test.append(validation, ignore_index=True) + train_test.drop_duplicates(keep=False, inplace=True) + print(f'Train dataset made, number of samples: {train_test.shape[0]}') + + return train_test, validation diff --git a/train_data_creator/src/main/utilities.py b/train_data_creator/src/main/utilities.py new file mode 100644 index 0000000..89469e0 --- /dev/null +++ b/train_data_creator/src/main/utilities.py @@ -0,0 +1,35 @@ +import numpy as np +import pandas as pd + + +def correct_order_vcf_notation(pseudo_vcf: pd.DataFrame): + print('Ordering (pseudo_)VCF') + pseudo_vcf['order'] = pseudo_vcf['#CHROM'] + pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'X'].index, 'order'] = 23 + pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'Y'].index, 'order'] = 24 + pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'MT'].index, 'order'] = 25 + pseudo_vcf['order'] = pseudo_vcf['order'].astype(int) + pseudo_vcf.sort_values(by=['order', 'POS'], inplace=True) + pseudo_vcf.drop(columns='order', inplace=True) + pseudo_vcf.reset_index(drop=True, inplace=True) + return pseudo_vcf + + +def equalize_class(data: pd.DataFrame, equalize_dict: dict): + print('Equalizing classes.') + # Drop the classes that we are not interested in. + for c in data['class'].unique(): + if c not in equalize_dict.keys(): + data.drop(index=data[data['class'] == c].index, inplace=True) + for key, value in equalize_dict.items(): + data.loc[data[data['class'] == key].index, 'class'] = value + return data + + +def apply_binarized_label(data: pd.DataFrame): + print('Applying binarized label.') + data['binarized_label'] = np.nan + data.loc[data[data['class'].isin(['LB', 'B'])].index, 'binarized_label'] = 0 + data.loc[data[data['class'].isin(['LP', 'P'])].index, 'binarized_label'] = 1 + data.drop(index=data[data['binarized_label'].isnull()].index, inplace=True) + return data diff --git a/train_data_creator/src/main/validators/__init__.py b/train_data_creator/src/main/validators/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/train_data_creator/src/main/validators/dataset_validator.py b/train_data_creator/src/main/validators/dataset_validator.py new file mode 100644 index 0000000..0f0c3b7 --- /dev/null +++ b/train_data_creator/src/main/validators/dataset_validator.py @@ -0,0 +1,26 @@ +import pandas as pd + + +class DatasetValidator: + def validate_vkgl(self, vkgl_dataset: pd.DataFrame): + columns_specific_to_vkgl = ['consensus_classification', 'chromosome', 'start'] + self._validate_n_variants(vkgl_dataset, 'VKGL') + self._validate_dataset(vkgl_dataset, columns_specific_to_vkgl, 'VKGL') + + def validate_clinvar(self, clinvar_dataset: pd.DataFrame): + columns_specific_to_clinvar = ['#CHROM', 'POS', 'REF', 'ALT', 'INFO'] + self._validate_n_variants(clinvar_dataset, 'ClinVar') + self._validate_dataset(clinvar_dataset, columns_specific_to_clinvar, 'ClinVar') + + @staticmethod + def _validate_dataset(dataset: pd.DataFrame, columns_required: list, data_source: str): + for column in columns_required: + if column not in dataset.columns: + raise KeyError( + f'Input {data_source} dataset is not consensus, missing column: {column}' + ) + + @staticmethod + def _validate_n_variants(dataset: pd.DataFrame, data_source: str): + if dataset.shape[0] == 0: + raise EOFError(f'Data source {data_source} does not contain variants!') diff --git a/train_data_creator/src/main/validators/input_validator.py b/train_data_creator/src/main/validators/input_validator.py new file mode 100644 index 0000000..06bb725 --- /dev/null +++ b/train_data_creator/src/main/validators/input_validator.py @@ -0,0 +1,36 @@ +import os +from pathlib import Path + + +class InputValidator: + def validate_vkgl(self, input_path): + self._validate_file_exist(input_path, 'VKGL') + if not input_path.endswith(('.tsv.gz', '.tsv')): + raise IOError('Input VKGL is not a (gzipped) tsv!') + + def validate_clinvar(self, input_path): + self._validate_file_exist(input_path, 'ClinVar') + if not input_path.endswith(('.vcf', '.vcf.gz')): + raise IOError('Input ClinVar is not a (gzipped) vcf!') + + @staticmethod + def _validate_file_exist(input_path, file_type): + if not os.path.isfile(input_path): + raise FileNotFoundError(f'Input for {file_type}: {input_path} is not a file!') + + @staticmethod + def validate_output(output_path): + makedir = False + if not os.path.isdir(output_path): + path_parent = str(Path(output_path).parent) + if not os.path.isdir(path_parent): + raise OSError(f'Output {output_path} cannot be made because parent does not exist!') + elif not os.access(path_parent, os.W_OK): + raise OSError(f'Output path {output_path} is not writable!') + else: + makedir = True + elif not os.access(output_path, os.W_OK): + raise OSError(f'Output path {output_path} is not writable!') + if makedir: + os.makedirs(output_path) + return output_path diff --git a/train_data_creator/test/__init__.py b/train_data_creator/test/__init__.py new file mode 100644 index 0000000..0613eb2 --- /dev/null +++ b/train_data_creator/test/__init__.py @@ -0,0 +1,5 @@ +from pathlib import Path + + +def get_project_root_dir(): + return Path(__file__).parent.parent diff --git a/train_data_creator/test/resources/hashtag_file.txt b/train_data_creator/test/resources/hashtag_file.txt new file mode 100644 index 0000000..f5b7f88 --- /dev/null +++ b/train_data_creator/test/resources/hashtag_file.txt @@ -0,0 +1,30 @@ +## This file should provide a sum of 14 in the test specific to counting double hashtag lines +## +## +## +## +## +## +## +## +## +## +## +## +## + + + + + + + + + + + + + + + + diff --git a/train_data_creator/test/resources/smol_clinvar.vcf.gz b/train_data_creator/test/resources/smol_clinvar.vcf.gz new file mode 100755 index 0000000..ed343ff Binary files /dev/null and b/train_data_creator/test/resources/smol_clinvar.vcf.gz differ diff --git a/train_data_creator/test/resources/smol_vkgl.tsv.gz b/train_data_creator/test/resources/smol_vkgl.tsv.gz new file mode 100644 index 0000000..205af82 Binary files /dev/null and b/train_data_creator/test/resources/smol_vkgl.tsv.gz differ diff --git a/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz b/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz new file mode 100644 index 0000000..f5fb62e Binary files /dev/null and b/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz differ diff --git a/train_data_creator/test/test_data_preprocessor.py b/train_data_creator/test/test_data_preprocessor.py new file mode 100644 index 0000000..eb6d624 --- /dev/null +++ b/train_data_creator/test/test_data_preprocessor.py @@ -0,0 +1,335 @@ +import os +import unittest + +import numpy as np +import pandas as pd + +from train_data_creator.test import get_project_root_dir +from train_data_creator.src.main.data_preprocessor import VKGL, ClinVar + + +class TestDataProcessor(unittest.TestCase): + __CHROM__ = '#CHROM' + + def setUp(self) -> None: + self.vkgl = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz') + self.clinvar = os.path.join(get_project_root_dir(), 'test', 'resources', + 'smol_clinvar.vcf.gz') + self.dataset = pd.DataFrame( + { + 'INFO': [ + 'SOMEVALUE=1003;CLNSIG=VUS;SOMEOTHERVALUE=103438;' + 'CLNREVSTAT=no_assertion_provided;GENEINFO=foo:10051', + 'SOMEVALUE=1243;CLNSIG=Likely_pathogenic;SOMEOTHERVALUE=8864389;' + 'CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=baz:509709' + ] + } + ) + + def test_vkgl(self): + vkgl = VKGL() + observed = vkgl.parse(self.vkgl) + expected = pd.DataFrame( + {self.__CHROM__: {0: 2, 1: 2, 2: 2, 3: 7, 4: 7, 5: 7, 6: 7, 7: 10, + 8: 11, 9: 11, 10: 11, 11: 11, 12: 11, 13: 11, + 14: 11, 15: 12, 16: 13, 17: 13, 18: 13, 19: 13, + 20: 13, 21: 13, 22: 13, 23: 13, 24: 13, 25: 13, + 26: 13, 27: 13, 28: 13, 29: 13, 30: 13, 31: 13, + 32: 13, 33: 13, 34: 13, 35: 13, 36: 14, 37: 16, + 38: 17, 39: 17, 40: 17, 41: 17, 42: 17, 43: 17, + 44: 17, 45: 17, 46: 17, 47: 17, 48: 17, 49: 17, + 50: 17, 51: 17, 52: 17, 53: 17, 54: 17, 55: 17, + 56: 17, 57: 17, 58: 17, 59: 17, 60: 19}, + 'POS': {0: 47600591, 1: 47614114, 2: 47698179, 3: 6026708, + 4: 6029452, 5: 6043386, 6: 124475296, 7: 89623860, + 8: 108119615, 9: 108119615, 10: 108121410, + 11: 108178738, 12: 108183385, 13: 108188266, + 14: 108202260, 15: 133249166, 16: 32900363, + 17: 32900930, 18: 32910891, 19: 32911703, + 20: 32912007, 21: 32913055, 22: 32913196, + 23: 32913609, 24: 32913760, 25: 32914196, + 26: 32914349, 27: 32914489, 28: 32914815, + 29: 32915135, 30: 32929360, 31: 32937429, + 32: 32944628, 33: 32971235, 34: 32972380, + 35: 32972695, 36: 65550984, 37: 68771372, + 38: 7578457, 39: 41209068, 40: 41215954, + 41: 41226601, 42: 41228668, 43: 41243190, + 44: 41244122, 45: 41244481, 46: 41245136, + 47: 41245237, 48: 41245900, 49: 41246092, + 50: 41246483, 51: 41249263, 52: 41251752, + 53: 41256074, 54: 41256266, 55: 41258361, + 56: 56780540, 57: 56801451, 58: 59924505, + 59: 59926423, 60: 11169097}, + 'REF': {0: 'T', 1: 'A', 2: 'A', 3: 'C', 4: 'GCTGA', 5: 'G', + 6: 'TAAACA', 7: 'CT', 8: 'CAA', 9: 'CA', 10: 'CT', + 11: 'G', 12: 'G', 13: 'AT', 14: 'A', 15: 'C', + 16: 'C', 17: 'A', 18: 'G', 19: 'C', 20: 'C', + 21: 'A', 22: 'G', 23: 'A', 24: 'A', 25: 'G', + 26: 'G', 27: 'G', 28: 'G', 29: 'TACTC', 30: 'T', + 31: 'G', 32: 'G', 33: 'G', 34: 'G', 35: 'A', + 36: 'C', 37: 'C', 38: 'C', 39: 'C', 40: 'A', + 41: 'G', 42: 'C', 43: 'T', 44: 'T', 45: 'C', + 46: 'C', 47: 'A', 48: 'T', 49: 'A', 50: 'C', + 51: 'G', 52: 'G', 53: 'CA', 54: 'T', 55: 'T', + 56: 'G', 57: 'C', 58: 'A', 59: 'A', 60: 'G'}, + 'ALT': {0: 'A', 1: 'G', 2: 'G', 3: 'A', 4: 'G', 5: 'A', + 6: 'T', 7: 'C', 8: 'C', 9: 'C', 10: 'C', 11: 'A', + 12: 'A', 13: 'A', 14: 'G', 15: 'T', 16: 'CT', + 17: 'G', 18: 'A', 19: 'T', 20: 'T', 21: 'G', + 22: 'C', 23: 'C', 24: 'G', 25: 'A', 26: 'T', + 27: 'A', 28: 'A', 29: 'T', 30: 'G', 31: 'A', + 32: 'A', 33: 'A', 34: 'A', 35: 'G', 36: 'T', + 37: 'T', 38: 'T', 39: 'T', 40: 'G', 41: 'C', + 42: 'T', 43: 'G', 44: 'C', 45: 'T', 46: 'G', + 47: 'G', 48: 'G', 49: 'G', 50: 'T', 51: 'A', + 52: 'T', 53: 'C', 54: 'C', 55: 'C', 56: 'T', + 57: 'T', 58: 'G', 59: 'G', 60: 'A'}, + 'gene': {0: 'EPCAM', 1: 'EPCAM', 2: 'MSH2', 3: 'PMS2', + 4: 'PMS2', 5: 'PMS2', 6: 'POT1', 7: 'PTEN', + 8: 'ATM', 9: 'ATM', 10: 'ATM', 11: 'ATM', + 12: 'ATM', 13: 'ATM', 14: 'ATM', 15: 'POLE', + 16: 'BRCA2', 17: 'BRCA2', 18: 'BRCA2', + 19: 'BRCA2', 20: 'BRCA2', 21: 'BRCA2', + 22: 'BRCA2', 23: 'BRCA2', 24: 'BRCA2', + 25: 'BRCA2', 26: 'BRCA2', 27: 'BRCA2', + 28: 'BRCA2', 29: 'BRCA2', 30: 'BRCA2', + 31: 'BRCA2', 32: 'BRCA2', 33: 'BRCA2', + 34: 'BRCA2', 35: 'BRCA2', 36: 'MAX', 37: 'CDH1', + 38: 'TP53', 39: 'BRCA1', 40: 'BRCA1', 41: 'BRCA1', + 42: 'BRCA1', 43: 'BRCA1', 44: 'BRCA1', + 45: 'BRCA1', 46: 'BRCA1', 47: 'BRCA1', + 48: 'BRCA1', 49: 'BRCA1', 50: 'BRCA1', + 51: 'BRCA1', 52: 'BRCA1', 53: 'BRCA1', + 54: 'BRCA1', 55: 'BRCA1', 56: 'RAD51C', + 57: 'RAD51C', 58: 'BRIP1', 59: 'BRIP1', + 60: 'SMARCA4'}, + 'class': {0: 'LB', 1: 'LB', 2: 'LB', 3: 'LB', 4: 'P', + 5: 'LB', 6: 'LB', 7: 'LB', 8: 'B', 9: 'LB', + 10: 'LB', 11: 'LB', 12: 'LB', 13: 'LB', 14: 'LB', + 15: 'LB', 16: 'LB', 17: 'LB', 18: 'LB', 19: 'LB', + 20: 'LB', 21: 'LB', 22: 'LB', 23: 'LB', 24: 'LB', + 25: 'LB', 26: 'LP', 27: 'LB', 28: 'LB', 29: 'LP', + 30: 'LB', 31: 'LB', 32: 'LB', 33: 'LB', 34: 'LB', + 35: 'LB', 36: 'LB', 37: 'LB', 38: 'LP', 39: 'LP', + 40: 'LP', 41: 'LB', 42: 'LB', 43: 'LB', 44: 'LB', + 45: 'LB', 46: 'LB', 47: 'LB', 48: 'LB', 49: 'LB', + 50: 'LB', 51: 'LB', 52: 'LB', 53: 'LB', 54: 'LB', + 55: 'LB', 56: 'LB', 57: 'LP', 58: 'LB', 59: 'B', + 60: 'LB'}, + 'review': {0: 2, 1: 1, 2: 2, 3: 2, 4: 1, 5: 2, 6: 2, 7: 2, + 8: 1, 9: 2, 10: 2, 11: 2, 12: 1, 13: 2, 14: 1, + 15: 1, 16: 2, 17: 1, 18: 2, 19: 1, 20: 2, 21: 2, + 22: 1, 23: 2, 24: 2, 25: 2, 26: 2, 27: 1, 28: 2, + 29: 2, 30: 2, 31: 2, 32: 2, 33: 1, 34: 2, 35: 2, + 36: 1, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 1, + 43: 2, 44: 2, 45: 1, 46: 2, 47: 2, 48: 2, 49: 2, + 50: 2, 51: 2, 52: 1, 53: 2, 54: 2, 55: 1, 56: 2, + 57: 2, 58: 2, 59: 1, 60: 1}, + 'source': {0: 'VKGL', 1: 'VKGL', 2: 'VKGL', 3: 'VKGL', + 4: 'VKGL', 5: 'VKGL', 6: 'VKGL', 7: 'VKGL', + 8: 'VKGL', 9: 'VKGL', 10: 'VKGL', 11: 'VKGL', + 12: 'VKGL', 13: 'VKGL', 14: 'VKGL', 15: 'VKGL', + 16: 'VKGL', 17: 'VKGL', 18: 'VKGL', 19: 'VKGL', + 20: 'VKGL', 21: 'VKGL', 22: 'VKGL', 23: 'VKGL', + 24: 'VKGL', 25: 'VKGL', 26: 'VKGL', 27: 'VKGL', + 28: 'VKGL', 29: 'VKGL', 30: 'VKGL', 31: 'VKGL', + 32: 'VKGL', 33: 'VKGL', 34: 'VKGL', 35: 'VKGL', + 36: 'VKGL', 37: 'VKGL', 38: 'VKGL', 39: 'VKGL', + 40: 'VKGL', 41: 'VKGL', 42: 'VKGL', 43: 'VKGL', + 44: 'VKGL', 45: 'VKGL', 46: 'VKGL', 47: 'VKGL', + 48: 'VKGL', 49: 'VKGL', 50: 'VKGL', 51: 'VKGL', + 52: 'VKGL', 53: 'VKGL', 54: 'VKGL', 55: 'VKGL', + 56: 'VKGL', 57: 'VKGL', 58: 'VKGL', 59: 'VKGL', + 60: 'VKGL'}, + 'binarized_label': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0, + 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, + 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, + 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, + 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, + 22: 0.0, 23: 0.0, 24: 0.0, 25: 0.0, + 26: 1.0, 27: 0.0, 28: 0.0, 29: 1.0, + 30: 0.0, 31: 0.0, 32: 0.0, 33: 0.0, + 34: 0.0, 35: 0.0, 36: 0.0, 37: 0.0, + 38: 1.0, 39: 1.0, 40: 1.0, 41: 0.0, + 42: 0.0, 43: 0.0, 44: 0.0, 45: 0.0, + 46: 0.0, 47: 0.0, 48: 0.0, 49: 0.0, + 50: 0.0, 51: 0.0, 52: 0.0, 53: 0.0, + 54: 0.0, 55: 0.0, 56: 0.0, 57: 1.0, + 58: 0.0, 59: 0.0, 60: 0.0}}) + pd.testing.assert_frame_equal(observed, expected) + + def test_clinvar(self): + clinvar = ClinVar() + observed = clinvar.parse(self.clinvar) + expected = pd.DataFrame( + {self.__CHROM__: {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, + 11: 1, + 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, + 22: 1, + 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, + 33: 1, + 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1}, + 'POS': {0: 865519, 1: 865545, 2: 865567, 3: 865579, 4: 865584, 5: 865625, 6: 865627, + 7: 865628, 8: 865662, 9: 865665, 10: 865694, 11: 865700, 12: 865705, + 13: 865723, 14: 865726, 15: 866404, 16: 866422, 17: 866431, 18: 866438, + 19: 866461, 20: 866478, 21: 871143, 22: 871143, 23: 871146, 24: 871155, + 25: 871158, 26: 871159, 27: 871173, 28: 871176, 29: 871192, 30: 871206, + 31: 871215, 32: 871215, 33: 871219, 34: 871229, 35: 871230, 36: 871239, + 37: 871252, 38: 874410, 39: 874415, 40: 874416, 41: 874451, 42: 874456, + 43: 874457}, + 'REF': {0: 'C', 1: 'G', 2: 'C', 3: 'C', 4: 'G', 5: 'G', 6: 'T', 7: 'G', 8: 'G', 9: 'G', + 10: 'C', 11: 'C', 12: 'C', 13: 'G', 14: 'C', 15: 'C', 16: 'C', 17: 'C', + 18: 'G', 19: 'G', 20: 'C', 21: 'G', 22: 'G', 23: 'C', 24: 'G', 25: 'C', + 26: 'G', 27: 'C', 28: 'C', 29: 'C', 30: 'C', 31: 'C', 32: 'C', 33: 'C', + 34: 'G', 35: 'C', 36: 'C', 37: 'C', 38: 'C', 39: 'C', 40: 'G', 41: 'C', + 42: 'G', 43: 'T'}, + 'ALT': {0: 'T', 1: 'A', 2: 'T', 3: 'T', 4: 'A', 5: 'A', 6: 'C', 7: 'A', 8: 'A', 9: 'A', + 10: 'T', 11: 'T', 12: 'T', 13: 'T', 14: 'G', 15: 'T', 16: 'T', 17: 'T', + 18: 'A', 19: 'A', 20: 'T', 21: 'A', 22: 'T', 23: 'T', 24: 'A', 25: 'T', + 26: 'A', 27: 'T', 28: 'T', 29: 'T', 30: 'T', 31: 'G', 32: 'T', 33: 'T', + 34: 'C', 35: 'T', 36: 'T', 37: 'T', 38: 'T', 39: 'T', 40: 'A', 41: 'T', + 42: 'A', 43: 'C'}, + 'gene': {0: 'SAMD11', 1: 'SAMD11', 2: 'SAMD11', 3: 'SAMD11', 4: 'SAMD11', 5: 'SAMD11', + 6: 'SAMD11', 7: 'SAMD11', 8: 'SAMD11', 9: 'SAMD11', 10: 'SAMD11', + 11: 'SAMD11', 12: 'SAMD11', 13: 'SAMD11', 14: 'SAMD11', 15: 'SAMD11', + 16: 'SAMD11', 17: 'SAMD11', 18: 'SAMD11', 19: 'SAMD11', 20: 'SAMD11', + 21: 'SAMD11', 22: 'SAMD11', 23: 'SAMD11', 24: 'SAMD11', 25: 'SAMD11', + 26: 'SAMD11', 27: 'SAMD11', 28: 'SAMD11', 29: 'SAMD11', 30: 'SAMD11', + 31: 'SAMD11', 32: 'SAMD11', 33: 'SAMD11', 34: 'SAMD11', 35: 'SAMD11', + 36: 'SAMD11', 37: 'SAMD11', 38: 'SAMD11', 39: 'SAMD11', 40: 'SAMD11', + 41: 'SAMD11', 42: 'SAMD11', 43: 'SAMD11'}, + 'class': {0: 'LB', 1: 'B', 2: 'LB', 3: 'LB', 4: 'B', 5: 'LB', 6: 'LB', 7: 'LB', + 8: 'LB', 9: 'B', 10: 'B', 11: 'LB', 12: 'B', 13: 'LB', 14: 'LB', 15: 'LB', + 16: 'B', 17: 'LB', 18: 'LB', 19: 'LB', 20: 'LB', 21: 'LB', 22: 'LB', 23: 'B', + 24: 'LB', 25: 'LB', 26: 'B', 27: 'LB', 28: 'B', 29: 'LB', 30: 'LB', 31: 'B', + 32: 'B', 33: 'LB', 34: 'LB', 35: 'LB', 36: 'B', 37: 'LB', 38: 'LB', 39: 'B', + 40: 'LB', 41: 'LB', 42: 'LB', 43: 'LB'}, + 'review': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, + 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, + 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, + 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1}, + 'source': {0: 'ClinVar', 1: 'ClinVar', 2: 'ClinVar', 3: 'ClinVar', 4: 'ClinVar', + 5: 'ClinVar', 6: 'ClinVar', 7: 'ClinVar', 8: 'ClinVar', 9: 'ClinVar', + 10: 'ClinVar', 11: 'ClinVar', 12: 'ClinVar', 13: 'ClinVar', 14: 'ClinVar', + 15: 'ClinVar', 16: 'ClinVar', 17: 'ClinVar', 18: 'ClinVar', 19: 'ClinVar', + 20: 'ClinVar', 21: 'ClinVar', 22: 'ClinVar', 23: 'ClinVar', 24: 'ClinVar', + 25: 'ClinVar', 26: 'ClinVar', 27: 'ClinVar', 28: 'ClinVar', 29: 'ClinVar', + 30: 'ClinVar', 31: 'ClinVar', 32: 'ClinVar', 33: 'ClinVar', 34: 'ClinVar', + 35: 'ClinVar', 36: 'ClinVar', 37: 'ClinVar', 38: 'ClinVar', 39: 'ClinVar', + 40: 'ClinVar', 41: 'ClinVar', 42: 'ClinVar', 43: 'ClinVar'}, + 'binarized_label': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, + 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, + 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, + 22: 0.0, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, + 29: 0.0, 30: 0.0, 31: 0.0, 32: 0.0, 33: 0.0, 34: 0.0, 35: 0.0, + 36: 0.0, 37: 0.0, 38: 0.0, 39: 0.0, 40: 0.0, 41: 0.0, 42: 0.0, + 43: 0.0}} + ) + pd.testing.assert_frame_equal(observed, expected) + + def test_vgkl__apply_review_status(self): + dataset = pd.DataFrame( + { + 'labs': [1, 2, 3, 4, 1] + } + ) + observed = VKGL()._apply_review_status(dataset) + expected = pd.DataFrame( + { + 'labs': [1, 2, 3, 4, 1], + 'review': [1, 2, 2, 2, 1] + } + ) + pd.testing.assert_frame_equal(observed, expected) + + def test_vkgl__correct_single_consensus(self): + one_lab = 'Classified by one lab' + link_column = ['A_link', 'Some_other_link', 'A_third_link', 'A_forth_link'] + likely_benign = '(Likely) Benign' + dataset = pd.DataFrame( + { + 'foo': [np.nan, likely_benign, np.nan, np.nan], + 'foo_link': link_column, + 'bar': ['Benign', np.nan, np.nan, np.nan], + 'bar_link': link_column, + 'baz': [np.nan, np.nan, 'Pathogenic', np.nan], + 'baz_link': link_column, + 'class': [one_lab, one_lab, one_lab, 'foo'] + } + ) + observed = VKGL()._correct_single_consensus(dataset) + expected = pd.DataFrame( + { + 'foo': ['', likely_benign, '', ''], + 'foo_link': link_column, + 'bar': ['Benign', '', '', ''], + 'bar_link': link_column, + 'baz': ['', '', 'Pathogenic', ''], + 'baz_link': link_column, + 'class': ['Benign', likely_benign, 'Pathogenic', 'foo'], + 'labs': [1, 1, 1, 2] + } + ) + pd.testing.assert_frame_equal(observed, expected) + + def test_vkgl__correct_column_names(self): + dataset = pd.DataFrame( + columns=['chromosome', 'start', 'ref', 'alt', 'consensus_classification', 'foo', 'bar'] + ) + observed = VKGL()._correct_column_names(dataset) + expected = pd.DataFrame( + columns=[self.__CHROM__, 'POS', 'REF', 'ALT', 'class', 'foo', 'bar'] + ) + pd.testing.assert_frame_equal(observed, expected) + + def test_clinvar__get_n_header(self): + file = os.path.join(get_project_root_dir(), 'test', 'resources', 'hashtag_file.txt') + observed = ClinVar()._get_n_header(file) + self.assertEqual(observed, 14) + + def test_clinvar__obtain_class(self): + expected = pd.concat( + [ + self.dataset, + pd.DataFrame( + { + 'class': ['VUS', 'Likely_pathogenic'] + } + ) + ], axis=1 + ) + observed = ClinVar()._obtain_class(self.dataset) + pd.testing.assert_frame_equal(observed, expected) + + def test_clinvar__obtain_gene(self): + expected = pd.concat( + [ + self.dataset, + pd.DataFrame( + { + 'gene': ['foo', 'baz'] + } + ) + ], axis=1 + ) + observed = ClinVar()._obtain_gene(self.dataset) + pd.testing.assert_frame_equal(observed, expected) + + def test_clinvar__obtain_review(self): + # Slightly modified expected since review status 0 is dropped + expected = pd.concat( + [ + pd.DataFrame(data=self.dataset.iloc[1, :].values, columns=['INFO'], index=[1]), + pd.DataFrame( + { + 'review': [3] + }, index=[1] + ) + ], axis=1 + ) + observed = ClinVar()._obtain_review(self.dataset) + pd.testing.assert_frame_equal(observed, expected) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/test_duplicate_processor.py b/train_data_creator/test/test_duplicate_processor.py new file mode 100644 index 0000000..cb0b2ba --- /dev/null +++ b/train_data_creator/test/test_duplicate_processor.py @@ -0,0 +1,42 @@ +import unittest + +import pandas as pd + +from train_data_creator.src.main.duplicate_processor import DuplicateProcessor + + +class TestDuplicateProcessor(unittest.TestCase): + def test_duplicate_processor(self): + dataset = pd.DataFrame( + { + '#CHROM': [1, 1, 2], + 'POS': [100, 100, 200], + 'REF': ['A', 'A', 'G'], + 'ALT': ['T', 'T', 'C'], + 'gene': ['foo', 'foo', 'bar'], + 'class': ['LP', 'LP', 'LB'], + 'review': [2, 3, 1], + 'source': ['ClinVar', 'VKGL', 'VKGL'], + 'binarized_label': [1.0, 1.0, 0.0] + } + ) + processor = DuplicateProcessor() + observed = processor.process(dataset) + expected = pd.DataFrame( + { + '#CHROM': [1, 2], + 'POS': [100, 200], + 'REF': ['A', 'G'], + 'ALT': ['T', 'C'], + 'gene': ['foo', 'bar'], + 'class': ['LP', 'LB'], + 'review': [2, 1], + 'source': ['ClinVar', 'VKGL'], + 'binarized_label': [1.0, 0.0] + }, index=[0, 2] + ) + pd.testing.assert_frame_equal(observed, expected) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/test_exporter.py b/train_data_creator/test/test_exporter.py new file mode 100644 index 0000000..8a1b468 --- /dev/null +++ b/train_data_creator/test/test_exporter.py @@ -0,0 +1,41 @@ +import os +import unittest + +import pandas as pd + +from train_data_creator.test import get_project_root_dir +from train_data_creator.src.main.exporter import Exporter + + +class TestExporter(unittest.TestCase): + __test_dir__ = os.path.join(get_project_root_dir(), '.test_dir') + + @classmethod + def tearDownClass(cls) -> None: + for file in os.listdir(cls.__test_dir__): + os.remove(os.path.join(cls.__test_dir__, file)) + os.rmdir(cls.__test_dir__) + + def test_exporter(self): + if not os.path.isdir(self.__test_dir__): + os.makedirs(self.__test_dir__) + exporter = Exporter(self.__test_dir__) + some_data = pd.DataFrame( + { + '#CHROM': [1, 2, 3], + 'POS': [100, 200, 300], + 'REF': ['A', 'A', 'A'], + 'ALT': ['C', 'C', 'C'], + 'gene': ['very', 'hot', 'fuzz'], + 'binarized_label': [1, 2, 3], + 'sample_weight': [0.8, 0.9, 1.0] + } + ) + exporter.export_train_test_dataset(some_data) + self.assertIn('train_test.vcf.gz', os.listdir(self.__test_dir__)) + exporter.export_validation_dataset(some_data) + self.assertIn('validation.vcf.gz', os.listdir(self.__test_dir__)) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/test_sample_weighter.py b/train_data_creator/test/test_sample_weighter.py new file mode 100644 index 0000000..c827c35 --- /dev/null +++ b/train_data_creator/test/test_sample_weighter.py @@ -0,0 +1,30 @@ +import unittest + +import pandas as pd + +from train_data_creator.src.main.sample_weighter import SampleWeighter + + +class TestSampleWeighter(unittest.TestCase): + def test_sample_weighter(self): + dataset = pd.DataFrame( + { + 'review': [3, 4, 1, 2] + } + ) + expected = pd.concat( + [ + dataset, + pd.DataFrame( + { + 'sample_weight': [1.0, 1.0, 0.8, 0.9] + } + ) + ], axis=1 + ) + observed = SampleWeighter().apply_sample_weight(dataset) + pd.testing.assert_frame_equal(observed, expected) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/test_split_datasets.py b/train_data_creator/test/test_split_datasets.py new file mode 100644 index 0000000..7e85c9d --- /dev/null +++ b/train_data_creator/test/test_split_datasets.py @@ -0,0 +1,38 @@ +import unittest + +import pandas as pd + +from train_data_creator.src.main.split_datasets import SplitDatasets + + +class TestSplitDatasets(unittest.TestCase): + def test_split(self): + dataset = pd.DataFrame( + { + '#CHROM': ['1', '2', '4', '5', '6', '7', '8', '10'], + 'POS': [100, 200, 400, 500, 600, 700, 800, 1000], + 'REF': ['A', 'A', 'C', 'A', 'A', 'G', 'C', 'CG'], + 'ALT': ['T', 'T', 'G', 'T', 'T', 'C', 'G', 'AT'], + 'source': [ + 'VKGL', 'VKGL', 'VKGL', 'VKGL', 'ClinVar', 'ClinVar', 'ClinVar', 'VKGL' + ], + 'review': [2, 2, 2, 2, 2, 3, 2, 2], + 'sample_weight': [0.9, 0.9, 0.9, 0.9, 0.9, 1.0, 0.9, 0.9], + 'binarized_label': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] + } + ) + splitter = SplitDatasets() + observed_train, observed_validation = splitter.split(dataset) + self.assertEqual( + observed_validation[observed_validation['binarized_label'] == 1.0].shape[0], + observed_validation[observed_validation['binarized_label'] == 0.0].shape[0] + ) + self.assertGreater( + observed_validation['review'].min(), 1 + ) + for row in observed_validation.iterrows(): + self.assertNotIn(observed_train.values.tolist(), row[1].values.tolist()) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/test_utilities.py b/train_data_creator/test/test_utilities.py new file mode 100644 index 0000000..232bbfc --- /dev/null +++ b/train_data_creator/test/test_utilities.py @@ -0,0 +1,80 @@ +import unittest + +import pandas as pd + +from train_data_creator.src.main.utilities import correct_order_vcf_notation, equalize_class, \ + apply_binarized_label + + +class TestUtilities(unittest.TestCase): + @classmethod + def setUpClass(cls) -> None: + cls.dataset = pd.DataFrame( + { + '#CHROM': ['1', '2', '3', '4', '6', '7', '5', '8', '9'], + 'POS': [100, 200, 300, 600, 700, 400, 500, 800, 900], + 'REF': ['A', 'A', 'G', 'C', 'A', 'A', 'G', 'C', 'T'], + 'ALT': ['T', 'T', 'C', 'G', 'T', 'T', 'C', 'G', 'A'], + 'source': [ + 'VKGL', 'VKGL', 'ClinVar', 'VKGL', 'VKGL', 'ClinVar', 'ClinVar', 'ClinVar', + 'ClinVar' + ], + 'review': [2, 2, 1, 2, 2, 2, 3, 2, 1], + 'binarized_label': [1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] + } + ) + cls.class_dataset = pd.DataFrame( + { + 'class': ['(Likely) Benign', 'LB', 'foo', 'Pathogenic'] + } + ) + cls.equalize_dict = {'(Likely) Benign': 'LB', 'Pathogenic': 'P'} + + def test_order(self): + observed = correct_order_vcf_notation(self.dataset) + expected = pd.DataFrame( + { + '#CHROM': ['1', '2', '3', '4', '5', '6', '7', '8', '9'], + 'POS': [100, 200, 300, 600, 500, 700, 400, 800, 900], + 'REF': ['A', 'A', 'G', 'C', 'G', 'A', 'A', 'C', 'T'], + 'ALT': ['T', 'T', 'C', 'G', 'C', 'T', 'T', 'G', 'A'], + 'source': [ + 'VKGL', 'VKGL', 'ClinVar', 'VKGL', 'ClinVar', 'VKGL', 'ClinVar', 'ClinVar', + 'ClinVar' + ], + 'review': [2, 2, 1, 2, 3, 2, 2, 2, 1], + 'binarized_label': [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0] + } + ) + pd.testing.assert_frame_equal(observed, expected) + + def test_equalize(self): + observed = equalize_class( + self.class_dataset, equalize_dict=self.equalize_dict + ) + expected = pd.DataFrame( + { + 'class': ['LB', 'P'] + } + ) + pd.testing.assert_frame_equal( + observed.reset_index(drop=True), expected.reset_index(drop=True) + ) + + def test_apply_binarized_label(self): + copy_class_dataset = self.class_dataset.copy(deep=True) + equalized = equalize_class(copy_class_dataset, self.equalize_dict) + observed = apply_binarized_label(equalized) + expected = pd.DataFrame( + { + 'class': ['LB', 'P'], + 'binarized_label': [0.0, 1.0] + } + ) + pd.testing.assert_frame_equal( + observed.reset_index(drop=True), expected.reset_index(drop=True) + ) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/validators/__init__.py b/train_data_creator/test/validators/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/train_data_creator/test/validators/test_dataset_validator.py b/train_data_creator/test/validators/test_dataset_validator.py new file mode 100644 index 0000000..e227c08 --- /dev/null +++ b/train_data_creator/test/validators/test_dataset_validator.py @@ -0,0 +1,59 @@ +import os +import unittest + +import pandas as pd + +from train_data_creator.test import get_project_root_dir +from train_data_creator.src.main.validators.dataset_validator import DatasetValidator + + +class TestDatasetValidator(unittest.TestCase): + @classmethod + def setUpClass(cls) -> None: + cls.vkgl = pd.read_csv( + os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz'), + sep='\t' + ) + cls.clinvar = pd.read_csv( + os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz'), + sep='\t', + skiprows=27 + ) + cls.validator = DatasetValidator() + + def test_validate_vkgl_corr(self): + self.validator.validate_vkgl(self.vkgl) + + def test_validate_clinvar_corr(self): + self.validator.validate_clinvar(self.clinvar) + + def test_validate_vkgl_incorr(self): + # Mimicking what happens when the public consensus is used. + self.assertRaises( + KeyError, + self.validator.validate_vkgl, + self.vkgl.rename(columns={'consensus_classification': 'classification'}) + ) + + def test_validate_clinvar_incorr(self): + # Mimicking what happens when the amount of comment lines are incorrect. + self.assertRaises( + KeyError, + self.validator.validate_clinvar, + self.clinvar.rename(columns={'#CHROM': 'chr', 'INFO': 'something_not_info'}) + ) + + def test_raise_no_variants(self): + # No variants in the file raise EOFError + vkgl_novars = pd.read_csv( + os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl_novars.tsv.gz') + ) + self.assertRaises( + EOFError, + self.validator.validate_vkgl, + vkgl_novars + ) + + +if __name__ == '__main__': + unittest.main() diff --git a/train_data_creator/test/validators/test_input_validator.py b/train_data_creator/test/validators/test_input_validator.py new file mode 100644 index 0000000..1ac5692 --- /dev/null +++ b/train_data_creator/test/validators/test_input_validator.py @@ -0,0 +1,93 @@ +import os +import unittest +from stat import S_IREAD + +from train_data_creator.test import get_project_root_dir +from train_data_creator.src.main.validators.input_validator import InputValidator + + +class TestInputValidator(unittest.TestCase): + __DIRECTORY__ = 'some_very_special_directory' + __READONLY_DIRECTORY__ = 'a_readonly_directory' + + @classmethod + def setUpClass(cls) -> None: + cls.validator = InputValidator() + + @classmethod + def tearDownClass(cls) -> None: + if cls.__DIRECTORY__ in os.listdir(get_project_root_dir()): + os.removedirs(os.path.join(get_project_root_dir(), cls.__DIRECTORY__)) + if cls.__READONLY_DIRECTORY__ in os.listdir(get_project_root_dir()): + os.removedirs(os.path.join(get_project_root_dir(), cls.__READONLY_DIRECTORY__)) + + def test_vkgl_corr(self): + input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz') + self.validator.validate_vkgl(input_path) + + def test_clinvar_corr(self): + input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz') + self.validator.validate_clinvar(input_path) + + def test_vkgl_incorr(self): + input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz') + self.assertRaises( + IOError, + self.validator.validate_vkgl, + input_path + ) + + def test_clinvar_incorr(self): + input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz') + self.assertRaises( + IOError, + self.validator.validate_clinvar, + input_path + ) + + def test__validate_file_exist(self): + input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'not_a_file.tsv.gz') + self.assertRaises( + FileNotFoundError, + self.validator._validate_file_exist, + input_path, + 'VKGL' + ) + + def test_output_not_directory(self): + output = os.path.join(get_project_root_dir(), 'not_a_directory', 'not_a_directory') + self.assertRaises( + OSError, + self.validator.validate_output, + output + ) + + def test_output_not_writable(self): + output = os.path.join(get_project_root_dir(), self.__READONLY_DIRECTORY__) + os.mkdir(os.path.join(output)) + os.chmod(output, S_IREAD) + # Testing if an existing not writable directory raises OSError + self.assertRaises( + OSError, + self.validator.validate_output, + output + ) + # Testing if making a new directory in a not writable directory raises OSError + self.assertRaises( + OSError, + self.validator.validate_output, + os.path.join(output, 'not_a_directory') + ) + + def test_output_back_to_parent(self): + output = os.path.join(get_project_root_dir(), self.__DIRECTORY__) + self.validator.validate_output(output) + directories = os.listdir(get_project_root_dir()) + full_directories = [] + for directory in directories: + full_directories.append(str(os.path.join(get_project_root_dir(), directory))) + self.assertIn(output, full_directories) + + +if __name__ == '__main__': + unittest.main() diff --git a/utility_scripts/compare_old_model.py b/utility_scripts/compare_old_model.py new file mode 100755 index 0000000..401b332 --- /dev/null +++ b/utility_scripts/compare_old_model.py @@ -0,0 +1,346 @@ +#!/usr/bin/env python3 + +import os +import argparse +import warnings + +import math +import pickle +import pandas as pd +from matplotlib import pyplot as plt +from sklearn.metrics import roc_auc_score + + +# Defining errors +class IncorrectFileError(Exception): + pass + + +class DataError(Exception): + pass + + +class CommandLineDigest: + def __init__(self): + parser = self._create_argument_parser() + self.arguments = parser.parse_args() + + def _create_argument_parser(self): + parser = argparse.ArgumentParser( + prog='Make VEP TSV train ready', + description='Converts an VEP output TSV (after conversion) to make it train ready.' + 'PLEASE NOTE: This script does NOT validate your input!' + ) + required = parser.add_argument_group('Required arguments') + + required.add_argument( + '--old_model_results', + type=str, + required=True, + help='Input location of the results file generated with the old CAPICE model ' + '(XGBoost >= 0.90). Has to be a (gzipped) tsv!' + ) + + required.add_argument( + '--new_model_results', + type=str, + required=True, + help='Input location of the results file generated with the new CAPICE model ' + '(XGBoost == 1.4.2). Has to be a gzipped tsv!' + ) + + required.add_argument( + '--vep_processed_capice_input', + type=str, + required=True, + help='Input location of the new CAPICE (XGBoost == 1.4.2) input TSV, processed using ' + 'the -t flag in the conversion script. Has to be a gzipped tsv!' + ) + + required.add_argument( + '--new_model', + type=str, + required=True, + help='Input location of the XGBoost == 1.4.2 model. Has to be a .pickle.dat!' + ) + + required.add_argument( + '--output', + type=str, + required=True, + help='Output directory.' + ) + + return parser + + def get_argument(self, argument_key): + value = None + if argument_key in self.arguments: + value = getattr(self.arguments, argument_key) + return value + + +class Validator: + + def validate_old_input_argument(self, old_input_path): + self._validate_argument(old_input_path, ('.tsv', '.tsv.gz')) + + def validate_ol_input_argument(self, ol_input_path): + self._validate_argument(ol_input_path, '.tsv.gz') + + def validate_new_input_argument(self, new_input_path): + self._validate_argument(new_input_path, '.tsv.gz') + + def validate_model_input_argument(self, model_input_path): + self._validate_argument(model_input_path, '.pickle.dat') + + @staticmethod + def _validate_argument(argument, extension): + if not argument.endswith(extension): + raise IncorrectFileError( + f'Argument {argument} was supplied an incorrect extension. ' + f'Required extensions: {extension}' + ) + + @staticmethod + def validate_output_argument(output): + if not os.path.isdir(output): + warnings.warn(f'Output path {output} does not exist! Creating...') + try: + os.makedirs(output) + except Exception as e: + print(f'Error encoutered creating output path: {e}') + exit(1) + + def validate_old_dataset(self, data): + required_columns = ['chr_pos_ref_alt', 'GeneName', 'probabilities'] + self._validate_dataframe(data, required_columns, 'Old CAPICE generated dataset') + + def validate_ol_dataset(self, data): + required_columns = ['%ID'] + self._validate_dataframe(data, required_columns, 'Original labels dataset') + + def validate_new_dataset(self, data): + required_columns = ['score', 'chr', 'pos', 'ref', 'alt', 'gene_name'] + self._validate_dataframe(data, required_columns, 'New CAPICE generated dataset') + + @staticmethod + def _validate_dataframe(data, required_columns, dataset): + for column in required_columns: + if column not in data.columns: + raise DataError(f'Required column {column} not found in: {dataset}!') + + +def main(): + print('Obtaining command line arguments') + cla = CommandLineDigest() + cadd_location = cla.get_argument('old_model_results') + ol_location = cla.get_argument('vep_processed_capice_input') + vep_location = cla.get_argument('new_model_results') + model_location = cla.get_argument('new_model') + output = cla.get_argument('output') + print('Validating command line arguments') + validator = Validator() + validator.validate_old_input_argument(cadd_location) + validator.validate_ol_input_argument(ol_location) + validator.validate_new_input_argument(vep_location) + validator.validate_model_input_argument(model_location) + validator.validate_output_argument(output) + print('Arguments passed\n') + + # Reading in data + print('Reading data') + cadd = pd.read_csv(cadd_location, sep='\t') + original_labels = pd.read_csv(ol_location, sep='\t') + vep = pd.read_csv(vep_location, sep='\t') + print('Validating data') + validator.validate_old_dataset(cadd) + validator.validate_ol_dataset(original_labels) + validator.validate_new_dataset(vep) + print('Data passed, processing model data') + + with open(model_location, 'rb') as model_file: + model = pickle.load(model_file) + importances = model.get_booster().get_score(importance_type='gain') + importances_dataframe = pd.DataFrame(data=list(importances.values()), + index=list(importances.keys()), + columns=['score']).sort_values(by='score', ascending=False) + importances_dataframe['feature'] = importances_dataframe.index + importances_dataframe.reset_index(drop=True, inplace=True) + print('All data loaded:') + print(f'Amount of variants within old CAPICE generated dataset: {cadd.shape[0]}. ' + f'Columns: {cadd.shape[1]}') + print(f'Amount of variants within the original labels dataset: {original_labels.shape[0]}. ' + f'Columns: {original_labels.shape[1]}') + print(f'Amount of variants within new CAPICE generated dataset ' + f'(should match the original labels dataset!): {vep.shape[0]}. ' + f'Columns: {vep.shape[1]}\n') + + print('Starting making merge columns.') + # Making merge column + cadd['ID'] = cadd[['chr_pos_ref_alt', 'GeneName']].astype(str).agg('_'.join, axis=1) + vep['ID'] = vep[['chr', 'pos', 'ref', 'alt', 'gene_name']].astype(str).agg('_'.join, axis=1) + + # Getting the true original labels + original_labels['binarized_label'] = original_labels['%ID'].str.split('_', expand=True)[ + 5].astype(float) + original_labels['ID'] = original_labels['%ID'].str.split('_', expand=True).loc[:, :4].astype( + str).agg('_'.join, axis=1) + print('Merge columns made, starting cleaning.') + + # Preparing all datasets + vep = vep[['ID', 'score']] + vep._is_copy = None + vep.rename(columns={'score': 'score_new'}, inplace=True) + cadd.rename(columns={'probabilities': 'score_old'}, inplace=True) + original_labels = original_labels[['ID', 'binarized_label']] + original_labels._is_copy = None + print('Cleaning done.\n') + + print('Starting merge.') + # Merging + merge = cadd.merge(original_labels, on='ID', how='left') + merge.drop(index=merge[merge['binarized_label'].isnull()].index, inplace=True) + merge = merge.merge(vep, on='ID', how='left') + merge.drop(index=merge[merge['score_new'].isnull()].index, inplace=True) + print(f'Merge done. Merge has {merge.shape[0]} variants.\n') + + print('Preparing plots.') + # Preparing plots + consequences = merge['Consequence'].unique() + ncols = 4 + nrows = math.ceil((consequences.size / ncols) + 1) + index = 1 + fig_auc = plt.figure(figsize=(20, 40)) + fig_auc.suptitle( + f'Old model vs new model AUC comparison.\n' + f'Old: {cadd_location}\n' + f'New: {vep_location}') + fig_vio = plt.figure(figsize=(20, 40)) + fig_vio.suptitle( + f'Old model vs new model score distributions.\n' + f'Old: {cadd_location}\n' + f'New: {vep_location}') + fig_box = plt.figure(figsize=(20, 40)) + fig_box.suptitle( + f'Old model vs new model score closeness (difference to true label)\n' + f'Old: {cadd_location}\n' + f'New: {vep_location}') + print('Preparation done.\n') + + print('Calculating global AUC.') + # Calculating global AUC + auc_old = round(roc_auc_score(merge['binarized_label'], merge['score_old']), 4) + auc_new = round(roc_auc_score(merge['binarized_label'], merge['score_new']), 4) + print(f'Global AUC calculated, old: {auc_old} and new: {auc_new}') + + print('Plotting global AUC.') + # Plotting global AUC + ax_auc = fig_auc.add_subplot(nrows, ncols, index) + ax_auc.bar(1, auc_old, color='red', label=f'Old: {auc_old}') + ax_auc.bar(2, auc_new, color='blue', label=f'New: {auc_new}') + ax_auc.set_title(f'Global (n={merge.shape[0]})') + ax_auc.set_xticks([1, 2], ['Old', 'New']) + ax_auc.set_xlim(0.0, 3.0) + ax_auc.set_ylim(0.0, 1.0) + ax_auc.legend(loc='lower right') + + # Plotting Violinplot for global + ax_vio = fig_vio.add_subplot(nrows, ncols, index) + ax_vio.violinplot(merge[['score_old', 'binarized_label', 'score_new']]) + ax_vio.set_title(f'Global (n={merge.shape[0]})') + ax_vio.set_xticks([1, 2, 3], ['Old', 'True', 'New']) + ax_vio.set_xlim(0.0, 4.0) + ax_vio.set_ylim(0.0, 1.0) + + # Plotting the score differences to the true label + merge['score_diff_old'] = abs(merge['binarized_label'] - merge['score_old']) + merge['score_diff_new'] = abs(merge['binarized_label'] - merge['score_new']) + ax_box = fig_box.add_subplot(nrows, ncols, index) + ax_box.boxplot( + [ + merge[merge['binarized_label'] == 0]['score_diff_old'], + merge[merge['binarized_label'] == 0]['score_diff_new'], + merge[merge['binarized_label'] == 1]['score_diff_old'], + merge[merge['binarized_label'] == 1]['score_diff_new'] + ], labels=['B_old', 'B_new', 'P_old', 'P_new'] + ) + ax_box.set_title(f'Global (n={merge.shape[0]})') + print('Plotting global AUC done.') + + print('Plotting for each consequence:') + # Global plots have been made, now for each consequence + index = 2 + for consequence in consequences: + print(f'Currently processing: {consequence}') + # Subsetting + subset = merge[merge['Consequence'] == consequence] + + # Calculating + # Try except because when an consequence is encountered with only 1 label, + # roc_auc_score will throw an error + try: + auc_old = round(roc_auc_score(subset['binarized_label'], subset['score_old']), 4) + except ValueError: + print(f'For consequence {consequence}, AUC old could not be calculated.') + continue + try: + auc_new = round(roc_auc_score(subset['binarized_label'], subset['score_new']), 4) + except ValueError: + print(f'For consequence {consequence}, AUC new could not be calculated.') + continue + + # Plotting auc + ax_auc = fig_auc.add_subplot(nrows, ncols, index) + ax_auc.bar(1, auc_old, color='red', label=f'Old: {auc_old}') + ax_auc.bar(2, auc_new, color='blue', label=f'New: {auc_new}') + ax_auc.set_title(f'{consequence} (n={subset.shape[0]})') + ax_auc.set_xticks([1, 2], ['Old', 'New']) + ax_auc.set_xlim(0.0, 3.0) + ax_auc.set_ylim(0.0, 1.0) + ax_auc.legend(loc='lower right') + + # Plotting Violinplot for global + ax_vio = fig_vio.add_subplot(nrows, ncols, index) + ax_vio.violinplot(subset[['score_old', 'binarized_label', 'score_new']]) + ax_vio.set_title(f'{consequence} (n={subset.shape[0]})') + ax_vio.set_xticks([1, 2, 3], ['Old', 'True', 'New']) + ax_vio.set_xlim(0.0, 4.0) + ax_vio.set_ylim(0.0, 1.0) + + # Plotting boxplots + ax_box = fig_box.add_subplot(nrows, ncols, index) + ax_box.boxplot( + [ + subset[subset['binarized_label'] == 0]['score_diff_old'], + subset[subset['binarized_label'] == 0]['score_diff_new'], + subset[subset['binarized_label'] == 1]['score_diff_old'], + subset[subset['binarized_label'] == 1]['score_diff_new'] + ], labels=['B_old', 'B_new', 'P_old', 'P_new'] + ) + ax_box.set_title(f'{consequence} (n={subset.shape[0]})') + + index += 1 + + print('Plotting for consequences done.\n') + + print('Plotting feature importances.') + fig_table, ax_table = plt.subplots(figsize=(10, 20)) + fig_table.suptitle(f'Feature importances model: {model_location}') + fig_table.patch.set_visible(False) + ax_table.axis('off') + ax_table.table(cellText=importances_dataframe.values, colLabels=importances_dataframe.columns, + loc='center') + print('Plotting for feature importances done.\n') + + print(f'Exporting plots to: {output}') + fig_auc.savefig(os.path.join(output, 'aucs.png')) + fig_vio.savefig(os.path.join(output, 'violins.png')) + fig_box.savefig(os.path.join(output, 'box.png')) + fig_table.savefig(os.path.join(output, 'feature_importances.png')) + print('Done.') + + +if __name__ == '__main__': + main() + diff --git a/utility_scripts/liftover_variants.sh b/utility_scripts/liftover_variants.sh new file mode 100755 index 0000000..5d82f4b --- /dev/null +++ b/utility_scripts/liftover_variants.sh @@ -0,0 +1,118 @@ +#!/bin/bash +#SBATCH --job-name=liftover_variants +#SBATCH --time=05:00:00 +#SBATCH --cpus-per-task=4 +#SBATCH --mem=20gb +#SBATCH --nodes=1 + +set -e + +errcho() { echo "$@" 1>&2; } + +readonly USAGE="Easy bash script to convert a GRCh37 VCF to GRCh38 VCF +Usage: +liftover_variants.sh -i -o +-i required: the GRCh37 input VCF +-o required: the GRCh38 output VCF path+filename (except extension!) + +Example: +bash liftover_variants.sh -i /path/to/GRCh37.vcf -o /path/to/GRCh38 + +Requirements: +Picard +EasyBuild +" + +main() { + digestCommandLine "$@" + runLiftover +} + +digestCommandLine() { + while getopts i:o:h flag + do + case "${flag}" in + i) input=${OPTARG};; + o) output=${OPTARG};; + h) + echo "${USAGE}" + exit;; + /?) + errcho "Error: invalid option" + echo "${USAGE}" + exit 1;; + esac + done + + validateCommandLine + + echo "CLA passed" +} + +validateCommandLine() { + local valid_command_line=true + if [ -z "${input}" ] + then + valid_command_line=false + errcho "input file not set" + else + if [ ! -f "${input}" ] + then + valid_command_line=false + errcho "input file does not exist" + fi + fi + + if [ -z "${output}" ] + then + valid_command_line=false + errcho "output not set" + fi + + if [ "${valid_command_line}" == false ]; + then + errcho 'exiting.' + exit 1 + fi +} + + +runLiftover() { + local output="${output}.vcf" + local rejected="${output}_rejected.vcf" + local input="${input}" + + local args=() + + args+=("/apps/software/picard/2.20.5-Java-11-LTS/picard.jar" "LiftoverVcf") + args+=("I=${input}") + args+=("O=${output}") + args+=("CHAIN=/apps/data/GRC/b37ToHg38.over.chain") + args+=("REJECT=${rejected}") + args+=("R=/apps/data/GRC/GCA_000001405.15/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz") + + echo "Loading Picard" + + module load picard + + echo "Running Picard" + + java -jar "${args[@]}" + + echo "Gzipping outputs" + + gzip "${output}" + gzip "${rejected}" + + echo "Removing indexing file if made" + if [ -f "${output}.vcf.idx" ] + then + rm "${output}.vcf.idx" + echo "Indexing file removed" + fi + + echo "Done" +} + +main "$@" + diff --git a/utility_scripts/vep_to_train.py b/utility_scripts/vep_to_train.py new file mode 100755 index 0000000..ec5f833 --- /dev/null +++ b/utility_scripts/vep_to_train.py @@ -0,0 +1,174 @@ +#!/usr/bin/env python3 + +import os +import argparse +import warnings + +import numpy as np +import pandas as pd + + +# Defining errors +class IncorrectFileError(Exception): + pass + + +class DataError(Exception): + pass + + +SAMPLE_WEIGHTS = [0.8, 0.9, 1.0] + + +class CommandLineDigest: + def __init__(self): + parser = self._create_argument_parser() + self.arguments = parser.parse_args() + + def _create_argument_parser(self): + parser = argparse.ArgumentParser( + prog='Make VEP TSV train ready', + description='Converts an VEP output TSV (after conversion) to make it train ready.' + ) + required = parser.add_argument_group('Required arguments') + optional = parser.add_argument_group('Optional arguments') + + required.add_argument( + '-i', + '--input', + type=str, + required=True, + help='Input location of the VEP TSV' + ) + required.add_argument( + '-o', + '--output', + type=str, + required=True, + help='Output path + filename' + ) + optional.add_argument( + '-a', + '--assembly', + action='store_true', + help='Flag to enable GRCh38 mode.' + ) + return parser + + def get_argument(self, argument_key): + value = None + if argument_key in self.arguments: + value = getattr(self.arguments, argument_key) + return value + + +class Validator: + @staticmethod + def validate_input_cla(input_path): + if not input_path.endswith(('.tsv.gz', '.tsv')): + raise IncorrectFileError('Input file is not a TSV!') + + @staticmethod + def validate_output_cla(output_path): + if not output_path.endswith('.tsv.gz'): + raise IncorrectFileError('Output file has to be defined as .tsv.gz!') + # This is intentional. + # You have more than enough time to prepare directories when VEP is running. + if not os.path.isdir(os.path.dirname(output_path)): + raise NotADirectoryError('Output file has to be placed in an directory that already ' + 'exists!') + + @staticmethod + def validate_input_dataset(input_data): + columns_must_be_present = ['%SYMBOL', '%CHROM', '%ID'] + for column in columns_must_be_present: + if column not in input_data.columns: + raise DataError(f'Missing required column: {column}') + + +def main(): + print('Parsing CLI') + cli = CommandLineDigest() + data_path = cli.get_argument('input') + output = cli.get_argument('output') + grch38 = cli.get_argument('assembly') + print('') + + print('Validating input arguments.') + validator = Validator() + validator.validate_input_cla(data_path) + validator.validate_output_cla(output) + + print('Reading in dataset') + data = pd.read_csv(data_path, sep='\t', na_values='.') + print('Validating dataset') + validator.validate_input_dataset(data) + print('Read in dataset.') + print(f'Shape: {data.shape}') + print(f'Head:\n{data.head()}\n') + + print('Dropping entries without gene.') + before_drop = data.shape[0] + data.drop(index=data[data['%SYMBOL'].isnull()].index, inplace=True) + after_drop = data.shape[0] + print(f'Dropped {before_drop-after_drop} variants.\n') + + if grch38: + print('Converting chromosome column') + data['%CHROM'] = data['%CHROM'].str.split('chr', expand=True)[1] + y = np.append(np.arange(1, 23).astype(str), ['X', 'Y', 'MT']) + before_drop = data.shape[0] + data.drop(data[~data["%CHROM"].isin(y)].index, inplace=True) + after_drop = data.shape[0] + print(f'Dropped {before_drop-after_drop} rows due to unknown chromosome.') + print('Conversion done.\n') + + print('Dropping full duplicates') + before_drop = data.shape[0] + data.drop_duplicates(inplace=True) + after_drop = data.shape[0] + print(f'Dropped {before_drop-after_drop} duplicates') + print('Dropping duplicates done.\n') + + print('Dropping mismatching gene entries.') + before_drop = data.shape[0] + data.drop( + index=data[data['%ID'].str.split('_', expand=True)[4] != data['%SYMBOL']].index, + inplace=True + ) + after_drop = data.shape[0] + print(f'Dropped {before_drop-after_drop} variants.\n') + + print('Extracting sample weight and binarized_label') + data['binarized_label'] = data['%ID'].str.split('_', expand=True)[5].astype(float) + data['sample_weight'] = data['%ID'].str.split('_', expand=True)[6].astype(float) + print('') + + print('Correcting possible errors within binarized_label or sample_weight') + before_drop = data.shape[0] + # Drop everything that doesn't have a binarized_label, also drop unused columns + data.drop(index=data[data['binarized_label'].isnull()].index, columns=['%ID'], inplace=True) + data.drop(index=data[~data['binarized_label'].isin([0.0, 1.0])].index, inplace=True) + data.drop(index=data[~data['sample_weight'].isin(SAMPLE_WEIGHTS)].index, inplace=True) + after_drop = data.shape[0] + print(f'Dropped {before_drop-after_drop} variants due to incorrect binarized_label or ' + f'sample_weight.\n') + + print('Please check the sample weights:') + print(data[['binarized_label', 'sample_weight']].value_counts()) + print('') + + print(f'Final shape of data: {data.shape}') + print(f'Of which benign: {data[data["binarized_label"] == 0].shape[0]}') + print(f'Of which pathogenic: {data[data["binarized_label"] == 1].shape[0]}') + n_other = data[~data["binarized_label"].isin([0, 1])].shape[0] + if n_other > 0: + warnings.warn(f'Of which other: {n_other}') + print('') + + print(f'Done! Exporting to {output}') + data.to_csv(output, index=False, compression='gzip', na_rep='.', sep='\t') + + +if __name__ == '__main__': + main()