diff --git a/README.md b/README.md
new file mode 100644
index 0000000..3607bc1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,172 @@
+# capice-resources
+
+Repository for resource files for CAPICE and updating CAPICE model. It contains serveral modules listed below. Each
+module contains their own README (if applicable) describing how to use them.
+
+## Installation
+
+Install should be as easy as `pip install -e '.[test]'`. 
+To test the individual modules, change directory to the specific module and run `pytest`.
+
+## Modules:
+
+### validation
+
+Validation is not really a module, but more a folder to contain plots of supporting evidence of a model's performance
+per GitHub release.
+
+### train_data_creator
+
+The module train_data_creator is here to create new CAPICE train-test and validation datasets. For more information,
+refer to the readme within the module.
+
+### utility_scripts
+
+utility_scripts contains a couple of bash scripts for ease of use, which include but are not limited to lifting over
+variants to GRCh38, running VEP104 and converting a VEP VCF output to TSV. 
+It is advised to use these scripts as they maintain a level of consistency throughout the lifespan of CAPICE.
+
+- liftover_variants.sh:
+
+The script liftover_variants.sh converts a VCF containing GRCh37 variants to GRCh38 variants.
+
+- vep_to_tsv.sh:
+
+The vep_to_tsv.sh converts the VEP output VCF back to TSV with all required columns.
+If features are added/removed, be sure to adjust the `pre_header` variable within this bash script accordingly.
+For instance, if feature `CADD_score` is added, within pre_header `\tCADD_score` has to be added.
+
+## Usage
+
+### TLDR
+
+_(For a more detailed explanation on creating the train-test and validation datasets, please see the README in
+[train_data_creator](./train_data_creator/README.md))_
+
+1. Make new [CAPICE](https://github.com/molgenis/capice) release, containing added or removed processors and/or code changes supporting a new model.
+   1. Steps:
+   2. Newly added features have been added to the [impute json](https://github.com/molgenis/capice/blob/master/resources/train_impute_values.json) and/or deprecated features have been removed.
+   3. Apply changes in features to PRE_HEADER in the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh).
+   4. Annotate new training VCF using VEP and convert the VCF using the [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (raw file: [train_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/train_input_raw.vcf.gz)) (note: use `-t` when using the conversion tool)
+   5. Make training TSV train ready using `utility_scripts/vep_to_train.py`.
+   6. Use newly generated training TSV to create new [PoC](https://github.com/molgenis/capice/blob/master/tests/resources/xgb_booster_poc.pickle.dat) model.
+   7. Update [predict_input.tsv.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input.tsv.gz) (raw file: [predict_input_raw.vcf.gz](https://github.com/molgenis/capice/blob/master/resources/predict_input_raw.vcf.gz)) with altered features.
+   8. Update [breakends_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends_vep.tsv.gz) (raw file: [breakends.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/breakends.vcf.gz)) with altered features.
+   9. Update [edge_cases_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases_vep.tsv.gz) (raw file: [edge_cases.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/edge_cases.vcf.gz)) with altered features.
+   10. Update [symbolic_alleles_vep.tsv.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles_vep.tsv.gz) (raw file: [symbolic_alleles.vcf.gz](https://github.com/molgenis/capice/blob/master/tests/resources/symbolic_alleles.vcf.gz)) with altered features.
+   11. Run [CAPICE](https://github.com/molgenis/capice) tests.
+2. Download latest non-public GRCh37 VKGL (`/apps/data/VKGL/GRCh37`)
+   and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets.
+3. Use [train_data_creator](./train_data_creator/README.md) to create a train-test and validation VCFs.
+   1. `python3 ./train_data_creator/main.py --input_input_vkgl </path/to/vkgl_nonpublic.tsv> --input_clinvar </path/to/clinvar.vcf.gz> -o </path/to/output>`
+4. Make [capice-resources](https://github.com/molgenis/capice-resources) GitHub release, matching the CAPICE release in step 1. 
+5. Attach both train-test and validation VCFs to [capice-resources](https://github.com/molgenis/capice-resources) release.
+6. Download the latest VEP release from the [Molgenis Cloud](https://download.molgeniscloud.org/downloads/vip/images/)
+7. Use the VEP singularity image, combined with the latest VEP cache to annotate the VCF files.
+   1. `singularity exec --bind /apps,/groups,/tmp vep <command>` (See [VEP](#VEP) for the command)
+8. (Optional for XGBoost 0.72.1 models) Upload the validation.vcf to [CADD1.4-GRCh37](https://cadd.gs.washington.edu/score) for annotation.
+9. Lift over not-annotated VCFs to GRCh38 using `lifover_variants.sh`. 
+   1. `sbatch lifover_variants.sh -i </path/to/step3.vcf.gz> -o </path/to/output/directory/file>` (please note: do not supply an extension as it doesn't produce a single file)
+10. Use the VEP singularity image to annotate the GRCh38 VCF files.
+11. Convert VEP annotated VCFs back to TSV using [CAPICE conversion tool](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) (using `-t`)
+    1. `capice/scripts/convert_vep_vcf_to_tsv_capice.sh -i </path/to/vep.vcf.gz> -o </path/to/vep.tsv.gz> -t`
+12. Process GRCH37 TSV 
+    1. `python3 ./utility_scripts/vep_to_train.py -i /path/to/vep.tsv.gz -o /path/to/vep_processed.tsv.gz`
+13. Repeat for GRCh37 validation (see step 12)
+14. Repeat steps 11 and 12 for train-test and validation for GRCh38 (add the `-a` flag to the `vep_to_train.py`).
+15. Update imputing JSON accordingly to the newly features added and/or removed.
+16. Make sure the latest release of CAPICE made in step 1 is available on the GCC cluster
+    1. `pip install capice` (be sure to install within a virtual environment or use the singularity image)
+17. Use the JSON made in step 16 and the train-test TSV made in step 12 to start the train protocol of CAPICE
+    1. You may want to use a custom script that loads the latest Python module, activates the virtual environment and activates the and then activates the train protocol. It can take 5 hours for a new model to train.
+    2. `module load <python>`
+    3. `source ./capice/venv/bin/activate`
+    4. `capice train -i </path/to/train-test.tsv.gz> -m </path/to/impute.json> -o </path/to/output>` 
+18. Attach new models to [CAPICE](https://github.com/molgenis/capice) and [capice-resources](https://github.com/molgenis/capice-resources) releases.
+19. Use new model generated in step 20 to generate CAPICE results file of the validation TSV.
+20. Use latest non `Release Candidate` model to generate CAPICE results file of the same validation TSV.
+21. Use `compare_old_model.py` in `utility_scripts` to compare performance of an old model to a new model.
+    1. `compare_old_model.py --old_model_results </path/to/old_capice_results.tsv> --vep_processed_capice_input </path/to/validation_vep.tsv.gz> --new_model_results </path/to/validation_vep_capice.tsv.gz> --output </path/to/rcX>`
+
+## Making train-test and validation VCF files
+
+Download the latest non-public GRCh37 VKGL (GCC `/apps/data/VKGL/GRCh37/vkgl_consensus_*.tsv`)
+and [Clinvar](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/) datasets and simply supply them to main.py within
+train_data_creator. For further details, use `python3 main.py -h` or refer to the README in the train_data_creator
+module.
+
+## VEP
+
+VEP is used to add annotations to the train-test and validation files. The following command is a good representation of what VEP command should be called:
+```commandline
+vep --input_file <path to your input file> --format vcf --output_file <path to your output file> --vcf --compress_output gzip --sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --no_stats --offline --cache --dir_cache */path/to/latest/vep/cache* --species "homo_sapiens" --assembly *GRCh37 or GRCh38* --refseq --use_given_ref --exclude_predicted --use_given_ref --flag_pick_allele --force_overwrite --fork 4 --dont_skip --allow_non_variant --per_gene
+```
+_(`--input_file`, `--output_file` and `--assembly` have to be manually altered)_
+
+_Command can vary based on the training features._
+
+**Comparing an XGBoost 0.72.1 (CAPICE 2.0.0 or lower) to a new XGBoost 1.4.2 model, upload the validation.vcf.gz to [CADD](https://cadd.gs.washington.edu/score) for annotation.** (set "INCLUDE ANNOTATIONS" to True)
+
+## Post-VEP processing
+
+First convert the VEP output VCF back to TSV using the `vep_to_tsv.sh` _(Note: pre_header must be altered according to the new features or removed features)_.
+**WARNING: Drop all columns that are not wanted within training, such as the ID column.** This step can be performed for all train-test and validation VCFs for both GRCh37 and 38.
+
+Making the converted VEP TSV train-ready requires use of the `vep_to_train.py` in `utility_scripts`. This works for both GRCh37 and GRCh38 (GRCh38 only when the `-a` flag is supplied.)
+
+This script makes sure that variants with the correct labels are preserved, so that the labels remains accurate for the variant.
+
+## Training
+
+Before the training module of CAPICE can be called, make sure your impute JSON used within training is up to date on your newly added and/or removed features.
+
+Training a new model is rather easy, but takes quite a bit of time. It is recommended to start a slurm job on the GCC cluster and to let it run overnight.
+
+The training module can be activated with the following command (assuming you have installed CAPICE the way it is supposed to be: `pip install .` or `pip install capice`):
+
+`capice -v train -i /path/to/processed/train_test.tsv.gz -m /path/to/up/to/date/impute.json -o /path/to/output.pickle.dat`
+
+## Generating results
+
+### XGBoost 0.90 vs XGBoost 1.4.2 models
+
+By now, you should already have a CADD 1.4 annotated output file for "old" CAPICE to score. 
+Get the CAPICE result file for both old generation and new generation. 
+I prefer to use CAPICE 1.3.1 for "old" CAPICE. For "new" CAPICE, use your latest release, as the model file will try and match the
+version to the CAPICE version. 
+
+### XGBoost 1.4.2 vs XGBoost 1.4.2 models
+
+Work in Progress
+
+## Validate
+
+Merging the old and new validation datafiles happens within `utility_scripts/compare_old_model.py`. 
+This script merges the old results file with a file containing the original labels (the direct output of a TSV converted VEP output using the `-t` flag of `conver_vep_vcf_to_tsv_capice.sh` of [CAPICE](https://github.com/molgenis/capice/blob/master/scripts/convert_vep_vcf_to_tsv_capice.sh) is perfect). 
+After that merge and some cleanup of columns that are not going to be used anymore, 
+the new results file is merged on top of the old results merged with their original true labels.
+
+After this merge, AUC is calculated globally, as well as an absolute score difference between the predicted score and the true label.
+These global results are then plotted, and AUC calculations are performed over all consequences present within the **old capice** datafile, as well as the absolute score difference.
+
+This absolute score difference tells us something of how close the scores are to the true label, where 1 means the score is far off from the true label and 0 means the score matches the label. We want 0.
+
+Furthermore, a table of the feature importances of the new model is plotted aswell, so you can see what is and what is not important.
+
+Compare the results of Old vs New and discus within the team, decide on next steps.
+
+
+## Common issues
+
+- Training with my generated dataset causes warning ValueError: multiclass format is not supported, resulting in average
+  CV performance of NaN.
+
+This one is a bit tricky, but it comes down with a bug somewhere within creating the ID column. One of the binarized
+labels is set to 2, origin is unknown. Is fixed as per CAPICE 3.0.0-beta2.
+
+## FAQ
+
+- I'm getting an `ValueError: invalid literal for int() with base 10: 'chr1'`, what's wrong?
+
+You've likely downloaded a genome build 38, which has a different annotation for the chromosome. 
+`train_data_creator` only supports build 37.
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000..4dc669e
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+
+from setuptools import setup
+with open('README.md', 'r', encoding='utf-8') as fh:
+    long_description = fh.read()
+
+setup(
+    name='capice-resources',
+    version='1.0.0-dev',
+    url='https://capice.molgeniscloud.org/',
+    license='LGPL-3.0',
+    author='Molgenis',
+    author_email='molgenis-support@umcg.nl',
+    description='Resource files for CAPICE to train new CAPICE models',
+    long_description=long_description,
+    classifiers=[
+        'Development Status :: 4 - Beta',
+        'License :: LGPL-3.0',
+        'Programming Language :: Python :: 3.9',
+        'Programming Language :: Python :: 3.10'
+    ],
+    python_requires='>=3.9.*',
+    install_requires=[
+        'pandas==1.3.4',
+        'numpy==1.21.4',
+        'matplotlib==3.5.1',
+        'xgboost==1.4.2',
+        'scikit-learn==1.0.2'
+    ],
+    extras_require={
+        'test': [
+            'pytest',
+        ]
+    }
+)
diff --git a/train_data_creator/README.md b/train_data_creator/README.md
new file mode 100644
index 0000000..1c9d42a
--- /dev/null
+++ b/train_data_creator/README.md
@@ -0,0 +1,100 @@
+# Train data creator
+
+###### Resource module for CAPICE to create new train-test and validation datasets
+
+## Usage
+
+`python3 main.py -iv path/to/vkgl_consensus.tsv.gz -ic path/to/clinvar.vcf.gz -o path/to/existing/output/directory`
+
+_Note: -iv needs to be the non-public VKGL, public will throw an KeyNotFoundError._
+
+_Note 2: the parent of the output directory has to exist. train_data_creator will, at most, create only 1 new directory._
+
+## Theory
+
+(data_preprocessor.py)
+
+Step 1 is to load in the different type datasets, which is not all that easy.
+
+VCF files have headers that can't be directly parsed into VCF. The amount of lines are obtained and provided to the
+pandas parser.
+
+The VKGL TSV is easier to parse since that doesn't require extra skips.
+
+After that the files are validated.
+
+For VKGL the file follows the following path:
+
+- Correcting the TSV to VCF-like column names.
+- Correcting the consensus-classification of "Classified by one lab" to the actual classification of the lab (also
+  setting the amount of labs for that variant to 1 for review status).
+- Ordering according to VCF standards.
+- Applying the ClinVar review status to their Gold Star Description.
+- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else).
+- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic).
+
+For ClinVar the file follows the following (similar) path:
+
+- (Before the file gets even loaded) obtaining the amount of header lines
+- Obtaining the classification from the INFO field.
+- Obtaining the gene from the INFO field.
+- Obtaining the review status from the INFO field.
+- Equalizing the (likely) benign and (likely) pathogenic (also VUS) classifications (and dropping everything else).
+- Applying a binarized label (0.0 for (likely) benign and 1.0 for (likely) pathogenic).
+
+After that, both files are equal in terms of columns, column types and variables within the columns.
+
+(main.py)
+
+Both datasets are merged, with the column "source" indicating their true origin and the originally loaded datasets are
+deleted to preserve RAM usage.
+
+(duplicate_processor.py)
+
+Once merged, duplicates have to be removed since it is possible that VKGL samples are present within ClinVar. Duplicates
+are dropped if all of the following columns are duplicates:
+
+- CHROM
+- POS
+- REF
+- ALT
+- gene
+- class
+
+Since the keep in pandas.DataFrame.drop_duplicates() is not defined, the first duplicate will remain present while the
+rest are dropped. Since VKGL is appended to the ClinVar dataset, the ClinVar variant will be kept.
+_Note: it might be possible that duplicate variants are still present with conflicting classes, since those do not count
+as duplicates. We don't think this will be an issue due to the amount of samples._
+
+Now that we have an assembled large dataset, we can start building the train-test and validation datasets. __But before
+we can do that__, it is useful to apply the sample weights already. Sample weights is what tells XGBoost (and other
+machine learning algorithms) how well it should trust the particular sample. As of current, the following ClinVar Gold
+Star ratings are used to apply the following sample weights:
+
+| Gold star rating | Sample weight |
+|:----------------:|:-------------:|
+|         0        |      Removed      |
+|         1        |      0.8      |
+|         2        |      0.9      |
+|         3        |      1.0      |
+|         4        |      1.0      |
+
+These sample weights follow the idea that [Li et al.](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00775-w) originally had for the initial model.
+Applying sample weights that are lower than 0.8 have the chance to influence the end score too much in such a way that scores will not reach 0.0 for benign and 1.0 for pathogenic (beta1 iteration 1). 
+Higher quality variants (2 labs or more confirming the consensus) should be rewarded with a higher sample weight.
+
+Now that sample weights have been mapped and applied, the large dataset can be split into the train-test and validation
+datasets.
+
+(split_datasets.py)
+
+The most important part about the splitting of the large dataset into the train-test and validation datasets is that the
+validation dataset is of high quality. This is done by randomly sampling 50% of high quality pathogenic variants (review
+score 2 or higher, which means at least 2 labs have confirmed the consensus). 
+Once 50% of high quality pathogenic variants have been sampled, an equal amount of benign samples of the same quality is sampled.
+These sampled variants now form the validation dataset. 
+These validation samples are then merged back to the training dataset, after which an `pd.drop_duplicates(keep=False)` is performed, 
+so all validation samples that were still in the train-test dataset are removed (since `sample` does not remove the dataframe entries).
+
+This results in the train-test and validation datasets, 
+that are exported to the output argument after adding a pseudo-vcf header and pseudo-vcf columns (and merging critical information to the ID column).
diff --git a/train_data_creator/__init__.py b/train_data_creator/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/train_data_creator/main.py b/train_data_creator/main.py
new file mode 100755
index 0000000..eb131ea
--- /dev/null
+++ b/train_data_creator/main.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python3
+
+import gc
+
+from train_data_creator.src.main.exporter import Exporter
+from train_data_creator.src.main.split_datasets import SplitDatasets
+from train_data_creator.src.main.sample_weighter import SampleWeighter
+from train_data_creator.src.main.data_preprocessor import VKGL, ClinVar
+from train_data_creator.src.main.utilities import correct_order_vcf_notation
+from train_data_creator.src.main.command_line_parser import CommandLineParser
+from train_data_creator.src.main.duplicate_processor import DuplicateProcessor
+from train_data_creator.src.main.validators.input_validator import InputValidator
+
+
+def main():
+    # Parse command line
+    print('Parsing command line arguments.')
+    clp = CommandLineParser()
+    input_vkgl = clp.get_argument('input_vkgl')
+    input_clinvar = clp.get_argument('input_clinvar')
+    output = clp.get_argument('output')
+
+    # Validate
+    print('Validating input arguments.')
+    validator = InputValidator()
+    validator.validate_vkgl(input_vkgl)
+    validator.validate_clinvar(input_clinvar)
+    output = validator.validate_output(output)
+    print('Input arguments valid.\n')
+
+    # Parse
+    print('Parsing VKGL.')
+    vkgl = VKGL().parse(input_vkgl)
+    print('Parsing ClinVar.')
+    clinvar = ClinVar().parse(input_clinvar)
+    print('Datafiles parsed.\n')
+
+    # Combine
+    print('Combining ClinVar and VKGL')
+    data = vkgl.append(clinvar, ignore_index=True)
+    print('ClinVar and VKGL combined.\n')
+
+    # Freeing up memory
+    del vkgl, clinvar
+    gc.collect()
+
+    # Dropping duplicates
+    DuplicateProcessor().process(data)
+
+    # Applying sample weight
+    SampleWeighter().apply_sample_weight(data)
+
+    # Deviding into train-test and validation
+    train_test, validation = SplitDatasets().split(data)
+
+    # Freeing up memory
+    del data
+    gc.collect()
+
+    # Exporting
+    exporter = Exporter(output)
+    exporter.export_validation_dataset(correct_order_vcf_notation(validation))
+    exporter.export_train_test_dataset(train_test)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/train_data_creator/src/__init__.py b/train_data_creator/src/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/train_data_creator/src/main/__init__.py b/train_data_creator/src/main/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/train_data_creator/src/main/command_line_parser.py b/train_data_creator/src/main/command_line_parser.py
new file mode 100644
index 0000000..d424f72
--- /dev/null
+++ b/train_data_creator/src/main/command_line_parser.py
@@ -0,0 +1,41 @@
+import argparse
+
+
+class CommandLineParser:
+    def __init__(self):
+        parser = self._create_argument_parser()
+        self.arguments = parser.parse_args()
+
+    def _create_argument_parser(self):
+        parser = argparse.ArgumentParser(
+            prog='Train-data-creator',
+            description='Creator of CAPICE train/test and validation datasets.'
+        )
+        required = parser.add_argument_group('Required arguments')
+
+        required.add_argument(
+            '--input_vkgl',
+            type=str,
+            required=True,
+            help='Input location of the VKGL dataset (consensus, not public)'
+        )
+        required.add_argument(
+            '--input_clinvar',
+            type=str,
+            required=True,
+            help='Input location of the ClinVar dataset.'
+        )
+        required.add_argument(
+            '-o',
+            '--output',
+            type=str,
+            required=True,
+            help='Output directory where the files should be placed.'
+        )
+        return parser
+
+    def get_argument(self, argument_key):
+        value = None
+        if argument_key in self.arguments:
+            value = getattr(self.arguments, argument_key)
+        return value
diff --git a/train_data_creator/src/main/data_preprocessor.py b/train_data_creator/src/main/data_preprocessor.py
new file mode 100644
index 0000000..98283da
--- /dev/null
+++ b/train_data_creator/src/main/data_preprocessor.py
@@ -0,0 +1,188 @@
+import gzip
+import warnings
+import pandas as pd
+
+from train_data_creator.src.main.utilities import correct_order_vcf_notation, equalize_class, \
+    apply_binarized_label
+from train_data_creator.src.main.validators.dataset_validator import DatasetValidator
+
+_COLUMNS_OF_INTEREST = ['#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class', 'review', 'source']
+
+
+class VKGL:
+    def __init__(self):
+        self.validator = DatasetValidator()
+
+    def parse(self, file_location):
+        data = pd.read_csv(file_location, sep='\t')
+        self.validator.validate_vkgl(data)
+        self._correct_column_names(data)
+        self._correct_single_consensus(data)
+        correct_order_vcf_notation(data)
+        self._apply_review_status(data)
+        data['source'] = 'VKGL'
+        data = data[_COLUMNS_OF_INTEREST]
+        equalize_class(
+            data,
+            equalize_dict={
+                'Likely benign': 'LB',
+                'Benign': 'B',
+                '(Likely) benign': 'LB',
+                'Pathogenic': 'P',
+                'Likely pathogenic': 'LP',
+                '(Likely) pathogenic': 'LP'
+            }
+        )
+        apply_binarized_label(data)
+        data.reset_index(drop=True, inplace=True)
+        return data
+
+    @staticmethod
+    def _apply_review_status(data: pd.DataFrame):
+        print('Applying review status.')
+        data['review'] = 2
+        data.loc[data[data['labs'] == 1].index, 'review'] = 1
+        return data
+
+    @staticmethod
+    def _correct_single_consensus(data: pd.DataFrame):
+        print('Correcting single consensus.')
+        lab_concensi = []
+        for col in data.columns:
+            if col.endswith('_link'):
+                lab_concensi.append(col.split('_link')[0])
+        for lab in lab_concensi:
+            data[lab].fillna('', inplace=True)
+        # The true amount of labs doesn't matter, as long as it is at least 2
+        data['labs'] = 2
+        # Grabbing the variants that only have 1 lab supporting the consensus
+        data.loc[
+            data[
+                data[
+                    'class'
+                ] == 'Classified by one lab'
+                ].index,
+            'labs'
+        ] = 1
+        # Correcting consensus
+        data.loc[
+            data[
+                data[
+                    'class'
+                ] == 'Classified by one lab'
+                ].index,
+            'class'
+        ] = data[lab_concensi].astype(str).agg(''.join, axis=1)
+        return data
+
+    @staticmethod
+    def _correct_column_names(data: pd.DataFrame):
+        print('Correcting column names.')
+        data.rename(
+            columns={
+                'chromosome': '#CHROM',
+                'start': 'POS',
+                'ref': 'REF',
+                'alt': 'ALT',
+                'consensus_classification': 'class'
+            }, inplace=True
+        )
+        return data
+
+
+class ClinVar:
+    def __init__(self):
+        self.validator = DatasetValidator()
+
+    def parse(self, file_location):
+        skiprows = self._get_n_header(file_location)
+        data = pd.read_csv(file_location, sep='\t', skiprows=skiprows)
+        self.validator.validate_clinvar(data)
+        self._obtain_class(data)
+        self._obtain_gene(data)
+        self._obtain_review(data)
+        data['source'] = 'ClinVar'
+        data = data[_COLUMNS_OF_INTEREST]
+        equalize_class(
+            data,
+            equalize_dict={
+                'Uncertain_significance': 'VUS',
+                'Likely_benign': 'LB',
+                'Benign': 'B',
+                'Pathogenic': 'P',
+                'Likely_pathogenic': 'LP',
+                'Benign/Likely_benign': 'LB',
+                'Pathogenic/Likely_pathogenic': 'LP'
+            }
+        )
+        apply_binarized_label(data)
+        data.reset_index(drop=True, inplace=True)
+        return data
+
+    @staticmethod
+    def _get_n_header(file_location):
+        print('Obtaining amount of skip rows.')
+        n_skip = 0
+        if file_location.endswith('.gz'):
+            file_conn = gzip.open(file_location, 'rt')
+        else:
+            file_conn = open(file_location, 'rt')
+        for line in file_conn:
+            if line.strip().startswith('##'):
+                n_skip += 1
+            else:
+                break
+        file_conn.close()
+        print(f'Total skip rows: {n_skip}')
+        return n_skip
+
+    @staticmethod
+    def _obtain_gene(data):
+        print('Obtaining gene.')
+        data['gene'] = data['INFO'].str.split(
+            'GENEINFO=', expand=True
+        )[1].str.split(':', expand=True)[0]
+        return data
+
+    @staticmethod
+    def _obtain_class(data):
+        print('Obtaining class.')
+        data['class'] = data['INFO'].str.split(
+            'CLNSIG=', expand=True
+        )[1].str.split(';', expand=True)[0]
+        return data
+
+    @staticmethod
+    def _obtain_review(data):
+        print('Obtaining review.')
+        # Dict with Gold stars matching the ClinVar review status
+        # Conflicting -1 because we dont want those
+        stars = {
+            'criteria_provided,_conflicting_interpretations': -1,
+            'no_assertion_provided': 0,
+            'no_assertion_criteria_provided': 0,
+            'no_assertion_for_individual_variant': 0,
+            'criteria_provided,_single_submitter': 1,
+            'criteria_provided,_multiple_submitters,_no_conflicts': 2,
+            'reviewed_by_expert_panel': 3,
+            'practice_guideline': 4
+        }
+
+        # Isolating column
+        data['review'] = data['INFO'].str.split(
+            'CLNREVSTAT=', expand=True
+        )[1].str.split(';', expand=True)[0]
+
+        # Dropping uninteresting review status
+        for status in data['review'].unique():
+            if status not in stars.keys():
+                warnings.warn(f'Found unknown review status: {status}')
+
+        # Mapping to Gold Stars values
+        for key, value in stars.items():
+            data.loc[data[data['review'] == key].index, 'review'] = value
+        data['review'] = data['review'].astype(int)
+
+        # Droppin
+        data.drop(data[data['review'] < 1].index, inplace=True)
+        return data
diff --git a/train_data_creator/src/main/duplicate_processor.py b/train_data_creator/src/main/duplicate_processor.py
new file mode 100644
index 0000000..0402ec2
--- /dev/null
+++ b/train_data_creator/src/main/duplicate_processor.py
@@ -0,0 +1,11 @@
+import pandas as pd
+
+
+class DuplicateProcessor:
+    @staticmethod
+    def process(data: pd.DataFrame):
+        print('Dropping duplicates.')
+        data.drop_duplicates(
+            subset=['#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'], inplace=True
+        )
+        return data
diff --git a/train_data_creator/src/main/exporter.py b/train_data_creator/src/main/exporter.py
new file mode 100644
index 0000000..1a6a105
--- /dev/null
+++ b/train_data_creator/src/main/exporter.py
@@ -0,0 +1,71 @@
+import os
+import gzip
+import pandas as pd
+
+
+class Exporter:
+    def __init__(self, output):
+        self.output = output
+        self.vcf_header = [
+            '##fileformat=VCFv4.2',
+            '##contig=<ID=1,length=249250621,assembly=b37>',
+            '##contig=<ID=2,assembly=b37,length=243199373>',
+            '##contig=<ID=3,assembly=b37,length=198022430>',
+            '##contig=<ID=4,length=191154276,assembly=b37>',
+            '##contig=<ID=5,length=180915260,assembly=b37>',
+            '##contig=<ID=6,length=171115067,assembly=b37>',
+            '##contig=<ID=7,length=159138663,assembly=b37>',
+            '##contig=<ID=8,length=146364022,assembly=b37>',
+            '##contig=<ID=9,length=141213431,assembly=b37>',
+            '##contig=<ID=10,length=135534747,assembly=b37>',
+            '##contig=<ID=11,length=135006516,assembly=b37>',
+            '##contig=<ID=12,length=133851895,assembly=b37>',
+            '##contig=<ID=13,length=115169878,assembly=b37>',
+            '##contig=<ID=14,length=107349540,assembly=b37>',
+            '##contig=<ID=15,length=102531392,assembly=b37>',
+            '##contig=<ID=16,length=90354753,assembly=b37>',
+            '##contig=<ID=17,length=81195210,assembly=b37>',
+            '##contig=<ID=18,length=78077248,assembly=b37>',
+            '##contig=<ID=19,length=59128983,assembly=b37>',
+            '##contig=<ID=20,length=63025520,assembly=b37>',
+            '##contig=<ID=21,length=48129895,assembly=b37>',
+            '##contig=<ID=22,length=51304566,assembly=b37>',
+            '##contig=<ID=X,assembly=b37,length=155270560>',
+            '##contig=<ID=Y,length=59373566,assembly=b37>',
+            '##contig=<ID=MT,length=16569,assembly=b37>',
+            '##fileDate=20200320'
+        ]
+
+    def export_validation_dataset(self, data: pd.DataFrame):
+        self._export(data, dataset_type='validation')
+
+    def export_train_test_dataset(self, data: pd.DataFrame):
+        self._export(data, dataset_type='train_test')
+
+    def _export(self, data: pd.DataFrame, dataset_type: str):
+        export_loc = os.path.join(self.output, dataset_type + '.vcf.gz')
+        with gzip.open(export_loc, 'wt') as pseudo_vcf:
+            for line in self.vcf_header:
+                pseudo_vcf.write(f'{line}\n')
+
+        # Adding required columns
+        data['QUAL'] = '.'
+        data['FILTER'] = 'PASS'
+        data['INFO'] = '.'
+
+        # Making sure that the variant can be mapped back after VEP
+        data['ID'] = data[
+            [
+                '#CHROM',
+                'POS',
+                'REF',
+                'ALT',
+                'gene',
+                'binarized_label',
+                'sample_weight'
+            ]
+        ].astype(str).agg('_'.join, axis=1)
+        data[
+            ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO']
+        ].to_csv(export_loc, mode='a', sep='\t', index=False, compression='gzip', na_rep='.')
+        print(f'Export to {export_loc} successful.')
diff --git a/train_data_creator/src/main/sample_weighter.py b/train_data_creator/src/main/sample_weighter.py
new file mode 100644
index 0000000..9d7a264
--- /dev/null
+++ b/train_data_creator/src/main/sample_weighter.py
@@ -0,0 +1,15 @@
+import pandas as pd
+
+
+class SampleWeighter:
+    @staticmethod
+    def apply_sample_weight(dataset: pd.DataFrame):
+        print('Applying sample weights')
+        sample_weights = {
+            4: 1.0,
+            3: 1.0,
+            2: 0.9,
+            1: 0.8
+        }
+        dataset['sample_weight'] = dataset['review'].map(sample_weights)
+        return dataset
diff --git a/train_data_creator/src/main/split_datasets.py b/train_data_creator/src/main/split_datasets.py
new file mode 100644
index 0000000..d180746
--- /dev/null
+++ b/train_data_creator/src/main/split_datasets.py
@@ -0,0 +1,37 @@
+import gc
+
+import pandas as pd
+
+
+class SplitDatasets:
+    def __init__(self):
+        self.frac = 0.5
+
+    def split(self, data: pd.DataFrame):
+        print('Splitting into validation and training.')
+        pathogenic_set = data[data['binarized_label'] == 1]
+        print(f'Amount of pathogenic variants:{pathogenic_set.shape[0]}')
+        benign_set = data[data['binarized_label'] == 0]
+        print(f'Amount of benign variants:{benign_set.shape[0]}')
+        validation = pathogenic_set[pathogenic_set['sample_weight'] >= 0.9].sample(frac=self.frac)
+        print(f'Sampled: {validation.shape[0]} high confidence pathogenic variants.')
+        if benign_set[benign_set['sample_weight'] >= 0.9].shape[0] < validation.shape[0]:
+            raise ValueError(
+                f'Not enough benign variants to match pathogenic variants, unable to create '
+                f'validation set.'
+            )
+        validation = validation.append(
+            benign_set[
+                benign_set['sample_weight'] >= 0.9].sample(n=validation.shape[0]), ignore_index=True
+        )
+        print(f'Validation dataset made, number of samples: {validation.shape[0]}')
+        del pathogenic_set, benign_set
+        gc.collect()
+
+        # Creating train_test dataset
+        train_test = data.copy(deep=True)
+        train_test = train_test.append(validation, ignore_index=True)
+        train_test.drop_duplicates(keep=False, inplace=True)
+        print(f'Train dataset made, number of samples: {train_test.shape[0]}')
+
+        return train_test, validation
diff --git a/train_data_creator/src/main/utilities.py b/train_data_creator/src/main/utilities.py
new file mode 100644
index 0000000..89469e0
--- /dev/null
+++ b/train_data_creator/src/main/utilities.py
@@ -0,0 +1,35 @@
+import numpy as np
+import pandas as pd
+
+
+def correct_order_vcf_notation(pseudo_vcf: pd.DataFrame):
+    print('Ordering (pseudo_)VCF')
+    pseudo_vcf['order'] = pseudo_vcf['#CHROM']
+    pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'X'].index, 'order'] = 23
+    pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'Y'].index, 'order'] = 24
+    pseudo_vcf.loc[pseudo_vcf[pseudo_vcf['order'] == 'MT'].index, 'order'] = 25
+    pseudo_vcf['order'] = pseudo_vcf['order'].astype(int)
+    pseudo_vcf.sort_values(by=['order', 'POS'], inplace=True)
+    pseudo_vcf.drop(columns='order', inplace=True)
+    pseudo_vcf.reset_index(drop=True, inplace=True)
+    return pseudo_vcf
+
+
+def equalize_class(data: pd.DataFrame, equalize_dict: dict):
+    print('Equalizing classes.')
+    # Drop the classes that we are not interested in.
+    for c in data['class'].unique():
+        if c not in equalize_dict.keys():
+            data.drop(index=data[data['class'] == c].index, inplace=True)
+    for key, value in equalize_dict.items():
+        data.loc[data[data['class'] == key].index, 'class'] = value
+    return data
+
+
+def apply_binarized_label(data: pd.DataFrame):
+    print('Applying binarized label.')
+    data['binarized_label'] = np.nan
+    data.loc[data[data['class'].isin(['LB', 'B'])].index, 'binarized_label'] = 0
+    data.loc[data[data['class'].isin(['LP', 'P'])].index, 'binarized_label'] = 1
+    data.drop(index=data[data['binarized_label'].isnull()].index, inplace=True)
+    return data
diff --git a/train_data_creator/src/main/validators/__init__.py b/train_data_creator/src/main/validators/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/train_data_creator/src/main/validators/dataset_validator.py b/train_data_creator/src/main/validators/dataset_validator.py
new file mode 100644
index 0000000..0f0c3b7
--- /dev/null
+++ b/train_data_creator/src/main/validators/dataset_validator.py
@@ -0,0 +1,26 @@
+import pandas as pd
+
+
+class DatasetValidator:
+    def validate_vkgl(self, vkgl_dataset: pd.DataFrame):
+        columns_specific_to_vkgl = ['consensus_classification', 'chromosome', 'start']
+        self._validate_n_variants(vkgl_dataset, 'VKGL')
+        self._validate_dataset(vkgl_dataset, columns_specific_to_vkgl, 'VKGL')
+
+    def validate_clinvar(self, clinvar_dataset: pd.DataFrame):
+        columns_specific_to_clinvar = ['#CHROM', 'POS', 'REF', 'ALT', 'INFO']
+        self._validate_n_variants(clinvar_dataset, 'ClinVar')
+        self._validate_dataset(clinvar_dataset, columns_specific_to_clinvar, 'ClinVar')
+
+    @staticmethod
+    def _validate_dataset(dataset: pd.DataFrame, columns_required: list, data_source: str):
+        for column in columns_required:
+            if column not in dataset.columns:
+                raise KeyError(
+                    f'Input {data_source} dataset is not consensus, missing column: {column}'
+                )
+
+    @staticmethod
+    def _validate_n_variants(dataset: pd.DataFrame, data_source: str):
+        if dataset.shape[0] == 0:
+            raise EOFError(f'Data source {data_source} does not contain variants!')
diff --git a/train_data_creator/src/main/validators/input_validator.py b/train_data_creator/src/main/validators/input_validator.py
new file mode 100644
index 0000000..06bb725
--- /dev/null
+++ b/train_data_creator/src/main/validators/input_validator.py
@@ -0,0 +1,36 @@
+import os
+from pathlib import Path
+
+
+class InputValidator:
+    def validate_vkgl(self, input_path):
+        self._validate_file_exist(input_path, 'VKGL')
+        if not input_path.endswith(('.tsv.gz', '.tsv')):
+            raise IOError('Input VKGL is not a (gzipped) tsv!')
+
+    def validate_clinvar(self, input_path):
+        self._validate_file_exist(input_path, 'ClinVar')
+        if not input_path.endswith(('.vcf', '.vcf.gz')):
+            raise IOError('Input ClinVar is not a (gzipped) vcf!')
+
+    @staticmethod
+    def _validate_file_exist(input_path, file_type):
+        if not os.path.isfile(input_path):
+            raise FileNotFoundError(f'Input for {file_type}: {input_path} is not a file!')
+
+    @staticmethod
+    def validate_output(output_path):
+        makedir = False
+        if not os.path.isdir(output_path):
+            path_parent = str(Path(output_path).parent)
+            if not os.path.isdir(path_parent):
+                raise OSError(f'Output {output_path} cannot be made because parent does not exist!')
+            elif not os.access(path_parent, os.W_OK):
+                raise OSError(f'Output path {output_path} is not writable!')
+            else:
+                makedir = True
+        elif not os.access(output_path, os.W_OK):
+            raise OSError(f'Output path {output_path} is not writable!')
+        if makedir:
+            os.makedirs(output_path)
+        return output_path
diff --git a/train_data_creator/test/__init__.py b/train_data_creator/test/__init__.py
new file mode 100644
index 0000000..0613eb2
--- /dev/null
+++ b/train_data_creator/test/__init__.py
@@ -0,0 +1,5 @@
+from pathlib import Path
+
+
+def get_project_root_dir():
+    return Path(__file__).parent.parent
diff --git a/train_data_creator/test/resources/hashtag_file.txt b/train_data_creator/test/resources/hashtag_file.txt
new file mode 100644
index 0000000..f5b7f88
--- /dev/null
+++ b/train_data_creator/test/resources/hashtag_file.txt
@@ -0,0 +1,30 @@
+## This file should provide a sum of 14 in the test specific to counting double hashtag lines
+##
+##
+##
+##
+##
+##
+##
+##
+##
+##
+##
+##
+##
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/train_data_creator/test/resources/smol_clinvar.vcf.gz b/train_data_creator/test/resources/smol_clinvar.vcf.gz
new file mode 100755
index 0000000..ed343ff
Binary files /dev/null and b/train_data_creator/test/resources/smol_clinvar.vcf.gz differ
diff --git a/train_data_creator/test/resources/smol_vkgl.tsv.gz b/train_data_creator/test/resources/smol_vkgl.tsv.gz
new file mode 100644
index 0000000..205af82
Binary files /dev/null and b/train_data_creator/test/resources/smol_vkgl.tsv.gz differ
diff --git a/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz b/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz
new file mode 100644
index 0000000..f5fb62e
Binary files /dev/null and b/train_data_creator/test/resources/smol_vkgl_novars.tsv.gz differ
diff --git a/train_data_creator/test/test_data_preprocessor.py b/train_data_creator/test/test_data_preprocessor.py
new file mode 100644
index 0000000..eb6d624
--- /dev/null
+++ b/train_data_creator/test/test_data_preprocessor.py
@@ -0,0 +1,335 @@
+import os
+import unittest
+
+import numpy as np
+import pandas as pd
+
+from train_data_creator.test import get_project_root_dir
+from train_data_creator.src.main.data_preprocessor import VKGL, ClinVar
+
+
+class TestDataProcessor(unittest.TestCase):
+    __CHROM__ = '#CHROM'
+
+    def setUp(self) -> None:
+        self.vkgl = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz')
+        self.clinvar = os.path.join(get_project_root_dir(), 'test', 'resources',
+                                   'smol_clinvar.vcf.gz')
+        self.dataset = pd.DataFrame(
+            {
+                'INFO': [
+                    'SOMEVALUE=1003;CLNSIG=VUS;SOMEOTHERVALUE=103438;'
+                    'CLNREVSTAT=no_assertion_provided;GENEINFO=foo:10051',
+                    'SOMEVALUE=1243;CLNSIG=Likely_pathogenic;SOMEOTHERVALUE=8864389;'
+                    'CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=baz:509709'
+                ]
+            }
+        )
+
+    def test_vkgl(self):
+        vkgl = VKGL()
+        observed = vkgl.parse(self.vkgl)
+        expected = pd.DataFrame(
+            {self.__CHROM__: {0: 2, 1: 2, 2: 2, 3: 7, 4: 7, 5: 7, 6: 7, 7: 10,
+                              8: 11, 9: 11, 10: 11, 11: 11, 12: 11, 13: 11,
+                              14: 11, 15: 12, 16: 13, 17: 13, 18: 13, 19: 13,
+                              20: 13, 21: 13, 22: 13, 23: 13, 24: 13, 25: 13,
+                              26: 13, 27: 13, 28: 13, 29: 13, 30: 13, 31: 13,
+                              32: 13, 33: 13, 34: 13, 35: 13, 36: 14, 37: 16,
+                              38: 17, 39: 17, 40: 17, 41: 17, 42: 17, 43: 17,
+                              44: 17, 45: 17, 46: 17, 47: 17, 48: 17, 49: 17,
+                              50: 17, 51: 17, 52: 17, 53: 17, 54: 17, 55: 17,
+                              56: 17, 57: 17, 58: 17, 59: 17, 60: 19},
+             'POS': {0: 47600591, 1: 47614114, 2: 47698179, 3: 6026708,
+                     4: 6029452, 5: 6043386, 6: 124475296, 7: 89623860,
+                     8: 108119615, 9: 108119615, 10: 108121410,
+                     11: 108178738, 12: 108183385, 13: 108188266,
+                     14: 108202260, 15: 133249166, 16: 32900363,
+                     17: 32900930, 18: 32910891, 19: 32911703,
+                     20: 32912007, 21: 32913055, 22: 32913196,
+                     23: 32913609, 24: 32913760, 25: 32914196,
+                     26: 32914349, 27: 32914489, 28: 32914815,
+                     29: 32915135, 30: 32929360, 31: 32937429,
+                     32: 32944628, 33: 32971235, 34: 32972380,
+                     35: 32972695, 36: 65550984, 37: 68771372,
+                     38: 7578457, 39: 41209068, 40: 41215954,
+                     41: 41226601, 42: 41228668, 43: 41243190,
+                     44: 41244122, 45: 41244481, 46: 41245136,
+                     47: 41245237, 48: 41245900, 49: 41246092,
+                     50: 41246483, 51: 41249263, 52: 41251752,
+                     53: 41256074, 54: 41256266, 55: 41258361,
+                     56: 56780540, 57: 56801451, 58: 59924505,
+                     59: 59926423, 60: 11169097},
+             'REF': {0: 'T', 1: 'A', 2: 'A', 3: 'C', 4: 'GCTGA', 5: 'G',
+                     6: 'TAAACA', 7: 'CT', 8: 'CAA', 9: 'CA', 10: 'CT',
+                     11: 'G', 12: 'G', 13: 'AT', 14: 'A', 15: 'C',
+                     16: 'C', 17: 'A', 18: 'G', 19: 'C', 20: 'C',
+                     21: 'A', 22: 'G', 23: 'A', 24: 'A', 25: 'G',
+                     26: 'G', 27: 'G', 28: 'G', 29: 'TACTC', 30: 'T',
+                     31: 'G', 32: 'G', 33: 'G', 34: 'G', 35: 'A',
+                     36: 'C', 37: 'C', 38: 'C', 39: 'C', 40: 'A',
+                     41: 'G', 42: 'C', 43: 'T', 44: 'T', 45: 'C',
+                     46: 'C', 47: 'A', 48: 'T', 49: 'A', 50: 'C',
+                     51: 'G', 52: 'G', 53: 'CA', 54: 'T', 55: 'T',
+                     56: 'G', 57: 'C', 58: 'A', 59: 'A', 60: 'G'},
+             'ALT': {0: 'A', 1: 'G', 2: 'G', 3: 'A', 4: 'G', 5: 'A',
+                     6: 'T', 7: 'C', 8: 'C', 9: 'C', 10: 'C', 11: 'A',
+                     12: 'A', 13: 'A', 14: 'G', 15: 'T', 16: 'CT',
+                     17: 'G', 18: 'A', 19: 'T', 20: 'T', 21: 'G',
+                     22: 'C', 23: 'C', 24: 'G', 25: 'A', 26: 'T',
+                     27: 'A', 28: 'A', 29: 'T', 30: 'G', 31: 'A',
+                     32: 'A', 33: 'A', 34: 'A', 35: 'G', 36: 'T',
+                     37: 'T', 38: 'T', 39: 'T', 40: 'G', 41: 'C',
+                     42: 'T', 43: 'G', 44: 'C', 45: 'T', 46: 'G',
+                     47: 'G', 48: 'G', 49: 'G', 50: 'T', 51: 'A',
+                     52: 'T', 53: 'C', 54: 'C', 55: 'C', 56: 'T',
+                     57: 'T', 58: 'G', 59: 'G', 60: 'A'},
+             'gene': {0: 'EPCAM', 1: 'EPCAM', 2: 'MSH2', 3: 'PMS2',
+                      4: 'PMS2', 5: 'PMS2', 6: 'POT1', 7: 'PTEN',
+                      8: 'ATM', 9: 'ATM', 10: 'ATM', 11: 'ATM',
+                      12: 'ATM', 13: 'ATM', 14: 'ATM', 15: 'POLE',
+                      16: 'BRCA2', 17: 'BRCA2', 18: 'BRCA2',
+                      19: 'BRCA2', 20: 'BRCA2', 21: 'BRCA2',
+                      22: 'BRCA2', 23: 'BRCA2', 24: 'BRCA2',
+                      25: 'BRCA2', 26: 'BRCA2', 27: 'BRCA2',
+                      28: 'BRCA2', 29: 'BRCA2', 30: 'BRCA2',
+                      31: 'BRCA2', 32: 'BRCA2', 33: 'BRCA2',
+                      34: 'BRCA2', 35: 'BRCA2', 36: 'MAX', 37: 'CDH1',
+                      38: 'TP53', 39: 'BRCA1', 40: 'BRCA1', 41: 'BRCA1',
+                      42: 'BRCA1', 43: 'BRCA1', 44: 'BRCA1',
+                      45: 'BRCA1', 46: 'BRCA1', 47: 'BRCA1',
+                      48: 'BRCA1', 49: 'BRCA1', 50: 'BRCA1',
+                      51: 'BRCA1', 52: 'BRCA1', 53: 'BRCA1',
+                      54: 'BRCA1', 55: 'BRCA1', 56: 'RAD51C',
+                      57: 'RAD51C', 58: 'BRIP1', 59: 'BRIP1',
+                      60: 'SMARCA4'},
+             'class': {0: 'LB', 1: 'LB', 2: 'LB', 3: 'LB', 4: 'P',
+                       5: 'LB', 6: 'LB', 7: 'LB', 8: 'B', 9: 'LB',
+                       10: 'LB', 11: 'LB', 12: 'LB', 13: 'LB', 14: 'LB',
+                       15: 'LB', 16: 'LB', 17: 'LB', 18: 'LB', 19: 'LB',
+                       20: 'LB', 21: 'LB', 22: 'LB', 23: 'LB', 24: 'LB',
+                       25: 'LB', 26: 'LP', 27: 'LB', 28: 'LB', 29: 'LP',
+                       30: 'LB', 31: 'LB', 32: 'LB', 33: 'LB', 34: 'LB',
+                       35: 'LB', 36: 'LB', 37: 'LB', 38: 'LP', 39: 'LP',
+                       40: 'LP', 41: 'LB', 42: 'LB', 43: 'LB', 44: 'LB',
+                       45: 'LB', 46: 'LB', 47: 'LB', 48: 'LB', 49: 'LB',
+                       50: 'LB', 51: 'LB', 52: 'LB', 53: 'LB', 54: 'LB',
+                       55: 'LB', 56: 'LB', 57: 'LP', 58: 'LB', 59: 'B',
+                       60: 'LB'},
+             'review': {0: 2, 1: 1, 2: 2, 3: 2, 4: 1, 5: 2, 6: 2, 7: 2,
+                        8: 1, 9: 2, 10: 2, 11: 2, 12: 1, 13: 2, 14: 1,
+                        15: 1, 16: 2, 17: 1, 18: 2, 19: 1, 20: 2, 21: 2,
+                        22: 1, 23: 2, 24: 2, 25: 2, 26: 2, 27: 1, 28: 2,
+                        29: 2, 30: 2, 31: 2, 32: 2, 33: 1, 34: 2, 35: 2,
+                        36: 1, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 1,
+                        43: 2, 44: 2, 45: 1, 46: 2, 47: 2, 48: 2, 49: 2,
+                        50: 2, 51: 2, 52: 1, 53: 2, 54: 2, 55: 1, 56: 2,
+                        57: 2, 58: 2, 59: 1, 60: 1},
+             'source': {0: 'VKGL', 1: 'VKGL', 2: 'VKGL', 3: 'VKGL',
+                        4: 'VKGL', 5: 'VKGL', 6: 'VKGL', 7: 'VKGL',
+                        8: 'VKGL', 9: 'VKGL', 10: 'VKGL', 11: 'VKGL',
+                        12: 'VKGL', 13: 'VKGL', 14: 'VKGL', 15: 'VKGL',
+                        16: 'VKGL', 17: 'VKGL', 18: 'VKGL', 19: 'VKGL',
+                        20: 'VKGL', 21: 'VKGL', 22: 'VKGL', 23: 'VKGL',
+                        24: 'VKGL', 25: 'VKGL', 26: 'VKGL', 27: 'VKGL',
+                        28: 'VKGL', 29: 'VKGL', 30: 'VKGL', 31: 'VKGL',
+                        32: 'VKGL', 33: 'VKGL', 34: 'VKGL', 35: 'VKGL',
+                        36: 'VKGL', 37: 'VKGL', 38: 'VKGL', 39: 'VKGL',
+                        40: 'VKGL', 41: 'VKGL', 42: 'VKGL', 43: 'VKGL',
+                        44: 'VKGL', 45: 'VKGL', 46: 'VKGL', 47: 'VKGL',
+                        48: 'VKGL', 49: 'VKGL', 50: 'VKGL', 51: 'VKGL',
+                        52: 'VKGL', 53: 'VKGL', 54: 'VKGL', 55: 'VKGL',
+                        56: 'VKGL', 57: 'VKGL', 58: 'VKGL', 59: 'VKGL',
+                        60: 'VKGL'},
+             'binarized_label': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0,
+                                 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0,
+                                 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0,
+                                 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0,
+                                 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0,
+                                 22: 0.0, 23: 0.0, 24: 0.0, 25: 0.0,
+                                 26: 1.0, 27: 0.0, 28: 0.0, 29: 1.0,
+                                 30: 0.0, 31: 0.0, 32: 0.0, 33: 0.0,
+                                 34: 0.0, 35: 0.0, 36: 0.0, 37: 0.0,
+                                 38: 1.0, 39: 1.0, 40: 1.0, 41: 0.0,
+                                 42: 0.0, 43: 0.0, 44: 0.0, 45: 0.0,
+                                 46: 0.0, 47: 0.0, 48: 0.0, 49: 0.0,
+                                 50: 0.0, 51: 0.0, 52: 0.0, 53: 0.0,
+                                 54: 0.0, 55: 0.0, 56: 0.0, 57: 1.0,
+                                 58: 0.0, 59: 0.0, 60: 0.0}})
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_clinvar(self):
+        clinvar = ClinVar()
+        observed = clinvar.parse(self.clinvar)
+        expected = pd.DataFrame(
+            {self.__CHROM__: {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1,
+                              11: 1,
+                              12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1,
+                              22: 1,
+                              23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1,
+                              33: 1,
+                              34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1},
+             'POS': {0: 865519, 1: 865545, 2: 865567, 3: 865579, 4: 865584, 5: 865625, 6: 865627,
+                     7: 865628, 8: 865662, 9: 865665, 10: 865694, 11: 865700, 12: 865705,
+                     13: 865723, 14: 865726, 15: 866404, 16: 866422, 17: 866431, 18: 866438,
+                     19: 866461, 20: 866478, 21: 871143, 22: 871143, 23: 871146, 24: 871155,
+                     25: 871158, 26: 871159, 27: 871173, 28: 871176, 29: 871192, 30: 871206,
+                     31: 871215, 32: 871215, 33: 871219, 34: 871229, 35: 871230, 36: 871239,
+                     37: 871252, 38: 874410, 39: 874415, 40: 874416, 41: 874451, 42: 874456,
+                     43: 874457},
+             'REF': {0: 'C', 1: 'G', 2: 'C', 3: 'C', 4: 'G', 5: 'G', 6: 'T', 7: 'G', 8: 'G', 9: 'G',
+                     10: 'C', 11: 'C', 12: 'C', 13: 'G', 14: 'C', 15: 'C', 16: 'C', 17: 'C',
+                     18: 'G', 19: 'G', 20: 'C', 21: 'G', 22: 'G', 23: 'C', 24: 'G', 25: 'C',
+                     26: 'G', 27: 'C', 28: 'C', 29: 'C', 30: 'C', 31: 'C', 32: 'C', 33: 'C',
+                     34: 'G', 35: 'C', 36: 'C', 37: 'C', 38: 'C', 39: 'C', 40: 'G', 41: 'C',
+                     42: 'G', 43: 'T'},
+             'ALT': {0: 'T', 1: 'A', 2: 'T', 3: 'T', 4: 'A', 5: 'A', 6: 'C', 7: 'A', 8: 'A', 9: 'A',
+                     10: 'T', 11: 'T', 12: 'T', 13: 'T', 14: 'G', 15: 'T', 16: 'T', 17: 'T',
+                     18: 'A', 19: 'A', 20: 'T', 21: 'A', 22: 'T', 23: 'T', 24: 'A', 25: 'T',
+                     26: 'A', 27: 'T', 28: 'T', 29: 'T', 30: 'T', 31: 'G', 32: 'T', 33: 'T',
+                     34: 'C', 35: 'T', 36: 'T', 37: 'T', 38: 'T', 39: 'T', 40: 'A', 41: 'T',
+                     42: 'A', 43: 'C'},
+             'gene': {0: 'SAMD11', 1: 'SAMD11', 2: 'SAMD11', 3: 'SAMD11', 4: 'SAMD11', 5: 'SAMD11',
+                      6: 'SAMD11', 7: 'SAMD11', 8: 'SAMD11', 9: 'SAMD11', 10: 'SAMD11',
+                      11: 'SAMD11', 12: 'SAMD11', 13: 'SAMD11', 14: 'SAMD11', 15: 'SAMD11',
+                      16: 'SAMD11', 17: 'SAMD11', 18: 'SAMD11', 19: 'SAMD11', 20: 'SAMD11',
+                      21: 'SAMD11', 22: 'SAMD11', 23: 'SAMD11', 24: 'SAMD11', 25: 'SAMD11',
+                      26: 'SAMD11', 27: 'SAMD11', 28: 'SAMD11', 29: 'SAMD11', 30: 'SAMD11',
+                      31: 'SAMD11', 32: 'SAMD11', 33: 'SAMD11', 34: 'SAMD11', 35: 'SAMD11',
+                      36: 'SAMD11', 37: 'SAMD11', 38: 'SAMD11', 39: 'SAMD11', 40: 'SAMD11',
+                      41: 'SAMD11', 42: 'SAMD11', 43: 'SAMD11'},
+             'class': {0: 'LB', 1: 'B', 2: 'LB', 3: 'LB', 4: 'B', 5: 'LB', 6: 'LB', 7: 'LB',
+                       8: 'LB', 9: 'B', 10: 'B', 11: 'LB', 12: 'B', 13: 'LB', 14: 'LB', 15: 'LB',
+                       16: 'B', 17: 'LB', 18: 'LB', 19: 'LB', 20: 'LB', 21: 'LB', 22: 'LB', 23: 'B',
+                       24: 'LB', 25: 'LB', 26: 'B', 27: 'LB', 28: 'B', 29: 'LB', 30: 'LB', 31: 'B',
+                       32: 'B', 33: 'LB', 34: 'LB', 35: 'LB', 36: 'B', 37: 'LB', 38: 'LB', 39: 'B',
+                       40: 'LB', 41: 'LB', 42: 'LB', 43: 'LB'},
+             'review': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1,
+                        12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1,
+                        23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1,
+                        34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1},
+             'source': {0: 'ClinVar', 1: 'ClinVar', 2: 'ClinVar', 3: 'ClinVar', 4: 'ClinVar',
+                        5: 'ClinVar', 6: 'ClinVar', 7: 'ClinVar', 8: 'ClinVar', 9: 'ClinVar',
+                        10: 'ClinVar', 11: 'ClinVar', 12: 'ClinVar', 13: 'ClinVar', 14: 'ClinVar',
+                        15: 'ClinVar', 16: 'ClinVar', 17: 'ClinVar', 18: 'ClinVar', 19: 'ClinVar',
+                        20: 'ClinVar', 21: 'ClinVar', 22: 'ClinVar', 23: 'ClinVar', 24: 'ClinVar',
+                        25: 'ClinVar', 26: 'ClinVar', 27: 'ClinVar', 28: 'ClinVar', 29: 'ClinVar',
+                        30: 'ClinVar', 31: 'ClinVar', 32: 'ClinVar', 33: 'ClinVar', 34: 'ClinVar',
+                        35: 'ClinVar', 36: 'ClinVar', 37: 'ClinVar', 38: 'ClinVar', 39: 'ClinVar',
+                        40: 'ClinVar', 41: 'ClinVar', 42: 'ClinVar', 43: 'ClinVar'},
+             'binarized_label': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0,
+                                 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0,
+                                 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0,
+                                 22: 0.0, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0,
+                                 29: 0.0, 30: 0.0, 31: 0.0, 32: 0.0, 33: 0.0, 34: 0.0, 35: 0.0,
+                                 36: 0.0, 37: 0.0, 38: 0.0, 39: 0.0, 40: 0.0, 41: 0.0, 42: 0.0,
+                                 43: 0.0}}
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_vgkl__apply_review_status(self):
+        dataset = pd.DataFrame(
+            {
+                'labs': [1, 2, 3, 4, 1]
+            }
+        )
+        observed = VKGL()._apply_review_status(dataset)
+        expected = pd.DataFrame(
+            {
+                'labs': [1, 2, 3, 4, 1],
+                'review': [1, 2, 2, 2, 1]
+            }
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_vkgl__correct_single_consensus(self):
+        one_lab = 'Classified by one lab'
+        link_column = ['A_link', 'Some_other_link', 'A_third_link', 'A_forth_link']
+        likely_benign = '(Likely) Benign'
+        dataset = pd.DataFrame(
+            {
+                'foo': [np.nan, likely_benign, np.nan, np.nan],
+                'foo_link': link_column,
+                'bar': ['Benign', np.nan, np.nan, np.nan],
+                'bar_link': link_column,
+                'baz': [np.nan, np.nan, 'Pathogenic', np.nan],
+                'baz_link': link_column,
+                'class': [one_lab, one_lab, one_lab, 'foo']
+            }
+        )
+        observed = VKGL()._correct_single_consensus(dataset)
+        expected = pd.DataFrame(
+            {
+                'foo': ['', likely_benign, '', ''],
+                'foo_link': link_column,
+                'bar': ['Benign', '', '', ''],
+                'bar_link': link_column,
+                'baz': ['', '', 'Pathogenic', ''],
+                'baz_link': link_column,
+                'class': ['Benign', likely_benign, 'Pathogenic', 'foo'],
+                'labs': [1, 1, 1, 2]
+            }
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_vkgl__correct_column_names(self):
+        dataset = pd.DataFrame(
+            columns=['chromosome', 'start', 'ref', 'alt', 'consensus_classification', 'foo', 'bar']
+        )
+        observed = VKGL()._correct_column_names(dataset)
+        expected = pd.DataFrame(
+            columns=[self.__CHROM__, 'POS', 'REF', 'ALT', 'class', 'foo', 'bar']
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_clinvar__get_n_header(self):
+        file = os.path.join(get_project_root_dir(), 'test', 'resources', 'hashtag_file.txt')
+        observed = ClinVar()._get_n_header(file)
+        self.assertEqual(observed, 14)
+
+    def test_clinvar__obtain_class(self):
+        expected = pd.concat(
+            [
+                self.dataset,
+                pd.DataFrame(
+                    {
+                        'class': ['VUS', 'Likely_pathogenic']
+                    }
+                )
+            ], axis=1
+        )
+        observed = ClinVar()._obtain_class(self.dataset)
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_clinvar__obtain_gene(self):
+        expected = pd.concat(
+            [
+                self.dataset,
+                pd.DataFrame(
+                    {
+                        'gene': ['foo', 'baz']
+                    }
+                )
+            ], axis=1
+        )
+        observed = ClinVar()._obtain_gene(self.dataset)
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_clinvar__obtain_review(self):
+        # Slightly modified expected since review status 0 is dropped
+        expected = pd.concat(
+            [
+                pd.DataFrame(data=self.dataset.iloc[1, :].values, columns=['INFO'], index=[1]),
+                pd.DataFrame(
+                    {
+                        'review': [3]
+                    }, index=[1]
+                )
+            ], axis=1
+        )
+        observed = ClinVar()._obtain_review(self.dataset)
+        pd.testing.assert_frame_equal(observed, expected)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/test_duplicate_processor.py b/train_data_creator/test/test_duplicate_processor.py
new file mode 100644
index 0000000..cb0b2ba
--- /dev/null
+++ b/train_data_creator/test/test_duplicate_processor.py
@@ -0,0 +1,42 @@
+import unittest
+
+import pandas as pd
+
+from train_data_creator.src.main.duplicate_processor import DuplicateProcessor
+
+
+class TestDuplicateProcessor(unittest.TestCase):
+    def test_duplicate_processor(self):
+        dataset = pd.DataFrame(
+            {
+                '#CHROM': [1, 1, 2],
+                'POS': [100, 100, 200],
+                'REF': ['A', 'A', 'G'],
+                'ALT': ['T', 'T', 'C'],
+                'gene': ['foo', 'foo', 'bar'],
+                'class': ['LP', 'LP', 'LB'],
+                'review': [2, 3, 1],
+                'source': ['ClinVar', 'VKGL', 'VKGL'],
+                'binarized_label': [1.0, 1.0, 0.0]
+            }
+        )
+        processor = DuplicateProcessor()
+        observed = processor.process(dataset)
+        expected = pd.DataFrame(
+            {
+                '#CHROM': [1, 2],
+                'POS': [100, 200],
+                'REF': ['A', 'G'],
+                'ALT': ['T', 'C'],
+                'gene': ['foo', 'bar'],
+                'class': ['LP', 'LB'],
+                'review': [2, 1],
+                'source': ['ClinVar', 'VKGL'],
+                'binarized_label': [1.0, 0.0]
+            }, index=[0, 2]
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/test_exporter.py b/train_data_creator/test/test_exporter.py
new file mode 100644
index 0000000..8a1b468
--- /dev/null
+++ b/train_data_creator/test/test_exporter.py
@@ -0,0 +1,41 @@
+import os
+import unittest
+
+import pandas as pd
+
+from train_data_creator.test import get_project_root_dir
+from train_data_creator.src.main.exporter import Exporter
+
+
+class TestExporter(unittest.TestCase):
+    __test_dir__ = os.path.join(get_project_root_dir(), '.test_dir')
+
+    @classmethod
+    def tearDownClass(cls) -> None:
+        for file in os.listdir(cls.__test_dir__):
+            os.remove(os.path.join(cls.__test_dir__, file))
+        os.rmdir(cls.__test_dir__)
+
+    def test_exporter(self):
+        if not os.path.isdir(self.__test_dir__):
+            os.makedirs(self.__test_dir__)
+        exporter = Exporter(self.__test_dir__)
+        some_data = pd.DataFrame(
+            {
+                '#CHROM': [1, 2, 3],
+                'POS': [100, 200, 300],
+                'REF': ['A', 'A', 'A'],
+                'ALT': ['C', 'C', 'C'],
+                'gene': ['very', 'hot', 'fuzz'],
+                'binarized_label': [1, 2, 3],
+                'sample_weight': [0.8, 0.9, 1.0]
+            }
+        )
+        exporter.export_train_test_dataset(some_data)
+        self.assertIn('train_test.vcf.gz', os.listdir(self.__test_dir__))
+        exporter.export_validation_dataset(some_data)
+        self.assertIn('validation.vcf.gz', os.listdir(self.__test_dir__))
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/test_sample_weighter.py b/train_data_creator/test/test_sample_weighter.py
new file mode 100644
index 0000000..c827c35
--- /dev/null
+++ b/train_data_creator/test/test_sample_weighter.py
@@ -0,0 +1,30 @@
+import unittest
+
+import pandas as pd
+
+from train_data_creator.src.main.sample_weighter import SampleWeighter
+
+
+class TestSampleWeighter(unittest.TestCase):
+    def test_sample_weighter(self):
+        dataset = pd.DataFrame(
+            {
+                'review': [3, 4, 1, 2]
+            }
+        )
+        expected = pd.concat(
+            [
+                dataset,
+                pd.DataFrame(
+                    {
+                        'sample_weight': [1.0, 1.0, 0.8, 0.9]
+                    }
+                )
+            ], axis=1
+        )
+        observed = SampleWeighter().apply_sample_weight(dataset)
+        pd.testing.assert_frame_equal(observed, expected)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/test_split_datasets.py b/train_data_creator/test/test_split_datasets.py
new file mode 100644
index 0000000..7e85c9d
--- /dev/null
+++ b/train_data_creator/test/test_split_datasets.py
@@ -0,0 +1,38 @@
+import unittest
+
+import pandas as pd
+
+from train_data_creator.src.main.split_datasets import SplitDatasets
+
+
+class TestSplitDatasets(unittest.TestCase):
+    def test_split(self):
+        dataset = pd.DataFrame(
+            {
+                '#CHROM': ['1', '2', '4', '5', '6', '7', '8', '10'],
+                'POS': [100, 200, 400, 500, 600, 700, 800, 1000],
+                'REF': ['A', 'A', 'C', 'A', 'A', 'G', 'C', 'CG'],
+                'ALT': ['T', 'T', 'G', 'T', 'T', 'C', 'G', 'AT'],
+                'source': [
+                    'VKGL', 'VKGL', 'VKGL', 'VKGL', 'ClinVar', 'ClinVar', 'ClinVar', 'VKGL'
+                ],
+                'review': [2, 2, 2, 2, 2, 3, 2, 2],
+                'sample_weight': [0.9, 0.9, 0.9, 0.9, 0.9, 1.0, 0.9, 0.9],
+                'binarized_label': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0]
+            }
+        )
+        splitter = SplitDatasets()
+        observed_train, observed_validation = splitter.split(dataset)
+        self.assertEqual(
+            observed_validation[observed_validation['binarized_label'] == 1.0].shape[0],
+            observed_validation[observed_validation['binarized_label'] == 0.0].shape[0]
+        )
+        self.assertGreater(
+            observed_validation['review'].min(), 1
+        )
+        for row in observed_validation.iterrows():
+            self.assertNotIn(observed_train.values.tolist(), row[1].values.tolist())
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/test_utilities.py b/train_data_creator/test/test_utilities.py
new file mode 100644
index 0000000..232bbfc
--- /dev/null
+++ b/train_data_creator/test/test_utilities.py
@@ -0,0 +1,80 @@
+import unittest
+
+import pandas as pd
+
+from train_data_creator.src.main.utilities import correct_order_vcf_notation, equalize_class, \
+    apply_binarized_label
+
+
+class TestUtilities(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls) -> None:
+        cls.dataset = pd.DataFrame(
+            {
+                '#CHROM': ['1', '2', '3', '4', '6', '7', '5', '8', '9'],
+                'POS': [100, 200, 300, 600, 700, 400, 500, 800, 900],
+                'REF': ['A', 'A', 'G', 'C', 'A', 'A', 'G', 'C', 'T'],
+                'ALT': ['T', 'T', 'C', 'G', 'T', 'T', 'C', 'G', 'A'],
+                'source': [
+                    'VKGL', 'VKGL', 'ClinVar', 'VKGL', 'VKGL', 'ClinVar', 'ClinVar', 'ClinVar',
+                    'ClinVar'
+                ],
+                'review': [2, 2, 1, 2, 2, 2, 3, 2, 1],
+                'binarized_label': [1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0]
+            }
+        )
+        cls.class_dataset = pd.DataFrame(
+            {
+                'class': ['(Likely) Benign', 'LB', 'foo', 'Pathogenic']
+             }
+        )
+        cls.equalize_dict = {'(Likely) Benign': 'LB', 'Pathogenic': 'P'}
+
+    def test_order(self):
+        observed = correct_order_vcf_notation(self.dataset)
+        expected = pd.DataFrame(
+            {
+                '#CHROM': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
+                'POS': [100, 200, 300, 600, 500, 700, 400, 800, 900],
+                'REF': ['A', 'A', 'G', 'C', 'G', 'A', 'A', 'C', 'T'],
+                'ALT': ['T', 'T', 'C', 'G', 'C', 'T', 'T', 'G', 'A'],
+                'source': [
+                    'VKGL', 'VKGL', 'ClinVar', 'VKGL', 'ClinVar', 'VKGL', 'ClinVar', 'ClinVar',
+                    'ClinVar'
+                ],
+                'review': [2, 2, 1, 2, 3, 2, 2, 2, 1],
+                'binarized_label': [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]
+            }
+        )
+        pd.testing.assert_frame_equal(observed, expected)
+
+    def test_equalize(self):
+        observed = equalize_class(
+            self.class_dataset, equalize_dict=self.equalize_dict
+        )
+        expected = pd.DataFrame(
+            {
+                'class': ['LB', 'P']
+            }
+        )
+        pd.testing.assert_frame_equal(
+            observed.reset_index(drop=True), expected.reset_index(drop=True)
+        )
+
+    def test_apply_binarized_label(self):
+        copy_class_dataset = self.class_dataset.copy(deep=True)
+        equalized = equalize_class(copy_class_dataset, self.equalize_dict)
+        observed = apply_binarized_label(equalized)
+        expected = pd.DataFrame(
+            {
+                'class': ['LB', 'P'],
+                'binarized_label': [0.0, 1.0]
+            }
+        )
+        pd.testing.assert_frame_equal(
+            observed.reset_index(drop=True), expected.reset_index(drop=True)
+        )
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/validators/__init__.py b/train_data_creator/test/validators/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/train_data_creator/test/validators/test_dataset_validator.py b/train_data_creator/test/validators/test_dataset_validator.py
new file mode 100644
index 0000000..e227c08
--- /dev/null
+++ b/train_data_creator/test/validators/test_dataset_validator.py
@@ -0,0 +1,59 @@
+import os
+import unittest
+
+import pandas as pd
+
+from train_data_creator.test import get_project_root_dir
+from train_data_creator.src.main.validators.dataset_validator import DatasetValidator
+
+
+class TestDatasetValidator(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls) -> None:
+        cls.vkgl = pd.read_csv(
+            os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz'),
+            sep='\t'
+        )
+        cls.clinvar = pd.read_csv(
+            os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz'),
+            sep='\t',
+            skiprows=27
+        )
+        cls.validator = DatasetValidator()
+
+    def test_validate_vkgl_corr(self):
+        self.validator.validate_vkgl(self.vkgl)
+
+    def test_validate_clinvar_corr(self):
+        self.validator.validate_clinvar(self.clinvar)
+
+    def test_validate_vkgl_incorr(self):
+        # Mimicking what happens when the public consensus is used.
+        self.assertRaises(
+            KeyError,
+            self.validator.validate_vkgl,
+            self.vkgl.rename(columns={'consensus_classification': 'classification'})
+        )
+
+    def test_validate_clinvar_incorr(self):
+        # Mimicking what happens when the amount of comment lines are incorrect.
+        self.assertRaises(
+            KeyError,
+            self.validator.validate_clinvar,
+            self.clinvar.rename(columns={'#CHROM': 'chr', 'INFO': 'something_not_info'})
+        )
+
+    def test_raise_no_variants(self):
+        # No variants in the file raise EOFError
+        vkgl_novars = pd.read_csv(
+            os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl_novars.tsv.gz')
+        )
+        self.assertRaises(
+            EOFError,
+            self.validator.validate_vkgl,
+            vkgl_novars
+        )
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/train_data_creator/test/validators/test_input_validator.py b/train_data_creator/test/validators/test_input_validator.py
new file mode 100644
index 0000000..1ac5692
--- /dev/null
+++ b/train_data_creator/test/validators/test_input_validator.py
@@ -0,0 +1,93 @@
+import os
+import unittest
+from stat import S_IREAD
+
+from train_data_creator.test import get_project_root_dir
+from train_data_creator.src.main.validators.input_validator import InputValidator
+
+
+class TestInputValidator(unittest.TestCase):
+    __DIRECTORY__ = 'some_very_special_directory'
+    __READONLY_DIRECTORY__ = 'a_readonly_directory'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        cls.validator = InputValidator()
+
+    @classmethod
+    def tearDownClass(cls) -> None:
+        if cls.__DIRECTORY__ in os.listdir(get_project_root_dir()):
+            os.removedirs(os.path.join(get_project_root_dir(), cls.__DIRECTORY__))
+        if cls.__READONLY_DIRECTORY__ in os.listdir(get_project_root_dir()):
+            os.removedirs(os.path.join(get_project_root_dir(), cls.__READONLY_DIRECTORY__))
+
+    def test_vkgl_corr(self):
+        input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz')
+        self.validator.validate_vkgl(input_path)
+
+    def test_clinvar_corr(self):
+        input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz')
+        self.validator.validate_clinvar(input_path)
+
+    def test_vkgl_incorr(self):
+        input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_clinvar.vcf.gz')
+        self.assertRaises(
+            IOError,
+            self.validator.validate_vkgl,
+            input_path
+        )
+
+    def test_clinvar_incorr(self):
+        input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'smol_vkgl.tsv.gz')
+        self.assertRaises(
+            IOError,
+            self.validator.validate_clinvar,
+            input_path
+        )
+
+    def test__validate_file_exist(self):
+        input_path = os.path.join(get_project_root_dir(), 'test', 'resources', 'not_a_file.tsv.gz')
+        self.assertRaises(
+            FileNotFoundError,
+            self.validator._validate_file_exist,
+            input_path,
+            'VKGL'
+        )
+
+    def test_output_not_directory(self):
+        output = os.path.join(get_project_root_dir(), 'not_a_directory', 'not_a_directory')
+        self.assertRaises(
+            OSError,
+            self.validator.validate_output,
+            output
+        )
+
+    def test_output_not_writable(self):
+        output = os.path.join(get_project_root_dir(), self.__READONLY_DIRECTORY__)
+        os.mkdir(os.path.join(output))
+        os.chmod(output, S_IREAD)
+        # Testing if an existing not writable directory raises OSError
+        self.assertRaises(
+            OSError,
+            self.validator.validate_output,
+            output
+        )
+        # Testing if making a new directory in a not writable directory raises OSError
+        self.assertRaises(
+            OSError,
+            self.validator.validate_output,
+            os.path.join(output, 'not_a_directory')
+        )
+
+    def test_output_back_to_parent(self):
+        output = os.path.join(get_project_root_dir(), self.__DIRECTORY__)
+        self.validator.validate_output(output)
+        directories = os.listdir(get_project_root_dir())
+        full_directories = []
+        for directory in directories:
+            full_directories.append(str(os.path.join(get_project_root_dir(), directory)))
+        self.assertIn(output, full_directories)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/utility_scripts/compare_old_model.py b/utility_scripts/compare_old_model.py
new file mode 100755
index 0000000..401b332
--- /dev/null
+++ b/utility_scripts/compare_old_model.py
@@ -0,0 +1,346 @@
+#!/usr/bin/env python3
+
+import os
+import argparse
+import warnings
+
+import math
+import pickle
+import pandas as pd
+from matplotlib import pyplot as plt
+from sklearn.metrics import roc_auc_score
+
+
+# Defining errors
+class IncorrectFileError(Exception):
+    pass
+
+
+class DataError(Exception):
+    pass
+
+
+class CommandLineDigest:
+    def __init__(self):
+        parser = self._create_argument_parser()
+        self.arguments = parser.parse_args()
+
+    def _create_argument_parser(self):
+        parser = argparse.ArgumentParser(
+            prog='Make VEP TSV train ready',
+            description='Converts an VEP output TSV (after conversion) to make it train ready.'
+                        'PLEASE NOTE: This script does NOT validate your input!'
+        )
+        required = parser.add_argument_group('Required arguments')
+
+        required.add_argument(
+            '--old_model_results',
+            type=str,
+            required=True,
+            help='Input location of the results file generated with the old CAPICE model '
+                 '(XGBoost >= 0.90). Has to be a (gzipped) tsv!'
+        )
+
+        required.add_argument(
+            '--new_model_results',
+            type=str,
+            required=True,
+            help='Input location of the results file generated with the new CAPICE model '
+                 '(XGBoost == 1.4.2). Has to be a gzipped tsv!'
+        )
+
+        required.add_argument(
+            '--vep_processed_capice_input',
+            type=str,
+            required=True,
+            help='Input location of the new CAPICE (XGBoost == 1.4.2) input TSV, processed using '
+                 'the -t flag in the conversion script. Has to be a gzipped tsv!'
+        )
+
+        required.add_argument(
+            '--new_model',
+            type=str,
+            required=True,
+            help='Input location of the XGBoost == 1.4.2 model. Has to be a .pickle.dat!'
+        )
+
+        required.add_argument(
+            '--output',
+            type=str,
+            required=True,
+            help='Output directory.'
+        )
+
+        return parser
+
+    def get_argument(self, argument_key):
+        value = None
+        if argument_key in self.arguments:
+            value = getattr(self.arguments, argument_key)
+        return value
+
+
+class Validator:
+
+    def validate_old_input_argument(self, old_input_path):
+        self._validate_argument(old_input_path, ('.tsv', '.tsv.gz'))
+
+    def validate_ol_input_argument(self, ol_input_path):
+        self._validate_argument(ol_input_path, '.tsv.gz')
+
+    def validate_new_input_argument(self, new_input_path):
+        self._validate_argument(new_input_path, '.tsv.gz')
+
+    def validate_model_input_argument(self, model_input_path):
+        self._validate_argument(model_input_path, '.pickle.dat')
+
+    @staticmethod
+    def _validate_argument(argument, extension):
+        if not argument.endswith(extension):
+            raise IncorrectFileError(
+                f'Argument {argument} was supplied an incorrect extension. '
+                f'Required extensions: {extension}'
+            )
+
+    @staticmethod
+    def validate_output_argument(output):
+        if not os.path.isdir(output):
+            warnings.warn(f'Output path {output} does not exist! Creating...')
+            try:
+                os.makedirs(output)
+            except Exception as e:
+                print(f'Error encoutered creating output path: {e}')
+                exit(1)
+
+    def validate_old_dataset(self, data):
+        required_columns = ['chr_pos_ref_alt', 'GeneName', 'probabilities']
+        self._validate_dataframe(data, required_columns, 'Old CAPICE generated dataset')
+
+    def validate_ol_dataset(self, data):
+        required_columns = ['%ID']
+        self._validate_dataframe(data, required_columns, 'Original labels dataset')
+
+    def validate_new_dataset(self, data):
+        required_columns = ['score', 'chr', 'pos', 'ref', 'alt', 'gene_name']
+        self._validate_dataframe(data, required_columns, 'New CAPICE generated dataset')
+
+    @staticmethod
+    def _validate_dataframe(data, required_columns, dataset):
+        for column in required_columns:
+            if column not in data.columns:
+                raise DataError(f'Required column {column} not found in: {dataset}!')
+
+
+def main():
+    print('Obtaining command line arguments')
+    cla = CommandLineDigest()
+    cadd_location = cla.get_argument('old_model_results')
+    ol_location = cla.get_argument('vep_processed_capice_input')
+    vep_location = cla.get_argument('new_model_results')
+    model_location = cla.get_argument('new_model')
+    output = cla.get_argument('output')
+    print('Validating command line arguments')
+    validator = Validator()
+    validator.validate_old_input_argument(cadd_location)
+    validator.validate_ol_input_argument(ol_location)
+    validator.validate_new_input_argument(vep_location)
+    validator.validate_model_input_argument(model_location)
+    validator.validate_output_argument(output)
+    print('Arguments passed\n')
+
+    # Reading in data
+    print('Reading data')
+    cadd = pd.read_csv(cadd_location, sep='\t')
+    original_labels = pd.read_csv(ol_location, sep='\t')
+    vep = pd.read_csv(vep_location, sep='\t')
+    print('Validating data')
+    validator.validate_old_dataset(cadd)
+    validator.validate_ol_dataset(original_labels)
+    validator.validate_new_dataset(vep)
+    print('Data passed, processing model data')
+
+    with open(model_location, 'rb') as model_file:
+        model = pickle.load(model_file)
+    importances = model.get_booster().get_score(importance_type='gain')
+    importances_dataframe = pd.DataFrame(data=list(importances.values()),
+                                         index=list(importances.keys()),
+                                         columns=['score']).sort_values(by='score', ascending=False)
+    importances_dataframe['feature'] = importances_dataframe.index
+    importances_dataframe.reset_index(drop=True, inplace=True)
+    print('All data loaded:')
+    print(f'Amount of variants within old CAPICE generated dataset: {cadd.shape[0]}. '
+          f'Columns: {cadd.shape[1]}')
+    print(f'Amount of variants within the original labels dataset: {original_labels.shape[0]}. '
+          f'Columns: {original_labels.shape[1]}')
+    print(f'Amount of variants within new CAPICE generated dataset '
+          f'(should match the original labels dataset!): {vep.shape[0]}. '
+          f'Columns: {vep.shape[1]}\n')
+
+    print('Starting making merge columns.')
+    # Making merge column
+    cadd['ID'] = cadd[['chr_pos_ref_alt', 'GeneName']].astype(str).agg('_'.join, axis=1)
+    vep['ID'] = vep[['chr', 'pos', 'ref', 'alt', 'gene_name']].astype(str).agg('_'.join, axis=1)
+
+    # Getting the true original labels
+    original_labels['binarized_label'] = original_labels['%ID'].str.split('_', expand=True)[
+        5].astype(float)
+    original_labels['ID'] = original_labels['%ID'].str.split('_', expand=True).loc[:, :4].astype(
+        str).agg('_'.join, axis=1)
+    print('Merge columns made, starting cleaning.')
+
+    # Preparing all datasets
+    vep = vep[['ID', 'score']]
+    vep._is_copy = None
+    vep.rename(columns={'score': 'score_new'}, inplace=True)
+    cadd.rename(columns={'probabilities': 'score_old'}, inplace=True)
+    original_labels = original_labels[['ID', 'binarized_label']]
+    original_labels._is_copy = None
+    print('Cleaning done.\n')
+
+    print('Starting merge.')
+    # Merging
+    merge = cadd.merge(original_labels, on='ID', how='left')
+    merge.drop(index=merge[merge['binarized_label'].isnull()].index, inplace=True)
+    merge = merge.merge(vep, on='ID', how='left')
+    merge.drop(index=merge[merge['score_new'].isnull()].index, inplace=True)
+    print(f'Merge done. Merge has {merge.shape[0]} variants.\n')
+
+    print('Preparing plots.')
+    # Preparing plots
+    consequences = merge['Consequence'].unique()
+    ncols = 4
+    nrows = math.ceil((consequences.size / ncols) + 1)
+    index = 1
+    fig_auc = plt.figure(figsize=(20, 40))
+    fig_auc.suptitle(
+        f'Old model vs new model AUC comparison.\n'
+        f'Old: {cadd_location}\n'
+        f'New: {vep_location}')
+    fig_vio = plt.figure(figsize=(20, 40))
+    fig_vio.suptitle(
+        f'Old model vs new model score distributions.\n'
+        f'Old: {cadd_location}\n'
+        f'New: {vep_location}')
+    fig_box = plt.figure(figsize=(20, 40))
+    fig_box.suptitle(
+        f'Old model vs new model score closeness (difference to true label)\n'
+        f'Old: {cadd_location}\n'
+        f'New: {vep_location}')
+    print('Preparation done.\n')
+
+    print('Calculating global AUC.')
+    # Calculating global AUC
+    auc_old = round(roc_auc_score(merge['binarized_label'], merge['score_old']), 4)
+    auc_new = round(roc_auc_score(merge['binarized_label'], merge['score_new']), 4)
+    print(f'Global AUC calculated, old: {auc_old} and new: {auc_new}')
+
+    print('Plotting global AUC.')
+    # Plotting global AUC
+    ax_auc = fig_auc.add_subplot(nrows, ncols, index)
+    ax_auc.bar(1, auc_old, color='red', label=f'Old: {auc_old}')
+    ax_auc.bar(2, auc_new, color='blue', label=f'New: {auc_new}')
+    ax_auc.set_title(f'Global (n={merge.shape[0]})')
+    ax_auc.set_xticks([1, 2], ['Old', 'New'])
+    ax_auc.set_xlim(0.0, 3.0)
+    ax_auc.set_ylim(0.0, 1.0)
+    ax_auc.legend(loc='lower right')
+
+    # Plotting Violinplot for global
+    ax_vio = fig_vio.add_subplot(nrows, ncols, index)
+    ax_vio.violinplot(merge[['score_old', 'binarized_label', 'score_new']])
+    ax_vio.set_title(f'Global (n={merge.shape[0]})')
+    ax_vio.set_xticks([1, 2, 3], ['Old', 'True', 'New'])
+    ax_vio.set_xlim(0.0, 4.0)
+    ax_vio.set_ylim(0.0, 1.0)
+
+    # Plotting the score differences to the true label
+    merge['score_diff_old'] = abs(merge['binarized_label'] - merge['score_old'])
+    merge['score_diff_new'] = abs(merge['binarized_label'] - merge['score_new'])
+    ax_box = fig_box.add_subplot(nrows, ncols, index)
+    ax_box.boxplot(
+        [
+            merge[merge['binarized_label'] == 0]['score_diff_old'],
+            merge[merge['binarized_label'] == 0]['score_diff_new'],
+            merge[merge['binarized_label'] == 1]['score_diff_old'],
+            merge[merge['binarized_label'] == 1]['score_diff_new']
+        ], labels=['B_old', 'B_new', 'P_old', 'P_new']
+    )
+    ax_box.set_title(f'Global (n={merge.shape[0]})')
+    print('Plotting global AUC done.')
+
+    print('Plotting for each consequence:')
+    # Global plots have been made, now for each consequence
+    index = 2
+    for consequence in consequences:
+        print(f'Currently processing: {consequence}')
+        # Subsetting
+        subset = merge[merge['Consequence'] == consequence]
+
+        # Calculating
+        # Try except because when an consequence is encountered with only 1 label,
+        # roc_auc_score will throw an error
+        try:
+            auc_old = round(roc_auc_score(subset['binarized_label'], subset['score_old']), 4)
+        except ValueError:
+            print(f'For consequence {consequence}, AUC old could not be calculated.')
+            continue
+        try:
+            auc_new = round(roc_auc_score(subset['binarized_label'], subset['score_new']), 4)
+        except ValueError:
+            print(f'For consequence {consequence}, AUC new could not be calculated.')
+            continue
+
+        # Plotting auc
+        ax_auc = fig_auc.add_subplot(nrows, ncols, index)
+        ax_auc.bar(1, auc_old, color='red', label=f'Old: {auc_old}')
+        ax_auc.bar(2, auc_new, color='blue', label=f'New: {auc_new}')
+        ax_auc.set_title(f'{consequence} (n={subset.shape[0]})')
+        ax_auc.set_xticks([1, 2], ['Old', 'New'])
+        ax_auc.set_xlim(0.0, 3.0)
+        ax_auc.set_ylim(0.0, 1.0)
+        ax_auc.legend(loc='lower right')
+
+        # Plotting Violinplot for global
+        ax_vio = fig_vio.add_subplot(nrows, ncols, index)
+        ax_vio.violinplot(subset[['score_old', 'binarized_label', 'score_new']])
+        ax_vio.set_title(f'{consequence} (n={subset.shape[0]})')
+        ax_vio.set_xticks([1, 2, 3], ['Old', 'True', 'New'])
+        ax_vio.set_xlim(0.0, 4.0)
+        ax_vio.set_ylim(0.0, 1.0)
+
+        # Plotting boxplots
+        ax_box = fig_box.add_subplot(nrows, ncols, index)
+        ax_box.boxplot(
+            [
+                subset[subset['binarized_label'] == 0]['score_diff_old'],
+                subset[subset['binarized_label'] == 0]['score_diff_new'],
+                subset[subset['binarized_label'] == 1]['score_diff_old'],
+                subset[subset['binarized_label'] == 1]['score_diff_new']
+            ], labels=['B_old', 'B_new', 'P_old', 'P_new']
+        )
+        ax_box.set_title(f'{consequence} (n={subset.shape[0]})')
+
+        index += 1
+
+    print('Plotting for consequences done.\n')
+
+    print('Plotting feature importances.')
+    fig_table, ax_table = plt.subplots(figsize=(10, 20))
+    fig_table.suptitle(f'Feature importances model: {model_location}')
+    fig_table.patch.set_visible(False)
+    ax_table.axis('off')
+    ax_table.table(cellText=importances_dataframe.values, colLabels=importances_dataframe.columns,
+                   loc='center')
+    print('Plotting for feature importances done.\n')
+
+    print(f'Exporting plots to: {output}')
+    fig_auc.savefig(os.path.join(output, 'aucs.png'))
+    fig_vio.savefig(os.path.join(output, 'violins.png'))
+    fig_box.savefig(os.path.join(output, 'box.png'))
+    fig_table.savefig(os.path.join(output, 'feature_importances.png'))
+    print('Done.')
+
+
+if __name__ == '__main__':
+    main()
+
diff --git a/utility_scripts/liftover_variants.sh b/utility_scripts/liftover_variants.sh
new file mode 100755
index 0000000..5d82f4b
--- /dev/null
+++ b/utility_scripts/liftover_variants.sh
@@ -0,0 +1,118 @@
+#!/bin/bash
+#SBATCH --job-name=liftover_variants
+#SBATCH --time=05:00:00
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=20gb
+#SBATCH --nodes=1
+
+set -e
+
+errcho() { echo "$@" 1>&2; }
+
+readonly USAGE="Easy bash script to convert a GRCh37 VCF to GRCh38 VCF
+Usage:
+liftover_variants.sh -i <arg> -o <arg>
+-i  required: the GRCh37 input VCF
+-o  required: the GRCh38 output VCF path+filename (except extension!)
+
+Example:
+bash liftover_variants.sh -i /path/to/GRCh37.vcf -o /path/to/GRCh38
+
+Requirements:
+Picard
+EasyBuild
+"
+
+main() {
+  digestCommandLine "$@"
+  runLiftover
+}
+
+digestCommandLine() {
+  while getopts i:o:h flag
+  do
+    case "${flag}" in
+      i) input=${OPTARG};;
+      o) output=${OPTARG};;
+      h)
+        echo "${USAGE}"
+        exit;;
+      /?)
+        errcho "Error: invalid option"
+        echo "${USAGE}"
+        exit 1;;
+    esac
+  done
+
+  validateCommandLine
+
+  echo "CLA passed"
+}
+
+validateCommandLine() {
+  local valid_command_line=true
+  if [ -z "${input}" ]
+  then
+    valid_command_line=false
+    errcho "input file not set"
+  else
+     if [ ! -f "${input}" ]
+     then
+       valid_command_line=false
+       errcho "input file does not exist"
+     fi
+  fi
+
+  if [ -z "${output}" ]
+  then
+    valid_command_line=false
+    errcho "output not set"
+  fi
+
+  if [ "${valid_command_line}" == false ];
+  then
+    errcho 'exiting.'
+    exit 1
+  fi
+}
+
+
+runLiftover() {
+  local output="${output}.vcf"
+  local rejected="${output}_rejected.vcf"
+  local input="${input}"
+
+  local args=()
+
+  args+=("/apps/software/picard/2.20.5-Java-11-LTS/picard.jar" "LiftoverVcf")
+  args+=("I=${input}")
+  args+=("O=${output}")
+  args+=("CHAIN=/apps/data/GRC/b37ToHg38.over.chain")
+  args+=("REJECT=${rejected}")
+  args+=("R=/apps/data/GRC/GCA_000001405.15/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz")
+
+  echo "Loading Picard"
+
+  module load picard
+
+  echo "Running Picard"
+
+  java -jar "${args[@]}"
+
+  echo "Gzipping outputs"
+
+  gzip "${output}"
+  gzip "${rejected}"
+
+  echo "Removing indexing file if made"
+  if [ -f "${output}.vcf.idx" ]
+  then
+    rm "${output}.vcf.idx"
+    echo "Indexing file removed"
+  fi
+
+  echo "Done"
+}
+
+main "$@"
+
diff --git a/utility_scripts/vep_to_train.py b/utility_scripts/vep_to_train.py
new file mode 100755
index 0000000..ec5f833
--- /dev/null
+++ b/utility_scripts/vep_to_train.py
@@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+
+import os
+import argparse
+import warnings
+
+import numpy as np
+import pandas as pd
+
+
+# Defining errors
+class IncorrectFileError(Exception):
+    pass
+
+
+class DataError(Exception):
+    pass
+
+
+SAMPLE_WEIGHTS = [0.8, 0.9, 1.0]
+
+
+class CommandLineDigest:
+    def __init__(self):
+        parser = self._create_argument_parser()
+        self.arguments = parser.parse_args()
+
+    def _create_argument_parser(self):
+        parser = argparse.ArgumentParser(
+            prog='Make VEP TSV train ready',
+            description='Converts an VEP output TSV (after conversion) to make it train ready.'
+        )
+        required = parser.add_argument_group('Required arguments')
+        optional = parser.add_argument_group('Optional arguments')
+
+        required.add_argument(
+            '-i',
+            '--input',
+            type=str,
+            required=True,
+            help='Input location of the VEP TSV'
+        )
+        required.add_argument(
+            '-o',
+            '--output',
+            type=str,
+            required=True,
+            help='Output path + filename'
+        )
+        optional.add_argument(
+            '-a',
+            '--assembly',
+            action='store_true',
+            help='Flag to enable GRCh38 mode.'
+        )
+        return parser
+
+    def get_argument(self, argument_key):
+        value = None
+        if argument_key in self.arguments:
+            value = getattr(self.arguments, argument_key)
+        return value
+
+
+class Validator:
+    @staticmethod
+    def validate_input_cla(input_path):
+        if not input_path.endswith(('.tsv.gz', '.tsv')):
+            raise IncorrectFileError('Input file is not a TSV!')
+
+    @staticmethod
+    def validate_output_cla(output_path):
+        if not output_path.endswith('.tsv.gz'):
+            raise IncorrectFileError('Output file has to be defined as .tsv.gz!')
+        # This is intentional.
+        # You have more than enough time to prepare directories when VEP is running.
+        if not os.path.isdir(os.path.dirname(output_path)):
+            raise NotADirectoryError('Output file has to be placed in an directory that already '
+                                     'exists!')
+
+    @staticmethod
+    def validate_input_dataset(input_data):
+        columns_must_be_present = ['%SYMBOL', '%CHROM', '%ID']
+        for column in columns_must_be_present:
+            if column not in input_data.columns:
+                raise DataError(f'Missing required column: {column}')
+
+
+def main():
+    print('Parsing CLI')
+    cli = CommandLineDigest()
+    data_path = cli.get_argument('input')
+    output = cli.get_argument('output')
+    grch38 = cli.get_argument('assembly')
+    print('')
+
+    print('Validating input arguments.')
+    validator = Validator()
+    validator.validate_input_cla(data_path)
+    validator.validate_output_cla(output)
+
+    print('Reading in dataset')
+    data = pd.read_csv(data_path, sep='\t', na_values='.')
+    print('Validating dataset')
+    validator.validate_input_dataset(data)
+    print('Read in dataset.')
+    print(f'Shape: {data.shape}')
+    print(f'Head:\n{data.head()}\n')
+    
+    print('Dropping entries without gene.')
+    before_drop = data.shape[0]
+    data.drop(index=data[data['%SYMBOL'].isnull()].index, inplace=True)
+    after_drop = data.shape[0]
+    print(f'Dropped {before_drop-after_drop} variants.\n')
+
+    if grch38:
+        print('Converting chromosome column')
+        data['%CHROM'] = data['%CHROM'].str.split('chr', expand=True)[1]
+        y = np.append(np.arange(1, 23).astype(str), ['X', 'Y', 'MT'])
+        before_drop = data.shape[0]
+        data.drop(data[~data["%CHROM"].isin(y)].index, inplace=True)
+        after_drop = data.shape[0]
+        print(f'Dropped {before_drop-after_drop} rows due to unknown chromosome.')
+        print('Conversion done.\n')
+
+    print('Dropping full duplicates')
+    before_drop = data.shape[0]
+    data.drop_duplicates(inplace=True)
+    after_drop = data.shape[0]
+    print(f'Dropped {before_drop-after_drop} duplicates')
+    print('Dropping duplicates done.\n')
+    
+    print('Dropping mismatching gene entries.')
+    before_drop = data.shape[0]
+    data.drop(
+        index=data[data['%ID'].str.split('_', expand=True)[4] != data['%SYMBOL']].index,
+        inplace=True
+    )
+    after_drop = data.shape[0]
+    print(f'Dropped {before_drop-after_drop} variants.\n')
+    
+    print('Extracting sample weight and binarized_label')
+    data['binarized_label'] = data['%ID'].str.split('_', expand=True)[5].astype(float)
+    data['sample_weight'] = data['%ID'].str.split('_', expand=True)[6].astype(float)
+    print('')
+    
+    print('Correcting possible errors within binarized_label or sample_weight')
+    before_drop = data.shape[0]
+    # Drop everything that doesn't have a binarized_label, also drop unused columns
+    data.drop(index=data[data['binarized_label'].isnull()].index, columns=['%ID'], inplace=True)
+    data.drop(index=data[~data['binarized_label'].isin([0.0, 1.0])].index, inplace=True)
+    data.drop(index=data[~data['sample_weight'].isin(SAMPLE_WEIGHTS)].index, inplace=True)
+    after_drop = data.shape[0]
+    print(f'Dropped {before_drop-after_drop} variants due to incorrect binarized_label or '
+          f'sample_weight.\n')
+
+    print('Please check the sample weights:')
+    print(data[['binarized_label', 'sample_weight']].value_counts())
+    print('')
+
+    print(f'Final shape of data: {data.shape}')
+    print(f'Of which benign: {data[data["binarized_label"] == 0].shape[0]}')
+    print(f'Of which pathogenic: {data[data["binarized_label"] == 1].shape[0]}')
+    n_other = data[~data["binarized_label"].isin([0, 1])].shape[0]
+    if n_other > 0:
+        warnings.warn(f'Of which other: {n_other}')
+    print('')
+
+    print(f'Done! Exporting to {output}')
+    data.to_csv(output, index=False, compression='gzip', na_rep='.', sep='\t')
+
+
+if __name__ == '__main__':
+    main()