This repo contains the code used for the project "Using DNA Language Models for mutation detection and correction"
Files used for mutation detection:
- data_handling_for_NER.py - data loader that parses the fasta files etc
- dnabert_for_token_classification - an implementation of a model for token classification as the authors did not implement such model.
- run_detection.py - the training loop for mutation detection
- run_detection_test.py - code for running a pre-trained model on the test set
- slurm_run_detection.sh - bash script for running the detection on the slurm cluster
Files used for mutation correction:
- MLM_DNA_BERT.ipynb
- data_handling_for_MLM.py - data loader that parses the fasta files etc
- run_correction_train.py - the training loop for mutation correction
Files used for data analysis and generation:
- analyze_seqs_stats.ipynb - was used to analyze the changes caused in amino acid sequences and tokenized sequences when simulating mutations.
- sequences_and_tokens_analysis.py - creates stats on the different datasets
- split_data.py - splits the data into train, dev, and test.
misc files:
- mutation_correction_env_1.yml - a yml find containing the requirements for the conda env we used
- test_stats.csv, train_stats.csv, val_stats.csv - files with stats on the mutated and non-mutated sequences