Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
DNABERT_2		DNABERT_2
Mutation_Correction		Mutation_Correction
__pycache__		__pycache__
MLM_DNA_BERT.ipynb		MLM_DNA_BERT.ipynb
README.md		README.md
analyze_seqs_stats.ipynb		analyze_seqs_stats.ipynb
data_handling_for_MLM.py		data_handling_for_MLM.py
data_handling_for_NER.py		data_handling_for_NER.py
dnabert_for_token_classification.py		dnabert_for_token_classification.py
mutation_correction_env_1.yml		mutation_correction_env_1.yml
run_correction_train.py		run_correction_train.py
run_detection.py		run_detection.py
run_detection_test.py		run_detection_test.py
sequences_and_tokens_analysis.py		sequences_and_tokens_analysis.py
slurm_run_detection.sh		slurm_run_detection.sh
split_data.py		split_data.py
test_stats.csv		test_stats.csv
train_stats.csv		train_stats.csv
val_stats.csv		val_stats.csv

Repository files navigation

mutations-detection

This repo contains the code used for the project "Using DNA Language Models for mutation detection and correction"

Files used for mutation detection:

data_handling_for_NER.py - data loader that parses the fasta files etc
dnabert_for_token_classification - an implementation of a model for token classification as the authors did not implement such model.
run_detection.py - the training loop for mutation detection
run_detection_test.py - code for running a pre-trained model on the test set
slurm_run_detection.sh - bash script for running the detection on the slurm cluster

Files used for mutation correction:

MLM_DNA_BERT.ipynb
data_handling_for_MLM.py - data loader that parses the fasta files etc
run_correction_train.py - the training loop for mutation correction

Files used for data analysis and generation:

analyze_seqs_stats.ipynb - was used to analyze the changes caused in amino acid sequences and tokenized sequences when simulating mutations.
sequences_and_tokens_analysis.py - creates stats on the different datasets
split_data.py - splits the data into train, dev, and test.

misc files:

mutation_correction_env_1.yml - a yml find containing the requirements for the conda env we used
test_stats.csv, train_stats.csv, val_stats.csv - files with stats on the mutated and non-mutated sequences

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages