Promoters and enhancers are both gene regulation elements and are often involved in activating specific genes. They are crucial elements in the regulation of gene expression, playing key roles in controlling when, where, and how much of a gene is transcribed into RNA and eventually translated into protein.
Their similarity can make it difficult to distinguish them for several reasons. Promoters and enhancers may share some similar or identical DNA sequences.
Both contain binding sequences for regulatory proteins, such as transcription factors, which can bind and influence gene activity. These sequences can be present in different combinations and contexts in both promoters and enhancers, making it challenging to distinguish the two elements based solely on DNA sequence. Furthermore they perform different but related functions. Promoters are generally positioned near the gene they regulate and are responsible for initiating transcription, while enhancers can be located at variable distances from the target gene and act by enhancing transcription efficiency.
- Clone the repository
git clone https://github.com/StePoli-00/Promoter-and-Enhancer-Classifier.git
- After cloning move into the folder:
cd Promoter-and-Enhancer-Classifier/
- Create a conda environment with python 3.11
conda create -n <env> python=3.11
- Install dependencies
pip install -r requirements.txt
Before using our mode first we must create dataset. Input Data are:
- Token Embedding: are generated by DNABERT Tokenizer
- One Hot Encoding: simple encoding of DNA basis: A = {1,0,0,0}, T = {0,1,0,0}, C = {0,0,1,0} e G = {0,0,0,1}
Depending on which model you want to test you must process data in two distinguish manner. In this section you will learn how to process both.
As first thing to do we must extract our data: promoters and enhancer
- Download the following files:
- genome file: GRCh38.p14.genome.fa
- bed file: human_epdnew_xxxxxx.bed
- Place where you prefer, our suggest is to use
Data
folder to store used for the model - From
Data_Preparation
run extract_promoters.py
cd Data_Preparation
python extract_promoters.py -g <genome_file> -b <bed_file> [-l <promoters length> -o <output_path> ]
where:
-o is the desired position of the output file
example:
python extract_promoters.py -g GRCh38.p14.genome.fa -b human_epdnew_VgGtt.bed -l 100 -o Desktop/folder
Note
By omitting -l
parameter will cut promoter sequence of variable length: 5,10,20,100,200,1000,2000.
Except for the Transfomer, all the other model take an input sequence of fixed length.
In your case this -l
parameter is mandatory
First Enhancers Preprocessing
For what concern the first enhancers preprocessing is required:
- Go to https://bio.liclab.net/ENdb/Download.php
- Download file signed by "All the experimentally confirmed enhancers"
Now you need the execution of first_enhancer_preprocessing.py.
This file take enhancer_file
, remove the mm10 genes (that is genes of mouse) and generate a file in a hg19 format (necessary for the next step of preprocessing).
Into terminal run:
python first_enhancer_preprocessing.py -e <enhancer_file.txt> -hg19_list <name_of_file_for_hg19_list>.txt -mm10_list <name_of_mm10_file>.txt
Note
This command generates a file called name_of_file_for_hg19_list.txt
that you will use for the second step of enhancers preprocessing.
Second Enhancers Preprocessing
Using https://genome.ucsc.edu/cgi-bin/hgLiftOver, you upload on this web tool the
name_of_file_for_hg19_list.txt
(obtained in the previous step) and obtain a .BED file that is a list of all enhancers converted into hg38 format.
This file contains enhancers under the format "chr_name:<start_position> - <end_position>".
- Into terminal run:
pip3 install biopython
The enhancer_preprocessing.py takes different mandatories arguments from command line:
- option
-b
: path bed file (hg38) - option
-f
: path fasta file (GRCh38.FASTA 3gb) - option
-e
: name of output file (.txt) for the final sequences.
Initially, the script open the .BED file and, using a Gene class (name, start, end) generate an iterative list of Gene.
Moreover, the script open GRCh38.FASTA
file and with FastaElem class (name, sequence) generate a list of FastaElem.
With Gene list and FastaElem list, using BioPython library, the script for each Gene into FastaElem (using the specific start,end position), takes the relative sequence of nucleotides and save it on enhancer.txt
output file.
The enhancer.txt file, at the end, contains all the enhancer with the lenght specify in the initial .BED file in hg38 version.
python enhancer_preprocessing.py -b <path_bed_file> -f <path_fasta_file> -e <output_name_enhancer_file>
It contains promoter and enhancer sequence. In the previous folder run create_csv_dataset.py
python create_csv_dataset.py -p <promoter_file> -e <enhancer_file> -o <output_file> -l <promoter_length>
example:
python create_csv_dataset.py -p promoters_100.fa -e enhancers.txt -o dataset.csv -l 100
Warning
The following file require to be execute with GPU
- make sure you have installed einops package, otherwise install it:
pip install einops
- Fed csv file into create_embedding.py
python -s <dataset.csv> -d <destination> example: python -s data.csv -d Project_Folder/
Note
- This script will create
Embedding
folder in the destination path provided which containsembeddings
andlabels
of our data converted from sequence to embeddings using Tokenizer of DNABert. - Before to run the script make sure that destination folder doesn't have any folder named
Embedding
because the script will stop. This behaviour is implemented in order to avoid overriding data from previous execution of the code. - if you have encounterd this type of error:
assert q.is_cuda and k.is_cuda and v.is_cuda
try to unistall triton package
- run split_dataset.py to create Dataset folder, run the following command:
python split_dataset.py -e <embedding_folder_path> -d <dataset_folder_path> [ -z <embedding_zip_path>] example: python split_dataset.py -e Project/Embedding -d Dataset
Note
- If you want to use -z option <embedding_zip_path> must have the same name for embedding_folder_path
- -z: can be used only if the embedding zip folder is only one.
After the common steps run this command to generate OHE Dataset
python create_datasetOHE.py -s <source_path> -d <destionation_path> -l <sequence_length>
where:
- s: is the csv file obtained at the end of common steps
- d: destionation of Dataset folder
- l: length of the sequence
example: python create_datasetOHE.py -s ..\Data\dataset100.csv -d . -l 100
This script will create dataset folder named Dataset_x
where x is the length passed by -l option.
In order to use the Transfomer model the steps are:
- run extract_promoters.py without
-l
option:
example:
python extract_promoters.py -g GRCh38.p14.genome.fa -b human_epdnew_VgGtt.bed -o Desktop/folder
the output file will be named as promoters_mixed.fa
- run enhancer_preprocessing.py will cut enhancer to
maximum sequence length
equal to 1000 because in literature enhancer and promoters have a variabile length of 100-1000 bp.
python enhancer_preprocessing.py
- run create_csv_dataset.py with
-l
equal to 1000 (e.g.maximum sequence length
)
example:
python create_csv_dataset.py -p promoters_mixed.fa -e enhancers.txt -o dataset.csv -l 1000
- Now that you have the csv file, create three folder
Datatset
Dataset_validation
Dataset_testing
and in each folder create three folders:
ids
att_mask
labels
- run saving_data.py
python saving_data.py -csv=<path to the csv> -dataset=<path to the Dataset folder> -tokenizer=<type of tokenizer>
example:
python saving_data.py -csv=".../dataset_different_size.csv" -dataset=".../Dataset" -tokenizer="DnaBert"
you can select three different type of tokenizer:
insert Bert, for bert tokenizer
insert Deep, for InstaDeep
inser DnaBert, for DnaBert2
this commmand will process the data in the csv using the selected tokenizer and will save for each sequence three informations token id, attention mask and label. If you want to try this, you can use the file data_different_size.csv present in the folder data of the repo. 6. run split_dataset.py, to create Training, Testing and Validation set.
python split_dataset.py -train=<path to the Training folder> -test=<path to the Testing folder> -validation=<path to the Validation folder>
example:
python split_dataset.py -train=".../Dataset" -test=".../Dataset_testing" -validation=".../Dataset_validation"
this commmand will divide the data present in the Dataset folder to create the validation and test set considering an 80/10/10 division.
Training command for CNN that takes input Embeddings
python train_CNN.py -d <dataset_path>
example: train_CNN.py -d Dataset200
Training command for CNN that takes input One Hot Encoding
python train_OHE.py -d <dataset_path>
example: train_OHE.py -d Dataset200
Training command for Transformer
python Training.py -train= <path to training data folder> -test= <path to testing data folder> -validation= <path to validation data folder>
example: python Training.py -train=".../Dataset" -test="..../Dataset_testing" -validation="..../Dataset_validation"
checkpoints are saved into lightning_logs/<version_number>/checkpoints
weights are stored into .ckpt file
python test_CNN.py -c <checkpoint_file> -d <dataset_path>
example: test_CNN.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200
This command is an example of testing OHE model
python test_OHE.py -c <checkpoint_file> -d <dataset_path>
example: test_OHE.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200
To test Transformer model on inference mode, visit the Inference Folder
. Inside of subfolder Script
use following python files in the following order:
-
extract exons from reference genome annotation file (.gtf): the output will be a
.json
file which contains all exon positionspython exons_extractor.py -gtf <genome_annotation_file> -out <output_file> -ch <list of crhomosome to extract > example: python exons_extractor.py gencode.v45.annotation.gtf -out exons.json -ch chr1,chr2,chr3
-
extract introns from reference genome (.fa): the output file will be a
.csv
file containing introns sequencepython -gr <genome_reference> -ex <exon_file.json> -out <introns_output_basename> example: python -gr GRCh38.p14.genome.fa -ex exons.json -out introns
The output file has been released in subfolder
Data
in order to check the final result -
pass the introns sequence to the model
python transformer_inference.py -data=<path to the introns csv> -weights=<path to the checkpoints> example: python transformer_inference..py -data=".../introns.csv" -weights=".../epoch=3-step=1736.ckpt""
Each training produce an output folder that contains checkpoint and "events.out" files.
- Into the terminal run:
pip3 install tensorboard
- After the installation move into result folder
cd lightning_logs
- Now, run command:
tensorboard --logdir <name_of_version_folder>
- This command show graphics about training, test and valiadation accuracy and loss.
The following table report interval of confidence of our models
Models |
Accuracy |
Loss |
Specificity |
Sensitivity |
F1 score |
MCC |
AUC |
---|---|---|---|---|---|---|---|
CNN 1D with Embedding | 0.905 ± 0.007 | 0.454 ± 0.043 | 0.907 ± 0.027 | 0.905 ± 0.030 | 0.902 ± 0.006 | 0.827 ± 0.029 | 0.968± 0.013 |
CNN 1D with One Hot Encoding | 0.888 ± 0.01 | 0.884 ± 0.029 | 0.905± 0.024 | 0.890± 0.009 | 0.815 ± 0.273 | 0.772 ± 0.029 | 0.947 ± 0.011 |
Transformer BERT | 0.668 ± 0.133 | 0.562 ± 0.019 | 0.346 ± 0.070 | 0.982 ± 0.078 | 0.501 ± 0.095 | 0.448 ± 0.068 | 0.674 ± 0.035 |
Transformer DNABERT2 | 0.961 ± 0.005 | 0.133 ± 0.009 | 0.944 ± 0.006 | 0.979 ± 0.001 | 0.960 ± 0.001 | 0.923 ± 0.012 | 0.987 ± 0.012 |
Transformer InstaDeep | 0.958 ± 0.005 | 0.158 ± 0.007 | 0.946 ± 0.016 | 0.968 ± 0.007 | 0.956 ± 0.023 | 0.916 ± 0.005 | 0.985 ± 0.001 |
Distributed under the MIT License. See LICENSE.txt
for more information.
- Antonio De Blasi - AntonioDeBlasi-Git
- Francesco Zampirollo - zampifre
- Stefano Politanò - StePoli-00