Promoter & Enhancer Classifier

Associated Report

Abstract

Promoters and enhancers are both gene regulation elements and are often involved in activating specific genes. They are crucial elements in the regulation of gene expression, playing key roles in controlling when, where, and how much of a gene is transcribed into RNA and eventually translated into protein. Their similarity can make it difficult to distinguish them for several reasons. Promoters and enhancers may share some similar or identical DNA sequences.
Both contain binding sequences for regulatory proteins, such as transcription factors, which can bind and influence gene activity. These sequences can be present in different combinations and contexts in both promoters and enhancers, making it challenging to distinguish the two elements based solely on DNA sequence. Furthermore they perform different but related functions. Promoters are generally positioned near the gene they regulate and are responsible for initiating transcription, while enhancers can be located at variable distances from the target gene and act by enhancing transcription efficiency.

Using https://genome.ucsc.edu/cgi-bin/hgLiftOver, you upload on this web tool the name_of_file_for_hg19_list.txt (obtained in the previous step) and obtain a .BED file that is a list of all enhancers converted into hg38 format. This file contains enhancers under the format "chr_name:<start_position> - <end_position>".

Into terminal run:

 pip3 install biopython

The enhancer_preprocessing.py takes different mandatories arguments from command line:

option -b: path bed file (hg38)
option -f: path fasta file (GRCh38.FASTA 3gb)
option -e: name of output file (.txt) for the final sequences.

Initially, the script open the .BED file and, using a Gene class (name, start, end) generate an iterative list of Gene.
Moreover, the script open GRCh38.FASTA file and with FastaElem class (name, sequence) generate a list of FastaElem.
With Gene list and FastaElem list, using BioPython library, the script for each Gene into FastaElem (using the specific start,end position), takes the relative sequence of nucleotides and save it on enhancer.txt output file. The enhancer.txt file, at the end, contains all the enhancer with the lenght specify in the initial .BED file in hg38 version.

python enhancer_preprocessing.py -b <path_bed_file> -f <path_fasta_file> -e <output_name_enhancer_file>

Create csv dataset

It contains promoter and enhancer sequence. In the previous folder run create_csv_dataset.py

python create_csv_dataset.py -p <promoter_file> -e <enhancer_file> -o <output_file> -l <promoter_length>

example:
python create_csv_dataset.py -p promoters_100.fa -e enhancers.txt -o dataset.csv -l 100

Embedding Tokenizer

Warning

The following file require to be execute with GPU

make sure you have installed einops package, otherwise install it:
```
pip install einops
```

Fed csv file into create_embedding.py

python -s <dataset.csv> -d <destination>

example:
python -s data.csv -d Project_Folder/

Note

This script will create Embedding folder in the destination path provided which contains embeddings and labels of our data converted from sequence to embeddings using Tokenizer of DNABert.
Before to run the script make sure that destination folder doesn't have any folder named Embedding because the script will stop. This behaviour is implemented in order to avoid overriding data from previous execution of the code.
if you have encounterd this type of error: assert q.is_cuda and k.is_cuda and v.is_cuda try to unistall triton package

run split_dataset.py to create Dataset folder, run the following command:

python split_dataset.py -e <embedding_folder_path> -d <dataset_folder_path> [ -z <embedding_zip_path>]  
example:
python split_dataset.py -e Project/Embedding -d Dataset

Note

If you want to use -z option <embedding_zip_path> must have the same name for embedding_folder_path
-z: can be used only if the embedding zip folder is only one.

(back to top)

One Hot Encoding

After the common steps run this command to generate OHE Dataset

python create_datasetOHE.py -s <source_path> -d <destionation_path> -l <sequence_length>

where:
- s: is the csv file obtained at the end of common steps
- d: destionation of Dataset folder
- l: length of the sequence

example: python create_datasetOHE.py -s ..\Data\dataset100.csv -d . -l 100

This script will create dataset folder named Dataset_x where x is the length passed by -l option.

Transformer

In order to use the Transfomer model the steps are:

run extract_promoters.py without -l option:

example: 
python extract_promoters.py -g GRCh38.p14.genome.fa -b human_epdnew_VgGtt.bed -o Desktop/folder

the output file will be named as promoters_mixed.fa

run enhancer_preprocessing.py will cut enhancer to maximum sequence length equal to 1000 because in literature enhancer and promoters have a variabile length of 100-1000 bp.

python enhancer_preprocessing.py

run create_csv_dataset.py with -l equal to 1000 (e.g. maximum sequence length)

example: 
python create_csv_dataset.py -p promoters_mixed.fa -e enhancers.txt -o dataset.csv -l 1000

Now that you have the csv file, create three folder

    Datatset
    Dataset_validation
    Dataset_testing

and in each folder create three folders:

    ids
    att_mask
    labels

run saving_data.py

python saving_data.py -csv=<path to the csv> -dataset=<path to the Dataset folder> -tokenizer=<type of tokenizer>
example:
python saving_data.py -csv=".../dataset_different_size.csv" -dataset=".../Dataset" -tokenizer="DnaBert"

you can select three different type of tokenizer:

insert Bert, for bert tokenizer
insert Deep, for InstaDeep
inser DnaBert, for DnaBert2

this commmand will process the data in the csv using the selected tokenizer and will save for each sequence three informations token id, attention mask and label. If you want to try this, you can use the file data_different_size.csv present in the folder data of the repo. 6. run split_dataset.py, to create Training, Testing and Validation set.

python split_dataset.py -train=<path to the Training folder> -test=<path to the Testing folder> -validation=<path to the Validation folder>
example:
python split_dataset.py -train=".../Dataset" -test=".../Dataset_testing"  -validation=".../Dataset_validation"

this commmand will divide the data present in the Dataset folder to create the validation and test set considering an 80/10/10 division.

(back to top)

Training

Training command for CNN that takes input Embeddings

   python train_CNN.py -d <dataset_path>
   example:  train_CNN.py -d Dataset200

Training command for CNN that takes input One Hot Encoding

   python train_OHE.py -d <dataset_path>
      example:  train_OHE.py -d Dataset200

Training command for Transformer

    python Training.py -train= <path to training data folder> -test= <path to testing data folder> -validation= <path to validation data folder>
    example:  python Training.py -train=".../Dataset" -test="..../Dataset_testing" -validation="..../Dataset_validation"

(back to top)

Inference

CNN

checkpoints are saved into lightning_logs/<version_number>/checkpoints weights are stored into .ckpt file

python test_CNN.py -c <checkpoint_file> -d <dataset_path>
example:  test_CNN.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200

This command is an example of testing OHE model

python test_OHE.py -c <checkpoint_file> -d <dataset_path>
example:  test_OHE.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200

Transformer

To test Transformer model on inference mode, visit the Inference Folder. Inside of subfolder Script use following python files in the following order:

extract exons from reference genome annotation file (.gtf): the output will be a .json file which contains all exon positions

python exons_extractor.py -gtf <genome_annotation_file> -out <output_file> -ch <list of crhomosome to extract >
example:
python exons_extractor.py gencode.v45.annotation.gtf -out exons.json  -ch chr1,chr2,chr3

extract introns from reference genome (.fa): the output file will be a .csv file containing introns sequence
```
python -gr <genome_reference> -ex <exon_file.json> -out <introns_output_basename>

example: 
python -gr GRCh38.p14.genome.fa -ex exons.json -out introns
```
The output file has been released in subfolder Data in order to check the final result

pass the introns sequence to the model

python transformer_inference.py -data=<path to the introns csv> -weights=<path to the checkpoints> 

example:
python transformer_inference..py -data=".../introns.csv" -weights=".../epoch=3-step=1736.ckpt""

(back to top)

View Result

Each training produce an output folder that contains checkpoint and "events.out" files.

Into the terminal run:
```
 pip3 install tensorboard
```
After the installation move into result folder
```
cd lightning_logs
```

Now, run command:

 tensorboard --logdir <name_of_version_folder>

This command show graphics about training, test and valiadation accuracy and loss.

(back to top)

Performance

The following table report interval of confidence of our models

Models	Accuracy	Loss	Specificity	Sensitivity	F1 score	MCC	AUC
CNN 1D with Embedding	0.905 ± 0.007	0.454 ± 0.043	0.907 ± 0.027	0.905 ± 0.030	0.902 ± 0.006	0.827 ± 0.029	0.968± 0.013
CNN 1D with One Hot Encoding	0.888 ± 0.01	0.884 ± 0.029	0.905± 0.024	0.890± 0.009	0.815 ± 0.273	0.772 ± 0.029	0.947 ± 0.011
Transformer BERT	0.668 ± 0.133	0.562 ± 0.019	0.346 ± 0.070	0.982 ± 0.078	0.501 ± 0.095	0.448 ± 0.068	0.674 ± 0.035
Transformer DNABERT2	0.961 ± 0.005	0.133 ± 0.009	0.944 ± 0.006	0.979 ± 0.001	0.960 ± 0.001	0.923 ± 0.012	0.987 ± 0.012
Transformer InstaDeep	0.958 ± 0.005	0.158 ± 0.007	0.946 ± 0.016	0.968 ± 0.007	0.956 ± 0.023	0.916 ± 0.005	0.985 ± 0.001

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contacts

Antonio De Blasi - AntonioDeBlasi-Git
Francesco Zampirollo - zampifre
Stefano Politanò - StePoli-00

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
Data		Data
Data_Preparation		Data_Preparation
Inference		Inference
Models		Models
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Promoter & Enhancer Classifier

Associated Report

Abstract

Table of Contents

Installation

Data Preparation

Common Steps

Promoter Preprocessing

Enhancer Preprocessing

Create csv dataset

Embedding Tokenizer

One Hot Encoding

Transformer

Training

Inference

CNN

Transformer

View Result

Performance

License

Contacts

About

Releases

Packages

Contributors 3

Languages

License

StePoli-00/Promoter-and-Enhancer-Classifier

Folders and files

Latest commit

History

Repository files navigation

Promoter & Enhancer Classifier

Associated Report

Abstract

Table of Contents

Installation

Data Preparation

Common Steps

Promoter Preprocessing

Enhancer Preprocessing

Create csv dataset

Embedding Tokenizer

One Hot Encoding

Transformer

Training

Inference

CNN

Transformer

View Result

Performance

License

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages