Skip to content

StePoli-00/Promoter-and-Enhancer-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Promoter & Enhancer Classifier

Abstract

Promoters and enhancers are both gene regulation elements and are often involved in activating specific genes. They are crucial elements in the regulation of gene expression, playing key roles in controlling when, where, and how much of a gene is transcribed into RNA and eventually translated into protein. Their similarity can make it difficult to distinguish them for several reasons. Promoters and enhancers may share some similar or identical DNA sequences.
Both contain binding sequences for regulatory proteins, such as transcription factors, which can bind and influence gene activity. These sequences can be present in different combinations and contexts in both promoters and enhancers, making it challenging to distinguish the two elements based solely on DNA sequence. Furthermore they perform different but related functions. Promoters are generally positioned near the gene they regulate and are responsible for initiating transcription, while enhancers can be located at variable distances from the target gene and act by enhancing transcription efficiency.

Table of Contents

  1. Installation
  2. Data Preparation
  3. Training
  4. Inference
  5. View Result
  6. Performance

(back to top)

Installation

  1. Clone the repository
     git clone https://github.com/StePoli-00/Promoter-and-Enhancer-Classifier.git
  2. After cloning move into the folder:
     cd Promoter-and-Enhancer-Classifier/
  3. Create a conda environment with python 3.11
    conda create -n <env> python=3.11
  4. Install dependencies
    pip install -r requirements.txt

(back to top)

Data Preparation

Before using our mode first we must create dataset. Input Data are:

  • Token Embedding: are generated by DNABERT Tokenizer
  • One Hot Encoding: simple encoding of DNA basis: A = {1,0,0,0}, T = {0,1,0,0}, C = {0,0,1,0} e G = {0,0,0,1}

Depending on which model you want to test you must process data in two distinguish manner. In this section you will learn how to process both.

Common Steps

As first thing to do we must extract our data: promoters and enhancer

Promoter Preprocessing

  1. Download the following files:
  1. Place where you prefer, our suggest is to use Data folder to store used for the model
  2. From Data_Preparation run extract_promoters.py
cd Data_Preparation 
python extract_promoters.py -g <genome_file> -b <bed_file> [-l <promoters length> -o <output_path> ]

where: 
-o is the desired position of the output file
example: 
python extract_promoters.py -g GRCh38.p14.genome.fa -b human_epdnew_VgGtt.bed -l 100 -o Desktop/folder

Note

By omitting -l parameter will cut promoter sequence of variable length: 5,10,20,100,200,1000,2000.
Except for the Transfomer, all the other model take an input sequence of fixed length.
In your case this -l parameter is mandatory

Enhancer Preprocessing

First Enhancers Preprocessing

For what concern the first enhancers preprocessing is required:

  1. Go to https://bio.liclab.net/ENdb/Download.php
  2. Download file signed by "All the experimentally confirmed enhancers"

Now you need the execution of first_enhancer_preprocessing.py.

This file take enhancer_file, remove the mm10 genes (that is genes of mouse) and generate a file in a hg19 format (necessary for the next step of preprocessing).

Into terminal run:

 python first_enhancer_preprocessing.py -e <enhancer_file.txt> -hg19_list <name_of_file_for_hg19_list>.txt -mm10_list <name_of_mm10_file>.txt

Note

This command generates a file called name_of_file_for_hg19_list.txt that you will use for the second step of enhancers preprocessing.

Second Enhancers Preprocessing

Using https://genome.ucsc.edu/cgi-bin/hgLiftOver, you upload on this web tool the name_of_file_for_hg19_list.txt (obtained in the previous step) and obtain a .BED file that is a list of all enhancers converted into hg38 format. This file contains enhancers under the format "chr_name:<start_position> - <end_position>".

  1. Into terminal run:
 pip3 install biopython

The enhancer_preprocessing.py takes different mandatories arguments from command line:

  1. option -b: path bed file (hg38)
  2. option -f: path fasta file (GRCh38.FASTA 3gb)
  3. option -e: name of output file (.txt) for the final sequences.

Initially, the script open the .BED file and, using a Gene class (name, start, end) generate an iterative list of Gene.
Moreover, the script open GRCh38.FASTA file and with FastaElem class (name, sequence) generate a list of FastaElem.
With Gene list and FastaElem list, using BioPython library, the script for each Gene into FastaElem (using the specific start,end position), takes the relative sequence of nucleotides and save it on enhancer.txt output file. The enhancer.txt file, at the end, contains all the enhancer with the lenght specify in the initial .BED file in hg38 version.

python enhancer_preprocessing.py -b <path_bed_file> -f <path_fasta_file> -e <output_name_enhancer_file>

Create csv dataset

It contains promoter and enhancer sequence. In the previous folder run create_csv_dataset.py

python create_csv_dataset.py -p <promoter_file> -e <enhancer_file> -o <output_file> -l <promoter_length>

example:
python create_csv_dataset.py -p promoters_100.fa -e enhancers.txt -o dataset.csv -l 100

Embedding Tokenizer

Warning

The following file require to be execute with GPU

  1. make sure you have installed einops package, otherwise install it:
    pip install einops
  2. Fed csv file into create_embedding.py
    python -s <dataset.csv> -d <destination>
    
    example:
    python -s data.csv -d Project_Folder/

Note

  • This script will create Embedding folder in the destination path provided which contains embeddings and labels of our data converted from sequence to embeddings using Tokenizer of DNABert.
  • Before to run the script make sure that destination folder doesn't have any folder named Embedding because the script will stop. This behaviour is implemented in order to avoid overriding data from previous execution of the code.
  • if you have encounterd this type of error: assert q.is_cuda and k.is_cuda and v.is_cuda try to unistall triton package
  1. run split_dataset.py to create Dataset folder, run the following command:
    python split_dataset.py -e <embedding_folder_path> -d <dataset_folder_path> [ -z <embedding_zip_path>]  
    example:
    python split_dataset.py -e Project/Embedding -d Dataset   

Note

  • If you want to use -z option <embedding_zip_path> must have the same name for embedding_folder_path
  • -z: can be used only if the embedding zip folder is only one.

(back to top)

One Hot Encoding

After the common steps run this command to generate OHE Dataset

python create_datasetOHE.py -s <source_path> -d <destionation_path> -l <sequence_length>

where:
- s: is the csv file obtained at the end of common steps
- d: destionation of Dataset folder
- l: length of the sequence

example: python create_datasetOHE.py -s ..\Data\dataset100.csv -d . -l 100

This script will create dataset folder named Dataset_x where x is the length passed by -l option.

Transformer

In order to use the Transfomer model the steps are:

  1. run extract_promoters.py without -l option:
example: 
python extract_promoters.py -g GRCh38.p14.genome.fa -b human_epdnew_VgGtt.bed -o Desktop/folder

the output file will be named as promoters_mixed.fa

  1. run enhancer_preprocessing.py will cut enhancer to maximum sequence length equal to 1000 because in literature enhancer and promoters have a variabile length of 100-1000 bp.
python enhancer_preprocessing.py
  1. run create_csv_dataset.py with -l equal to 1000 (e.g. maximum sequence length)
example: 
python create_csv_dataset.py -p promoters_mixed.fa -e enhancers.txt -o dataset.csv -l 1000
  1. Now that you have the csv file, create three folder
    Datatset
    Dataset_validation
    Dataset_testing

and in each folder create three folders:

    ids
    att_mask
    labels
  1. run saving_data.py
python saving_data.py -csv=<path to the csv> -dataset=<path to the Dataset folder> -tokenizer=<type of tokenizer>
example:
python saving_data.py -csv=".../dataset_different_size.csv" -dataset=".../Dataset" -tokenizer="DnaBert" 

you can select three different type of tokenizer:

insert Bert, for bert tokenizer
insert Deep, for InstaDeep
inser DnaBert, for DnaBert2

this commmand will process the data in the csv using the selected tokenizer and will save for each sequence three informations token id, attention mask and label. If you want to try this, you can use the file data_different_size.csv present in the folder data of the repo. 6. run split_dataset.py, to create Training, Testing and Validation set.

python split_dataset.py -train=<path to the Training folder> -test=<path to the Testing folder> -validation=<path to the Validation folder>
example:
python split_dataset.py -train=".../Dataset" -test=".../Dataset_testing"  -validation=".../Dataset_validation" 

this commmand will divide the data present in the Dataset folder to create the validation and test set considering an 80/10/10 division.

(back to top)

Training

Training command for CNN that takes input Embeddings

   python train_CNN.py -d <dataset_path>
   example:  train_CNN.py -d Dataset200

Training command for CNN that takes input One Hot Encoding

   python train_OHE.py -d <dataset_path>
      example:  train_OHE.py -d Dataset200

Training command for Transformer

    python Training.py -train= <path to training data folder> -test= <path to testing data folder> -validation= <path to validation data folder>
    example:  python Training.py -train=".../Dataset" -test="..../Dataset_testing" -validation="..../Dataset_validation"   

(back to top)

Inference

CNN

checkpoints are saved into lightning_logs/<version_number>/checkpoints weights are stored into .ckpt file

python test_CNN.py -c <checkpoint_file> -d <dataset_path>
example:  test_CNN.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200

This command is an example of testing OHE model

python test_OHE.py -c <checkpoint_file> -d <dataset_path>
example:  test_OHE.py -c lightning_logs/version_0/checkpoints/check01.ckpt -d Dataset200

Transformer

To test Transformer model on inference mode, visit the Inference Folder. Inside of subfolder Script use following python files in the following order:

  1. extract exons from reference genome annotation file (.gtf): the output will be a .json file which contains all exon positions

    python exons_extractor.py -gtf <genome_annotation_file> -out <output_file> -ch <list of crhomosome to extract >
    example:
    python exons_extractor.py gencode.v45.annotation.gtf -out exons.json  -ch chr1,chr2,chr3
  2. extract introns from reference genome (.fa): the output file will be a .csv file containing introns sequence

    python -gr <genome_reference> -ex <exon_file.json> -out <introns_output_basename>
    
    example: 
    python -gr GRCh38.p14.genome.fa -ex exons.json -out introns

    The output file has been released in subfolder Data in order to check the final result

  3. pass the introns sequence to the model

    python transformer_inference.py -data=<path to the introns csv> -weights=<path to the checkpoints> 
    
    example:
    python transformer_inference..py -data=".../introns.csv" -weights=".../epoch=3-step=1736.ckpt"" 

(back to top)

View Result

Each training produce an output folder that contains checkpoint and "events.out" files.

  1. Into the terminal run:
     pip3 install tensorboard
  2. After the installation move into result folder
    cd lightning_logs
  3. Now, run command:
     tensorboard --logdir <name_of_version_folder> 
  4. This command show graphics about training, test and valiadation accuracy and loss.

(back to top)

Performance

The following table report interval of confidence of our models

Models

Accuracy

Loss

Specificity

Sensitivity

F1 score

MCC

AUC

CNN 1D with Embedding 0.905 ± 0.007 0.454 ± 0.043 0.907 ± 0.027 0.905 ± 0.030 0.902 ± 0.006 0.827 ± 0.029 0.968± 0.013
CNN 1D with One Hot Encoding 0.888 ± 0.01 0.884 ± 0.029 0.905± 0.024 0.890± 0.009 0.815 ± 0.273 0.772 ± 0.029 0.947 ± 0.011
Transformer BERT 0.668 ± 0.133 0.562 ± 0.019 0.346 ± 0.070 0.982 ± 0.078 0.501 ± 0.095 0.448 ± 0.068 0.674 ± 0.035
Transformer DNABERT2 0.961 ± 0.005 0.133 ± 0.009 0.944 ± 0.006 0.979 ± 0.001 0.960 ± 0.001 0.923 ± 0.012 0.987 ± 0.012
Transformer InstaDeep 0.958 ± 0.005 0.158 ± 0.007 0.946 ± 0.016 0.968 ± 0.007 0.956 ± 0.023 0.916 ± 0.005 0.985 ± 0.001

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contacts

About

Project from "AI for BioInformatics" @ Unimore

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •