- Explore_matrices_and_get_data_for_multi_model.ipynb - in this notebook we explore the UniBind+Remap matrix and collect data to train a 50 TF multi-model;
- Train_multi_model.ipynb - in this notebook we train a CNN multi-model with 50 TFs and generate the results;
- Train_multi_model_DanQ.ipynb - in this notebook we train a DanQ multi-model with 50 TFs and generate the results;
- TL_exploring_pentad_TFs.ipynb - in this notebook we explore the results of training with biologically relevant groups (with TF pentad) using a CNN model;
- TL_exploring_pentad_TFs_DanQ.ipynb - in this notebook we explore the results of training with biologically relevant groups (with TF pentad) using a DanQ model;
- TL_effect_of_multimodels.ipynb - effect of different multi-models (5 vs 50) on the individual model performance (for pentad TFs);
- TL_individual_models.ipynb - plotting the results for the individual models vs multi-model; plotting the box plots for the effect of TL;
- TL_with_freezed_layers.ipynb - inspecting the TL performance, when convolutional layers are freezed;
- Exploring_effect_of_BM_on_15_TFs.ipynb - in this notebook we lot the effect of BM on TL for 15 TFs (3 TF for each of the pentad family);
- Model_interpretation.ipynb - here we perform multi- and individual models interpretation by converting first layer convolutional filters into PWMs;
- Interpretation_of_models_finetuned_with_cofactors.ipynb - interpreting individual models that were preinitialized with weights from multi-models trained on cofactors;
- Data_size_effect_on_TL.ipynb - exploring how TL affects the performance for different sub-sampled data sets;
- UMAP_Binding_heatmap_and_selecting_groups.ipynb - in this notebook we analyze the binding matrix by plotting UMAP plot and the heatmap of binding pattern similarities. Moreover, we select biologically relevant groups for TL;
- run_training_of_individual_models.sh - run to train 148 individual TF models from scratch or using 50 TF multi-model to initialize weights;
- run_training_of_individual_models_FREEZING_LAYERS.sh - run to train 148 individual TF models from scratch or using 50 TF multi-model to initialize weights; this time convolutional layers are freezed;
- run_single_tf_refined_subsample_experiment.sh - run to test TL boundaries by sub-sampling different numbers of positive regions;
- run_BM_real_TFs_last_exp.sh (runs run_BM_tl_last_exp_corrected_remove.sh) - trains models with TL for 15 TFs;
- run_BM_multimodel_TFs.sh (runs run_BM_multimodel.sh) - perform TL using either 50 or 5 TF multi-model;
- run_BM_real_TFs.sh (runs run_BM_tl_subsample.sh) - train individual models using TL with TFs from the same BM; for speed, subsample data sets to 1000 positives/negatives; also runs run_BM_tl_subsample_DanQ.sh - same as above but for DanQ;
- run_cofactors_real_TFs.sh (runs run_cofactor_tl_subsample.sh) - train individual models using TL with TFs that are cofactors/STRING partners/low correlated TFs with the same BM; for speed, subsample data sets to 1000 positives/negatives;
- get_indiv_data_for_each_TF.sh - get data splits for individual TFs;
- split_the_dataset.py - script that takes as input fasta files and labels and splits the data into train/validation/test sets; one-hot encodes (and reverse complements if required) the sequences;
- Run_Analysis_Training.py - trains a multi-model;
- Run_String_Analysis_Training.py - same as above, but saves class labels in a separate file;
- remove_training_data.py - script for removing data used to train a multi-model from the original TF binding matrix;
- Run_Analysis_Transfer_Learning.py - trains individual TF models with/without TL and with/without testing;
- Run_Analysis_String_Transfer_Learning.py - same as above, but accepts class labels to use during the testing;
- Run_Analysis_Transfer_Learning_Subsampling.py - same as above, but takes as input a specified test data set (used during testing TL boundaries);
- get_data_for_TF.py - script to build fasta and labels files for a specific TF;
- get_data_for_TF_subsample_positives_old.py - same as above, but subsamples data to a certain number of positives/negatives
- get_data_for_TF_subsample_positives.py - same as above, but also subsamples a certain number of test sequences;
- Run_BM_Analysis.py - generates fasta and labels files for a specific TF and binding mode; the final subsampled data cannot be less than 70,000 sequences;
- Run_BM_Analysis_LE.py - same as above, but no restriction on subsampled data size;
- Run_BM_Multimodel_Analysis.py - same as above, but randomly samples 40,000 regions, and saves labels for 50 and 5 classes multi-model;
- Run_Cofactor_Analysis.py - generates fasta and labels files for a specific TF and its biological group (cofactors, string, low correlated BM); the final subsampled data cannot be less than 70,000 sequences;
- Run_Cofactor_Multimodel_Analysis.py - same as above, but randomly samples 40,000 regions, and saves labels for 50 and 5 classes multi-model;
- split_the_dataset_bm_multimodel.py - splits data set for different multimodels;
- models.py - python script with model architectures;
- tools.py - python script with functions used to analyze the data;
- deeplift_scores.py - python script to compute DeepLIFT importance scores using the Captum library;