Skip to content

Latest commit

 

History

History
154 lines (115 loc) · 5.96 KB

README.md

File metadata and controls

154 lines (115 loc) · 5.96 KB

KnowEnG's Gene Signature Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Signature Analysis Pipeline.

This pipeline performs network-based signature analysis on the columns of a given spreadsheet, where spreadsheet's columns correspond to sample-labels and rows correspond to gene-labels. The signature is based on correlating gene expression data (network enriched) against known gene signature data.

There are four similarity "signature" methods that one can choose from:

  • similarity (traditional method)
  • net_similarity (with network enrichment)
  • cc_similarity (with bootstraps)
  • cc_net_similarity (with bootstraps and network enrichment)

and two correlation measures:

  • spearman
  • cosine

How to run this pipeline with Our data


1. Clone the Gene_Signature_Pipeline Repo

 git clone https://github.com/KnowEnG-Research/Gene_Signature_Pipeline.git

2. Install the following (Ubuntu or Linux)

pip3 install pyyaml
pip3 install knpackage
pip3 install scipy==0.18.0
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install matplotlib==1.4.2
pip3 install scikit-learn==0.17.1

apt-get install -y python3-pip
apt-get install -y libfreetype6-dev libxft-dev
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran

3. Change directory to Gene_Signature_Pipeline

cd Gene_Signature_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a similarity option:

Command Option
make run_spearman spearman similarity
make run_net_spearman spearman similarity with network enrichment
make run_cc_spearman spearman similarity with bootstraps
make run_cc_net_spearman spearman similarity with bootstraps & network enrichment

How to run this pipeline with Your data


Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in the Gene_Signature_Pipeline/data/run_files zTEMPLATE_cc_net_spearman.yml

* Modify run_paramters file (YAML Format)

Change processing_method to one of: serial, parallel depending on your machine.

processing_method: serial

set the data file targets to the files you want to run, and the parameters as appropriate for your data.

* Run the Gene Signature Pipeline:

  • Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH    
  • Run
python3 ../src/gene_signature.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_spearman.yml

Description of "run_parameters" file


Key Value Comments
method similarity, cc_similarity, net_similarity or cc_net_similarity Choose similarity method
similarity_measure spearman, cos Choose correlation measure
gg_network_name_full_path directory+gg_network_name Path and file name of the 4 col network file
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user supplied gene sets
signature_name_full_path directory+signature_data_name Path and file name of user supplied signature data
results_directory directory Directory to save the output files
tmp_directory directory Directory to save the intermediate files
rwr_max_iterations 100 Maximum number of iterations without convergence in random walk with restart
rwr_convergence_tolerence 1.0e-8 Frobenius norm tolerence of spreadsheet vector in random walk
rwr_restart_probability 0.7 alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo
rows_sampling_fraction 0.8 Select 80% of spreadsheet rows
number_of_bootstraps 4 Number of random samplings
processing_method serial or parallel or distribute Choose processing method

gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
signature_data_name =


Description of Output files saved in results directory


  • Output files of all four methods save samples by signature similarity "correlation" with name similarity_matrix_{method}{measure}{timestamp}_viz.tsv.
signature 1 ... signature m
sample 1 float ... float
... ... ... ...
sample n float ... float