Repository Initialization (#1)

* initializing README * edit language in README * add relative link to gestalt * update relative links in README * modify README for correct links and instructions * update README instructions * add ignore file * add functionality scripts these scripts are copied over from https://github.com/greenelab/tad_pathways and are only slightly modified * add example files and example results * add conda environment file * add example pipelines * add repo initialization script * add R sessionInfo * update gestalt path in readme * set error to exit bash scripts * add info to preprint * add correct bmd evidence files
greenelab · Jan 6, 2017 · e69df96 · e69df96
1 parent 86a678e
commit e69df96
Show file tree

Hide file tree

Showing 37 changed files with 20,146 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.Rhistory
+data/
+tad_pathways_data.tar.gz
diff --git a/README.md b/README.md
@@ -0,0 +1,155 @@
+# TAD_Pathways
+
+## Leveraging TADs to identify candidate genes at GWAS signals
+
+**Gregory P. Way and Casey S. Greene - 2017**
+
+### Summary
+
+The repository contains data and instructions to implement a "TAD_Pathways"
+analysis for over 300 different trait/disease GWAS or custom SNP lists.
+
+TAD_Pathways uses the principles of topologically association domains (TADs) to
+define where an association signal (typically a GWAS signal) can most likely
+impact gene function. We use TAD boundaries as defined by
+[Dixon et al. 2012](https://doi.org/10.1038/nature11082) and
+[hg19 Gencode genes](ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/)
+to identify which genes may be implicated. We then input this list into a
+[WebGestalt Pathways Analysis](http://webgestalt.org/) to output
+significantly associated pathways implicated by the input TAD-defined geneset.
+
+For more specific details about the method, refer to our
+[preprint](https://doi.org/10.1101/087718 "Determining causal genes from GWAS signals using topologically associating domains").
+
+### Setup
+
+Before you begin, download the necessary TAD based index files and GWAS
+curation files and setup python environment:
+
+```bash
+bash initialize.sh
+
+source activate tad_pathways
+```
+
+### Examples
+
+We provide three different examples for a TAD pathways analysis pipeline. To run
+each of the analyses:
+
+```bash
+# Example using Bone Mineral Density GWAS
+bash example_pipeline_bmd.sh
+
+# Example using Type 2 Diabetes GWAS
+bash example_pipeline_t2d.sh
+
+# Example using custom input SNPs
+bash example_pipeline_custom.sh
+```
+
+### General Usage
+
+There are two ways to implement a TAD_Pathways analysis:
+
+1. GWAS
+2. Custom
+
+#### GWAS
+
+Browse the `data/gwas_tad_genes/` directory to select a GWAS file. Each file in
+this directory is a tab separated text file that includes information regarding
+each gene located within a signal TAD. The column `gene_name` is the
+comprehensive list of all implicated genes. For complete information on how
+these lists were constructed, refer to
+https://github.com/greenelab/tad_pathways. 
+
+Input this gene list directly into a
+[WebGestalt Pathway Analysis](http://webgestalt.org/) and skip to the
+[WebGestalt step](#webgestalt-pathway-analysis).
+
+#### Custom
+
+Create a comma separated file where the first row of each column names the list
+of snps below in subsequent rows. There can be many columns with variable
+length rows.
+
+E.g.: `custom_example.csv`
+
+| Group 1 | Group 2 |
+| ------- | ------- |
+| rs12345 | rs67891 |
+| rs19876 | rs54321 |
+| ...     | ...     |
+
+Then, perform the following steps:
+
+```bash
+# Map custom SNPs to genomic locations
+Rscript --vanilla scripts/build_snp_list.R \
+        --snp_file "custom_example.csv" \
+        --output_file "mapped_results.tsv"
+
+# Build TAD based genelists for each group
+python scripts/build_custom_TAD_genelist.py \
+       --snp_data_file "mapped_results.tsv" \
+       --output_file "custom_tad_genelist.tsv"
+```
+
+Skip now to the the [WebGestalt step](#webgestalt-pathway-analysis).
+
+### WebGestalt Pathway Analysis
+
+Insert either the GWAS curated genelist or a column from the custom genelist
+with the following parameters:
+
+| Parameter | Input |
+| --------- | ----- |
+| Select gene ID type | *hsapiens__gene_symbol* |
+| Enrichment Analysis | *GO Analysis* |
+| GO Slim Classification | *Yes* |
+| Reference Set | *hsapiens__genome* |
+| Statistical Method | *Hypergeometric* |
+| Multiple Test Adjustment | *BH* |
+| Significance Level | *Top10* |
+| Minimum Number of Genes for a Category | *4* |
+
+Once the analysis is complete, click `Export TSV Only` and save the file as
+`gestalt/<INSERT_TRAIT_HERE>_gestalt.tsv`. 
+
+### Curation
+
+Clean and tidy the output files and summarize into convenient lists of
+candidate genes. These genes may or may not be the nearest gene to the GWAS
+signal and will require experimental validation.
+
+```bash
+# An example for Bone Mineral Density (see `example_pipeline_bmd.sh` as well)
+
+# Process WebGestalt Output saved in `data/gestalt/bmd_gestalt.tsv`
+python scripts/parse_gestalt.py --trait 'bmd' --process
+
+# Output evidence tables
+python scripts/construct_evidence.py \
+        --trait 'bmd' \
+        --genelist 'data/gwas_catalog/Bone_mineral_density_hg19.tsv' \
+        --pathway 'skeletal system development'
+
+# Summarize evidence
+python scripts/assign_evidence_to_TADs.py \
+        --evidence 'results/bmd_gene_evidence.csv' \
+        --snps 'data/gwas_tad_genes/Bone_mineral_density_hg19_SNPs.tsv' \
+        --output_file 'results/BMD_evidence_summary.tsv'
+
+# Output venn diagram
+R --no-save --args 'results/bmd_gene_evidence.csv' \
+        'BMD' < scripts/integrative_summary.R
+```
+
+### Contact
+
+For all questions and bug reporting please file a
+[GitHub issue](https://github.com/greenelab/tad_pathways/issues)
+
+For all other questions contact Casey Greene at [email protected] or
+Struan Grant at [email protected]
diff --git a/custom_example.csv b/custom_example.csv
@@ -0,0 +1,103 @@
+prostate_cancer
+rs10009409
+rs1016343
+rs10187424
+rs103294
+rs1041449
+rs10486567
+rs10875943
+rs10896449
+rs10934853
+rs10936632
+rs10993994
+rs11135910
+rs11214775
+rs11228565
+rs115306967
+rs115457135
+rs11568818
+rs11650494
+rs11902236
+rs12051443
+rs12155172
+rs1218582
+rs12480328
+rs12500426
+rs12621278
+rs12653946
+rs1270884
+rs130067
+rs1327301
+rs13385191
+rs1447295
+rs1456315
+rs1465618
+rs1512268
+rs16901979
+rs17021918
+rs17181170
+rs17599629
+rs17694493
+rs1775148
+rs1859962
+rs188140481
+rs1894292
+rs1933488
+rs1983891
+rs2121875
+rs2238776
+rs2242652
+rs2273669
+rs2405942
+rs2427345
+rs2660753
+rs2735839
+rs2807031
+rs3096702
+rs3123078
+rs339331
+rs3771570
+rs3850699
+rs4242382
+rs4245739
+rs4430796
+rs4713266
+rs4844289
+rs4962416
+rs56232506
+rs5759167
+rs5919432
+rs5945572
+rs6062509
+rs636291
+rs6465657
+rs651164
+rs6545977
+rs6625711
+rs6763931
+rs684232
+rs6869841
+rs6983267
+rs7127900
+rs7130881
+rs7141529
+rs7153648
+rs7210100
+rs721048
+rs7241993
+rs7501939
+rs7584330
+rs7611694
+rs76934034
+rs7837688
+rs7931342
+rs8008270
+rs80130819
+rs8014671
+rs8102476
+rs817826
+rs9284813
+rs9287719
+rs9364554
+rs9443189
+rs9600079
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,7 @@
+name: tad_pathways
+dependencies:
+- python=3.5.2
+- pandas=0.18.0
+- numexpr=2.5.2
+- numpy=1.11.1
+- scipy=0.17.1
diff --git a/example_pipeline_bmd.sh b/example_pipeline_bmd.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+set -o errexit
+
+# Example of a TAD_Pathways Analysis applied to Bone Mineral Density GWAS
+
+# After saving WebGestalt tsv file, parse its contents
+python scripts/parse_gestalt.py --trait 'bmd'
+
+# Construct an evidence file - Nearest gene to gwas or not
+python scripts/construct_evidence.py \
+        --trait 'bmd'\
+        --gwas 'data/gwas_catalog/Bone_mineral_density_hg19.tsv'\
+        --pathway 'skeletal system development'
+
+# Summarize the evidence file
+python scripts/summarize_evidence.py \
+        --evidence 'results/bmd_gene_evidence.csv' \
+        --snps 'data/gwas_tad_snps/Bone_mineral_density_hg19_SNPs.tsv' \
+        --output_file 'results/bmd_gene_evidence_summary.tsv'
+
+# Visualize overlap in TAD pathways curation
+R --no-save --args 'results/bmd_gene_evidence.csv' \
+        < scripts/integrative_summary.R 
diff --git a/example_pipeline_custom.sh b/example_pipeline_custom.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+set -o errexit
+
+# Example of a TAD_Pathways Analysis applied to a Custom SNP list
+# For this example, the custom SNP list is the GWAS findings for
+# Prostate Cancer. The the data is used as a custom input.
+
+# Map SNPs to genomic location
+Rscript --vanilla scripts/build_snp_list.R \
+        --snp_file 'custom_example.csv' \
+        --output_file 'results/custom_example_location.tsv'
+
+# Build a customized genelist to input into WebGestalt
+python scripts/build_custom_tad_genelist.py \
+        --snp_data_file 'results/custom_example_location.tsv' \
+        --output_file 'results/custom_example_tad_results.tsv'
+
+# After saving WebGestalt tsv file, parse its contents
+python scripts/parse_gestalt.py --trait 'custom'
+
+# Construct an evidence file - Nearest gene to gwas or not
+python scripts/construct_evidence.py \
+            --trait 'custom'\
+            --gwas 'results/custom_example_tad_results_nearest_gene.tsv'\
+            --pathway 'epidermis development,antigen processing and presentation'
+
+# Summarize the evidence file
+python scripts/summarize_evidence.py \
+            --evidence 'results/custom_gene_evidence.csv' \
+            --snps 'results/custom_example_tad_results.tsv' \
+            --output_file 'results/custom_gene_evidence_summary.tsv'
+
+# Visualize overlap in TAD pathways curation
+R --no-save --args 'results/custom_gene_evidence.csv' \
+                < scripts/integrative_summary.R 
+
diff --git a/example_pipeline_t2d.sh b/example_pipeline_t2d.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+set -o errexit
+
+# Example of a TAD_Pathways Analysis applied to Type 2 Diabetes GWAS
+
+# After saving WebGestalt tsv file, parse its contents
+python scripts/parse_gestalt.py --trait 't2d'
+
+# Construct an evidence file - Nearest gene to gwas or not
+python scripts/construct_evidence.py \
+        --trait 't2d'\
+        --gwas 'data/gwas_catalog/Type_2_diabetes_hg19.tsv'\
+        --pathway 'peptide hormone secretion'
+
+# Summarize the evidence file
+python scripts/summarize_evidence.py \
+        --evidence 'results/t2d_gene_evidence.csv' \
+        --snps 'data/gwas_tad_snps/Type_2_diabetes_hg19_SNPs.tsv' \
+        --output_file 'results/t2d_gene_evidence_summary.tsv'
+
+# Visualize overlap in TAD pathways curation
+R --no-save --args 'results/t2d_gene_evidence.csv' \
+        < scripts/integrative_summary.R