update documentation README

epigen · Feb 21, 2024 · ecf5ef5 · ecf5ef5
1 parent 7d6eb5a
commit ecf5ef5
Show file tree

Hide file tree

Showing 14 changed files with 96 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,12 @@
-# scRNA-seq Differential Expression Analysis & Visualization Snakemake Workflow powered by Seurat
-A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for performing differential expression analyses (DEA) of (multimodal) sc/snRNA-seq data powered by the R package [Seurat's](https://satijalab.org/seurat/index.html) functions [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) and [FindAllMarkers](https://satijalab.org/seurat/reference/findallmarkers).
+# Single-cell RNA sequencing (scRNA-seq) Differential Expression Analysis & Visualization Snakemake Workflow
+A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for performing differential expression analyses (DEA) of processed (multimodal) scRNA-seq data powered by the R package [Seurat's](https://satijalab.org/seurat/index.html) functions [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) and [FindAllMarkers](https://satijalab.org/seurat/reference/findallmarkers).
 
 This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details, instructions and modules check out the project's repository. Please consider starring and sharing modules that are useful to you, this helps me in prioritizing my efforts!
 
 **If you use this workflow in a publication, don't forget to give credits to the authors by citing the URL of this (original) repository (and its DOI, see Zenodo badge above -> coming soon).**
 
+**If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI [10.5281/zenodo.XXXX]().**
+
 ![Workflow Rulegraph](./workflow/dags/rulegraph.svg)
 
 Table of contents
@@ -17,65 +19,109 @@ Table of contents
   * [Configuration](#configuration)
   * [Examples](#examples)
   * [Links](#links)
+  * [Resources](#resources)
+  * [Publications](#publications)
 
 # Authors
 - [Stephan Reichl](https://github.com/sreichl)
-
+- [Christoph Bock](https://github.com/chrbock)
 
 # Software
 This project wouldn't be possible without the following software and their dependencies:
 
 | Software       | Reference (DOI)                                   |
 | :------------: | :-----------------------------------------------: |
+| data.table     | https://r-datatable.com                           |
 | EnhancedVolcano| https://doi.org/10.18129/B9.bioc.EnhancedVolcano  |
+| future         | https://doi.org/10.32614/RJ-2021-048              |
 | ggplot2        | https://ggplot2.tidyverse.org/                    |
-| patchwork      | https://CRAN.R-project.org/package=patchwork      |
 | pheatmap       | https://cran.r-project.org/package=pheatmap       |
 | Seurat         | https://doi.org/10.1016/j.cell.2021.04.048        |
 | Snakemake      | https://doi.org/10.12688/f1000research.29032.2    |
 
 # Methods
-This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].
+This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/\*.yaml file) or post execution in the result directory (/envs/scrnaseq_processing_seurat/\*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].
 
 The outlined analyses were performed using the R package Seurat (ver) [ref] unless stated otherwise.
 
-**Differential Expression Analysis (DEA).** DEA was performed on the assay [X] and data slot [X] with Seurat's FindMarkers/FindAllMarkers function using the statistical test [X] with the parameters log2(fold change) threshold of [X] and minimal percentage of expression [X]. The results were filtered for relevant features by adjusted p-value of [X], absolute log2(fold change) of [X] and minimum percentage of expression [X].
+**Differential Expression Analysis (DEA).** DEA was performed on the assay [X] and data slot [X] with Seurat's [FindMarkers|FindAllMarkers] function using the statistical test [X] with the parameters log2(fold change) threshold of [X] and minimal percentage of expression [X]. The results were filtered for relevant features by adjusted p-value of [X], absolute log2(fold change) of [X] and minimum percentage of expression [X].
 
-**Visualization.** All and filtered result statistics, i.e., number of statistically significant results split by positive (up) and negative (down) effect-sizes, were separately visualized with stacked bar plots using ggplot (ver) [ref]. 
+**Visualization.** All filtered result statistics, i.e., number of statistically significant results split by positive (up) and negative (down) effect-sizes, were separately visualized with stacked bar plots using ggplot (ver) [ref]. 
 To visually summarize results of the same analysis the filtered log2(fold change) values of features that were found to be at least in one comparison statistically significantly differentially expressed were visualized in a hierarchically clustered heatmap using pheatmap (ver) [ref]. 
 Volcano plots were generated for each analysis using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [X] and log2(fold change) threshold of [X] as visual cut-offs for the y- and x-axis, respectively.
 
 **The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [ref - cite this workflow here].**
 
 # Features
-The workflow perfroms the following steps.
+The workflow performs the following steps to produce the outlined results (`dea_seurat/{analysis}/`).
 - Differential Expression Analysis (DEA)
-  - using Seurat's [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) or [FindAllMarkers](https://satijalab.org/seurat/reference/findallmarkers) depending on the configuration (CSV)
-  - feature list per comparison group and direction (up/down) for downstream analysis (eg enrichment analysis) (TXT)
-  - (optional) feature score tables (with two columns: "feature" and "score") per comparison group using {score_formula} for downstream analyses (eg preranked enrichment analysis) (CSV).
-- DEA result statistics: number of statistically significant results split by positive (up) and negative (down) change (CSV)
-- DEA result filtering by 
-  - statistical significance (adjusted p-value)
-  - effect-size (log 2 fold change)
-  - expression (minimum percentage of expression) in one of the comparison groups
-- Log Fold Change (LFC) matrix of filtered features by comparison groups (CSV)
-- Visualizations
-  - all and filtered DEA result statistics: number of features and direction (stacked Bar plots)
-  - Volanco plot per comparison with configured cutoffs for statistical significance and effect-size
-  - Clustered Heatmaps of the LFC matrix
+  - using Seurat's [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) or [FindAllMarkers](https://satijalab.org/seurat/reference/findallmarkers) depending on the configuration (`results.csv`). This step is parallelized using the R package `future`.
+  - feature list per comparison group and direction (up/down) for downstream analysis (e.g., enrichment analysis) (`feature_lists/{group}_{up/down}_features.txt`)
+  - (optional) feature score tables (with two columns: "feature" and "score") per comparison group using `{score_formula}` for downstream analyses (e.g., preranked enrichment analysis) (`feature_lists/{group}_featureScores.csv`).
+  - results filtered according to the configured thresholds:
+    - statistical significance (adjusted p-value).
+    - effect-size (log2 fold change).
+    - expression (minimum percentage of expression) in one of the comparison groups.
+  - filtered result statistics: number of statistically significant results split by positive (up) and negative (down) change (`stats.csv`).
+- Visualizations (`dea_seurat/{analysis}/plots`)
+  - filtered result statistics: number of features and direction as bar plot (`stats.png`).
+  - volanco plots per comparison group with effect size on the x-axis and raw p-value(`*_rawp`)/adjusted p-value (`_adjp`) on the y-axis (`volcano/{feature_list}/{group}.png`).
+      - highlighting features according to configured cut-offs for statistical significance (`pCutoff`) and effect size (`FCcutoff`).
+      - (optional) highlighting features according to configured feature lists.
+  - hierarchically clustered heatmap of effect sizes (LFC) per comparison (features x comparisons) indicating statistical significance with a star '\*' (`heatmap/{feature_list}.png`).
+      - using all filtered features (FILTERED)
+      - (optional) using configured feature lists
+      - in case of more than 100 features the row labels and significance indicators (\*) are removed
 
 # Usage
 Here are some tips for the usage of this workflow:
-- perform your first run with loose filtering options/cut-offs and set all the same to see if further filtering is even necessar or useful
+- Perform your first run with loose filtering options/cut-offs and set the same for filtering and plotting to see if further filtering is even necessar or useful.
+- Try one small/simple analysis first before running all desired analyses.
 
 # Configuration
 Detailed specifications can be found here [./config/README.md](./config/README.md)
 
 # Examples
---- COMING SOON ---
+We selected a scRNA-seq data set consisting of 15 CRC samples from [Lee et al (2020) Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nature Genetics](https://doi.org/10.1038/s41588-020-0636-z). Downloaded from the [Weizmann Institute - Curated Cancer Cell Atlas (3CA) - Colorectal Cancer](https://www.weizmann.ac.il/sites/3CA/colorectal) section.
+- samples/patients: 15
+- cells: 21657
+- features (genes): 22276
+- preprocessed using the compatible MR.PARETO module for [scRNA-seq data processing & visualization](https://github.com/epigen/scrnaseq_processing_seurat)
+- We performed a 1 vs rest analysis using the cell type annotation ("ALL").
+- total runtime on HPC w/ SLURM (32GB RAM; only DEA with 8 cores otherwise 1 core): <25 minutes for X jobs in total
+
+A comparison of the cell type marker expression split by cell types visualized as a dot plot with the DEA results as hierarchically clustered heatmap of the effect sizes.
+
+[data source/authors](https://www.weizmann.ac.il/sites/3CA/study-data/cell-types/20086)             |  Workflow Output
+:-------------------------:|:-------------------------:
+![Cell Type Marker Dot plot](./test/data/Lee2020NatGenet/CellType_DotPlot.png)  |  ![Cell Type Marker Dot plot](./test/results/Lee2020NatGenet/dea_seurat/Lee2020NatGenet_cellTypes/plots/heatmap/CellTypeMarkers.png)
+
+We provide metadata, annotation and configuration files for this data set in ./test. The processed and prepared [Seurat RDS object](https://doi.org/10.5281/zenodo.XXXXX) has to be downloaded from Zenodo by following the instructions below.
+  ```console
+  # download Zenodo records using zenodo_get
+
+  # install zenodo_get v1.3.4
+  conda install -c conda-forge zenodo_get=1.3.4
+
+  # download the prepare Seurat RDS object
+  zenodo_get --record XXXXX --output-dir=test/data/Lee2020NatGenet/
+  ```
 
 # Links
 - [GitHub Repository](https://github.com/epigen/dea_seurat/)
 - [GitHub Page](https://epigen.github.io/dea_seurat/)
 - [Zenodo Repository (coming soon)]()
 - [Snakemake Workflow Catalog Entry](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/dea_seurat)
+
+# Resources
+- Recommended compatible [MR.PARETO](https://github.com/epigen/mr.pareto) modules:
+  - for upstream processing (before)
+    - [scRNA-seq Data Processing & Visualization](https://github.com/epigen/scrnaseq_processing_seurat) for processing and preparing a Seurat object as input.
+  - for downstream analyses (after)
+    - [Unsupervised Analysis](https://github.com/epigen/unsupervised_analysis) to understand and visualize similarities and variations between groups using DEA results, including dimensionality reduction and cluster analysis. Useful for both group and gene level analyses.
+  - [Enrichment Analysis](https://github.com/epigen/enrichment_analysis) for biomedical interpretation of differential analysis results using prior knoweledge.
+
+
+# Publications
+The following publications successfully used this module for their analyses.
+- ...
diff --git a/config/README.md b/config/README.md
@@ -1,11 +1,11 @@
 # Configuration
 
-You need one configuration file and one annotation file to run the complete workflow. You can use the provided example as starting point. Always use absolute paths. If in doubt read the comments in the config and/or try the default values.
+You need one configuration file and one annotation file to run the complete workflow. You can use the provided example as starting point. If in doubt read the comments in the config and/or try the default values.
 
-- project configuration (config/config.yaml): different for every project/dataset and configures the analyses to be performed
+- project configuration (config/config.yaml): different for every project/dataset and configures the analyses to be performed.
 - sample annotation (sample_annotation): CSV file consisting of five columns
-    -  name: name of the dataset/analysis (tip: keep it short, but descriptive and distinctive)
-    -  data: absolute path to the input Seurat object as .rds
-    -  assay: the Seurat assay to be used (eg SCT or RNA)
-    -  metadata: column name of the metadata that should be used to group cells for comparison (eg condition)
-    -  control: name of the class/level that should be used as control in the comparison (eg untreated) or "ALL" to compare every class against the rest (eg useful to find cluster markers; one vs all)
+    -  name: name of the dataset/analysis (tip: keep it short, but descriptive and distinctive).
+    -  data: path to the input Seurat object as .rds.
+    -  assay: the Seurat assay to be used (e.g., SCT or RNA).
+    -  metadata: column name of the metadata that should be used to group cells for comparison (e.g., condition or cell_type).
+    -  control: name of the class/level that should be used as control in the comparison (e.g., untreated) or "ALL" to compare every class against the rest (e.g., useful to find cluster markers; one vs all)
diff --git a/test/data/Lee2020NatGenet/CellType_DotPlot.png b/test/data/Lee2020NatGenet/CellType_DotPlot.png
diff --git a/...NatGenet/dea_seurat/Lee2020NatGenet_cellTypes/plots/heatmap/CellTypeMarkers.png b/...NatGenet/dea_seurat/Lee2020NatGenet_cellTypes/plots/heatmap/CellTypeMarkers.png
diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -48,7 +48,7 @@ rule all:
                                  analysis = analyses,
                                  feature_list = feature_lists + ['FILTERED'],
                          ),
-        envs = expand(os.path.join(config["result_path"],'envs',module_name,'{env}.yaml'),env=['seurat','visualization']), # 'volcanos','ggplot','heatmap'
+        envs = expand(os.path.join(config["result_path"],'envs',module_name,'{env}.yaml'),env=['seurat','volcanos','ggplot','heatmap']),
         configs = os.path.join(config["result_path"],'configs',module_name,'{}_config.yaml'.format(config["project_name"])),
         annotations = os.path.join(config["result_path"],'configs',module_name,'{}_annot.csv'.format(config["project_name"])),
         feature_lists = expand(os.path.join(config["result_path"],'configs','dea_seurat','{feature_list}.txt'), feature_list = feature_lists),

diff --git a/workflow/envs/ggplot.yaml b/workflow/envs/ggplot.yaml
@@ -0,0 +1,6 @@
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - r-ggplot2=3.4.2
+  - r-data.table=1.14.8
diff --git a/workflow/envs/visualization.yaml → workflow/envs/heatmap.yaml b/workflow/envs/visualization.yaml → workflow/envs/heatmap.yaml
@@ -3,7 +3,6 @@ channels:
   - bioconda
   - defaults
 dependencies:
-  - bioconductor-enhancedvolcano=1.12.0
   - r-ggplot2=3.4.2
   - r-ggplotify=0.1.0
   - r-pheatmap=1.0.12

diff --git a/workflow/envs/volcanos.yaml b/workflow/envs/volcanos.yaml
@@ -0,0 +1,8 @@
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - bioconductor-enhancedvolcano=1.20.0
+  - r-data.table=1.14.10
+  - r-ggplot2=3.4.4
diff --git a/workflow/rules/dea.smk b/workflow/rules/dea.smk
@@ -48,7 +48,7 @@ rule aggregate:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)
     conda:
-        "../envs/visualization.yaml"
+        "../envs/ggplot.yaml"
     log:
         os.path.join("logs","rules","aggregate_{analysis}.log"),
     params:

diff --git a/workflow/rules/visualize.smk b/workflow/rules/visualize.smk
@@ -18,7 +18,7 @@ rule volcanos:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)
     conda:
-        "../envs/visualization.yaml"
+        "../envs/volcanos.yaml"
     log:
         os.path.join("logs","rules","volcanos_{analysis}_{feature_list}.log"),
     params:
@@ -47,7 +47,7 @@ rule heatmap:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 1)
     conda:
-        "../envs/visualization.yaml"
+        "../envs/heatmap.yaml"
     log:
         os.path.join("logs","rules","lfc_heatmap_{analysis}_{feature_list}.log"),
     params:

diff --git a/workflow/scripts/aggregate.R b/workflow/scripts/aggregate.R
@@ -1,5 +1,4 @@
 #### load libraries & utility function 
-library("ggplot2")
 
 # source utility functions
 # source("workflow/scripts/utils.R")

diff --git a/workflow/scripts/heatmap.R b/workflow/scripts/heatmap.R
@@ -1,7 +1,5 @@
 #### load libraries & utility function 
 library("pheatmap")
-library("patchwork")
-library("ggplot2")
 library("ggplotify")
 library("reshape2")
 

diff --git a/workflow/scripts/utils.R b/workflow/scripts/utils.R
@@ -1,4 +1,5 @@
 ### Utility Functions
+library("ggplot2")
 library("data.table")
 
 # extended ggsave

diff --git a/workflow/scripts/volcanos.R b/workflow/scripts/volcanos.R
@@ -1,6 +1,5 @@
 #### load libraries & utility function 
-library("EnhancedVolcano", quietly=TRUE)
-library("patchwork", quietly=TRUE)
+library("EnhancedVolcano")
 library("ggplot2")
 
 options(ragg.max_dim = 100000) # required for large volcano panels