Merge pull request #808 from AlexsLemonade/jaclyn-taroni/711-use-rms-de

Add scAdvanced GSEA notebook that uses pseudobulk RMS DE results
AlexsLemonade · Nov 25, 2024 · 6d4bf3c · 6d4bf3c
2 parents 8138bf4 + fd4e209
commit 6d4bf3c
Show file tree

Hide file tree

Showing 8 changed files with 478 additions and 5,738 deletions.
diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd
@@ -0,0 +1,289 @@
+---
+title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
+output:
+  html_notebook:
+    toc: true
+    toc_float: true
+author: CCDL for ALSF
+date: 2024
+---
+
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
+- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
+- Perform GSEA with the `clusterProfiler` package
+- Visualize GSEA results with the `enrichplot` package
+
+---
+
+In this notebook, we'll analyze the differential expression results from the last notebook.
+
+GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
+The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
+
+There are 3 general steps in FCS methods ([Khatri _et al._ 2012](https://doi.org/10.1371/journal.pcbi.1002375)):
+
+1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
+2. Aggregate gene-level statistics into a pathway-level statistic
+3. Assess the statistical significance of the pathway-level statistic
+
+#### Other resources
+
+* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/Training-modules/DGE-functional-analysis/lessons/02_functional_analysis.html)
+* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
+You can read more about `fgsea` in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
+* See the [refine.bio examples for "Gene set enrichment analysis - RNA-seq"](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.
+
+## Set up
+
+### Libraries
+
+```{r libraries}
+# Package to run GSEA
+library(clusterProfiler)
+# Package that contains the MSigDB gene sets in tidy format
+library(msigdbr)
+```
+
+### Directories and Files
+
+#### Directories
+
+```{r create_dir, live = TRUE}
+# We'll use the marker genes as GSEA input
+rms_analysis_dir <- file.path("analysis", "rms")
+
+# We'll create a directory to specifically hold the pathway results if it doesn't
+# exist yet
+results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
+fs::dir_create(results_dir)
+```
+
+#### Input files
+
+
+```{r input_files}
+input_file <- file.path(rms_analysis_dir,
+                        "deseq",
+                        "rms_myoblast_deseq_results.tsv")
+```
+
+#### Output files
+
+We'll save our table of GSEA results as a TSV.
+
+```{r output_files}
+output_file <- file.path(results_dir,
+                         "rms_myoblast_gsea_results.tsv")
+```
+
+## Gene sets
+
+We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
+The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.
+
+Let's take a look at what organisms the package supports.
+
+```{r show_species}
+msigdbr_species()
+```
+
+MSigDB contains 8 different gene set collections.
+
+    H: hallmark gene sets
+    C1: positional gene sets
+    C2: curated gene sets
+    C3: motif gene sets
+    C4: computational gene sets
+    C5: GO gene sets
+    C6: oncogenic signatures
+    C7: immunologic signatures
+
+We'll use the Hallmark collection for GSEA.
+Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):
+
+> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.
+
+Notably, there are only 50 gene sets included in this collection.
+The fewer gene sets we test, the lower our multiple hypothesis testing burden.
+
+We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.
+
+```{r immunologic_sets, live = TRUE}
+hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
+                           category = "H")
+```
+
+## Gene Set Enrichment Analysis
+
+_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_
+
+![](diagrams/subramanian_fig1.jpg)
+
+**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**
+
+GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
+This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))
+
+Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
+Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
+The enrichment score for a pathway is the running sum's maximum deviation from zero.
+GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
+As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).
+
+The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
+When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.
+
+## DESeq2 results
+
+```{r read_in_markers, live = TRUE}
+deseq_df <- readr::read_tsv(input_file)
+```
+
+```{r deseq_head}
+head(deseq_df)
+```
+
+This data frame uses Ensembl gene identifiers.
+We'll need to make sure our gene sets use the same identifiers.
+Let's take a look at the first few rows of the data frame that contains the hallmark gene sets.
+
+```{r hallmark_head, live = TRUE}
+head(hs_hallmarks_df)
+```
+
+We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
+However, we'll need to pass the correct column to the function that runs GSEA.
+
+If we needed to do gene identifier conversion, we would likely use the `AnnotationDbi` package.
+You can see an example in this Harvard Chan Bioinformatics Core Training material: <https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/AnnotationDbi_lesson.html>
+
+One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
+(Multiple Ensembl gene identifiers can map to the same symbol.)
+
+The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
+Practically speaking, gene sets are just collections of gene identifiers!
+When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
+So, let's check for duplicates in the data frame of DESeq2 results.
+
+```{r check_duplicates, live = TRUE}
+any(duplicated(deseq_df$ensembl_id))
+```
+
+There are no duplicates for us to worry about!
+
+### Pre-ranked list
+
+The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
+This is step 1 -- gene-level statistics.
+
+```{r lfc_vector}
+lfc_vector <- deseq_df |>
+  # Extract a vector of `log2FoldChange` named by `ensembl_id`
+  dplyr::pull(log2FoldChange, name = ensembl_id)
+lfc_vector <- sort(lfc_vector, decreasing = TRUE)
+```
+
+Let's look at the top ranked values.
+
+```{r head_lfc, live = TRUE}
+# Look at first entries of the log fold change vector
+head(lfc_vector)
+```
+
+And the bottom of the list.
+
+```{r tail_lfc, live = TRUE}
+# Look at the last entries of the log fold change vector
+tail(lfc_vector)
+```
+
+## Run GSEA
+
+Now for the analysis!
+
+We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).
+
+```{r run_gsea}
+gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
+                     minGSSize = 25,  # minimum gene set size
+                     maxGSSize = 500,  # maximum gene set set
+                     pvalueCutoff = 0.05,
+                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
+                     TERM2GENE = dplyr::select(hs_hallmarks_df,
+                                               gs_name,
+                                               ensembl_gene))  # pass the correct identifier column
+```
+Let's take a look at the GSEA results.
+
+```{r view_gsea, live = TRUE, eval = FALSE}
+View(gsea_results@result |>
+       dplyr::arrange(dplyr::desc(NES))
+)
+```
+
+Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.
+
+Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.
+
+Let's write these results to file.
+
+```{r write_gsea}
+gsea_results@result |> readr::write_tsv(output_file)
+```
+
+### Visualizing GSEA results
+
+We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
+Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.
+
+#### Highly Positive NES
+
+Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.
+
+```{r highly_pos}
+enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
+                     title = "HALLMARK_MYC_TARGETS_V2",
+                     color.line = "#0066FF")
+```
+
+Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.
+
+#### Highly Negative NES
+
+The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.
+
+```{r highly_neg}
+enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_MYOGENESIS",
+                     title = "HALLMARK_MYOGENESIS",
+                     color.line = "#0066FF")
+```
+
+This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.
+
+#### A non-significant result
+
+The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
+Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.
+
+```{r p53, live = TRUE}
+enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_P53_PATHWAY",
+                     title = "HALLMARK_P53_PATHWAY",
+                     color.line = "#0066FF")
+```
+
+Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.
+
+*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*
+
+## Session Info
+
+```{r session_info}
+sessionInfo()
+```