From a555c2264da9fb217b83324872c2d2307d707e68 Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 09:56:10 -0500 Subject: [PATCH 1/7] Add renumbered version of GSEA notebook that uses pseudobulk DE results --- .../04-gene_set_enrichment_analysis.Rmd | 286 ++ .../04-gene_set_enrichment_analysis.nb.html | 3501 +++++++++++++++++ 2 files changed, 3787 insertions(+) create mode 100644 scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd create mode 100644 scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd new file mode 100644 index 00000000..0f300979 --- /dev/null +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd @@ -0,0 +1,286 @@ +--- +title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)" +output: + html_notebook: + toc: true + toc_float: true +author: CCDL for ALSF +date: 2024 +--- + +## Objectives + +This notebook will demonstrate how to: + +- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA) +- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package +- Perform GSEA with the `clusterProfiler` package +- Visualize GSEA results with the `enrichplot` package + +--- + +In this notebook, we'll analyze the differential expression results from the last notebook. + +GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102). +The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest. + +There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)): + +1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results) +2. Gene-level statistics are aggregated into a pathway-level statistic +3. Assess the statistical significance of the pathway-level statistic + +#### Other resources + +* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) +* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood. +You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). +* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. + +## Set up + +### Libraries + +```{r libraries} +# Package to run GSEA +library(clusterProfiler) +# Package that contains the MSigDB gene sets in tidy format +library(msigdbr) +``` + +### Directories and Files + +#### Directories + +```{r create_dir, live = TRUE} +# We'll use the marker genes as GSEA input +rms_analysis_dir <- file.path("analysis", "rms") + +# We'll create a directory to specifically hold the pathway results if it doesn't +# exist yet +results_dir <- file.path(rms_analysis_dir, "pathway-analysis") +fs::dir_create(results_dir) +``` + +#### Input files + + +```{r input_files} +input_file <- file.path(rms_analysis_dir, + "deseq", + "rms_myoblast_deseq_results.tsv") +``` + +#### Output files + +We'll save our table of GSEA results as a TSV. + +```{r output_files} +output_file <- file.path(results_dir, + "rms_myoblast_gsea_results.tsv") +``` + +## Gene sets + +We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)). +The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms. + +Let's take a look at what organisms the package supports. + +```{r show_species} +msigdbr_species() +``` + +MSigDB contains 8 different gene set collections. + + H: hallmark gene sets + C1: positional gene sets + C2: curated gene sets + C3: motif gene sets + C4: computational gene sets + C5: GO gene sets + C6: oncogenic signatures + C7: immunologic signatures + +We'll use the Hallmark collection for GSEA. +Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): + +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. + +Notably, there are only 50 gene sets included in this collection. +The fewer gene sets we test, the lower our multiple hypothesis testing burden. + +We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function. + +```{r immunologic_sets, live = TRUE} +hs_hallmarks_df <- msigdbr(species = "Homo sapiens", + category = "H") +``` + +## Gene Set Enrichment Analysis + +_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_ + +![](diagrams/subramanian_fig1.jpg) + +**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).** + +GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. +This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)) + +Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated: +Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway. +The enrichment score for a pathway is the running sum's maximum deviation from zero. +GSEA also assesses statistical significance of the scores for each pathway through permutation testing. +As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)). + +The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets. +When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. + +## DESeq2 results + +```{r read_in_markers, live = TRUE} +deseq_df <- readr::read_tsv(input_file) +``` + +```{r deseq_head} +head(deseq_df) +``` + +This data frame uses Ensembl gene identifiers. +We'll need to make sure our gene sets use the same identifiers. +Let's take a look at the first few rows of the data frame that contains the hallmark genesets. + +```{r hallmark_head, live = TRUE} +head(hs_hallmarks_df) +``` + +We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion. +However, we'll need to pass the correct column to the function that runs GSEA. + +One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols. +(Multiple Ensembl gene identifiers can map to the same symbol.) + +The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. +Practically speaking, gene sets are just collections of gene identifiers! +When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results. +So, let's check for duplicates in the data frame of DESeq2 results. + +```{r check_duplicates, live = TRUE} +any(duplicated(deseq_df$ensembl_id)) +``` + +There are no duplicates for us to worry about! + +### Pre-ranked list + +The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. +This is step 1 -- gene-level statistics. + +```{r lfc_vector} +lfc_vector <- deseq_df |> + # Extract a vector of `log2FoldChange` named by `ensembl_id` + dplyr::pull(log2FoldChange, name = ensembl_id) +lfc_vector <- sort(lfc_vector, decreasing = TRUE) +``` + +Let's look at the top ranked values. + +```{r head_lfc, live = TRUE} +# Look at first entries of the log fold change vector +head(lfc_vector) +``` + +And the bottom of the list. + +```{r tail_lfc, live = TRUE} +# Look at the last entries of the log fold change vector +tail(lfc_vector) +``` + +## Run GSEA + +Now for the analysis! + +We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). + +```{r run_gsea} +gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list + minGSSize = 25, # minimum gene set size + maxGSSize = 500, # maximum gene set set + pvalueCutoff = 0.05, + pAdjustMethod = "BH", # correction for multiple hypothesis testing + TERM2GENE = dplyr::select(hs_hallmarks_df, + gs_name, + ensembl_gene)) # pass the correct identifier column +``` +Let's take a look at the GSEA results. + +```{r view_gsea, live = TRUE, eval = FALSE} +View(gsea_results@result |> + dplyr::arrange(dplyr::desc(NES)) +) +``` + +Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable. + +Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts. + +Let's write these results to file. + +```{r write_gsea} +gsea_results@result |> readr::write_tsv(output_file) +``` + +### Visualizing GSEA results + +We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. +Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated. + +#### Highly Positive NES + +Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot. + +```{r highly_pos} +enrichplot::gseaplot(gsea_results, + geneSetID = "HALLMARK_MYC_TARGETS_V2", + title = "HALLMARK_MYC_TARGETS_V2", + color.line = "#0066FF") +``` + +Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. + +#### Highly Negative NES + +The gene set `HALLMARK_MYOGENESIS` had a highly negative NES. + +```{r highly_neg} +enrichplot::gseaplot(gsea_results, + geneSetID = "HALLMARK_MYOGENESIS", + title = "HALLMARK_MYOGENESIS", + color.line = "#0066FF") +``` + +This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. + +#### A non-significant result + +The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name. +Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier. + +```{r p53, live = TRUE} +enrichplot::gseaplot(gsea_results, + geneSetID = "HALLMARK_P53_PATHWAY", + title = "HALLMARK_P53_PATHWAY", + color.line = "#0066FF") +``` + +Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score. + +*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.* + +## Session Info + +```{r session_info} +sessionInfo() +``` diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html new file mode 100644 index 00000000..618e4c60 --- /dev/null +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html @@ -0,0 +1,3501 @@ + + + + + + + + + + + + + + + +Pathway analysis: Gene Set Enrichment Analysis (GSEA) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + +
+

Objectives

+

This notebook will demonstrate how to:

+
    +
  • Prepare tabular data of gene-level statistics for use with Gene Set +Enrichment Analysis (GSEA)
  • +
  • Access Molecular +Signatures Database gene set collections via the +msigdbr package
  • +
  • Perform GSEA with the clusterProfiler package
  • +
  • Visualize GSEA results with the enrichplot package
  • +
+
+

In this notebook, we’ll analyze the differential expression results +from the last notebook.

+

GSEA is a functional class scoring (FCS) approach to pathway analysis +that was first introduced in Subramanian et +al. (2005). The rationale behind FCS approaches is that small +changes in individual genes that participate in the same biological +process or pathway can be significant and of biological interest.

+

There are 3 general steps in FCS methods (Khatri et +al. 2012):

+
    +
  1. Calculate a gene-level statistic (here, we’ll use the summary log +fold changes in our DESeq2 results)
  2. +
  3. Gene-level statistics are aggregated into a pathway-level +statistic
  4. +
  5. Assess the statistical significance of the pathway-level +statistic
  6. +
+
+

Other resources

+ +
+
+
+

Set up

+
+

Libraries

+ + + +
# Package to run GSEA
+ + +
Warning message:
+replacing previous import ‘S4Arrays::makeNindexFromArrayViewport’ by ‘DelayedArray::makeNindexFromArrayViewport’ when loading ‘SummarizedExperiment’ 
+ + +
library(clusterProfiler)
+ + +
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
+
+If you use clusterProfiler in published research, please cite:
+T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
+
+Attaching package: ‘clusterProfiler’
+
+The following object is masked from ‘package:stats’:
+
+    filter
+ + +
# Package that contains the MSigDB gene sets in tidy format
+library(msigdbr)
+ + + +
+
+

Directories and Files

+
+

Directories

+ + + +
# We'll use the marker genes as GSEA input
+rms_analysis_dir <- file.path("analysis", "rms")
+
+# We'll create a directory to specifically hold the pathway results if it doesn't
+# exist yet
+results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
+fs::dir_create(results_dir)
+ + + +
+
+

Input files

+ + + +
input_file <- file.path(rms_analysis_dir,
+                        "deseq",
+                        "rms_myoblast_deseq_results.tsv")
+ + + +
+
+

Output files

+

We’ll save our table of GSEA results as a TSV.

+ + + +
output_file <- file.path(results_dir,
+                         "rms_myoblast_gsea_results.tsv")
+ + + +
+
+
+
+

Gene sets

+

We will use gene sets from the Molecular +Signatures Database (MSigDB) from the Broad Institute (Subramanian, Tamayo +et al. 2005). The msigdbr +package contains MSigDB datasets already in the tidy format required by +clusterProfiler and supports multiple organisms.

+

Let’s take a look at what organisms the package supports.

+ + + +
msigdbr_species()
+ + +
+ +
+ + + +

MSigDB contains 8 different gene set collections.

+
H: hallmark gene sets
+C1: positional gene sets
+C2: curated gene sets
+C3: motif gene sets
+C4: computational gene sets
+C5: GO gene sets
+C6: oncogenic signatures
+C7: immunologic signatures
+

We’ll use the Hallmark collection for GSEA. Here’s an excerpt of the +collection +description:

+
+

Hallmark gene sets summarize and represent specific well-defined +biological states or processes and display coherent expression. These +gene sets were generated by a computational methodology based on +identifying gene set overlaps and retaining genes that display +coordinate expression. The hallmarks reduce noise and redundancy and +provide a better delineated biological space for GSEA.

+
+

Notably, there are only 50 gene sets included in this collection. The +fewer gene sets we test, the lower our multiple hypothesis testing +burden.

+

We can retrieve only the Hallmark gene sets by specifying +category = "H" to the msigdbr() function.

+ + + +
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
+                           category = "H")
+ + + +
+
+

Gene Set Enrichment Analysis

+

Adapted from refine.bio +examples

+

+

Figure 1. Subramanian et +al. (2005).

+

GSEA calculates a pathway-level metric, called an enrichment score +(sometimes abbreviated as ES), by ranking genes by a gene-level +statistic. This score reflects whether or not a gene set or pathway is +over-represented at the top or bottom of the gene rankings (Subramanian et +al. 2005; Yu)

+

Specifically, all genes are ranked from most positive to most +negative based on their statistic and a running sum is calculated: +Starting with the most highly ranked genes, the running sum increases +for each gene in the pathway and decreases for each gene not in the +pathway. The enrichment score for a pathway is the running sum’s maximum +deviation from zero. GSEA also assesses statistical significance of the +scores for each pathway through permutation testing. As a result, each +input pathway will have a p-value associated with it that is then +corrected for multiple hypothesis testing (Subramanian et +al. 2005; Yu).

+

The implementation of GSEA we use in here examples requires a gene +list ordered by some statistic and input gene sets. When you use +previously computed gene-level statistics with GSEA, it is called GSEA +pre-ranked.

+
+
+

DESeq2 results

+ + + +
deseq_df <- readr::read_tsv(input_file)
+ + +
Rows: 60319 Columns: 27
+── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr  (2): ensembl_id, gene_symbol
+dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, SCPCL000478.detected, SCPCL000479.mean, SCPCL000479.detected, SCPCL000480.mean, SCPCL000480.detected, SCPCL000481.m...
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+ + + + + + +
head(deseq_df)
+ + +
+ +
+ + + +

This data frame uses Ensembl gene identifiers. We’ll need to make +sure our gene sets use the same identifiers. Let’s take a look at the +first few rows of the data frame that contains the hallmark +genesets.

+ + + +
head(hs_hallmarks_df)
+ + +
+ +
+ + + +

We can see that the gene sets from msigdbr have Ensembl +gene identifiers associated with them, so we don’t need to do any +conversion. However, we’ll need to pass the correct column to the +function that runs GSEA.

+

One good thing about Ensembl gene identifiers is that they are less +likely to be duplicated than, for example, gene symbols. (Multiple +Ensembl gene identifiers can map to the same symbol.)

+

The GSEA approach requires on discriminating between genes that are +in a gene set and those that are not. Practically speaking, gene sets +are just collections of gene identifiers! When the function we use for +GSEA pre-ranked gets a list with duplicated gene identifiers, it can +produce unexpected results. So, let’s check for duplicates in the data +frame of DESeq2 results.

+ + + +
any(duplicated(deseq_df$ensembl_id))
+ + +
[1] FALSE
+ + + +

There are no duplicates for us to worry about!

+
+

Pre-ranked list

+

The GSEA() function takes a pre-ranked (sorted) named +vector of statistics, where the names in the vector are gene +identifiers. This is step 1 – gene-level statistics.

+ + + +
lfc_vector <- deseq_df |>
+  # Extract a vector of `log2FoldChange` named by `ensembl_id`
+  dplyr::pull(log2FoldChange, name = ensembl_id)
+lfc_vector <- sort(lfc_vector, decreasing = TRUE)
+ + + +

Let’s look at the top ranked values.

+ + + +
# Look at first entries of the log fold change vector
+head(lfc_vector)
+ + +
ENSG00000263366 ENSG00000223760 ENSG00000253377 ENSG00000265843 ENSG00000104722 ENSG00000228835 
+      11.364714       10.573752       10.476990       10.199449       10.019651        9.898818 
+ + + +

And the bottom of the list.

+ + + +
# Look at the last entries of the log fold change vector
+tail(lfc_vector)
+ + +
ENSG00000269186 ENSG00000268388 ENSG00000165606 ENSG00000285640 ENSG00000118432 ENSG00000184221 
+      -10.93216       -11.35119       -11.36925       -11.90034       -11.92082       -12.12577 
+ + + +
+
+
+

Run GSEA

+

Now for the analysis!

+

We can use the GSEA() function to perform GSEA with any +generic set of gene sets, but there are several functions for using +specific, commonly used gene sets (e.g., gseKEGG()).

+ + + +
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
+                     minGSSize = 25,  # minimum gene set size
+                     maxGSSize = 500,  # maximum gene set set
+                     pvalueCutoff = 0.05,
+                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
+                     TERM2GENE = dplyr::select(hs_hallmarks_df,
+                                               gs_name,
+                                               ensembl_gene))  # pass the correct identifier column
+ + +
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
+
+preparing geneSet collections...
+GSEA analysis...
+Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
+  There are ties in the preranked stats (20.32% of the list).
+The order of those tied genes will be arbitrary, which may produce unexpected results.
+leading edge analysis...
+done...
+ + + +

Let’s take a look at the GSEA results.

+ + + +
View(gsea_results@result |>
+       dplyr::arrange(dplyr::desc(NES))
+)
+ + + +

Normalized enrichment scores (NES) are enrichment scores that are +scaled to make gene sets that contain different number of genes +comparable.

+

Pathways with significant, highly positive NES are enriched in ERMS +myoblasts, whereas pathways with significant, highly negative NES are +enriched in ARMS myoblasts.

+

Let’s write these results to file.

+ + + +
gsea_results@result |> readr::write_tsv(output_file)
+ + + +
+

Visualizing GSEA results

+

We can visualize GSEA results for individual pathways or gene sets +using enrichplot::gseaplot(). Let’s take a look at 3 +different pathways – one with a highly positive NES, one with a highly +negative NES, and one that was not a significant result – to get more +insight into how ES are calculated.

+
+

Highly Positive NES

+

Let’s take look at a pathway with a highly positive NES +(HALLMARK_MYC_TARGETS_V2) using a GSEA plot.

+ + + +
enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
+                     title = "HALLMARK_MYC_TARGETS_V2",
+                     color.line = "#0066FF")
+ + +

+ + + +

Notice how the genes that are in the gene set, indicated by the black +bars, tend to be on the left side of the graph indicating that they have +positive gene-level scores.

+
+
+

Highly Negative NES

+

The gene set HALLMARK_MYOGENESIS had a highly negative +NES.

+ + + +
enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_MYOGENESIS",
+                     title = "HALLMARK_MYOGENESIS",
+                     color.line = "#0066FF")
+ + +

+ + + +

This gene set shows the opposite pattern – genes in the pathway tend +to be on the right side of the graph.

+
+
+

A non-significant result

+

The @results slot will only show gene sets that pass the +pvalueCutoff threshold we supplied to GSEA(), +but we can plot any gene set so long as we know its name. Let’s look at +HALLMARK_P53_PATHWAY, which was not in the results we +viewed earlier.

+ + + +
enrichplot::gseaplot(gsea_results,
+                     geneSetID = "HALLMARK_P53_PATHWAY",
+                     title = "HALLMARK_P53_PATHWAY",
+                     color.line = "#0066FF")
+ + +

+ + + +

Genes in the pathway are distributed more evenly throughout the +ranked list, resulting in a more “middling” score.

+

Note: The plots returned by enrichplot::gseaplot are +ggplots, so we could use ggplot2::ggsave() to save them to +file if we wanted to.

+
+
+
+
+

Session Info

+ + + +
sessionInfo()
+ + +
R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
+ [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats     graphics  grDevices datasets  utils     methods   base     
+
+other attached packages:
+[1] msigdbr_7.5.1          clusterProfiler_4.12.0
+
+loaded via a namespace (and not attached):
+  [1] RColorBrewer_1.1-3          rstudioapi_0.16.0           jsonlite_1.8.8              magrittr_2.0.3              estimability_1.5            ggbeeswarm_0.7.2           
+  [7] farver_2.1.1                rmarkdown_2.26              fs_1.6.4                    zlibbioc_1.50.0             vctrs_0.6.5                 memoise_2.0.1              
+ [13] DelayedMatrixStats_1.26.0   ggtree_3.12.0               htmltools_0.5.8.1           S4Arrays_1.4.0              BiocNeighbors_1.22.0        SparseArray_1.4.0          
+ [19] gridGraphics_0.5-1          plyr_1.8.9                  emmeans_1.10.1              cachem_1.0.8                igraph_2.0.3                lifecycle_1.0.4            
+ [25] pkgconfig_2.0.3             gson_0.1.0                  rsvd_1.0.5                  Matrix_1.7-0                R6_2.5.1                    fastmap_1.1.1              
+ [31] GenomeInfoDbData_1.2.12     MatrixGenerics_1.16.0       digest_0.6.35               aplot_0.2.2                 enrichplot_1.24.0           colorspace_2.1-0           
+ [37] patchwork_1.2.0             AnnotationDbi_1.66.0        S4Vectors_0.42.0            scater_1.32.0               qusage_2.38.0               irlba_2.3.5.1              
+ [43] GenomicRanges_1.56.0        RSQLite_2.3.6               beachmat_2.20.0             labeling_0.4.3              fansi_1.0.6                 httr_1.4.7                 
+ [49] polyclip_1.10-6             abind_1.4-5                 compiler_4.4.0              bit64_4.0.5                 withr_3.0.0                 BiocParallel_1.38.0        
+ [55] viridis_0.6.5               DBI_1.2.2                   ggforce_0.4.2               MASS_7.3-60.2               DelayedArray_0.30.0         HDO.db_0.99.1              
+ [61] tools_4.4.0                 vipor_0.4.7                 beeswarm_0.4.0              ape_5.8                     scatterpie_0.2.2            fftw_1.0-8                 
+ [67] glue_1.7.0                  nlme_3.1-164                GOSemSim_2.30.0             grid_4.4.0                  shadowtext_0.1.3            reshape2_1.4.4             
+ [73] fgsea_1.30.0                generics_0.1.3              gtable_0.3.5                tzdb_0.4.0                  tidyr_1.3.1                 data.table_1.15.4          
+ [79] hms_1.1.3                   ScaledMatrix_1.12.0         BiocSingular_1.20.0         tidygraph_1.3.1             utf8_1.2.4                  XVector_0.44.0             
+ [85] BiocGenerics_0.50.0         ggrepel_0.9.5               pillar_1.9.0                stringr_1.5.1               vroom_1.6.5                 babelgene_22.9             
+ [91] limma_3.60.0                yulab.utils_0.1.4           splines_4.4.0               dplyr_1.1.4                 tweenr_2.0.3                treeio_1.28.0              
+ [97] lattice_0.22-6              renv_1.0.7                  bit_4.0.5                   tidyselect_1.2.1            GO.db_3.19.1                SingleCellExperiment_1.26.0
+[103] Biostrings_2.72.0           scuttle_1.14.0              knitr_1.46                  gridExtra_2.3               IRanges_2.38.0              SummarizedExperiment_1.34.0
+[109] stats4_4.4.0                xfun_0.43                   graphlayouts_1.1.1          Biobase_2.64.0              statmod_1.5.0               matrixStats_1.3.0          
+[115] stringi_1.8.3               UCSC.utils_1.0.0            lazyeval_0.2.2              ggfun_0.1.4                 yaml_2.3.8                  evaluate_0.23              
+[121] codetools_0.2-20            ggraph_2.2.1                tibble_3.2.1                qvalue_2.36.0               BiocManager_1.30.22         ggplotify_0.1.2            
+[127] cli_3.6.2                   xtable_1.8-4                munsell_0.5.1               Rcpp_1.0.12                 GenomeInfoDb_1.40.0         coda_0.19-4.1              
+[133] png_0.1-8                   parallel_4.4.0              ggplot2_3.5.1               readr_2.1.5                 blob_1.2.4                  DOSE_3.30.0                
+[139] sparseMatrixStats_1.16.0    mvtnorm_1.2-4               viridisLite_0.4.2           tidytree_0.4.6              scales_1.3.0                purrr_1.0.2                
+[145] crayon_1.5.2                rlang_1.1.3                 cowplot_1.1.3               fastmatch_1.1-4             KEGGREST_1.44.0            
+ + +
+ +
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
                        "deseq",
                        "rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
                           category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark genesets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
  # Extract a vector of `log2FoldChange` named by `ensembl_id`
  dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmarks_df,
                                               gs_name,
                                               ensembl_gene))  # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
                     title = "HALLMARK_MYC_TARGETS_V2",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYOGENESIS",
                     title = "HALLMARK_MYOGENESIS",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

+ + +
+
+ +
+ + + + + + + + + + + + + + + + + From c4ae9f947afb5de3b764fb5b03492ea491566d7b Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 09:56:34 -0500 Subject: [PATCH 2/7] Remove 05 version of GSEA that uses marker genes --- .../05-gene_set_enrichment_analysis-live.Rmd | 344 -- .../05-gene_set_enrichment_analysis.Rmd | 356 -- .../05-gene_set_enrichment_analysis.nb.html | 3635 ----------------- 3 files changed, 4335 deletions(-) delete mode 100644 scRNA-seq-advanced/05-gene_set_enrichment_analysis-live.Rmd delete mode 100644 scRNA-seq-advanced/05-gene_set_enrichment_analysis.Rmd delete mode 100644 scRNA-seq-advanced/05-gene_set_enrichment_analysis.nb.html diff --git a/scRNA-seq-advanced/05-gene_set_enrichment_analysis-live.Rmd b/scRNA-seq-advanced/05-gene_set_enrichment_analysis-live.Rmd deleted file mode 100644 index 0435b77e..00000000 --- a/scRNA-seq-advanced/05-gene_set_enrichment_analysis-live.Rmd +++ /dev/null @@ -1,344 +0,0 @@ ---- -title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)" -output: - html_notebook: - toc: true - toc_float: true -author: CCDL for ALSF -date: 2021 ---- - -## Objectives - -This notebook will demonstrate how to: - -- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA) -- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package -- Perform GSEA with the `clusterProfiler` package -- Visualize GSEA results with the `enrichplot` package - ---- - -In this notebook, we'll analyze the marker genes from cluster 1, just as we did in the previous notebook. -Unlike ORA, GSEA allows us to use the full list of genes we could reliably measure, rather than picking some cutoff ourselves. - -GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102). -The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest. -FCS methods are better suited for identifying these pathways that show coordinated changes than ORA. -In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes. - -There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)): - -1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our marker genes table) -2. Gene-level statistics are aggregated into a pathway-level statistic -3. Assess the statistical significance of the pathway-level statistic - -We'll note here that GSEA was designed to detect small coordinated changes between _conditions_ in (necessarily bulk!) microarray data. -It may be more difficult to suss out small coordinated changes when we assay individual cells, due to technical or biological dropout. - -Individual cells may also differently sized ([Blasi *et al.* 2017](https://doi.org/10.1088/1478-3975/aa609a)), be at different stages of the cell cycle ([Satija lab](https://satijalab.org/seurat/archive/v3.1/cell_cycle_vignette.html)), or experiencing different levels of cellular stress ([Luecken and Theis. 2019](https://doi.org/10.15252/msb.20188746)), some of which may be interesting in the context of cancer and therefore should not be corrected for. -(In practice, it is common to correct for cell cycle stage but we have not done that upstream.) -These aspects also have the potential to cloud our ability to detect small coordinated changes in other pathways. - -There is some literature that suggests that dropout (excessive zeros) observed in single cell data is a function of cell type heterogeneity ([Kim *et al.* 2020](https://doi.org/10.1186/s13059-020-02096-y)). -We're using GSEA to compare clusters of cells, which we (ideally) expect to be different biological states and/or cell types. -If many genes "drop out" between cell types, we may expect our statistics used as input to GSEA to be noisy. - -With all this in mind, it may be unsurprising that ORA, which generally captures genes with large individual changes and lower within-group variation (the most "statistically significant" genes), is well-suited to analyzing scRNA-seq data and the advantages of GSEA in a bulk analysis setting may not be fully available here. -Nonetheless, GSEA is used for pathway analysis of scRNA-seq data. - -Methods specifically for pathway analysis of scRNA-seq (e.g., [Ma *et al.* 2020](https://doi.org/10.1038/s41467-020-15298-6)) are being developed, so stay tuned! - -#### Other resources - -* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) -* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood. -You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). -* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. - -## Set up - -### Libraries - -```{r libraries} -# Package to run GSEA -library(clusterProfiler) -# Package that contains the MSigDB gene sets in tidy format -library(msigdbr) -``` - -### Directories and Files - -#### Directories - -```{r create_dir, live = TRUE} -# We'll use the marker genes as GSEA input - -# We'll create a directory to specifically hold the pathway results if it doesn't -# exist yet - -``` - -#### Input files - -We're going to use the version of the cluster 1 marker genes table that has gene symbols, so we don't have to do gene identifier conversion again! - -```{r input_files} -input_file <- file.path(hodgkins_analysis_dir, - "markers", - "cluster01_markers_with_gene_symbols.tsv") -``` - -#### Output files - -We'll save our table of GSEA results as a TSV. - -```{r output_files} -output_file <- file.path(results_dir, - "cluster01_immunologic_gsea_results.tsv") -``` - -## Gene sets - -We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)). -The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms. - -Let's take a look at what organisms the package supports. - -```{r show_species} -msigdbr_species() -``` - -MSigDB contains 8 different gene set collections. - - H: hallmark gene sets - C1: positional gene sets - C2: curated gene sets - C3: motif gene sets - C4: computational gene sets - C5: GO gene sets - C6: oncogenic signatures - C7: immunologic signatures - - -Because we're working with a sample that's partly comprised of immune cells, let's use the immunologic signatures to analyze our marker genes. -These signatures are generally derived from gene expression experiments of a broad set of cell types, perturbations, and in some cases responses to specific vaccines ([Godec *et al.* 2016](https://doi.org/10.1016/j.immuni.2015.12.006)); they usually have an "up" and "down" component. -You can [read more about the C7 collection at MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#C7). - -We might expect that analyzing our marker genes with this collection could give us some information not only about cell identity or type, but also phenotype or activation state. - -We can retrieve only the Immunologic gene sets by specifying `category = "C7"` to the `msigdbr()` function. -Again, we only want human gene sets here so we specify with that with the `species` argument. - -```{r immunologic_sets, live = TRUE} - -``` - -## Gene Set Enrichment Analysis - -_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_ - -![](diagrams/subramanian_fig1.jpg) - -**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).** - -GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. -This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)) - -Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated: -Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway. -The enrichment score for a pathway is the running sum's maximum deviation from zero. -GSEA also assesses statistical significance of the scores for each pathway through permutation testing. -As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)). - -The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets. -When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. - -## Marker gene results - -```{r read_in_markers, live = TRUE} - -``` - -```{r} -head(markers_df) -``` - -Since this data frame of marker genes includes gene symbols, we do not need to perform any kind of gene identifier conversion. -We do, however, need to check for duplicate gene symbols. -We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`). -The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values. - -```{r any_duplicated, live = TRUE} - -``` - -Luckily this number is very low, such that we do not expect it to impact our results much, but this will still cause a problem when we go to run GSEA. - -### Removing duplicates - -The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. -Practically speaking, gene sets are just collections of gene identifiers! -When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results. - -Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them. - -Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps. - -```{r gene_dups, live = TRUE} - -``` - -Now we'll look at the values for the duplicated gene symbols. - -```{r show_gene_dups} -markers_df |> - dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |> - dplyr::arrange(gene_symbol) -``` - -We can see that the associated values vary for each row. - -Let's keep the gene symbols associated with the higher absolute value of the gene-level statistic we'll use for GSEA (`summary.logFC`). - -Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. -We should keep this decision in mind when interpreting our results. -For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking. - -In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the summary log fold change. -This will keep the first row with the duplicated value thus keeping the row with the largest absolute value. - -```{r filter_dup_markers} -filtered_markers_df <- markers_df |> - # Sort so that the highest absolute values of the summary log2 fold change are - # at the top - dplyr::arrange(dplyr::desc(abs(summary.logFC))) |> - # Filter out the duplicated rows using `dplyr::distinct()` - dplyr::distinct(gene_symbol, .keep_all = TRUE) -``` - -Let's see what happened to our duplicate identifiers. - -```{r show_filtered_markers, live = TRUE} -# Subset to & arrange by gene symbols that were duplicated in the original -# data frame of results - -``` - -Now we're ready to prep our pre-ranked list for GSEA. - -### Pre-ranked list - -The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. -This is step 1 -- gene-level statistics. - -Here, we're using the summary log fold change, which summarizes of all of the log fold changes from the pairwise comparisons between cluster 1 and other clusters into a single value ([ref](https://rdrr.io/bioc/scran/man/combineMarkers.html)). -As such, this statistic gives us information about the relative magnitude and directionality of each gene's expression in cluster 1 relative to all other clusters. - -```{r lfc_vector} -lfc_vector <- filtered_markers_df |> - # Extract a vector of `summary.logFC` named by `gene_symbol` - dplyr::pull(summary.logFC, name = gene_symbol) -lfc_vector <- sort(lfc_vector, decreasing = TRUE) -``` - -Let's look at the top ranked values. - -```{r head_lfc, live = TRUE} -# Look at first entries of the log fold change vector - -``` - -And the bottom of the list. - -```{r tail_lfc, live = TRUE} -# Look at the last entries of the log fold change vector - -``` - -## Run GSEA - -Now for the analysis! - -We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). - -```{r run_gsea} -gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list - minGSSize = 25, # minimum gene set size - maxGSSize = 500, # maximum gene set set - pvalueCutoff = 0.05, - pAdjustMethod = "BH", # correction for multiple hypothesis testing - TERM2GENE = dplyr::select(hs_immunologic_df, - gs_name, - gene_symbol)) -``` -Let's take a look at the GSEA results. - -```{r view_gsea, live = TRUE, eval = FALSE} - -``` - -Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable. - -Let's write these results to file. - -```{r write_gsea} -gsea_results@result |> readr::write_tsv(output_file) -``` - -### Visualizing GSEA results - -We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. -Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated. - -#### Highly Positive NES - -Let's take look at a pathway with a highly positive NES (`GSE29618_BCELL_VS_PDC_UP`) using a GSEA plot. -This gene set is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/cards/GSE29618_BCELL_VS_PDC_UP)): - -> Genes up-regulated in comparison of B cells versus plasmacytoid dendritic cells (pDC) . - -```{r highly_pos} -enrichplot::gseaplot(gsea_results, - geneSetID = "GSE29618_BCELL_VS_PDC_UP", - title = "GSE29618_BCELL_VS_PDC_UP", - color.line = "#0066FF") -``` - -Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. - -#### Highly Negative NES - -The gene set `GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN` had a highly negative NES, and is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN)) - -> Genes down-regulated in comparison of B cells from influenza vaccinee at day 7 post-vaccination versus plasmacytoid dendritic cells (pDC) at day 7 post-vaccination. - -This is the "down" signature that corresponds to one of the "up" pathways with a highly positive NES, so in a way this is not adding new information! - -```{r highly_neg} -enrichplot::gseaplot(gsea_results, - geneSetID = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN", - title = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN", - color.line = "#0066FF") -``` - -This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. - -#### A non-significant result - -The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name. -Let's look at [`GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN`](http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP&keywords=GOLDRATH), which examines the difference between effector CD8 T cells and memory CD8 T cells and was not in the results we viewed earlier. - -```{r nonsig, live = TRUE} - -``` - -Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score. - -*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.* - -## Session Info - -```{r session_info} -sessionInfo() -``` diff --git a/scRNA-seq-advanced/05-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/05-gene_set_enrichment_analysis.Rmd deleted file mode 100644 index ce9a36b2..00000000 --- a/scRNA-seq-advanced/05-gene_set_enrichment_analysis.Rmd +++ /dev/null @@ -1,356 +0,0 @@ ---- -title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)" -output: - html_notebook: - toc: true - toc_float: true -author: CCDL for ALSF -date: 2021 ---- - -## Objectives - -This notebook will demonstrate how to: - -- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA) -- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package -- Perform GSEA with the `clusterProfiler` package -- Visualize GSEA results with the `enrichplot` package - ---- - -In this notebook, we'll analyze the marker genes from cluster 1, just as we did in the previous notebook. -Unlike ORA, GSEA allows us to use the full list of genes we could reliably measure, rather than picking some cutoff ourselves. - -GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102). -The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest. -FCS methods are better suited for identifying these pathways that show coordinated changes than ORA. -In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes. - -There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)): - -1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our marker genes table) -2. Gene-level statistics are aggregated into a pathway-level statistic -3. Assess the statistical significance of the pathway-level statistic - -We'll note here that GSEA was designed to detect small coordinated changes between _conditions_ in (necessarily bulk!) microarray data. -It may be more difficult to suss out small coordinated changes when we assay individual cells, due to technical or biological dropout. - -Individual cells may also differently sized ([Blasi *et al.* 2017](https://doi.org/10.1088/1478-3975/aa609a)), be at different stages of the cell cycle ([Satija lab](https://satijalab.org/seurat/archive/v3.1/cell_cycle_vignette.html)), or experiencing different levels of cellular stress ([Luecken and Theis. 2019](https://doi.org/10.15252/msb.20188746)), some of which may be interesting in the context of cancer and therefore should not be corrected for. -(In practice, it is common to correct for cell cycle stage but we have not done that upstream.) -These aspects also have the potential to cloud our ability to detect small coordinated changes in other pathways. - -There is some literature that suggests that dropout (excessive zeros) observed in single cell data is a function of cell type heterogeneity ([Kim *et al.* 2020](https://doi.org/10.1186/s13059-020-02096-y)). -We're using GSEA to compare clusters of cells, which we (ideally) expect to be different biological states and/or cell types. -If many genes "drop out" between cell types, we may expect our statistics used as input to GSEA to be noisy. - -With all this in mind, it may be unsurprising that ORA, which generally captures genes with large individual changes and lower within-group variation (the most "statistically significant" genes), is well-suited to analyzing scRNA-seq data and the advantages of GSEA in a bulk analysis setting may not be fully available here. -Nonetheless, GSEA is used for pathway analysis of scRNA-seq data. - -Methods specifically for pathway analysis of scRNA-seq (e.g., [Ma *et al.* 2020](https://doi.org/10.1038/s41467-020-15298-6)) are being developed, so stay tuned! - -#### Other resources - -* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) -* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood. -You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). -* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. - -## Set up - -### Libraries - -```{r libraries} -# Package to run GSEA -library(clusterProfiler) -# Package that contains the MSigDB gene sets in tidy format -library(msigdbr) -``` - -### Directories and Files - -#### Directories - -```{r create_dir, live = TRUE} -# We'll use the marker genes as GSEA input -hodgkins_analysis_dir <- file.path("analysis", "hodgkins") - -# We'll create a directory to specifically hold the pathway results if it doesn't -# exist yet -results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis") -fs::dir_create(results_dir) -``` - -#### Input files - -We're going to use the version of the cluster 1 marker genes table that has gene symbols, so we don't have to do gene identifier conversion again! - -```{r input_files} -input_file <- file.path(hodgkins_analysis_dir, - "markers", - "cluster01_markers_with_gene_symbols.tsv") -``` - -#### Output files - -We'll save our table of GSEA results as a TSV. - -```{r output_files} -output_file <- file.path(results_dir, - "cluster01_immunologic_gsea_results.tsv") -``` - -## Gene sets - -We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)). -The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms. - -Let's take a look at what organisms the package supports. - -```{r show_species} -msigdbr_species() -``` - -MSigDB contains 8 different gene set collections. - - H: hallmark gene sets - C1: positional gene sets - C2: curated gene sets - C3: motif gene sets - C4: computational gene sets - C5: GO gene sets - C6: oncogenic signatures - C7: immunologic signatures - - -Because we're working with a sample that's partly comprised of immune cells, let's use the immunologic signatures to analyze our marker genes. -These signatures are generally derived from gene expression experiments of a broad set of cell types, perturbations, and in some cases responses to specific vaccines ([Godec *et al.* 2016](https://doi.org/10.1016/j.immuni.2015.12.006)); they usually have an "up" and "down" component. -You can [read more about the C7 collection at MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#C7). - -We might expect that analyzing our marker genes with this collection could give us some information not only about cell identity or type, but also phenotype or activation state. - -We can retrieve only the Immunologic gene sets by specifying `category = "C7"` to the `msigdbr()` function. -Again, we only want human gene sets here so we specify with that with the `species` argument. - -```{r immunologic_sets, live = TRUE} -hs_immunologic_df <- msigdbr(species = "Homo sapiens", - category = "C7") -``` - -## Gene Set Enrichment Analysis - -_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_ - -![](diagrams/subramanian_fig1.jpg) - -**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).** - -GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. -This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)) - -Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated: -Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway. -The enrichment score for a pathway is the running sum's maximum deviation from zero. -GSEA also assesses statistical significance of the scores for each pathway through permutation testing. -As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)). - -The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets. -When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. - -## Marker gene results - -```{r read_in_markers, live = TRUE} -markers_df <- readr::read_tsv(input_file) -``` - -```{r} -head(markers_df) -``` - -Since this data frame of marker genes includes gene symbols, we do not need to perform any kind of gene identifier conversion. -We do, however, need to check for duplicate gene symbols. -We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`). -The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values. - -```{r any_duplicated, live = TRUE} -sum(duplicated(markers_df$gene_symbol)) -``` - -Luckily this number is very low, such that we do not expect it to impact our results much, but this will still cause a problem when we go to run GSEA. - -### Removing duplicates - -The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. -Practically speaking, gene sets are just collections of gene identifiers! -When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results. - -Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them. - -Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps. - -```{r gene_dups, live = TRUE} -duplicated_gene_symbols <- markers_df |> - dplyr::filter(duplicated(gene_symbol)) |> - dplyr::pull(gene_symbol) -``` - -Now we'll look at the values for the duplicated gene symbols. - -```{r show_gene_dups} -markers_df |> - dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |> - dplyr::arrange(gene_symbol) -``` - -We can see that the associated values vary for each row. - -Let's keep the gene symbols associated with the higher absolute value of the gene-level statistic we'll use for GSEA (`summary.logFC`). - -Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. -We should keep this decision in mind when interpreting our results. -For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking. - -In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the summary log fold change. -This will keep the first row with the duplicated value thus keeping the row with the largest absolute value. - -```{r filter_dup_markers} -filtered_markers_df <- markers_df |> - # Sort so that the highest absolute values of the summary log2 fold change are - # at the top - dplyr::arrange(dplyr::desc(abs(summary.logFC))) |> - # Filter out the duplicated rows using `dplyr::distinct()` - dplyr::distinct(gene_symbol, .keep_all = TRUE) -``` - -Let's see what happened to our duplicate identifiers. - -```{r show_filtered_markers, live = TRUE} -# Subset to & arrange by gene symbols that were duplicated in the original -# data frame of results -filtered_markers_df |> - dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |> - dplyr::arrange(gene_symbol) -``` - -Now we're ready to prep our pre-ranked list for GSEA. - -### Pre-ranked list - -The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. -This is step 1 -- gene-level statistics. - -Here, we're using the summary log fold change, which summarizes of all of the log fold changes from the pairwise comparisons between cluster 1 and other clusters into a single value ([ref](https://rdrr.io/bioc/scran/man/combineMarkers.html)). -As such, this statistic gives us information about the relative magnitude and directionality of each gene's expression in cluster 1 relative to all other clusters. - -```{r lfc_vector} -lfc_vector <- filtered_markers_df |> - # Extract a vector of `summary.logFC` named by `gene_symbol` - dplyr::pull(summary.logFC, name = gene_symbol) -lfc_vector <- sort(lfc_vector, decreasing = TRUE) -``` - -Let's look at the top ranked values. - -```{r head_lfc, live = TRUE} -# Look at first entries of the log fold change vector -head(lfc_vector) -``` - -And the bottom of the list. - -```{r tail_lfc, live = TRUE} -# Look at the last entries of the log fold change vector -tail(lfc_vector) -``` - -## Run GSEA - -Now for the analysis! - -We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). - -```{r run_gsea} -gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list - minGSSize = 25, # minimum gene set size - maxGSSize = 500, # maximum gene set set - pvalueCutoff = 0.05, - pAdjustMethod = "BH", # correction for multiple hypothesis testing - TERM2GENE = dplyr::select(hs_immunologic_df, - gs_name, - gene_symbol)) -``` -Let's take a look at the GSEA results. - -```{r view_gsea, live = TRUE, eval = FALSE} -View(gsea_results@result |> - dplyr::arrange(dplyr::desc(NES)) -) -``` - -Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable. - -Let's write these results to file. - -```{r write_gsea} -gsea_results@result |> readr::write_tsv(output_file) -``` - -### Visualizing GSEA results - -We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. -Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated. - -#### Highly Positive NES - -Let's take look at a pathway with a highly positive NES (`GSE29618_BCELL_VS_PDC_UP`) using a GSEA plot. -This gene set is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/cards/GSE29618_BCELL_VS_PDC_UP)): - -> Genes up-regulated in comparison of B cells versus plasmacytoid dendritic cells (pDC) . - -```{r highly_pos} -enrichplot::gseaplot(gsea_results, - geneSetID = "GSE29618_BCELL_VS_PDC_UP", - title = "GSE29618_BCELL_VS_PDC_UP", - color.line = "#0066FF") -``` - -Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. - -#### Highly Negative NES - -The gene set `GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN` had a highly negative NES, and is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN)) - -> Genes down-regulated in comparison of B cells from influenza vaccinee at day 7 post-vaccination versus plasmacytoid dendritic cells (pDC) at day 7 post-vaccination. - -This is the "down" signature that corresponds to one of the "up" pathways with a highly positive NES, so in a way this is not adding new information! - -```{r highly_neg} -enrichplot::gseaplot(gsea_results, - geneSetID = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN", - title = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN", - color.line = "#0066FF") -``` - -This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. - -#### A non-significant result - -The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name. -Let's look at [`GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN`](http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP&keywords=GOLDRATH), which examines the difference between effector CD8 T cells and memory CD8 T cells and was not in the results we viewed earlier. - -```{r nonsig, live = TRUE} -enrichplot::gseaplot(gsea_results, - geneSetID = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN", - title = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN", - color.line = "#0066FF") -``` - -Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score. - -*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.* - -## Session Info - -```{r session_info} -sessionInfo() -``` diff --git a/scRNA-seq-advanced/05-gene_set_enrichment_analysis.nb.html b/scRNA-seq-advanced/05-gene_set_enrichment_analysis.nb.html deleted file mode 100644 index 57cce690..00000000 --- a/scRNA-seq-advanced/05-gene_set_enrichment_analysis.nb.html +++ /dev/null @@ -1,3635 +0,0 @@ - - - - - - - - - - - - - - - -Pathway analysis: Gene Set Enrichment Analysis (GSEA) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - - -
-

Objectives

-

This notebook will demonstrate how to:

-
    -
  • Prepare tabular data of gene-level statistics for use with Gene Set -Enrichment Analysis (GSEA)
  • -
  • Access Molecular -Signatures Database gene set collections via the -msigdbr package
  • -
  • Perform GSEA with the clusterProfiler package
  • -
  • Visualize GSEA results with the enrichplot package
  • -
-
-

In this notebook, we’ll analyze the marker genes from cluster 1, just -as we did in the previous notebook. Unlike ORA, GSEA allows us to use -the full list of genes we could reliably measure, rather than picking -some cutoff ourselves.

-

GSEA is a functional class scoring (FCS) approach to pathway analysis -that was first introduced in Subramanian et -al. (2005). The rationale behind FCS approaches is that small -changes in individual genes that participate in the same biological -process or pathway can be significant and of biological interest. FCS -methods are better suited for identifying these pathways that show -coordinated changes than ORA. In ORA, we pick a cutoff that -typically only captures genes with large individual -changes.

-

There are 3 general steps in FCS methods (Khatri et -al. 2012):

-
    -
  1. Calculate a gene-level statistic (here, we’ll use the summary log -fold changes in our marker genes table)
  2. -
  3. Gene-level statistics are aggregated into a pathway-level -statistic
  4. -
  5. Assess the statistical significance of the pathway-level -statistic
  6. -
-

We’ll note here that GSEA was designed to detect small coordinated -changes between conditions in (necessarily bulk!) microarray -data. It may be more difficult to suss out small coordinated changes -when we assay individual cells, due to technical or biological -dropout.

-

Individual cells may also differently sized (Blasi et al. -2017), be at different stages of the cell cycle (Satija -lab), or experiencing different levels of cellular stress (Luecken and Theis. -2019), some of which may be interesting in the context of cancer and -therefore should not be corrected for. (In practice, it is common to -correct for cell cycle stage but we have not done that upstream.) These -aspects also have the potential to cloud our ability to detect small -coordinated changes in other pathways.

-

There is some literature that suggests that dropout (excessive zeros) -observed in single cell data is a function of cell type heterogeneity -(Kim et -al. 2020). We’re using GSEA to compare clusters of cells, which -we (ideally) expect to be different biological states and/or cell types. -If many genes “drop out” between cell types, we may expect our -statistics used as input to GSEA to be noisy.

-

With all this in mind, it may be unsurprising that ORA, which -generally captures genes with large individual changes and lower -within-group variation (the most “statistically significant” genes), is -well-suited to analyzing scRNA-seq data and the advantages of GSEA in a -bulk analysis setting may not be fully available here. Nonetheless, GSEA -is used for pathway analysis of scRNA-seq data.

-

Methods specifically for pathway analysis of scRNA-seq (e.g., Ma et al. -2020) are being developed, so stay tuned!

-
-

Other resources

- -
-
-
-

Set up

-
-

Libraries

- - - -
# Package to run GSEA
-library(clusterProfiler)
- - -
- - -
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
-
-If you use clusterProfiler in published research, please cite:
-T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
- - -

-Attaching package: 'clusterProfiler'
- - -
The following object is masked from 'package:stats':
-
-    filter
- - -
# Package that contains the MSigDB gene sets in tidy format
-library(msigdbr)
- - - -
-
-

Directories and Files

-
-

Directories

- - - -
# We'll use the marker genes as GSEA input
-hodgkins_analysis_dir <- file.path("analysis", "hodgkins")
-
-# We'll create a directory to specifically hold the pathway results if it doesn't
-# exist yet
-results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis")
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir, recursive = TRUE)
-}
- - - -
-
-

Input files

-

We’re going to use the version of the cluster 1 marker genes table -that has gene symbols, so we don’t have to do gene identifier conversion -again!

- - - -
input_file <- file.path(hodgkins_analysis_dir,
-                        "markers",
-                        "cluster01_markers_with_gene_symbols.tsv")
- - - -
-
-

Output files

-

We’ll save our table of GSEA results as a TSV.

- - - -
output_file <- file.path(results_dir,
-                         "cluster01_immunologic_gsea_results.tsv")
- - - -
-
-
-
-

Gene sets

-

We will use gene sets from the Molecular -Signatures Database (MSigDB) from the Broad Institute (Subramanian, Tamayo -et al. 2005). The msigdbr -package contains MSigDB datasets already in the tidy format required by -clusterProfiler and supports multiple organisms.

-

Let’s take a look at what organisms the package supports.

- - - -
msigdbr_species()
- -
- -
- - -

MSigDB contains 8 different gene set collections.

-
H: hallmark gene sets
-C1: positional gene sets
-C2: curated gene sets
-C3: motif gene sets
-C4: computational gene sets
-C5: GO gene sets
-C6: oncogenic signatures
-C7: immunologic signatures
-

Because we’re working with a sample that’s partly comprised of immune -cells, let’s use the immunologic signatures to analyze our marker genes. -These signatures are generally derived from gene expression experiments -of a broad set of cell types, perturbations, and in some cases responses -to specific vaccines (Godec et -al. 2016); they usually have an “up” and “down” component. You -can read -more about the C7 collection at MSigDB.

-

We might expect that analyzing our marker genes with this collection -could give us some information not only about cell identity or type, but -also phenotype or activation state.

-

We can retrieve only the Immunologic gene sets by specifying -category = "C7" to the msigdbr() function. -Again, we only want human gene sets here so we specify with that with -the species argument.

- - - -
hs_immunologic_df <- msigdbr(species = "Homo sapiens",
-                             category = "C7")
- - - -
-
-

Gene Set Enrichment Analysis

-

Adapted from refine.bio -examples

-

-

Figure 1. Subramanian et -al. (2005).

-

GSEA calculates a pathway-level metric, called an enrichment score -(sometimes abbreviated as ES), by ranking genes by a gene-level -statistic. This score reflects whether or not a gene set or pathway is -over-represented at the top or bottom of the gene rankings (Subramanian et -al. 2005; Yu)

-

Specifically, all genes are ranked from most positive to most -negative based on their statistic and a running sum is calculated: -Starting with the most highly ranked genes, the running sum increases -for each gene in the pathway and decreases for each gene not in the -pathway. The enrichment score for a pathway is the running sum’s maximum -deviation from zero. GSEA also assesses statistical significance of the -scores for each pathway through permutation testing. As a result, each -input pathway will have a p-value associated with it that is then -corrected for multiple hypothesis testing (Subramanian et -al. 2005; Yu).

-

The implementation of GSEA we use in here examples requires a gene -list ordered by some statistic and input gene sets. When you use -previously computed gene-level statistics with GSEA, it is called GSEA -pre-ranked.

-
-
-

Marker gene results

- - - -
markers_df <- readr::read_tsv(input_file)
- - -
Rows: 7158 Columns: 17
-── Column specification ────────────────────────────────────────────────────────
-Delimiter: "\t"
-chr  (2): ensembl_id, gene_symbol
-dbl (15): p.value, FDR, summary.logFC, logFC.2, logFC.3, logFC.4, logFC.5, l...
-
-ℹ Use `spec()` to retrieve the full column specification for this data.
-ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- - - - - - -
head(markers_df)
- -
- -
- - -

Since this data frame of marker genes includes gene symbols, we do -not need to perform any kind of gene identifier conversion. We do, -however, need to check for duplicate gene symbols. We can accomplish -this with duplicated(), which returns a logical vector -(e.g., TRUE or FALSE). The function -sum() will count TRUE values as 1s and -FALSE as 0s, so using it with duplicated() -will count the number of duplicate values.

- - - -
sum(duplicated(markers_df$gene_symbol))
- - -
[1] 1
- - - -

Luckily this number is very low, such that we do not expect it to -impact our results much, but this will still cause a problem when we go -to run GSEA.

-
-

Removing duplicates

-

The GSEA approach requires on discriminating between genes that are -in a gene set and those that are not. Practically speaking, gene sets -are just collections of gene identifiers! When the function we use for -GSEA pre-ranked gets a list with duplicated gene identifiers, it can -produce unexpected results.

-

Compared to the total number of genes that are in our results, there -are not a lot of duplicates but we’ll still need to make a decision -about how to handle them.

-

Let’s get a vector of the duplicated gene symbols so we can use it to -explore our filtering steps.

- - - -
duplicated_gene_symbols <- markers_df |>
-  dplyr::filter(duplicated(gene_symbol)) |>
-  dplyr::pull(gene_symbol)
- - - -

Now we’ll look at the values for the duplicated gene symbols.

- - - -
markers_df |>
-  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
-  dplyr::arrange(gene_symbol)
- -
- -
- - -

We can see that the associated values vary for each row.

-

Let’s keep the gene symbols associated with the higher absolute value -of the gene-level statistic we’ll use for GSEA -(summary.logFC).

-

Retaining the instance of the gene symbols with the higher absolute -value of a gene-level statistic means that we will retain the value that -is likely to be more highly- or lowly-ranked or, put another way, the -values less likely to be towards the middle of the ranked gene list. We -should keep this decision in mind when interpreting our results. For -example, if all the duplicate identifiers happened to be in a particular -gene set, we may get an overly optimistic view of how perturbed that -gene set is because we preferentially selected instances of the -identifier that have a higher absolute value of the statistic used for -ranking.

-

In the next chunk, we are going to filter out the duplicated rows -using the dplyr::distinct() function after sorting by -absolute value of the summary log fold change. This will keep the first -row with the duplicated value thus keeping the row with the largest -absolute value.

- - - -
filtered_markers_df <- markers_df |>
-  # Sort so that the highest absolute values of the summary log2 fold change are
-  # at the top
-  dplyr::arrange(dplyr::desc(abs(summary.logFC))) |>
-  # Filter out the duplicated rows using `dplyr::distinct()`
-  dplyr::distinct(gene_symbol, .keep_all = TRUE)
- - - -

Let’s see what happened to our duplicate identifiers.

- - - -
# Subset to & arrange by gene symbols that were duplicated in the original
-# data frame of results
-filtered_markers_df |>
-  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
-  dplyr::arrange(gene_symbol)
- -
- -
- - -

Now we’re ready to prep our pre-ranked list for GSEA.

-
-
-

Pre-ranked list

-

The GSEA() function takes a pre-ranked (sorted) named -vector of statistics, where the names in the vector are gene -identifiers. This is step 1 – gene-level statistics.

-

Here, we’re using the summary log fold change, which summarizes of -all of the log fold changes from the pairwise comparisons between -cluster 1 and other clusters into a single value (ref). As -such, this statistic gives us information about the relative magnitude -and directionality of each gene’s expression in cluster 1 relative to -all other clusters.

- - - -
lfc_vector <- filtered_markers_df |>
-  # Extract a vector of `summary.logFC` named by `gene_symbol`
-  dplyr::pull(summary.logFC, name = gene_symbol)
-lfc_vector <- sort(lfc_vector, decreasing = TRUE)
- - - -

Let’s look at the top ranked values.

- - - -
# Look at first entries of the log fold change vector
-head(lfc_vector)
- - -
    MS4A1      IGHM     CD79A      CD69      CD37 TNFRSF13C 
- 3.652490  3.244756  3.037044  2.776535  2.128713  2.029677 
- - - -

And the bottom of the list.

- - - -
# Look at the last entries of the log fold change vector
-tail(lfc_vector)
- - -
    RNF19A        ID3     PDLIM1   CYB561A3   HLA-DPB1       IRF8 
--0.7687903 -0.8241992 -0.8295044 -0.9030800 -1.0339844 -1.2398847 
- - - -
-
-
-

Run GSEA

-

Now for the analysis!

-

We can use the GSEA() function to perform GSEA with any -generic set of gene sets, but there are several functions for using -specific, commonly used gene sets (e.g., gseKEGG()).

- - - -
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
-                     minGSSize = 25,  # minimum gene set size
-                     maxGSSize = 500,  # maximum gene set set
-                     pvalueCutoff = 0.05,
-                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
-                     TERM2GENE = dplyr::select(hs_immunologic_df,
-                                               gs_name,
-                                               gene_symbol))
- - -
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
- - -
preparing geneSet collections...
- - -
GSEA analysis...
- - -
leading edge analysis...
- - -
done...
- - - -

Let’s take a look at the GSEA results.

- - - -
View(gsea_results@result |>
-       dplyr::arrange(dplyr::desc(NES))
-)
- - - -

Normalized enrichment scores (NES) are enrichment scores that are -scaled to make gene sets that contain different number of genes -comparable.

-

Let’s write these results to file.

- - - -
gsea_results@result |> readr::write_tsv(output_file)
- - - -
-

Visualizing GSEA results

-

We can visualize GSEA results for individual pathways or gene sets -using enrichplot::gseaplot(). Let’s take a look at 3 -different pathways – one with a highly positive NES, one with a highly -negative NES, and one that was not a significant result – to get more -insight into how ES are calculated.

-
-

Highly Positive NES

-

Let’s take look at a pathway with a highly positive NES -(GSE29618_BCELL_VS_PDC_UP) using a GSEA plot. This gene set -is comprised of (ref):

-
-

Genes up-regulated in comparison of B cells versus plasmacytoid -dendritic cells (pDC) .

-
- - - -
enrichplot::gseaplot(gsea_results,
-                     geneSetID = "GSE29618_BCELL_VS_PDC_UP",
-                     title = "GSE29618_BCELL_VS_PDC_UP",
-                     color.line = "#0066FF")
- - -

- - - -

Notice how the genes that are in the gene set, indicated by the black -bars, tend to be on the left side of the graph indicating that they have -positive gene-level scores.

-
-
-

Highly Negative NES

-

The gene set GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN -had a highly negative NES, and is comprised of (ref)

-
-

Genes down-regulated in comparison of B cells from influenza vaccinee -at day 7 post-vaccination versus plasmacytoid dendritic cells (pDC) at -day 7 post-vaccination.

-
-

This is the “down” signature that corresponds to one of the “up” -pathways with a highly positive NES, so in a way this is not adding new -information!

- - - -
enrichplot::gseaplot(gsea_results,
-                     geneSetID = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN",
-                     title = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN",
-                     color.line = "#0066FF")
- - -

- - - -

This gene set shows the opposite pattern – genes in the pathway tend -to be on the right side of the graph.

-
-
-

A non-significant result

-

The @results slot will only show gene sets that pass the -pvalueCutoff threshold we supplied to GSEA(), -but we can plot any gene set so long as we know its name. Let’s look at -GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, -which examines the difference between effector CD8 T cells and memory -CD8 T cells and was not in the results we viewed earlier.

- - - -
enrichplot::gseaplot(gsea_results,
-                     geneSetID = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN",
-                     title = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN",
-                     color.line = "#0066FF")
- - -

- - - -

Genes in the pathway are distributed more evenly throughout the -ranked list, resulting in a more “middling” score.

-

Note: The plots returned by enrichplot::gseaplot are -ggplots, so we could use ggplot2::ggsave() to save them to -file if we wanted to.

-
-
-
-
-

Session Info

- - - -
sessionInfo()
- - -
R version 4.4.1 (2024-06-14)
-Platform: x86_64-pc-linux-gnu
-Running under: Ubuntu 22.04.4 LTS
-
-Matrix products: default
-BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
-LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
-
-locale:
- [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
- [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
- [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
- [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
- [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
-
-time zone: Etc/UTC
-tzcode source: system (glibc)
-
-attached base packages:
-[1] stats     graphics  grDevices utils     datasets  methods   base     
-
-other attached packages:
-[1] msigdbr_7.5.1          clusterProfiler_4.12.0 optparse_1.7.5        
-
-loaded via a namespace (and not attached):
-  [1] DBI_1.2.2               gson_0.1.0              shadowtext_0.1.3       
-  [4] gridExtra_2.3           rlang_1.1.3             magrittr_2.0.3         
-  [7] DOSE_3.30.0             compiler_4.4.1          RSQLite_2.3.6          
- [10] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
- [13] stringr_1.5.1           pkgconfig_2.0.3         crayon_1.5.2           
- [16] fastmap_1.1.1           XVector_0.44.0          labeling_0.4.3         
- [19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
- [22] rmarkdown_2.26          tzdb_0.4.0              enrichplot_1.24.0      
- [25] UCSC.utils_1.0.0        purrr_1.0.2             bit_4.0.5              
- [28] xfun_0.43               zlibbioc_1.50.0         cachem_1.0.8           
- [31] aplot_0.2.2             GenomeInfoDb_1.40.0     jsonlite_1.8.8         
- [34] blob_1.2.4              highr_0.10              BiocParallel_1.38.0    
- [37] tweenr_2.0.3            parallel_4.4.1          R6_2.5.1               
- [40] bslib_0.7.0             stringi_1.8.3           RColorBrewer_1.1-3     
- [43] jquerylib_0.1.4         GOSemSim_2.30.0         Rcpp_1.0.12            
- [46] knitr_1.46              readr_2.1.5             IRanges_2.38.0         
- [49] Matrix_1.7-0            splines_4.4.1           igraph_2.0.3           
- [52] tidyselect_1.2.1        qvalue_2.36.0           yaml_2.3.8             
- [55] viridis_0.6.5           codetools_0.2-20        lattice_0.22-6         
- [58] tibble_3.2.1            plyr_1.8.9              treeio_1.28.0          
- [61] Biobase_2.64.0          withr_3.0.0             KEGGREST_1.44.0        
- [64] evaluate_0.23           gridGraphics_0.5-1      scatterpie_0.2.2       
- [67] getopt_1.20.4           polyclip_1.10-6         Biostrings_2.72.0      
- [70] ggtree_3.12.0           pillar_1.9.0            stats4_4.4.1           
- [73] ggfun_0.1.4             generics_0.1.3          vroom_1.6.5            
- [76] hms_1.1.3               S4Vectors_0.42.0        ggplot2_3.5.1          
- [79] tidytree_0.4.6          munsell_0.5.1           scales_1.3.0           
- [82] glue_1.7.0              lazyeval_0.2.2          tools_4.4.1            
- [85] data.table_1.15.4       fgsea_1.30.0            babelgene_22.9         
- [88] fs_1.6.4                graphlayouts_1.1.1      fastmatch_1.1-4        
- [91] tidygraph_1.3.1         cowplot_1.1.3           grid_4.4.1             
- [94] ape_5.8                 tidyr_1.3.1             AnnotationDbi_1.66.0   
- [97] colorspace_2.1-0        nlme_3.1-164            patchwork_1.2.0        
-[100] GenomeInfoDbData_1.2.12
- [ reached getOption("max.print") -- omitted 20 entries ]
- - -
- -
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2021
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the marker genes from cluster 1, just as we did in the previous notebook.
Unlike ORA, GSEA allows us to use the full list of genes we could reliably measure, rather than picking some cutoff ourselves.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
FCS methods are better suited for identifying these pathways that show coordinated changes than ORA.
In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our marker genes table)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

We'll note here that GSEA was designed to detect small coordinated changes between _conditions_ in (necessarily bulk!) microarray data.
It may be more difficult to suss out small coordinated changes when we assay individual cells, due to technical or biological dropout.

Individual cells may also differently sized ([Blasi *et al.* 2017](https://doi.org/10.1088/1478-3975/aa609a)), be at different stages of the cell cycle ([Satija lab](https://satijalab.org/seurat/archive/v3.1/cell_cycle_vignette.html)), or experiencing different levels of cellular stress ([Luecken and Theis. 2019](https://doi.org/10.15252/msb.20188746)), some of which may be interesting in the context of cancer and therefore should not be corrected for.
(In practice, it is common to correct for cell cycle stage but we have not done that upstream.)
These aspects also have the potential to cloud our ability to detect small coordinated changes in other pathways.

There is some literature that suggests that dropout (excessive zeros) observed in single cell data is a function of cell type heterogeneity ([Kim *et al.* 2020](https://doi.org/10.1186/s13059-020-02096-y)).
We're using GSEA to compare clusters of cells, which we (ideally) expect to be different biological states and/or cell types.
If many genes "drop out" between cell types, we may expect our statistics used as input to GSEA to be noisy.

With all this in mind, it may be unsurprising that ORA, which generally captures genes with large individual changes and lower within-group variation (the most "statistically significant" genes), is well-suited to analyzing scRNA-seq data and the advantages of GSEA in a bulk analysis setting may not be fully available here.
Nonetheless, GSEA is used for pathway analysis of scRNA-seq data.

Methods specifically for pathway analysis of scRNA-seq (e.g., [Ma *et al.* 2020](https://doi.org/10.1038/s41467-020-15298-6)) are being developed, so stay tuned!

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
hodgkins_analysis_dir <- file.path("analysis", "hodgkins")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis")
if (!dir.exists(results_dir)) {
  dir.create(results_dir, recursive = TRUE)
}
```

#### Input files

We're going to use the version of the cluster 1 marker genes table that has gene symbols, so we don't have to do gene identifier conversion again!

```{r input_files}
input_file <- file.path(hodgkins_analysis_dir,
                        "markers",
                        "cluster01_markers_with_gene_symbols.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "cluster01_immunologic_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures


Because we're working with a sample that's partly comprised of immune cells, let's use the immunologic signatures to analyze our marker genes.
These signatures are generally derived from gene expression experiments of a broad set of cell types, perturbations, and in some cases responses to specific vaccines ([Godec *et al.* 2016](https://doi.org/10.1016/j.immuni.2015.12.006)); they usually have an "up" and "down" component.
You can [read more about the C7 collection at MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#C7).

We might expect that analyzing our marker genes with this collection could give us some information not only about cell identity or type, but also phenotype or activation state.

We can retrieve only the Immunologic gene sets by specifying `category = "C7"` to the `msigdbr()` function.
Again, we only want human gene sets here so we specify with that with the `species` argument.

```{r immunologic_sets, live = TRUE}
hs_immunologic_df <- msigdbr(species = "Homo sapiens",
                             category = "C7")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## Marker gene results

```{r read_in_markers, live = TRUE}
markers_df <- readr::read_tsv(input_file)
```

```{r}
head(markers_df)
```

Since this data frame of marker genes includes gene symbols, we do not need to perform any kind of gene identifier conversion.
We do, however, need to check for duplicate gene symbols.
We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`).
The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values.

```{r any_duplicated, live = TRUE}
sum(duplicated(markers_df$gene_symbol))
```

Luckily this number is very low, such that we do not expect it to impact our results much, but this will still cause a problem when we go to run GSEA.

### Removing duplicates

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.

Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them.

Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.

```{r gene_dups, live = TRUE}
duplicated_gene_symbols <- markers_df |>
  dplyr::filter(duplicated(gene_symbol)) |>
  dplyr::pull(gene_symbol)
```

Now we'll look at the values for the duplicated gene symbols.

```{r show_gene_dups}
markers_df |>
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
  dplyr::arrange(gene_symbol)
```

We can see that the associated values vary for each row.

Let's keep the gene symbols associated with the higher absolute value of the gene-level statistic we'll use for GSEA (`summary.logFC`).

Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list.
We should keep this decision in mind when interpreting our results.
For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the summary log fold change.
This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.

```{r filter_dup_markers}
filtered_markers_df <- markers_df |>
  # Sort so that the highest absolute values of the summary log2 fold change are
  # at the top
  dplyr::arrange(dplyr::desc(abs(summary.logFC))) |>
  # Filter out the duplicated rows using `dplyr::distinct()`
  dplyr::distinct(gene_symbol, .keep_all = TRUE)
```

Let's see what happened to our duplicate identifiers.

```{r show_filtered_markers, live = TRUE}
# Subset to & arrange by gene symbols that were duplicated in the original
# data frame of results
filtered_markers_df |>
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
  dplyr::arrange(gene_symbol)
```

Now we're ready to prep our pre-ranked list for GSEA.

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

Here, we're using the summary log fold change, which summarizes of all of the log fold changes from the pairwise comparisons between cluster 1 and other clusters into a single value ([ref](https://rdrr.io/bioc/scran/man/combineMarkers.html)).
As such, this statistic gives us information about the relative magnitude and directionality of each gene's expression in cluster 1 relative to all other clusters.

```{r lfc_vector}
lfc_vector <- filtered_markers_df |>
  # Extract a vector of `summary.logFC` named by `gene_symbol`
  dplyr::pull(summary.logFC, name = gene_symbol)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_immunologic_df,
                                               gs_name,
                                               gene_symbol))
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`GSE29618_BCELL_VS_PDC_UP`) using a GSEA plot.
This gene set is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/cards/GSE29618_BCELL_VS_PDC_UP)):

> Genes up-regulated in comparison of B cells versus plasmacytoid dendritic cells (pDC) .

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "GSE29618_BCELL_VS_PDC_UP",
                     title = "GSE29618_BCELL_VS_PDC_UP",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN` had a highly negative NES, and is comprised of ([ref](https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN))

>  	Genes down-regulated in comparison of B cells from influenza vaccinee at day 7 post-vaccination versus plasmacytoid dendritic cells (pDC) at day 7 post-vaccination.

This is the "down" signature that corresponds to one of the "up" pathways with a highly positive NES, so in a way this is not adding new information!

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN",
                     title = "GSE29618_BCELL_VS_PDC_DAY7_FLU_VACCINE_DN",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at [`GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN`](http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP&keywords=GOLDRATH), which examines the difference between effector CD8 T cells and memory CD8 T cells and was not in the results we viewed earlier.

```{r nonsig, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN",
                     title = "GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

- - -
-
- -
- - - - - - - - - - - - - - - - - From 6afb72011832657260f5ac77daec4a4795a033dd Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 09:56:58 -0500 Subject: [PATCH 3/7] Remove ORA notebook --- .../04-overrepresentation_analysis-live.Rmd | 416 -- .../04-overrepresentation_analysis.Rmd | 429 -- .../04-overrepresentation_analysis.nb.html | 3868 ----------------- 3 files changed, 4713 deletions(-) delete mode 100644 scRNA-seq-advanced/04-overrepresentation_analysis-live.Rmd delete mode 100644 scRNA-seq-advanced/04-overrepresentation_analysis.Rmd delete mode 100644 scRNA-seq-advanced/04-overrepresentation_analysis.nb.html diff --git a/scRNA-seq-advanced/04-overrepresentation_analysis-live.Rmd b/scRNA-seq-advanced/04-overrepresentation_analysis-live.Rmd deleted file mode 100644 index 8c4edd8a..00000000 --- a/scRNA-seq-advanced/04-overrepresentation_analysis-live.Rmd +++ /dev/null @@ -1,416 +0,0 @@ ---- -title: "Pathway analysis: Over-representation analysis (ORA)" -output: - html_notebook: - toc: true - toc_float: true -author: CCDL for ALSF -date: 2020 ---- - -## Objectives - -This notebook will demonstrate how to: - -- Perform gene identifier conversion with [`AnnotationDBI` annotation packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) -- Prepare gene sets for over-representation analysis, including an appropriate background set -- Perform over-representation analysis with the `clusterProfiler` package -- Introduce resources for ORA of scRNA-seq data, such as Gene Ontology - ---- - -In this notebook, we'll cover a type of pathway or gene set analysis called over-representation analysis (ORA). -The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance? -The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive. - -ORA has some limitations, outlined nicely (and more extensively!) in [Khatri _et al._ (2012)]( https://doi.org/10.1371/journal.pcbi.1002375). -One of the main issues with ORA is that typically all genes are treated as equal -- the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance. - -We will use the [`clusterProfiler` package](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) ([Yu *et al.* 2012](https://doi.org/10.1089/omi.2011.0118.)) to perform ORA. -`clusterProfiler` has many built-in functions that will run a specific type of analysis using a specific source of pathways/gene sets automatically, but for our purposes we're going to keep things as general as possible. -See the [`clusterProfiler` book](https://yulab-smu.github.io/clusterProfiler-book/index.html) for more information about the package's full suite of functionality. - -Because different bioinformatics tools often require different types of gene identifiers, we'll also cover how to convert between gene identifiers using [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) Bioconductor packages in this notebook. -Check out the [_AnnotationDbi: Introduction To Bioconductor Annotation Packages_ (Carlson 2020.) vignette](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) for more information. - -#### Other resources - -* For another example using `clusterProfiler`, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) -* [`WebGestaltR`](https://cran.r-project.org/web/packages/WebGestaltR/) is another R package that can be used for ORA. -* OSCA has [a section on using Gene Ontology sets to annotate clusters](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html#assigning-cluster-labels-from-markers) using a different Bioconductor package. - -## Set up - -### Libraries - -```{r libraries} -# Package we'll use to run ORA -library(clusterProfiler) -# Package that contains MSigDB gene sets in tidy format -library(msigdbr) -# Homo sapiens annotation package we'll use for gene identifier conversion -library(org.Hs.eg.db) -``` - -### Directories and files - -#### Directories - -```{r create_ora_directory, live = TRUE} - -# We'll use the output of the clustering notebook for ORA - -# We'll create a directory to specifically hold the pathway results - -# Create a directory for an updated marker list file - -``` - -#### Input files - -We're going to perform ORA on the marker genes from cluster 1 we looked at in the previous notebook. - -```{r cluster_input} -input_file <- file.path(hodgkins_data_dir, - "markers", - "cluster01_markers.tsv") -``` - -We're going to use a file that is a cleaned version of the [CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)) human markers resources. -You can see how we obtained and cleaned this file in [this notebook](https://github.com/AlexsLemonade/training-modules/blob/master/scRNA-seq/setup/cellmarker_genes.Rmd). - -```{r cm_input} -cellmarker_file <- file.path("gene-sets", - "CellMarker_cleaned_human_markers.tsv") -``` - -#### Output files - -In the next notebook, we'll use the cluster 1 results. -Saving the post-gene identifier conversion results in this notebook will prevent us from duplicating effort (and code)! - -```{r gene_id_file} -cluster1_output_file <- file.path(hodgkins_analysis_dir, - "markers", - "cluster01_markers_with_gene_symbols.tsv") -``` - -We'll save the table of ORA results (e.g., p-values). - -```{r output_files, live = TRUE} - -``` - -## Read in marker genes - -Let's read in our marker genes from cluster 1. - -```{r read_in_markers, live = TRUE} - -``` - -And let's take a peek! - -```{r marker_head, live = TRUE} - -``` - -### Gene identifier conversion - -Our data frame of marker genes contains Ensembl gene identifiers. -We will need to convert from these identifiers into gene symbols or Entrez IDs for our use with cell marker data. - -We're going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! - -The annotation package `org.Hs.eg.db` contains information for different identifiers. -`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. -To perform gene identifier conversion in mouse (_Mus musculus_) we could use `org.Mm.eg.db`; -we would use `org.Dr.eg.db` for zebrafish (_Danio rerio_). - -We can see what types of IDs are available to us in an annotation package with `keytypes()`. - -```{r keytypes, live = TRUE} - -``` - -Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to Entrez IDs (`ENTREZID`). - -The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`. - -```{r map_to_symbol} -# This returns a named vector which we can convert to a data frame, where -# the keys (Ensembl IDs) are the names -symbols_vector <- mapIds(org.Hs.eg.db, # Specify the annotation package - # The vector of gene identifiers we want to - # map - keys = markers_df$gene, - # What type of gene identifiers we're starting - # with - keytype = "ENSEMBL", - # The type of gene identifier we want returned - column = "SYMBOL", - # In the case of 1:many mappings, return the - # first one. This is default behavior! - multiVals = "first") - -# We would like a data frame we can join to the DGE results -symbols_df <- data.frame( - ensembl_id = names(symbols_vector), - gene_symbol = symbols_vector -) |> - # Drop genes that don't have a gene symbol! - tidyr::drop_na() -``` - -This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. -In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs. - -Now we are ready to add the gene symbols to our data frame with the marker genes. -We can use a _join_ function from the `dplyr` package to do this, which will use the Ensembl gene IDs in both data frames to determine how to join the rows. - -```{r add_symbols, live = TRUE} - - # An *inner* join will only return rows that are in both data frames - - # The name of the column that contains the Ensembl gene IDs - # in the left data frame and right data frame, effectively - # dropping genes that don't have gene symbols - -``` - -We're going to use this table again, with the gene symbols, in the next notebook so let's write it to a file. - -```{r save_gene_symbols, live = TRUE} - -``` - -## Over-representation Analysis (ORA) - -To test for over-representation, we can calculate a p-value with a hypergeometric test ([ref](https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis)). - -\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\) - -Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway or gene set, `n` is the number of genes we are interested in (our marker genes), and `k` is the number of genes that overlap between the pathway or gene set and our marker genes. - -Borrowing an example from [_clusterProfiler: universal enrichment tool for functional and comparative study_ (Yu )](http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis): - -> **Example**: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set. - -We'll call genes that are differentially expressed `gene_in_interest` and genes that are in the gene set `in_gene_set`. - -```{r gene_table} -gene_table <- data.frame( - gene_not_interest = c(2613, 15310), - gene_in_interest = c(28, 29) -) -rownames(gene_table) <- c("in_gene_set", "not_in_gene_set") - -gene_table -``` - -We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution. -This corresponds to a one-sided Fisher's exact test. - -```{r fisher_test} -fisher.test(gene_table, alternative = "greater") -``` - -When we test **multiple pathways or gene sets**, the p-values then need to be **adjusted** for multiple hypothesis testing. - -### Marker gene preparation - -#### Top 100 cluster 1 genes - -We're interested in what pathways are over-represented in genes that mark, or are more highly expressed in, cluster 1. -We'll select the top 100 genes for this comparison. - -```{r cluster01_genes} -cluster01_markers <- markers_df |> - # Positive fold changes - dplyr::filter(summary.logFC > 0) |> - # Take the "top 100" when ordering by FDR (smallest FDR values) - dplyr::slice_min(order_by = FDR, n = 100) |> - # Extract a vector of the gene symbols - dplyr::pull(gene_symbol) -``` - -You may think that selecting the top 100 genes is pretty arbitrary and you are not wrong! -This also may seem strict, but let's think about our goal for this particular analysis: we hope that using ORA with resources that encode some knowledge about cell types will help us annotate cluster 1. -We're using the very top ranked genes in service of this goal. - -An alternative approach would be to have filtered to genes that are below a specific FDR threshold, but that would also be an arbitrary threshold that we chose. -This is a weakness of over-representation analysis; we generally need to pick a threshold for the genes that we include in our tests. - -As we mentioned in the clustering notebook, our statistical calculations are circular here. -We identified clusters based on gene expression and then identified genes that are differentially expressed between clusters. -We can use the FDR values for ranking, like we are here, but we shouldn't use them to define "significance." -You can read a little more about this in the [_Invalidity of p-values_ section of the OSCA book](https://bioconductor.org/books/3.14/OSCA.advanced/marker-detection-redux.html#p-value-invalidity). - -#### Background set - -As we saw above, calculating the p-value for ORA relies on the number of genes in the background distribution. -Sometimes folks consider genes from the entire genome to comprise the background, but in the example borrowed from the `clusterProfiler` authors, they state: - -> 17,980 genes detected in a Microarray study - -Where the key phrase is **genes detected**. - -If we were unable to include a gene in our marker gene comparisons because, for example, we could not reliably measure it, we shouldn't include in our background set! - -Our markers data frame has all of the genes that we did not filter out and have gene symbols corresponding to the Ensembl gene IDs, so we can use the `gene_symbol` column as our background set. - -```{r get_background_set, live = TRUE} - -``` - -### Gene Ontology ORA - -#### A note on GO - -The Gene Ontology (GO) ([Ashburner *et al.* 2000](https://dx.doi.org/10.1038/75556); [The Gene Ontology Consortium. 2021](https://doi.org/10.1093/nar/gkaa1113))is [an ontology](https://en.wikipedia.org/wiki/Ontology_(information_science)) that documents or describes how genes function in the biological domain and is comprised of 3 parts ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)): - -* **Molecular function (MF):** activities that occur on the molecular level, often can be performed by individual genes -* **Biological process (BP):** programs carried out by multiple molecular entities; similar to what is sometimes thought of as a pathway, but with no notion of _dynamics_ -* **Cellular component (CC):** cellular structures and compartments - -GO is "loosely hierarchical" ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)). -There are parent and child terms, where the "higher up" a term is, the more general it is. -This is somewhat abstract, so let's take a look at a GO term that's likely to be relevant to our Hodgkin's lymphoma data to get an idea of what this is like in practice: [`leukocyte activation`](http://www.informatics.jax.org/vocab/gene_ontology/GO:0045321). - -#### Run `enrichGO()` - -Now that we have our background set and our genes of interest, we're ready to run ORA using the `enrichGO()` function. - -```{r go_ora} -go_ora_results <- enrichGO(gene = cluster01_markers, # Top 100 genes - universe = background_set, # Background genes - keyType = "SYMBOL", # Our genes are gene symbols - OrgDb = org.Hs.eg.db, # Supports conversion, if needed - ont = "BP", # Biological process - pAdjustMethod = "BH", # FDR - pvalueCutoff = 0.00001) # Very stringent for viz -``` - -What is returned by `enrichGO()`? - -```{r view_go_ora, eval = FALSE} -View(go_ora_results) -``` - -The information we're most likely interested in is in the `result` slot. -Let's convert this into a data frame that we can write to file. - -```{r go_df} -go_result_df <- data.frame(go_ora_results@result) -``` - -Let's take a look at the GO sets with an adjusted p-value less than 0.00001. - -```{r filter_padj, live = TRUE} - -``` - -Some of these columns have names that are somewhat opaque and are used for visualization, so let's briefly talk about them here! - -| Column name | Numerator | Denominator | -|-------------|-----------|-------------| -| `GeneRatio` | Number of genes in the top 100 marker genes and the specific gene set (also in `Count` column) | Number of genes in the top 100 marker genes and _any_ gene set | -| `BgRatio` | Number of genes in the background set and the specific gene set | Number of genes in the background set and _any_ gene set | - -#### Visualizing results - -We can use a dot plot to visualize our significant enrichment results. - -```{r dotplot, live = TRUE} - -``` - -**What cell type do you think the cells from cluster 1 are?** - -We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant. - -```{r upsetplot, live = TRUE} - -``` - -It's important to keep in mind that gene sets or pathways aren't independent, either! -This is particularly true of GO terms, which are hierarchical in nature. - -### CellMarker ORA - -Next, we'll use another resource, CellMarker, as additional support for our conclusions based on GO. - -#### About CellMarker - -[CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) is a manually curated database of cell marker data for human and mouse ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)). -Studies/evidence of the following nature were included in the manual curation process ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)): - -* scRNA-seq -* Flow cytometry -* Immunostaining - -And there are both gene and protein markers. - -Let's read in the CellMarker data we've already cleaned: - -```{r read_in_cm, live = TRUE} - -``` -#### Run `enricher()` - -`clusterProfiler::enricher()` is a more general way to perform ORA -- we specify the pathways or gene sets ourselves. -We can use the it to run ORA with the CellMarker data, which contains gene symbols. -That's why we needed to perform gene identifier conversion earlier! - -```{r cm_ora, live = TRUE} - - # The pathway information should be a data frame with a term name or - # identifier and the gene identifiers - -``` - -#### Visualizing results - -Let's look at a bar plot of the significant results. - -```{r cm_barplot} -barplot(cellmarker_ora_results) + - # This is a ggplot, so we can use the following to label the x-axis! - ggplot2::labs(x = "count") -``` - -One thing that the [web version of CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) allows us to do is to get a better idea of _marker prevalence_. -Some of the genes that show up in our top 100 may not be very specific. -It's also likely that kidney is not very relevant to our analysis of Hodgkin's lymphoma, but lymph node is highly relevant. -However, blood cells in kidney are not _unexpected_ and, if we view the [CellMarker tissue type browser](http://bio-bigdata.hrbmu.edu.cn/CellMarker/browse.jsp), we can see that there are _a lot_ of genes annotated to kidney. -Every resource will have some limitations "baked in." - -We could do additional filtering of the gene sets to remove irrelevant tissues and to remove marker genes that are not particularly specific. -(Fewer gene sets would mean a lower multiple hypothesis testing burden.) -If some of the CellMarker marker genes come from immunostaining of terminally differentiated cells, we might not expect their transcripts to show up in scRNA-seq data, for example. -All things to keep in mind if you use this resource for your own work! - -### Write results to file - -Write the results to file for both GO and CellMarker gene sets. - -```{r write_results, live = TRUE} - -``` - -## Parting thoughts - -The goal of analyzing marker genes with ORA, like we did in this notebook, is to help us interpret our clustering results or annotate clusters. -We shouldn't use marker genes or any downstream analysis we perform with them to _justify_ cluster assignments; we generally want our cluster assignments to be data-driven. - -Just like it's important to be skeptical about p-values returned from the marker gene analysis itself, we should not get too attached to the idea of statistical significance here. -If we were to tweak the cutoffs (e.g., take the top 2500 genes), we might expect to get completely different significant results. -This is a limitation of ORA: we have to make some choices that are somewhat arbitrary. -It's a good idea to think about the goal of your analysis _upfront_ and pick cutoffs based on your goals. -That can help you avoid the temptation of trying a bunch of cutoffs until you get results that "look good." - -## Session Info - -```{r session_info} -sessionInfo() -``` diff --git a/scRNA-seq-advanced/04-overrepresentation_analysis.Rmd b/scRNA-seq-advanced/04-overrepresentation_analysis.Rmd deleted file mode 100644 index 277c0c71..00000000 --- a/scRNA-seq-advanced/04-overrepresentation_analysis.Rmd +++ /dev/null @@ -1,429 +0,0 @@ ---- -title: "Pathway analysis: Over-representation analysis (ORA)" -output: - html_notebook: - toc: true - toc_float: true -author: CCDL for ALSF -date: 2020 ---- - -## Objectives - -This notebook will demonstrate how to: - -- Perform gene identifier conversion with [`AnnotationDBI` annotation packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) -- Prepare gene sets for over-representation analysis, including an appropriate background set -- Perform over-representation analysis with the `clusterProfiler` package -- Introduce resources for ORA of scRNA-seq data, such as Gene Ontology - ---- - -In this notebook, we'll cover a type of pathway or gene set analysis called over-representation analysis (ORA). -The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance? -The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive. - -ORA has some limitations, outlined nicely (and more extensively!) in [Khatri _et al._ (2012)]( https://doi.org/10.1371/journal.pcbi.1002375). -One of the main issues with ORA is that typically all genes are treated as equal -- the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance. - -We will use the [`clusterProfiler` package](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) ([Yu *et al.* 2012](https://doi.org/10.1089/omi.2011.0118.)) to perform ORA. -`clusterProfiler` has many built-in functions that will run a specific type of analysis using a specific source of pathways/gene sets automatically, but for our purposes we're going to keep things as general as possible. -See the [`clusterProfiler` book](https://yulab-smu.github.io/clusterProfiler-book/index.html) for more information about the package's full suite of functionality. - -Because different bioinformatics tools often require different types of gene identifiers, we'll also cover how to convert between gene identifiers using [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) Bioconductor packages in this notebook. -Check out the [_AnnotationDbi: Introduction To Bioconductor Annotation Packages_ (Carlson 2020.) vignette](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) for more information. - -#### Other resources - -* For another example using `clusterProfiler`, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) -* [`WebGestaltR`](https://cran.r-project.org/web/packages/WebGestaltR/) is another R package that can be used for ORA. -* OSCA has [a section on using Gene Ontology sets to annotate clusters](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html#assigning-cluster-labels-from-markers) using a different Bioconductor package. - -## Set up - -### Libraries - -```{r libraries} -# Package we'll use to run ORA -library(clusterProfiler) -# Package that contains MSigDB gene sets in tidy format -library(msigdbr) -# Homo sapiens annotation package we'll use for gene identifier conversion -library(org.Hs.eg.db) -``` - -### Directories and files - -#### Directories - -```{r create_ora_directory, live = TRUE} -hodgkins_data_dir <- file.path("data", "hodgkins") -# We'll use the output of the clustering notebook for ORA -hodgkins_analysis_dir <- file.path("analysis", "hodgkins") - -# We'll create a directory to specifically hold the pathway results -results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis") -fs::dir_create(results_dir) - -# Create a directory for an updated marker list file -fs::dir_create(file.path(hodgkins_analysis_dir, "markers")) -``` - -#### Input files - -We're going to perform ORA on the marker genes from cluster 1 we looked at in the previous notebook. - -```{r cluster_input} -input_file <- file.path(hodgkins_data_dir, - "markers", - "cluster01_markers.tsv") -``` - -We're going to use a file that is a cleaned version of the [CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)) human markers resources. -You can see how we obtained and cleaned this file in [this notebook](https://github.com/AlexsLemonade/training-modules/blob/master/scRNA-seq/setup/cellmarker_genes.Rmd). - -```{r cm_input} -cellmarker_file <- file.path("gene-sets", - "CellMarker_cleaned_human_markers.tsv") -``` - -#### Output files - -In the next notebook, we'll use the cluster 1 results. -Saving the post-gene identifier conversion results in this notebook will prevent us from duplicating effort (and code)! - -```{r gene_id_file} -cluster1_output_file <- file.path(hodgkins_analysis_dir, - "markers", - "cluster01_markers_with_gene_symbols.tsv") -``` - -We'll save the table of ORA results (e.g., p-values). - -```{r output_files, live = TRUE} -go_results_file <- file.path(results_dir, "cluster01_go_ora_results.tsv") -cm_results_file <- file.path(results_dir, - "cluster01_CellMarker_ora_results.tsv") -``` - -## Read in marker genes - -Let's read in our marker genes from cluster 1. - -```{r read_in_markers, live = TRUE} -markers_df <- readr::read_tsv(input_file) -``` - -And let's take a peek! - -```{r marker_head, live = TRUE} -head(markers_df) -``` - -### Gene identifier conversion - -Our data frame of marker genes contains Ensembl gene identifiers. -We will need to convert from these identifiers into gene symbols or Entrez IDs for our use with cell marker data. - -We're going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! - -The annotation package `org.Hs.eg.db` contains information for different identifiers. -`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. -To perform gene identifier conversion in mouse (_Mus musculus_) we could use `org.Mm.eg.db`; -we would use `org.Dr.eg.db` for zebrafish (_Danio rerio_). - -We can see what types of IDs are available to us in an annotation package with `keytypes()`. - -```{r keytypes, live = TRUE} -keytypes(org.Hs.eg.db) -``` - -Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to Entrez IDs (`ENTREZID`). - -The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`. - -```{r map_to_symbol} -# This returns a named vector which we can convert to a data frame, where -# the keys (Ensembl IDs) are the names -symbols_vector <- mapIds(org.Hs.eg.db, # Specify the annotation package - # The vector of gene identifiers we want to - # map - keys = markers_df$gene, - # What type of gene identifiers we're starting - # with - keytype = "ENSEMBL", - # The type of gene identifier we want returned - column = "SYMBOL", - # In the case of 1:many mappings, return the - # first one. This is default behavior! - multiVals = "first") - -# We would like a data frame we can join to the DGE results -symbols_df <- data.frame( - ensembl_id = names(symbols_vector), - gene_symbol = symbols_vector -) |> - # Drop genes that don't have a gene symbol! - tidyr::drop_na() -``` - -This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. -In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs. - -Now we are ready to add the gene symbols to our data frame with the marker genes. -We can use a _join_ function from the `dplyr` package to do this, which will use the Ensembl gene IDs in both data frames to determine how to join the rows. - -```{r add_symbols, live = TRUE} -markers_df <- symbols_df |> - # An *inner* join will only return rows that are in both data frames - dplyr::inner_join(markers_df, - # The name of the column that contains the Ensembl gene IDs - # in the left data frame and right data frame, effectively - # dropping genes that don't have gene symbols - by = c("ensembl_id" = "gene")) -``` - -We're going to use this table again, with the gene symbols, in the next notebook so let's write it to a file. - -```{r save_gene_symbols, live = TRUE} -readr::write_tsv(markers_df, file = cluster1_output_file) -``` - -## Over-representation Analysis (ORA) - -To test for over-representation, we can calculate a p-value with a hypergeometric test ([ref](https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis)). - -\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\) - -Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway or gene set, `n` is the number of genes we are interested in (our marker genes), and `k` is the number of genes that overlap between the pathway or gene set and our marker genes. - -Borrowing an example from [_clusterProfiler: universal enrichment tool for functional and comparative study_ (Yu )](http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis): - -> **Example**: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set. - -We'll call genes that are differentially expressed `gene_in_interest` and genes that are in the gene set `in_gene_set`. - -```{r gene_table} -gene_table <- data.frame( - gene_not_interest = c(2613, 15310), - gene_in_interest = c(28, 29) -) -rownames(gene_table) <- c("in_gene_set", "not_in_gene_set") - -gene_table -``` - -We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution. -This corresponds to a one-sided Fisher's exact test. - -```{r fisher_test} -fisher.test(gene_table, alternative = "greater") -``` - -When we test **multiple pathways or gene sets**, the p-values then need to be **adjusted** for multiple hypothesis testing. - -### Marker gene preparation - -#### Top 100 cluster 1 genes - -We're interested in what pathways are over-represented in genes that mark, or are more highly expressed in, cluster 1. -We'll select the top 100 genes for this comparison. - -```{r cluster01_genes} -cluster01_markers <- markers_df |> - # Positive fold changes - dplyr::filter(summary.logFC > 0) |> - # Take the "top 100" when ordering by FDR (smallest FDR values) - dplyr::slice_min(order_by = FDR, n = 100) |> - # Extract a vector of the gene symbols - dplyr::pull(gene_symbol) -``` - -You may think that selecting the top 100 genes is pretty arbitrary and you are not wrong! -This also may seem strict, but let's think about our goal for this particular analysis: we hope that using ORA with resources that encode some knowledge about cell types will help us annotate cluster 1. -We're using the very top ranked genes in service of this goal. - -An alternative approach would be to have filtered to genes that are below a specific FDR threshold, but that would also be an arbitrary threshold that we chose. -This is a weakness of over-representation analysis; we generally need to pick a threshold for the genes that we include in our tests. - -As we mentioned in the clustering notebook, our statistical calculations are circular here. -We identified clusters based on gene expression and then identified genes that are differentially expressed between clusters. -We can use the FDR values for ranking, like we are here, but we shouldn't use them to define "significance." -You can read a little more about this in the [_Invalidity of p-values_ section of the OSCA book](https://bioconductor.org/books/3.14/OSCA.advanced/marker-detection-redux.html#p-value-invalidity). - -#### Background set - -As we saw above, calculating the p-value for ORA relies on the number of genes in the background distribution. -Sometimes folks consider genes from the entire genome to comprise the background, but in the example borrowed from the `clusterProfiler` authors, they state: - -> 17,980 genes detected in a Microarray study - -Where the key phrase is **genes detected**. - -If we were unable to include a gene in our marker gene comparisons because, for example, we could not reliably measure it, we shouldn't include in our background set! - -Our markers data frame has all of the genes that we did not filter out and have gene symbols corresponding to the Ensembl gene IDs, so we can use the `gene_symbol` column as our background set. - -```{r get_background_set, live = TRUE} -background_set <- markers_df$gene_symbol -``` - -### Gene Ontology ORA - -#### A note on GO - -The Gene Ontology (GO) ([Ashburner *et al.* 2000](https://dx.doi.org/10.1038/75556); [The Gene Ontology Consortium. 2021](https://doi.org/10.1093/nar/gkaa1113))is [an ontology](https://en.wikipedia.org/wiki/Ontology_(information_science)) that documents or describes how genes function in the biological domain and is comprised of 3 parts ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)): - -* **Molecular function (MF):** activities that occur on the molecular level, often can be performed by individual genes -* **Biological process (BP):** programs carried out by multiple molecular entities; similar to what is sometimes thought of as a pathway, but with no notion of _dynamics_ -* **Cellular component (CC):** cellular structures and compartments - -GO is "loosely hierarchical" ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)). -There are parent and child terms, where the "higher up" a term is, the more general it is. -This is somewhat abstract, so let's take a look at a GO term that's likely to be relevant to our Hodgkin's lymphoma data to get an idea of what this is like in practice: [`leukocyte activation`](http://www.informatics.jax.org/vocab/gene_ontology/GO:0045321). - -#### Run `enrichGO()` - -Now that we have our background set and our genes of interest, we're ready to run ORA using the `enrichGO()` function. - -```{r go_ora} -go_ora_results <- enrichGO(gene = cluster01_markers, # Top 100 genes - universe = background_set, # Background genes - keyType = "SYMBOL", # Our genes are gene symbols - OrgDb = org.Hs.eg.db, # Supports conversion, if needed - ont = "BP", # Biological process - pAdjustMethod = "BH", # FDR - pvalueCutoff = 0.00001) # Very stringent for viz -``` - -What is returned by `enrichGO()`? - -```{r view_go_ora, eval = FALSE} -View(go_ora_results) -``` - -The information we're most likely interested in is in the `result` slot. -Let's convert this into a data frame that we can write to file. - -```{r go_df} -go_result_df <- data.frame(go_ora_results@result) -``` - -Let's take a look at the GO sets with an adjusted p-value less than 0.00001. - -```{r filter_padj, live = TRUE} -go_result_df |> - dplyr::filter(p.adjust < 0.00001) -``` - -Some of these columns have names that are somewhat opaque and are used for visualization, so let's briefly talk about them here! - -| Column name | Numerator | Denominator | -|-------------|-----------|-------------| -| `GeneRatio` | Number of genes in the top 100 marker genes and the specific gene set (also in `Count` column) | Number of genes in the top 100 marker genes and _any_ gene set | -| `BgRatio` | Number of genes in the background set and the specific gene set | Number of genes in the background set and _any_ gene set | - -#### Visualizing results - -We can use a dot plot to visualize our significant enrichment results. - -```{r dotplot, live = TRUE} -enrichplot::dotplot(go_ora_results) -``` - -**What cell type do you think the cells from cluster 1 are?** - -We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant. - -```{r upsetplot, live = TRUE} -enrichplot::upsetplot(go_ora_results) -``` - -It's important to keep in mind that gene sets or pathways aren't independent, either! -This is particularly true of GO terms, which are hierarchical in nature. - -### CellMarker ORA - -Next, we'll use another resource, CellMarker, as additional support for our conclusions based on GO. - -#### About CellMarker - -[CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) is a manually curated database of cell marker data for human and mouse ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)). -Studies/evidence of the following nature were included in the manual curation process ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)): - -* scRNA-seq -* Flow cytometry -* Immunostaining - -And there are both gene and protein markers. - -Let's read in the CellMarker data we've already cleaned: - -```{r read_in_cm, live = TRUE} -cellmarker_df <- readr::read_tsv(cellmarker_file) -``` -#### Run `enricher()` - -`clusterProfiler::enricher()` is a more general way to perform ORA -- we specify the pathways or gene sets ourselves. -We can use the it to run ORA with the CellMarker data, which contains gene symbols. -That's why we needed to perform gene identifier conversion earlier! - -```{r cm_ora, live = TRUE} -cellmarker_ora_results <- enricher( - gene = cluster01_markers, # Genes of interest - pvalueCutoff = 0.05, # More permissive cut off - pAdjustMethod = "BH", # FDR - universe = background_set, # Background set - # The pathway information should be a data frame with a term name or - # identifier and the gene identifiers - TERM2GENE = cellmarker_df -) -``` - -#### Visualizing results - -Let's look at a bar plot of the significant results. - -```{r cm_barplot} -barplot(cellmarker_ora_results) + - # This is a ggplot, so we can use the following to label the x-axis! - ggplot2::labs(x = "count") -``` - -One thing that the [web version of CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) allows us to do is to get a better idea of _marker prevalence_. -Some of the genes that show up in our top 100 may not be very specific. -It's also likely that kidney is not very relevant to our analysis of Hodgkin's lymphoma, but lymph node is highly relevant. -However, blood cells in kidney are not _unexpected_ and, if we view the [CellMarker tissue type browser](http://bio-bigdata.hrbmu.edu.cn/CellMarker/browse.jsp), we can see that there are _a lot_ of genes annotated to kidney. -Every resource will have some limitations "baked in." - -We could do additional filtering of the gene sets to remove irrelevant tissues and to remove marker genes that are not particularly specific. -(Fewer gene sets would mean a lower multiple hypothesis testing burden.) -If some of the CellMarker marker genes come from immunostaining of terminally differentiated cells, we might not expect their transcripts to show up in scRNA-seq data, for example. -All things to keep in mind if you use this resource for your own work! - -### Write results to file - -Write the results to file for both GO and CellMarker gene sets. - -```{r write_results, live = TRUE} -readr::write_tsv(go_result_df, file = go_results_file) -data.frame(cellmarker_ora_results@result) |> - readr::write_tsv(file = cm_results_file) -``` - -## Parting thoughts - -The goal of analyzing marker genes with ORA, like we did in this notebook, is to help us interpret our clustering results or annotate clusters. -We shouldn't use marker genes or any downstream analysis we perform with them to _justify_ cluster assignments; we generally want our cluster assignments to be data-driven. - -Just like it's important to be skeptical about p-values returned from the marker gene analysis itself, we should not get too attached to the idea of statistical significance here. -If we were to tweak the cutoffs (e.g., take the top 2500 genes), we might expect to get completely different significant results. -This is a limitation of ORA: we have to make some choices that are somewhat arbitrary. -It's a good idea to think about the goal of your analysis _upfront_ and pick cutoffs based on your goals. -That can help you avoid the temptation of trying a bunch of cutoffs until you get results that "look good." - -## Session Info - -```{r session_info} -sessionInfo() -``` diff --git a/scRNA-seq-advanced/04-overrepresentation_analysis.nb.html b/scRNA-seq-advanced/04-overrepresentation_analysis.nb.html deleted file mode 100644 index d064c476..00000000 --- a/scRNA-seq-advanced/04-overrepresentation_analysis.nb.html +++ /dev/null @@ -1,3868 +0,0 @@ - - - - - - - - - - - - - - - -Pathway analysis: Over-representation analysis (ORA) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - - -
-

Objectives

-

This notebook will demonstrate how to:

-
    -
  • Perform gene identifier conversion with AnnotationDBI -annotation packages
  • -
  • Prepare gene sets for over-representation analysis, including an -appropriate background set
  • -
  • Perform over-representation analysis with the -clusterProfiler package
  • -
  • Introduce resources for ORA of scRNA-seq data, such as Gene -Ontology
  • -
-
-

In this notebook, we’ll cover a type of pathway or gene set analysis -called over-representation analysis (ORA). The idea behind ORA is -relatively straightforward: given a set of genes, do these genes overlap -with a pathway more than we expect by chance? The simplicity of only -requiring an input gene set (sort of, more on that below) can be -attractive.

-

ORA has some limitations, outlined nicely (and more extensively!) in -Khatri et -al. (2012). One of the main issues with ORA is that typically -all genes are treated as equal – the context of the magnitude of a -change we may be measuring is removed and each gene is treated as -independent, which can sometimes result in an incorrect estimate of -significance.

-

We will use the clusterProfiler -package (Yu et -al. 2012) to perform ORA. clusterProfiler has many -built-in functions that will run a specific type of analysis using a -specific source of pathways/gene sets automatically, but for our -purposes we’re going to keep things as general as possible. See the clusterProfiler -book for more information about the package’s full suite of -functionality.

-

Because different bioinformatics tools often require different types -of gene identifiers, we’ll also cover how to convert between gene -identifiers using AnnotationDbi -Bioconductor packages in this notebook. Check out the AnnotationDbi: -Introduction To Bioconductor Annotation Packages (Carlson 2020.) -vignette for more information.

-
-

Other resources

- -
-
-
-

Set up

-
-

Libraries

- - - -
# Package we'll use to run ORA
-library(clusterProfiler)
- - -
- - -
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
-
-If you use clusterProfiler in published research, please cite:
-T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
- - -

-Attaching package: 'clusterProfiler'
- - -
The following object is masked from 'package:stats':
-
-    filter
- - -
# Package that contains MSigDB gene sets in tidy format
-library(msigdbr)
-# Homo sapiens annotation package we'll use for gene identifier conversion
-library(org.Hs.eg.db)
- - -
Loading required package: AnnotationDbi
- - -
Loading required package: stats4
- - -
Loading required package: BiocGenerics
- - -

-Attaching package: 'BiocGenerics'
- - -
The following objects are masked from 'package:stats':
-
-    IQR, mad, sd, var, xtabs
- - -
The following objects are masked from 'package:base':
-
-    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
-    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
-    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
-    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
-    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
-    tapply, union, unique, unsplit, which.max, which.min
- - -
Loading required package: Biobase
- - -
Welcome to Bioconductor
-
-    Vignettes contain introductory material; view with
-    'browseVignettes()'. To cite Bioconductor, see
-    'citation("Biobase")', and for packages 'citation("pkgname")'.
- - -
Loading required package: IRanges
- - -
Loading required package: S4Vectors
- - -

-Attaching package: 'S4Vectors'
- - -
The following object is masked from 'package:clusterProfiler':
-
-    rename
- - -
The following object is masked from 'package:utils':
-
-    findMatches
- - -
The following objects are masked from 'package:base':
-
-    expand.grid, I, unname
- - -

-Attaching package: 'IRanges'
- - -
The following object is masked from 'package:clusterProfiler':
-
-    slice
- - -

-Attaching package: 'AnnotationDbi'
- - -
The following object is masked from 'package:clusterProfiler':
-
-    select
- - -
- - - -
-
-

Directories and files

-
-

Directories

- - - -
hodgkins_data_dir <- file.path("data", "hodgkins")
-# We'll use the output of the clustering notebook for ORA
-hodgkins_analysis_dir <- file.path("analysis", "hodgkins")
-
-# We'll create a directory to specifically hold the pathway results
-results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis")
-fs::dir_create(results_dir)
-
-# Create a directory for an updated marker list file
-fs::dir_create(file.path(hodgkins_analysis_dir, "markers"))
- - - -
-
-

Input files

-

We’re going to perform ORA on the marker genes from cluster 1 we -looked at in the previous notebook.

- - - -
input_file <- file.path(hodgkins_data_dir,
-                        "markers",
-                        "cluster01_markers.tsv")
- - - -

We’re going to use a file that is a cleaned version of the CellMarker (Zhang et al. -2018) human markers resources. You can see how we obtained and -cleaned this file in this -notebook.

- - - -
cellmarker_file <- file.path("gene-sets",
-                             "CellMarker_cleaned_human_markers.tsv")
- - - -
-
-

Output files

-

In the next notebook, we’ll use the cluster 1 results. Saving the -post-gene identifier conversion results in this notebook will prevent us -from duplicating effort (and code)!

- - - -
cluster1_output_file <- file.path(hodgkins_analysis_dir,
-                                  "markers",
-                                  "cluster01_markers_with_gene_symbols.tsv")
- - - -

We’ll save the table of ORA results (e.g., p-values).

- - - -
go_results_file <- file.path(results_dir, "cluster01_go_ora_results.tsv")
-cm_results_file <- file.path(results_dir,
-                             "cluster01_CellMarker_ora_results.tsv")
- - - -
-
-
-
-

Read in marker genes

-

Let’s read in our marker genes from cluster 1.

- - - -
markers_df <- readr::read_tsv(input_file)
- - -
Rows: 7264 Columns: 16
-── Column specification ────────────────────────────────────────────────────────
-Delimiter: "\t"
-chr  (1): gene
-dbl (15): p.value, FDR, summary.logFC, logFC.2, logFC.3, logFC.4, logFC.5, l...
-
-ℹ Use `spec()` to retrieve the full column specification for this data.
-ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- - - -

And let’s take a peek!

- - - -
head(markers_df)
- -
- -
- - -
-

Gene identifier conversion

-

Our data frame of marker genes contains Ensembl gene identifiers. We -will need to convert from these identifiers into gene symbols or Entrez -IDs for our use with cell marker data.

-

We’re going to convert our identifiers to gene symbols because they -are a bit more human readable, but you can, with the change of a single -argument, use the same code to convert to many other types of -identifiers!

-

The annotation package org.Hs.eg.db contains information -for different identifiers. org.Hs.eg.db is specific to -Homo sapiens – this is what the Hs in the package -name is referencing. To perform gene identifier conversion in mouse -(Mus musculus) we could use org.Mm.eg.db; we would -use org.Dr.eg.db for zebrafish (Danio rerio).

-

We can see what types of IDs are available to us in an annotation -package with keytypes().

- - - -
keytypes(org.Hs.eg.db)
- - -
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
- [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
-[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
-[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
-[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
-[26] "UNIPROT"     
- - -
ACCNUM
-ALIAS
-ENSEMBL
-ENSEMBLPROT
-ENSEMBLTRANS
-ENTREZID
-ENZYME
-EVIDENCE
-EVIDENCEALL
-GENENAME
-GENETYPE
-GO
-GOALL
-IPI
-MAP
-OMIM
-ONTOLOGY
-ONTOLOGYALL
-PATH
-PFAM
-PMID
-PROSITE
-REFSEQ
-SYMBOL
-UCSCKG
-UNIPROT
- - - -

Even though we’ll use this package to convert from Ensembl gene IDs -(ENSEMBL) to gene symbols (SYMBOL), we could -just as easily use it to convert from an Ensembl transcript ID -(ENSEMBLTRANS) to Entrez IDs (ENTREZID).

-

The function we will use to map from Ensembl gene IDs to gene symbols -is called mapIds().

- - - -
# This returns a named vector which we can convert to a data frame, where
-# the keys (Ensembl IDs) are the names
-symbols_vector <- mapIds(org.Hs.eg.db,  # Specify the annotation package
-                         # The vector of gene identifiers we want to
-                         # map
-                         keys = markers_df$gene,
-                         # What type of gene identifiers we're starting
-                         # with
-                         keytype = "ENSEMBL",
-                         # The type of gene identifier we want returned
-                         column = "SYMBOL",
-                         # In the case of 1:many mappings, return the
-                         # first one. This is default behavior!
-                         multiVals = "first")
- - -
'select()' returned 1:many mapping between keys and columns
- - -
# We would like a data frame we can join to the DGE results
-symbols_df <- data.frame(
-  ensembl_id = names(symbols_vector),
-  gene_symbol = symbols_vector
-) |>
-  # Drop genes that don't have a gene symbol!
-  tidyr::drop_na()
- - - -

This message is letting us know that sometimes Ensembl gene -identifiers will map to multiple gene symbols. In this case, it’s also -possible that a gene symbol will map to multiple Ensembl IDs.

-

Now we are ready to add the gene symbols to our data frame with the -marker genes. We can use a join function from the -dplyr package to do this, which will use the Ensembl gene -IDs in both data frames to determine how to join the rows.

- - - -
markers_df <- symbols_df |>
-  # An *inner* join will only return rows that are in both data frames
-  dplyr::inner_join(markers_df,
-                    # The name of the column that contains the Ensembl gene IDs
-                    # in the left data frame and right data frame, effectively
-                    # dropping genes that don't have gene symbols
-                    by = c("ensembl_id" = "gene"))
- - - -

We’re going to use this table again, with the gene symbols, in the -next notebook so let’s write it to a file.

- - - -
readr::write_tsv(markers_df, file = cluster1_output_file)
- - - -
-
-
-

Over-representation Analysis (ORA)

-

To test for over-representation, we can calculate a p-value with a -hypergeometric test (ref).

-

\(p = 1 - \displaystyle\sum_{i = -0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} -}\)

-

Where N is the number of genes in the background -distribution, M is the number of genes in a pathway or gene -set, n is the number of genes we are interested in (our -marker genes), and k is the number of genes that overlap -between the pathway or gene set and our marker genes.

-

Borrowing an example from clusterProfiler: -universal enrichment tool for functional and comparative study (Yu -):

-
-

Example: Suppose we have 17,980 genes detected in a -Microarray study and 57 genes were differentially expressed. Among the -differential expressed genes, 28 are annotated to a gene set.

-
-

We’ll call genes that are differentially expressed -gene_in_interest and genes that are in the gene set -in_gene_set.

- - - -
gene_table <- data.frame(
-  gene_not_interest = c(2613, 15310),
-  gene_in_interest = c(28, 29)
-)
-rownames(gene_table) <- c("in_gene_set", "not_in_gene_set")
-
-gene_table
- -
- -
- - -

We can assess if the 28 overlapping genes mean that the -differentially expressed genes are over-represented in the gene set with -the hypergeometric distribution. This corresponds to a one-sided -Fisher’s exact test.

- - - -
fisher.test(gene_table, alternative = "greater")
- - -

-    Fisher's Exact Test for Count Data
-
-data:  gene_table
-p-value = 1
-alternative hypothesis: true odds ratio is greater than 1
-95 percent confidence interval:
- 0.110242      Inf
-sample estimates:
-odds ratio 
- 0.1767937 
- - - -

When we test multiple pathways or gene sets, the -p-values then need to be adjusted for multiple -hypothesis testing.

-
-

Marker gene preparation

-
-

Top 100 cluster 1 genes

-

We’re interested in what pathways are over-represented in genes that -mark, or are more highly expressed in, cluster 1. We’ll select the top -100 genes for this comparison.

- - - -
cluster01_markers <- markers_df |>
-  # Positive fold changes
-  dplyr::filter(summary.logFC > 0) |>
-  # Take the "top 100" when ordering by FDR (smallest FDR values)
-  dplyr::slice_min(order_by = FDR, n = 100) |>
-  # Extract a vector of the gene symbols
-  dplyr::pull(gene_symbol)
- - - -

You may think that selecting the top 100 genes is pretty arbitrary -and you are not wrong! This also may seem strict, but let’s think about -our goal for this particular analysis: we hope that using ORA with -resources that encode some knowledge about cell types will help us -annotate cluster 1. We’re using the very top ranked genes in service of -this goal.

-

An alternative approach would be to have filtered to genes that are -below a specific FDR threshold, but that would also be an arbitrary -threshold that we chose. This is a weakness of over-representation -analysis; we generally need to pick a threshold for the genes that we -include in our tests.

-

As we mentioned in the clustering notebook, our statistical -calculations are circular here. We identified clusters based on gene -expression and then identified genes that are differentially expressed -between clusters. We can use the FDR values for ranking, like we are -here, but we shouldn’t use them to define “significance.” You can read a -little more about this in the Invalidity -of p-values section of the OSCA book.

-
-
-

Background set

-

As we saw above, calculating the p-value for ORA relies on the number -of genes in the background distribution. Sometimes folks consider genes -from the entire genome to comprise the background, but in the example -borrowed from the clusterProfiler authors, they state:

-
-

17,980 genes detected in a Microarray study

-
-

Where the key phrase is genes detected.

-

If we were unable to include a gene in our marker gene comparisons -because, for example, we could not reliably measure it, we shouldn’t -include in our background set!

-

Our markers data frame has all of the genes that we did not filter -out and have gene symbols corresponding to the Ensembl gene IDs, so we -can use the gene_symbol column as our background set.

- - - -
background_set <- markers_df$gene_symbol
- - - -
-
-
-

Gene Ontology ORA

-
-

A note on GO

-

The Gene Ontology (GO) (Ashburner et al. -2000; The Gene -Ontology Consortium. 2021)is an -ontology that documents or describes how genes function in the -biological domain and is comprised of 3 parts (GO ontology -documentation):

-
    -
  • Molecular function (MF): activities that occur on -the molecular level, often can be performed by individual genes
  • -
  • Biological process (BP): programs carried out by -multiple molecular entities; similar to what is sometimes thought of as -a pathway, but with no notion of dynamics
  • -
  • Cellular component (CC): cellular structures and -compartments
  • -
-

GO is “loosely hierarchical” (GO ontology -documentation). There are parent and child terms, where the “higher -up” a term is, the more general it is. This is somewhat abstract, so -let’s take a look at a GO term that’s likely to be relevant to our -Hodgkin’s lymphoma data to get an idea of what this is like in practice: -leukocyte activation.

-
-
-

Run enrichGO()

-

Now that we have our background set and our genes of interest, we’re -ready to run ORA using the enrichGO() function.

- - - -
go_ora_results <- enrichGO(gene = cluster01_markers,  # Top 100 genes
-                           universe = background_set,  # Background genes
-                           keyType = "SYMBOL",  # Our genes are gene symbols
-                           OrgDb = org.Hs.eg.db,  # Supports conversion, if needed
-                           ont = "BP",  # Biological process
-                           pAdjustMethod = "BH",  # FDR
-                           pvalueCutoff = 0.00001) # Very stringent for viz
- - - -

What is returned by enrichGO()?

- - - -
View(go_ora_results)
- - - -

The information we’re most likely interested in is in the -result slot. Let’s convert this into a data frame that we -can write to file.

- - - -
go_result_df <- data.frame(go_ora_results@result)
- - - -

Let’s take a look at the GO sets with an adjusted p-value less than -0.00001.

- - - -
go_result_df |>
-  dplyr::filter(p.adjust < 0.00001)
- -
- -
- - -

Some of these columns have names that are somewhat opaque and are -used for visualization, so let’s briefly talk about them here!

- ----- - - - - - - - - - - - - - - - - - - - -
Column nameNumeratorDenominator
GeneRatioNumber of genes in the top 100 marker genes and the specific gene -set (also in Count column)Number of genes in the top 100 marker genes and any gene -set
BgRatioNumber of genes in the background set and the specific gene setNumber of genes in the background set and any gene set
-
-
-

Visualizing results

-

We can use a dot plot to visualize our significant enrichment -results.

- - - -
enrichplot::dotplot(go_ora_results)
- - -

- - - -

What cell type do you think the cells from cluster 1 -are?

-

We can use an UpSet -plot to visualize the overlap between the gene sets -that were returned as significant.

- - - -
enrichplot::upsetplot(go_ora_results)
- - -

- - - -

It’s important to keep in mind that gene sets or pathways aren’t -independent, either! This is particularly true of GO terms, which are -hierarchical in nature.

-
-
-
-

CellMarker ORA

-

Next, we’ll use another resource, CellMarker, as additional support -for our conclusions based on GO.

-
-

About CellMarker

-

CellMarker -is a manually curated database of cell marker data for human and mouse -(Zhang et al. -2018). Studies/evidence of the following nature were included in the -manual curation process (Zhang et al. -2018):

-
    -
  • scRNA-seq
  • -
  • Flow cytometry
  • -
  • Immunostaining
  • -
-

And there are both gene and protein markers.

-

Let’s read in the CellMarker data we’ve already cleaned:

- - - -
cellmarker_df <- readr::read_tsv(cellmarker_file)
- - -
Rows: 30208 Columns: 2
-── Column specification ────────────────────────────────────────────────────────
-Delimiter: "\t"
-chr (2): cell_marker, gene_symbol
-
-ℹ Use `spec()` to retrieve the full column specification for this data.
-ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- - - -
-
-

Run enricher()

-

clusterProfiler::enricher() is a more general way to -perform ORA – we specify the pathways or gene sets ourselves. We can use -the it to run ORA with the CellMarker data, which contains gene symbols. -That’s why we needed to perform gene identifier conversion earlier!

- - - -
cellmarker_ora_results <- enricher(
-  gene = cluster01_markers,  # Genes of interest
-  pvalueCutoff = 0.05,  # More permissive cut off
-  pAdjustMethod = "BH",  # FDR
-  universe = background_set,  # Background set
-  # The pathway information should be a data frame with a term name or
-  # identifier and the gene identifiers
-  TERM2GENE = cellmarker_df
-)
- - - -
-
-

Visualizing results

-

Let’s look at a bar plot of the significant results.

- - - -
barplot(cellmarker_ora_results) +
-  # This is a ggplot, so we can use the following to label the x-axis!
-  ggplot2::labs(x = "count")
- - -

- - - -

One thing that the web version of -CellMarker allows us to do is to get a better idea of marker -prevalence. Some of the genes that show up in our top 100 may not -be very specific. It’s also likely that kidney is not very relevant to -our analysis of Hodgkin’s lymphoma, but lymph node is highly relevant. -However, blood cells in kidney are not unexpected and, if we -view the CellMarker -tissue type browser, we can see that there are a lot of -genes annotated to kidney. Every resource will have some limitations -“baked in.”

-

We could do additional filtering of the gene sets to remove -irrelevant tissues and to remove marker genes that are not particularly -specific. (Fewer gene sets would mean a lower multiple hypothesis -testing burden.) If some of the CellMarker marker genes come from -immunostaining of terminally differentiated cells, we might not expect -their transcripts to show up in scRNA-seq data, for example. All things -to keep in mind if you use this resource for your own work!

-
-
-
-

Write results to file

-

Write the results to file for both GO and CellMarker gene sets.

- - - -
readr::write_tsv(go_result_df, file = go_results_file)
-data.frame(cellmarker_ora_results@result) |>
-  readr::write_tsv(file = cm_results_file)
- - - -
-
-
-

Parting thoughts

-

The goal of analyzing marker genes with ORA, like we did in this -notebook, is to help us interpret our clustering results or annotate -clusters. We shouldn’t use marker genes or any downstream analysis we -perform with them to justify cluster assignments; we generally -want our cluster assignments to be data-driven.

-

Just like it’s important to be skeptical about p-values returned from -the marker gene analysis itself, we should not get too attached to the -idea of statistical significance here. If we were to tweak the cutoffs -(e.g., take the top 2500 genes), we might expect to get completely -different significant results. This is a limitation of ORA: we have to -make some choices that are somewhat arbitrary. It’s a good idea to think -about the goal of your analysis upfront and pick cutoffs based -on your goals. That can help you avoid the temptation of trying a bunch -of cutoffs until you get results that “look good.”

-
-
-

Session Info

- - - -
sessionInfo()
- - -
R version 4.4.1 (2024-06-14)
-Platform: x86_64-pc-linux-gnu
-Running under: Ubuntu 22.04.4 LTS
-
-Matrix products: default
-BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
-LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
-
-locale:
- [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
- [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
- [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
- [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
- [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
-
-time zone: Etc/UTC
-tzcode source: system (glibc)
-
-attached base packages:
-[1] stats4    stats     graphics  grDevices utils     datasets  methods  
-[8] base     
-
-other attached packages:
-[1] org.Hs.eg.db_3.19.1    AnnotationDbi_1.66.0   IRanges_2.38.0        
-[4] S4Vectors_0.42.0       Biobase_2.64.0         BiocGenerics_0.50.0   
-[7] msigdbr_7.5.1          clusterProfiler_4.12.0 optparse_1.7.5        
-
-loaded via a namespace (and not attached):
-  [1] DBI_1.2.2               gson_0.1.0              shadowtext_0.1.3       
-  [4] gridExtra_2.3           rlang_1.1.3             magrittr_2.0.3         
-  [7] DOSE_3.30.0             compiler_4.4.1          RSQLite_2.3.6          
- [10] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
- [13] stringr_1.5.1           pkgconfig_2.0.3         crayon_1.5.2           
- [16] fastmap_1.1.1           XVector_0.44.0          labeling_0.4.3         
- [19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
- [22] rmarkdown_2.26          tzdb_0.4.0              enrichplot_1.24.0      
- [25] UCSC.utils_1.0.0        purrr_1.0.2             bit_4.0.5              
- [28] xfun_0.43               zlibbioc_1.50.0         cachem_1.0.8           
- [31] aplot_0.2.2             GenomeInfoDb_1.40.0     jsonlite_1.8.8         
- [34] blob_1.2.4              highr_0.10              BiocParallel_1.38.0    
- [37] tweenr_2.0.3            parallel_4.4.1          R6_2.5.1               
- [40] bslib_0.7.0             stringi_1.8.3           RColorBrewer_1.1-3     
- [43] jquerylib_0.1.4         GOSemSim_2.30.0         Rcpp_1.0.12            
- [46] knitr_1.46              readr_2.1.5             Matrix_1.7-0           
- [49] splines_4.4.1           igraph_2.0.3            tidyselect_1.2.1       
- [52] qvalue_2.36.0           yaml_2.3.8              viridis_0.6.5          
- [55] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
- [58] plyr_1.8.9              treeio_1.28.0           withr_3.0.0            
- [61] KEGGREST_1.44.0         evaluate_0.23           gridGraphics_0.5-1     
- [64] scatterpie_0.2.2        getopt_1.20.4           polyclip_1.10-6        
- [67] ggupset_0.3.0           Biostrings_2.72.0       ggtree_3.12.0          
- [70] pillar_1.9.0            ggfun_0.1.4             generics_0.1.3         
- [73] vroom_1.6.5             hms_1.1.3               ggplot2_3.5.1          
- [76] tidytree_0.4.6          munsell_0.5.1           scales_1.3.0           
- [79] glue_1.7.0              lazyeval_0.2.2          tools_4.4.1            
- [82] data.table_1.15.4       fgsea_1.30.0            babelgene_22.9         
- [85] fs_1.6.4                graphlayouts_1.1.1      fastmatch_1.1-4        
- [88] tidygraph_1.3.1         cowplot_1.1.3           grid_4.4.1             
- [91] ape_5.8                 tidyr_1.3.1             colorspace_2.1-0       
- [94] nlme_3.1-164            patchwork_1.2.0         GenomeInfoDbData_1.2.12
- [97] ggforce_0.4.2           cli_3.6.2               fansi_1.0.6            
-[100] viridisLite_0.4.2      
- [ reached getOption("max.print") -- omitted 15 entries ]
- - -
- -
---
title: "Pathway analysis: Over-representation analysis (ORA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2020
---

## Objectives

This notebook will demonstrate how to:

- Perform gene identifier conversion with [`AnnotationDBI` annotation packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf)
- Prepare gene sets for over-representation analysis, including an appropriate background set
- Perform over-representation analysis with the `clusterProfiler` package
- Introduce resources for ORA of scRNA-seq data, such as Gene Ontology

---

In this notebook, we'll cover a type of pathway or gene set analysis called over-representation analysis (ORA).
The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance?
The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive.

ORA has some limitations, outlined nicely (and more extensively!) in [Khatri _et al._ (2012)]( https://doi.org/10.1371/journal.pcbi.1002375).
One of the main issues with ORA is that typically all genes are treated as equal -- the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance.

We will use the [`clusterProfiler` package](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) ([Yu *et al.* 2012](https://doi.org/10.1089/omi.2011.0118.)) to perform ORA.
`clusterProfiler` has many built-in functions that will run a specific type of analysis using a specific source of pathways/gene sets automatically, but for our purposes we're going to keep things as general as possible.
See the [`clusterProfiler` book](https://yulab-smu.github.io/clusterProfiler-book/index.html) for more information about the package's full suite of functionality.

Because different bioinformatics tools often require different types of gene identifiers, we'll also cover how to convert between gene identifiers using [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) Bioconductor packages in this notebook.
Check out the [_AnnotationDbi: Introduction To Bioconductor Annotation Packages_ (Carlson 2020.) vignette](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) for more information.

#### Other resources

* For another example using `clusterProfiler`, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* [`WebGestaltR`](https://cran.r-project.org/web/packages/WebGestaltR/) is another R package that can be used for ORA.
* OSCA has [a section on using Gene Ontology sets to annotate clusters](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html#assigning-cluster-labels-from-markers) using a different Bioconductor package.

## Set up

### Libraries

```{r libraries}
# Package we'll use to run ORA
library(clusterProfiler)
# Package that contains MSigDB gene sets in tidy format
library(msigdbr)
# Homo sapiens annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)
```

### Directories and files

#### Directories

```{r create_ora_directory, live = TRUE}
hodgkins_data_dir <- file.path("data", "hodgkins")
# We'll use the output of the clustering notebook for ORA
hodgkins_analysis_dir <- file.path("analysis", "hodgkins")

# We'll create a directory to specifically hold the pathway results
results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)

# Create a directory for an updated marker list file
fs::dir_create(file.path(hodgkins_analysis_dir, "markers"))
```

#### Input files

We're going to perform ORA on the marker genes from cluster 1 we looked at in the previous notebook.

```{r cluster_input}
input_file <- file.path(hodgkins_data_dir,
                        "markers",
                        "cluster01_markers.tsv")
```

We're going to use a file that is a cleaned version of the [CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)) human markers resources.
You can see how we obtained and cleaned this file in [this notebook](https://github.com/AlexsLemonade/training-modules/blob/master/scRNA-seq/setup/cellmarker_genes.Rmd).

```{r cm_input}
cellmarker_file <- file.path("gene-sets",
                             "CellMarker_cleaned_human_markers.tsv")
```

#### Output files

In the next notebook, we'll use the cluster 1 results.
Saving the post-gene identifier conversion results in this notebook will prevent us from duplicating effort (and code)!

```{r gene_id_file}
cluster1_output_file <- file.path(hodgkins_analysis_dir,
                                  "markers",
                                  "cluster01_markers_with_gene_symbols.tsv")
```

We'll save the table of ORA results (e.g., p-values).

```{r output_files, live = TRUE}
go_results_file <- file.path(results_dir, "cluster01_go_ora_results.tsv")
cm_results_file <- file.path(results_dir,
                             "cluster01_CellMarker_ora_results.tsv")
```

## Read in marker genes

Let's read in our marker genes from cluster 1.

```{r read_in_markers, live = TRUE}
markers_df <- readr::read_tsv(input_file)
```

And let's take a peek!

```{r marker_head, live = TRUE}
head(markers_df)
```

### Gene identifier conversion

Our data frame of marker genes contains Ensembl gene identifiers.
We will need to convert from these identifiers into gene symbols or Entrez IDs for our use with cell marker data.

We're going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package `org.Hs.eg.db` contains information for different identifiers.
`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing.
To perform gene identifier conversion in mouse (_Mus musculus_) we could use `org.Mm.eg.db`;
we would use `org.Dr.eg.db` for zebrafish (_Danio rerio_).

We can see what types of IDs are available to us in an annotation package with `keytypes()`.

```{r keytypes, live = TRUE}
keytypes(org.Hs.eg.db)
```

Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to Entrez IDs (`ENTREZID`).

The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`.

```{r map_to_symbol}
# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(org.Hs.eg.db,  # Specify the annotation package
                         # The vector of gene identifiers we want to
                         # map
                         keys = markers_df$gene,
                         # What type of gene identifiers we're starting
                         # with
                         keytype = "ENSEMBL",
                         # The type of gene identifier we want returned
                         column = "SYMBOL",
                         # In the case of 1:many mappings, return the
                         # first one. This is default behavior!
                         multiVals = "first")

# We would like a data frame we can join to the DGE results
symbols_df <- data.frame(
  ensembl_id = names(symbols_vector),
  gene_symbol = symbols_vector
) |>
  # Drop genes that don't have a gene symbol!
  tidyr::drop_na()
```

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols.
In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs.

Now we are ready to add the gene symbols to our data frame with the marker genes.
We can use a _join_ function from the `dplyr` package to do this, which will use the Ensembl gene IDs in both data frames to determine how to join the rows.

```{r add_symbols, live = TRUE}
markers_df <- symbols_df |>
  # An *inner* join will only return rows that are in both data frames
  dplyr::inner_join(markers_df,
                    # The name of the column that contains the Ensembl gene IDs
                    # in the left data frame and right data frame, effectively
                    # dropping genes that don't have gene symbols
                    by = c("ensembl_id" = "gene"))
```

We're going to use this table again, with the gene symbols, in the next notebook so let's write it to a file.

```{r save_gene_symbols, live = TRUE}
readr::write_tsv(markers_df, file = cluster1_output_file)
```

## Over-representation Analysis (ORA)

To test for over-representation, we can calculate a p-value with a hypergeometric test ([ref](https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis)).

\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)

Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway or gene set, `n` is the number of genes we are interested in (our marker genes), and `k` is the number of genes that overlap between the pathway or gene set and our marker genes.

Borrowing an example from [_clusterProfiler: universal enrichment tool for functional and comparative study_ (Yu )](http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis):

> **Example**: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set.

We'll call genes that are differentially expressed `gene_in_interest` and genes that are in the gene set `in_gene_set`.

```{r gene_table}
gene_table <- data.frame(
  gene_not_interest = c(2613, 15310),
  gene_in_interest = c(28, 29)
)
rownames(gene_table) <- c("in_gene_set", "not_in_gene_set")

gene_table
```

We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution.
This corresponds to a one-sided Fisher's exact test.

```{r fisher_test}
fisher.test(gene_table, alternative = "greater")
```

When we test **multiple pathways or gene sets**, the p-values then need to be **adjusted** for multiple hypothesis testing.

### Marker gene preparation

#### Top 100 cluster 1 genes

We're interested in what pathways are over-represented in genes that mark, or are more highly expressed in, cluster 1.
We'll select the top 100 genes for this comparison.

```{r cluster01_genes}
cluster01_markers <- markers_df |>
  # Positive fold changes
  dplyr::filter(summary.logFC > 0) |>
  # Take the "top 100" when ordering by FDR (smallest FDR values)
  dplyr::slice_min(order_by = FDR, n = 100) |>
  # Extract a vector of the gene symbols
  dplyr::pull(gene_symbol)
```

You may think that selecting the top 100 genes is pretty arbitrary and you are not wrong!
This also may seem strict, but let's think about our goal for this particular analysis: we hope that using ORA with resources that encode some knowledge about cell types will help us annotate cluster 1.
We're using the very top ranked genes in service of this goal.

An alternative approach would be to have filtered to genes that are below a specific FDR threshold, but that would also be an arbitrary threshold that we chose.
This is a weakness of over-representation analysis; we generally need to pick a threshold for the genes that we include in our tests.

As we mentioned in the clustering notebook, our statistical calculations are circular here.
We identified clusters based on gene expression and then identified genes that are differentially expressed between clusters.
We can use the FDR values for ranking, like we are here, but we shouldn't use them to define "significance."
You can read a little more about this in the [_Invalidity of p-values_ section of the OSCA book](https://bioconductor.org/books/3.14/OSCA.advanced/marker-detection-redux.html#p-value-invalidity).

#### Background set

As we saw above, calculating the p-value for ORA relies on the number of genes in the background distribution.
Sometimes folks consider genes from the entire genome to comprise the background, but in the example borrowed from the `clusterProfiler` authors, they state:

> 17,980 genes detected in a Microarray study

Where the key phrase is **genes detected**.

If we were unable to include a gene in our marker gene comparisons because, for example, we could not reliably measure it, we shouldn't include in our background set!

Our markers data frame has all of the genes that we did not filter out and have gene symbols corresponding to the Ensembl gene IDs, so we can use the `gene_symbol` column as our background set.

```{r get_background_set, live = TRUE}
background_set <- markers_df$gene_symbol
```

### Gene Ontology ORA

#### A note on GO

The Gene Ontology (GO) ([Ashburner *et al.* 2000](https://dx.doi.org/10.1038/75556); [The Gene Ontology Consortium. 2021](https://doi.org/10.1093/nar/gkaa1113))is [an ontology](https://en.wikipedia.org/wiki/Ontology_(information_science)) that documents or describes how genes function in the biological domain and is comprised of 3 parts ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)):

* **Molecular function (MF):** activities that occur on the molecular level, often can be performed by individual genes
* **Biological process (BP):** programs carried out by multiple molecular entities; similar to what is sometimes thought of as a pathway, but with no notion of _dynamics_
* **Cellular component (CC):** cellular structures and compartments

GO is "loosely hierarchical" ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)).
There are parent and child terms, where the "higher up" a term is, the more general it is.
This is somewhat abstract, so let's take a look at a GO term that's likely to be relevant to our Hodgkin's lymphoma data to get an idea of what this is like in practice: [`leukocyte activation`](http://www.informatics.jax.org/vocab/gene_ontology/GO:0045321).

#### Run `enrichGO()`

Now that we have our background set and our genes of interest, we're ready to run ORA using the `enrichGO()` function.

```{r go_ora}
go_ora_results <- enrichGO(gene = cluster01_markers,  # Top 100 genes
                           universe = background_set,  # Background genes
                           keyType = "SYMBOL",  # Our genes are gene symbols
                           OrgDb = org.Hs.eg.db,  # Supports conversion, if needed
                           ont = "BP",  # Biological process
                           pAdjustMethod = "BH",  # FDR
                           pvalueCutoff = 0.00001) # Very stringent for viz
```

What is returned by `enrichGO()`?

```{r view_go_ora, eval = FALSE}
View(go_ora_results)
```

The information we're most likely interested in is in the `result` slot.
Let's convert this into a data frame that we can write to file.

```{r go_df}
go_result_df <- data.frame(go_ora_results@result)
```

Let's take a look at the GO sets with an adjusted p-value less than 0.00001.

```{r filter_padj, live = TRUE}
go_result_df |>
  dplyr::filter(p.adjust < 0.00001)
```

Some of these columns have names that are somewhat opaque and are used for visualization, so let's briefly talk about them here!

| Column name | Numerator | Denominator |
|-------------|-----------|-------------|
| `GeneRatio` | Number of genes in the top 100 marker genes and the specific gene set (also in `Count` column) | Number of genes in the top 100 marker genes and _any_ gene set |
| `BgRatio`   | Number of genes in the background set and the specific gene set | Number of genes in the background set and _any_ gene set |

#### Visualizing results

We can use a dot plot to visualize our significant enrichment results.

```{r dotplot, live = TRUE}
enrichplot::dotplot(go_ora_results)
```

**What cell type do you think the cells from cluster 1 are?**

We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant.

```{r upsetplot, live = TRUE}
enrichplot::upsetplot(go_ora_results)
```

It's important to keep in mind that gene sets or pathways aren't independent, either!
This is particularly true of GO terms, which are hierarchical in nature.

### CellMarker ORA

Next, we'll use another resource, CellMarker, as additional support for our conclusions based on GO.

#### About CellMarker

[CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) is a manually curated database of cell marker data for human and mouse ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)).
Studies/evidence of the following nature were included in the manual curation process ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)):

* scRNA-seq
* Flow cytometry
* Immunostaining

And there are both gene and protein markers.

Let's read in the CellMarker data we've already cleaned:

```{r read_in_cm, live = TRUE}
cellmarker_df <- readr::read_tsv(cellmarker_file)
```
#### Run `enricher()`

`clusterProfiler::enricher()` is a more general way to perform ORA -- we specify the pathways or gene sets ourselves.
We can use the it to run ORA with the CellMarker data, which contains gene symbols.
That's why we needed to perform gene identifier conversion earlier!

```{r cm_ora, live = TRUE}
cellmarker_ora_results <- enricher(
  gene = cluster01_markers,  # Genes of interest
  pvalueCutoff = 0.05,  # More permissive cut off
  pAdjustMethod = "BH",  # FDR
  universe = background_set,  # Background set
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = cellmarker_df
)
```

#### Visualizing results

Let's look at a bar plot of the significant results.

```{r cm_barplot}
barplot(cellmarker_ora_results) +
  # This is a ggplot, so we can use the following to label the x-axis!
  ggplot2::labs(x = "count")
```

One thing that the [web version of CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) allows us to do is to get a better idea of _marker prevalence_.
Some of the genes that show up in our top 100 may not be very specific.
It's also likely that kidney is not very relevant to our analysis of Hodgkin's lymphoma, but lymph node is highly relevant.
However, blood cells in kidney are not _unexpected_ and, if we view the [CellMarker tissue type browser](http://bio-bigdata.hrbmu.edu.cn/CellMarker/browse.jsp), we can see that there are _a lot_ of genes annotated to kidney.
Every resource will have some limitations "baked in."

We could do additional filtering of the gene sets to remove irrelevant tissues and to remove marker genes that are not particularly specific.
(Fewer gene sets would mean a lower multiple hypothesis testing burden.)
If some of the CellMarker marker genes come from immunostaining of terminally differentiated cells, we might not expect their transcripts to show up in scRNA-seq data, for example.
All things to keep in mind if you use this resource for your own work!

### Write results to file

Write the results to file for both GO and CellMarker gene sets.

```{r write_results, live = TRUE}
readr::write_tsv(go_result_df, file = go_results_file)
data.frame(cellmarker_ora_results@result) |>
  readr::write_tsv(file = cm_results_file)
```

## Parting thoughts

The goal of analyzing marker genes with ORA, like we did in this notebook, is to help us interpret our clustering results or annotate clusters.
We shouldn't use marker genes or any downstream analysis we perform with them to _justify_ cluster assignments; we generally want our cluster assignments to be data-driven.

Just like it's important to be skeptical about p-values returned from the marker gene analysis itself, we should not get too attached to the idea of statistical significance here.
If we were to tweak the cutoffs (e.g., take the top 2500 genes), we might expect to get completely different significant results.
This is a limitation of ORA: we have to make some choices that are somewhat arbitrary.
It's a good idea to think about the goal of your analysis _upfront_ and pick cutoffs based on your goals.
That can help you avoid the temptation of trying a bunch of cutoffs until you get results that "look good."

## Session Info

```{r session_info}
sessionInfo()
```

- - -
-
- -
- - - - - - - - - - - - - - - - - From 086d1aefdcc6787b81fbd9fcd1f996775682a99f Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 15:17:17 +0000 Subject: [PATCH 4/7] geneset -> gene sets and rerun --- .../04-gene_set_enrichment_analysis.Rmd | 2 +- .../04-gene_set_enrichment_analysis.nb.html | 100 +++++++++++------- 2 files changed, 64 insertions(+), 38 deletions(-) diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd index 0f300979..0ae64adb 100644 --- a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd @@ -149,7 +149,7 @@ head(deseq_df) This data frame uses Ensembl gene identifiers. We'll need to make sure our gene sets use the same identifiers. -Let's take a look at the first few rows of the data frame that contains the hallmark genesets. +Let's take a look at the first few rows of the data frame that contains the hallmark gene sets. ```{r hallmark_head, live = TRUE} head(hs_hallmarks_df) diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html index 618e4c60..a825edd6 100644 --- a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html @@ -2949,7 +2949,7 @@

Libraries

library(clusterProfiler)
- +
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
 
 If you use clusterProfiler in published research, please cite:
@@ -2957,6 +2957,18 @@ 

Libraries

Attaching package: ‘clusterProfiler’ +The following object is masked from ‘package:AnnotationDbi’: + + select + +The following object is masked from ‘package:IRanges’: + + slice + +The following object is masked from ‘package:S4Vectors’: + + rename + The following object is masked from ‘package:stats’: filter
@@ -3102,12 +3114,12 @@

DESeq2 results

deseq_df <- readr::read_tsv(input_file)
- +
Rows: 60319 Columns: 27
-── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Delimiter: "\t"
 chr  (2): ensembl_id, gene_symbol
-dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, SCPCL000478.detected, SCPCL000479.mean, SCPCL000479.detected, SCPCL000480.mean, SCPCL000480.detected, SCPCL000481.m...
+dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, SCPCL000478.detected, SCPCL000479.mean, SCPCL000479.det...
 
 ℹ Use `spec()` to retrieve the full column specification for this data.
 ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
@@ -3130,8 +3142,8 @@

DESeq2 results

This data frame uses Ensembl gene identifiers. We’ll need to make sure our gene sets use the same identifiers. Let’s take a look at the -first few rows of the data frame that contains the hallmark -genesets.

+first few rows of the data frame that contains the hallmark gene +sets.

@@ -3351,7 +3363,7 @@

Session Info

sessionInfo()
- +
R version 4.4.0 (2024-04-24)
 Platform: x86_64-pc-linux-gnu
 Running under: Ubuntu 22.04.4 LTS
@@ -3361,49 +3373,63 @@ 

Session Info

LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 locale: - [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C - [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C + [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 + [6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C +[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C time zone: Etc/UTC tzcode source: system (glibc) attached base packages: -[1] stats graphics grDevices datasets utils methods base +[1] stats4 stats graphics grDevices utils datasets methods base other attached packages: -[1] msigdbr_7.5.1 clusterProfiler_4.12.0 + [1] clusterProfiler_4.12.0 msigdbr_7.5.1 GSEABase_1.66.0 graph_1.82.0 + [5] annotate_1.82.0 XML_3.99-0.16.1 AnnotationDbi_1.66.0 AUCell_1.26.0 + [9] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 Biobase_2.64.0 GenomicRanges_1.56.0 +[13] GenomeInfoDb_1.40.0 IRanges_2.38.0 S4Vectors_0.42.0 BiocGenerics_0.50.0 +[17] MatrixGenerics_1.16.0 matrixStats_1.3.0 loaded via a namespace (and not attached): - [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8 magrittr_2.0.3 estimability_1.5 ggbeeswarm_0.7.2 - [7] farver_2.1.1 rmarkdown_2.26 fs_1.6.4 zlibbioc_1.50.0 vctrs_0.6.5 memoise_2.0.1 - [13] DelayedMatrixStats_1.26.0 ggtree_3.12.0 htmltools_0.5.8.1 S4Arrays_1.4.0 BiocNeighbors_1.22.0 SparseArray_1.4.0 - [19] gridGraphics_0.5-1 plyr_1.8.9 emmeans_1.10.1 cachem_1.0.8 igraph_2.0.3 lifecycle_1.0.4 - [25] pkgconfig_2.0.3 gson_0.1.0 rsvd_1.0.5 Matrix_1.7-0 R6_2.5.1 fastmap_1.1.1 - [31] GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0 digest_0.6.35 aplot_0.2.2 enrichplot_1.24.0 colorspace_2.1-0 - [37] patchwork_1.2.0 AnnotationDbi_1.66.0 S4Vectors_0.42.0 scater_1.32.0 qusage_2.38.0 irlba_2.3.5.1 - [43] GenomicRanges_1.56.0 RSQLite_2.3.6 beachmat_2.20.0 labeling_0.4.3 fansi_1.0.6 httr_1.4.7 - [49] polyclip_1.10-6 abind_1.4-5 compiler_4.4.0 bit64_4.0.5 withr_3.0.0 BiocParallel_1.38.0 - [55] viridis_0.6.5 DBI_1.2.2 ggforce_0.4.2 MASS_7.3-60.2 DelayedArray_0.30.0 HDO.db_0.99.1 - [61] tools_4.4.0 vipor_0.4.7 beeswarm_0.4.0 ape_5.8 scatterpie_0.2.2 fftw_1.0-8 - [67] glue_1.7.0 nlme_3.1-164 GOSemSim_2.30.0 grid_4.4.0 shadowtext_0.1.3 reshape2_1.4.4 - [73] fgsea_1.30.0 generics_0.1.3 gtable_0.3.5 tzdb_0.4.0 tidyr_1.3.1 data.table_1.15.4 - [79] hms_1.1.3 ScaledMatrix_1.12.0 BiocSingular_1.20.0 tidygraph_1.3.1 utf8_1.2.4 XVector_0.44.0 - [85] BiocGenerics_0.50.0 ggrepel_0.9.5 pillar_1.9.0 stringr_1.5.1 vroom_1.6.5 babelgene_22.9 - [91] limma_3.60.0 yulab.utils_0.1.4 splines_4.4.0 dplyr_1.1.4 tweenr_2.0.3 treeio_1.28.0 - [97] lattice_0.22-6 renv_1.0.7 bit_4.0.5 tidyselect_1.2.1 GO.db_3.19.1 SingleCellExperiment_1.26.0 -[103] Biostrings_2.72.0 scuttle_1.14.0 knitr_1.46 gridExtra_2.3 IRanges_2.38.0 SummarizedExperiment_1.34.0 -[109] stats4_4.4.0 xfun_0.43 graphlayouts_1.1.1 Biobase_2.64.0 statmod_1.5.0 matrixStats_1.3.0 -[115] stringi_1.8.3 UCSC.utils_1.0.0 lazyeval_0.2.2 ggfun_0.1.4 yaml_2.3.8 evaluate_0.23 -[121] codetools_0.2-20 ggraph_2.2.1 tibble_3.2.1 qvalue_2.36.0 BiocManager_1.30.22 ggplotify_0.1.2 -[127] cli_3.6.2 xtable_1.8-4 munsell_0.5.1 Rcpp_1.0.12 GenomeInfoDb_1.40.0 coda_0.19-4.1 -[133] png_0.1-8 parallel_4.4.0 ggplot2_3.5.1 readr_2.1.5 blob_1.2.4 DOSE_3.30.0 -[139] sparseMatrixStats_1.16.0 mvtnorm_1.2-4 viridisLite_0.4.2 tidytree_0.4.6 scales_1.3.0 purrr_1.0.2 -[145] crayon_1.5.2 rlang_1.1.3 cowplot_1.1.3 fastmatch_1.1-4 KEGGREST_1.44.0
+ [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8 magrittr_2.0.3 + [5] ggbeeswarm_0.7.2 farver_2.1.1 rmarkdown_2.26 fs_1.6.4 + [9] zlibbioc_1.50.0 vctrs_0.6.5 memoise_2.0.1 DelayedMatrixStats_1.26.0 + [13] ggtree_3.12.0 htmltools_0.5.8.1 S4Arrays_1.4.0 BiocNeighbors_1.22.0 + [17] gridGraphics_0.5-1 SparseArray_1.4.0 sass_0.4.9 bslib_0.7.0 + [21] htmlwidgets_1.6.4 plyr_1.8.9 cachem_1.0.8 igraph_2.0.3 + [25] lifecycle_1.0.4 pkgconfig_2.0.3 gson_0.1.0 rsvd_1.0.5 + [29] Matrix_1.7-0 R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.12 + [33] aplot_0.2.2 digest_0.6.35 enrichplot_1.24.0 colorspace_2.1-0 + [37] patchwork_1.2.0 scater_1.32.0 irlba_2.3.5.1 RSQLite_2.3.6 + [41] beachmat_2.20.0 labeling_0.4.3 fansi_1.0.6 httr_1.4.7 + [45] polyclip_1.10-6 abind_1.4-5 compiler_4.4.0 bit64_4.0.5 + [49] withr_3.0.0 BiocParallel_1.38.0 viridis_0.6.5 DBI_1.2.2 + [53] ggforce_0.4.2 R.utils_2.12.3 MASS_7.3-60.2 DelayedArray_0.30.0 + [57] HDO.db_0.99.1 tools_4.4.0 vipor_0.4.7 scatterpie_0.2.2 + [61] ape_5.8 beeswarm_0.4.0 R.oo_1.26.0 glue_1.7.0 + [65] nlme_3.1-164 GOSemSim_2.30.0 shadowtext_0.1.3 grid_4.4.0 + [69] reshape2_1.4.4 fgsea_1.30.0 generics_0.1.3 gtable_0.3.5 + [73] tzdb_0.4.0 R.methodsS3_1.8.2 tidyr_1.3.1 data.table_1.15.4 + [77] hms_1.1.3 BiocSingular_1.20.0 tidygraph_1.3.1 ScaledMatrix_1.12.0 + [81] utf8_1.2.4 XVector_0.44.0 stringr_1.5.1 ggrepel_0.9.5 + [85] pillar_1.9.0 vroom_1.6.5 yulab.utils_0.1.4 babelgene_22.9 + [89] splines_4.4.0 dplyr_1.1.4 tweenr_2.0.3 treeio_1.28.0 + [93] lattice_0.22-6 renv_1.0.7 bit_4.0.5 tidyselect_1.2.1 + [97] GO.db_3.19.1 Biostrings_2.72.0 scuttle_1.14.0 knitr_1.46 +[101] gridExtra_2.3 xfun_0.43 graphlayouts_1.1.1 stringi_1.8.3 +[105] UCSC.utils_1.0.0 lazyeval_0.2.2 ggfun_0.1.4 yaml_2.3.8 +[109] evaluate_0.23 codetools_0.2-20 ggraph_2.2.1 qvalue_2.36.0 +[113] tibble_3.2.1 ggplotify_0.1.2 cli_3.6.2 xtable_1.8-4 +[117] jquerylib_0.1.4 munsell_0.5.1 Rcpp_1.0.12 png_0.1-8 +[121] parallel_4.4.0 ggplot2_3.5.1 readr_2.1.5 blob_1.2.4 +[125] DOSE_3.30.0 sparseMatrixStats_1.16.0 tidytree_0.4.6 viridisLite_0.4.2 +[129] scales_1.3.0 purrr_1.0.2 crayon_1.5.2 rlang_1.1.3 +[133] fastmatch_1.1-4 cowplot_1.1.3 KEGGREST_1.44.0 -
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
                        "deseq",
                        "rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
                           category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark genesets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
  # Extract a vector of `log2FoldChange` named by `ensembl_id`
  dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmarks_df,
                                               gs_name,
                                               ensembl_gene))  # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
                     title = "HALLMARK_MYC_TARGETS_V2",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYOGENESIS",
                     title = "HALLMARK_MYOGENESIS",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

+
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
                        "deseq",
                        "rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
                           category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark gene sets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
  # Extract a vector of `log2FoldChange` named by `ensembl_id`
  dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmarks_df,
                                               gs_name,
                                               ensembl_gene))  # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
                     title = "HALLMARK_MYC_TARGETS_V2",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYOGENESIS",
                     title = "HALLMARK_MYOGENESIS",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

From fb4a2a8b4bb70f1f3b5e44fd17c26c95cb7f9cbd Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 10:24:55 -0500 Subject: [PATCH 5/7] Make corresponding changes to render live script --- scripts/render-live.sh | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/scripts/render-live.sh b/scripts/render-live.sh index 2cb07fc4..c95e12f4 100644 --- a/scripts/render-live.sh +++ b/scripts/render-live.sh @@ -35,8 +35,7 @@ files=( scRNA-seq-advanced/01-read_filter_normalize_scRNA.Rmd scRNA-seq-advanced/02-dataset_integration.Rmd scRNA-seq-advanced/03-differential_expression.Rmd - scRNA-seq-advanced/04-overrepresentation_analysis.Rmd - scRNA-seq-advanced/05-gene_set_enrichment_analysis.Rmd + scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd # machine-learning/01-openpbta_heatmap.Rmd # machine-learning/02-openpbta_consensus_clustering.Rmd # machine-learning/03-openpbta_PLIER.Rmd From fff81bf861600a31f96a211d352253c9a8709442 Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 12:22:14 -0500 Subject: [PATCH 6/7] Apply suggestions from code review Co-authored-by: Stephanie Spielman --- scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd index 0ae64adb..5b519495 100644 --- a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd @@ -24,18 +24,18 @@ In this notebook, we'll analyze the differential expression results from the las GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102). The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest. -There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)): +There are 3 general steps in FCS methods ([Khatri _et al._ 2012](https://doi.org/10.1371/journal.pcbi.1002375)): 1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results) -2. Gene-level statistics are aggregated into a pathway-level statistic +2. Aggregate gene-level statistics into a pathway-level statistic 3. Assess the statistical significance of the pathway-level statistic #### Other resources * For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) * The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood. -You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). -* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. +You can read more about `fgsea` in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). +* See the [refine.bio examples for "Gene set enrichment analysis - RNA-seq"](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. ## Set up From 44f3b761366124417510894ef28aa3e02f595d88 Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:54:11 +0000 Subject: [PATCH 7/7] Respond to review and rerun --- .../04-gene_set_enrichment_analysis.Rmd | 5 +- .../04-gene_set_enrichment_analysis.nb.html | 123 +++++++----------- 2 files changed, 52 insertions(+), 76 deletions(-) diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd index 5b519495..5029c88d 100644 --- a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd @@ -32,7 +32,7 @@ There are 3 general steps in FCS methods ([Khatri _et al._ 2012](https://doi.org #### Other resources -* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html) +* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/Training-modules/DGE-functional-analysis/lessons/02_functional_analysis.html) * The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood. You can read more about `fgsea` in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012). * See the [refine.bio examples for "Gene set enrichment analysis - RNA-seq"](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted. @@ -158,6 +158,9 @@ head(hs_hallmarks_df) We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion. However, we'll need to pass the correct column to the function that runs GSEA. +If we needed to do gene identifier conversion, we would likely use the `AnnotationDbi` package. +You can see an example in this Harvard Chan Bioinformatics Core Training material: + One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols. (Multiple Ensembl gene identifiers can map to the same symbol.) diff --git a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html index a825edd6..5197164f 100644 --- a/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html +++ b/scRNA-seq-advanced/04-gene_set_enrichment_analysis.nb.html @@ -2911,8 +2911,7 @@

Objectives

  1. Calculate a gene-level statistic (here, we’ll use the summary log fold changes in our DESeq2 results)
  2. -
  3. Gene-level statistics are aggregated into a pathway-level -statistic
  4. +
  5. Aggregate gene-level statistics into a pathway-level statistic
  6. Assess the statistical significance of the pathway-level statistic
@@ -2920,16 +2919,16 @@

Objectives

Other resources

@@ -2939,36 +2938,22 @@

Set up

Libraries

- -
# Package to run GSEA
- - -
Warning message:
-replacing previous import ‘S4Arrays::makeNindexFromArrayViewport’ by ‘DelayedArray::makeNindexFromArrayViewport’ when loading ‘SummarizedExperiment’ 
- - -
library(clusterProfiler)
+ +
# Package to run GSEA
+library(clusterProfiler)
- -
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
+
+

+Registered S3 method overwritten by 'data.table':
+  method           from
+  print.data.table     
+clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
 
 If you use clusterProfiler in published research, please cite:
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
 
 Attaching package: ‘clusterProfiler’
 
-The following object is masked from ‘package:AnnotationDbi’:
-
-    select
-
-The following object is masked from ‘package:IRanges’:
-
-    slice
-
-The following object is masked from ‘package:S4Vectors’:
-
-    rename
-
 The following object is masked from ‘package:stats’:
 
     filter
@@ -3114,12 +3099,12 @@

DESeq2 results

deseq_df <- readr::read_tsv(input_file)
- +
Rows: 60319 Columns: 27
-── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Delimiter: "\t"
 chr  (2): ensembl_id, gene_symbol
-dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, SCPCL000478.detected, SCPCL000479.mean, SCPCL000479.det...
+dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, SCPCL000478.detected, SCPCL000479.mean, SCPCL000479.de...
 
 ℹ Use `spec()` to retrieve the full column specification for this data.
 ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
@@ -3162,6 +3147,9 @@

DESeq2 results

gene identifiers associated with them, so we don’t need to do any conversion. However, we’ll need to pass the correct column to the function that runs GSEA.

+

If we needed to do gene identifier conversion, we would likely use +the AnnotationDbi package. You can see an example in this +Harvard Chan Bioinformatics Core Training material: https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/AnnotationDbi_lesson.html

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols. (Multiple Ensembl gene identifiers can map to the same symbol.)

@@ -3363,7 +3351,7 @@

Session Info

sessionInfo()
- +
R version 4.4.0 (2024-04-24)
 Platform: x86_64-pc-linux-gnu
 Running under: Ubuntu 22.04.4 LTS
@@ -3381,55 +3369,40 @@ 

Session Info

tzcode source: system (glibc) attached base packages: -[1] stats4 stats graphics grDevices utils datasets methods base +[1] stats graphics grDevices datasets utils methods base other attached packages: - [1] clusterProfiler_4.12.0 msigdbr_7.5.1 GSEABase_1.66.0 graph_1.82.0 - [5] annotate_1.82.0 XML_3.99-0.16.1 AnnotationDbi_1.66.0 AUCell_1.26.0 - [9] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 Biobase_2.64.0 GenomicRanges_1.56.0 -[13] GenomeInfoDb_1.40.0 IRanges_2.38.0 S4Vectors_0.42.0 BiocGenerics_0.50.0 -[17] MatrixGenerics_1.16.0 matrixStats_1.3.0 +[1] msigdbr_7.5.1 clusterProfiler_4.12.0 loaded via a namespace (and not attached): - [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8 magrittr_2.0.3 - [5] ggbeeswarm_0.7.2 farver_2.1.1 rmarkdown_2.26 fs_1.6.4 - [9] zlibbioc_1.50.0 vctrs_0.6.5 memoise_2.0.1 DelayedMatrixStats_1.26.0 - [13] ggtree_3.12.0 htmltools_0.5.8.1 S4Arrays_1.4.0 BiocNeighbors_1.22.0 - [17] gridGraphics_0.5-1 SparseArray_1.4.0 sass_0.4.9 bslib_0.7.0 - [21] htmlwidgets_1.6.4 plyr_1.8.9 cachem_1.0.8 igraph_2.0.3 - [25] lifecycle_1.0.4 pkgconfig_2.0.3 gson_0.1.0 rsvd_1.0.5 - [29] Matrix_1.7-0 R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.12 - [33] aplot_0.2.2 digest_0.6.35 enrichplot_1.24.0 colorspace_2.1-0 - [37] patchwork_1.2.0 scater_1.32.0 irlba_2.3.5.1 RSQLite_2.3.6 - [41] beachmat_2.20.0 labeling_0.4.3 fansi_1.0.6 httr_1.4.7 - [45] polyclip_1.10-6 abind_1.4-5 compiler_4.4.0 bit64_4.0.5 - [49] withr_3.0.0 BiocParallel_1.38.0 viridis_0.6.5 DBI_1.2.2 - [53] ggforce_0.4.2 R.utils_2.12.3 MASS_7.3-60.2 DelayedArray_0.30.0 - [57] HDO.db_0.99.1 tools_4.4.0 vipor_0.4.7 scatterpie_0.2.2 - [61] ape_5.8 beeswarm_0.4.0 R.oo_1.26.0 glue_1.7.0 - [65] nlme_3.1-164 GOSemSim_2.30.0 shadowtext_0.1.3 grid_4.4.0 - [69] reshape2_1.4.4 fgsea_1.30.0 generics_0.1.3 gtable_0.3.5 - [73] tzdb_0.4.0 R.methodsS3_1.8.2 tidyr_1.3.1 data.table_1.15.4 - [77] hms_1.1.3 BiocSingular_1.20.0 tidygraph_1.3.1 ScaledMatrix_1.12.0 - [81] utf8_1.2.4 XVector_0.44.0 stringr_1.5.1 ggrepel_0.9.5 - [85] pillar_1.9.0 vroom_1.6.5 yulab.utils_0.1.4 babelgene_22.9 - [89] splines_4.4.0 dplyr_1.1.4 tweenr_2.0.3 treeio_1.28.0 - [93] lattice_0.22-6 renv_1.0.7 bit_4.0.5 tidyselect_1.2.1 - [97] GO.db_3.19.1 Biostrings_2.72.0 scuttle_1.14.0 knitr_1.46 -[101] gridExtra_2.3 xfun_0.43 graphlayouts_1.1.1 stringi_1.8.3 -[105] UCSC.utils_1.0.0 lazyeval_0.2.2 ggfun_0.1.4 yaml_2.3.8 -[109] evaluate_0.23 codetools_0.2-20 ggraph_2.2.1 qvalue_2.36.0 -[113] tibble_3.2.1 ggplotify_0.1.2 cli_3.6.2 xtable_1.8-4 -[117] jquerylib_0.1.4 munsell_0.5.1 Rcpp_1.0.12 png_0.1-8 -[121] parallel_4.4.0 ggplot2_3.5.1 readr_2.1.5 blob_1.2.4 -[125] DOSE_3.30.0 sparseMatrixStats_1.16.0 tidytree_0.4.6 viridisLite_0.4.2 -[129] scales_1.3.0 purrr_1.0.2 crayon_1.5.2 rlang_1.1.3 -[133] fastmatch_1.1-4 cowplot_1.1.3 KEGGREST_1.44.0
+ [1] DBI_1.2.2 gson_0.1.0 shadowtext_0.1.3 gridExtra_2.3 rlang_1.1.3 + [6] magrittr_2.0.3 DOSE_3.30.0 compiler_4.4.0 RSQLite_2.3.6 png_0.1-8 + [11] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.2 + [16] fastmap_1.1.1 XVector_0.44.0 labeling_0.4.3 ggraph_2.2.1 utf8_1.2.4 + [21] HDO.db_0.99.1 tzdb_0.4.0 enrichplot_1.24.0 UCSC.utils_1.0.0 purrr_1.0.2 + [26] bit_4.0.5 xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8 aplot_0.2.2 + [31] GenomeInfoDb_1.40.0 jsonlite_1.8.8 blob_1.2.4 BiocParallel_1.38.0 tweenr_2.0.3 + [36] parallel_4.4.0 R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3 GOSemSim_2.30.0 + [41] Rcpp_1.0.12 knitr_1.46 readr_2.1.5 IRanges_2.38.0 Matrix_1.7-0 + [46] splines_4.4.0 igraph_2.0.3 tidyselect_1.2.1 qvalue_2.36.0 rstudioapi_0.16.0 + [51] yaml_2.3.8 viridis_0.6.5 codetools_0.2-20 lattice_0.22-6 tibble_3.2.1 + [56] plyr_1.8.9 treeio_1.28.0 Biobase_2.64.0 withr_3.0.0 KEGGREST_1.44.0 + [61] gridGraphics_0.5-1 scatterpie_0.2.2 polyclip_1.10-6 Biostrings_2.72.0 pillar_1.9.0 + [66] BiocManager_1.30.22 ggtree_3.12.0 renv_1.0.7 stats4_4.4.0 ggfun_0.1.4 + [71] generics_0.1.3 vroom_1.6.5 hms_1.1.3 S4Vectors_0.42.0 ggplot2_3.5.1 + [76] tidytree_0.4.6 munsell_0.5.1 scales_1.3.0 glue_1.7.0 lazyeval_0.2.2 + [81] tools_4.4.0 data.table_1.15.4 fgsea_1.30.0 babelgene_22.9 fs_1.6.4 + [86] graphlayouts_1.1.1 fastmatch_1.1-4 tidygraph_1.3.1 cowplot_1.1.3 grid_4.4.0 + [91] ape_5.8 tidyr_1.3.1 AnnotationDbi_1.66.0 colorspace_2.1-0 nlme_3.1-164 + [96] GenomeInfoDbData_1.2.12 patchwork_1.2.0 ggforce_0.4.2 cli_3.6.2 fansi_1.0.6 +[101] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.5 yulab.utils_0.1.4 digest_0.6.35 +[106] BiocGenerics_0.50.0 ggrepel_0.9.5 ggplotify_0.1.2 farver_2.1.1 memoise_2.0.1 +[111] lifecycle_1.0.4 httr_1.4.7 GO.db_3.19.1 bit64_4.0.5 MASS_7.3-60.2
-
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
                        "deseq",
                        "rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
                           category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark gene sets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
  # Extract a vector of `log2FoldChange` named by `ensembl_id`
  dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmarks_df,
                                               gs_name,
                                               ensembl_gene))  # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
                     title = "HALLMARK_MYC_TARGETS_V2",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYOGENESIS",
                     title = "HALLMARK_MYOGENESIS",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

+
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012](https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Aggregate gene-level statistics into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/Training-modules/DGE-functional-analysis/lessons/02_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about `fgsea` in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* See the [refine.bio examples for "Gene set enrichment analysis - RNA-seq"](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
                        "deseq",
                        "rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
                         "rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

    H: hallmark gene sets
    C1: positional gene sets
    C2: curated gene sets
    C3: motif gene sets
    C4: computational gene sets
    C5: GO gene sets
    C6: oncogenic signatures
    C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
                           category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark gene sets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

If we needed to do gene identifier conversion, we would likely use the `AnnotationDbi` package.
You can see an example in this Harvard Chan Bioinformatics Core Training material: <https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/AnnotationDbi_lesson.html>

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
  # Extract a vector of `log2FoldChange` named by `ensembl_id`
  dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmarks_df,
                                               gs_name,
                                               ensembl_gene))  # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V2",
                     title = "HALLMARK_MYC_TARGETS_V2",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYOGENESIS",
                     title = "HALLMARK_MYOGENESIS",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```
