Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scAdvanced GSEA notebook that uses pseudobulk RMS DE results #808

Merged
merged 8 commits into from
Nov 25, 2024
286 changes: 286 additions & 0 deletions scRNA-seq-advanced/04-gene_set_enrichment_analysis.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
html_notebook:
toc: true
toc_float: true
author: CCDL for ALSF
date: 2024
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA)
- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we'll analyze the differential expression results from the last notebook.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved

1. Calculate a gene-level statistic (here, we'll use the summary log fold changes in our DESeq2 results)
2. Gene-level statistics are aggregated into a pathway-level statistic
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this training has been read-only for a little while.

I think this is probably their more maintained version? https://hbctraining.github.io/Training-modules/DGE-functional-analysis/lessons/02_functional_analysis.html

* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r create_dir, live = TRUE}
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")

# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
```

#### Input files


```{r input_files}
input_file <- file.path(rms_analysis_dir,
"deseq",
"rms_myoblast_deseq_results.tsv")
```

#### Output files

We'll save our table of GSEA results as a TSV.

```{r output_files}
output_file <- file.path(results_dir,
"rms_myoblast_gsea_results.tsv")
```

## Gene sets

We will use gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) from the Broad Institute ([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102)).
The [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package contains MSigDB datasets already in the tidy format required by `clusterProfiler` and supports multiple organisms.

Let's take a look at what organisms the package supports.

```{r show_species}
msigdbr_species()
```

MSigDB contains 8 different gene set collections.

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

We'll use the Hallmark collection for GSEA.
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r immunologic_sets, live = TRUE}
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
category = "H")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

## DESeq2 results

```{r read_in_markers, live = TRUE}
deseq_df <- readr::read_tsv(input_file)
```

```{r deseq_head}
head(deseq_df)
```

This data frame uses Ensembl gene identifiers.
We'll need to make sure our gene sets use the same identifiers.
Let's take a look at the first few rows of the data frame that contains the hallmark genesets.

```{r hallmark_head, live = TRUE}
head(hs_hallmarks_df)
```

We can see that the gene sets from `msigdbr` have Ensembl gene identifiers associated with them, so we don't need to do any conversion.
However, we'll need to pass the correct column to the function that runs GSEA.

One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols.
(Multiple Ensembl gene identifiers can map to the same symbol.)

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
So, let's check for duplicates in the data frame of DESeq2 results.

```{r check_duplicates, live = TRUE}
any(duplicated(deseq_df$ensembl_id))
```

There are no duplicates for us to worry about!

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- deseq_df |>
# Extract a vector of `log2FoldChange` named by `ensembl_id`
dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
minGSSize = 25, # minimum gene set size
maxGSSize = 500, # maximum gene set set
pvalueCutoff = 0.05,
pAdjustMethod = "BH", # correction for multiple hypothesis testing
TERM2GENE = dplyr::select(hs_hallmarks_df,
gs_name,
ensembl_gene)) # pass the correct identifier column
```
Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before figuring out who to request for a full review, I'm going to ask @jashapiro to take a look at this interpretation as the instructor of the 03 notebook in the upcoming workshop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. The scores are relative to ARMS, so:

  • Positive values ---> ERMS is higher than ARMS ---> Enriched for ERMS.
  • Negative values ---> ERMS is lower than ARMS ---> Enriched for ARMS.


Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(output_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

Let's take look at a pathway with a highly positive NES (`HALLMARK_MYC_TARGETS_V2`) using a GSEA plot.

```{r highly_pos}
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_MYC_TARGETS_V2",
title = "HALLMARK_MYC_TARGETS_V2",
color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_MYOGENESIS` had a highly negative NES.

```{r highly_neg}
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_MYOGENESIS",
title = "HALLMARK_MYOGENESIS",
color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_P53_PATHWAY",
title = "HALLMARK_P53_PATHWAY",
color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```
Loading
Loading