From ab14226cd49ca104dc650f6710e46df0a2c74bbe Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Wed, 7 Aug 2024 18:18:29 +0000 Subject: [PATCH 1/3] Add a magic files doc for RNA-seq --- RNA-seq/setup/magic-files.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) create mode 100644 RNA-seq/setup/magic-files.md diff --git a/RNA-seq/setup/magic-files.md b/RNA-seq/setup/magic-files.md new file mode 100644 index 00000000..29755782 --- /dev/null +++ b/RNA-seq/setup/magic-files.md @@ -0,0 +1,15 @@ +# Cooking show magic + +To allow participants who might get bogged down to quickly recover and continue with later notebook sections or exercises, we like to keep a set of output files on hand that they can quickly access as needed. +This document is for tracking where these files are created and stored. + +In general, the files for this module will be stored in `/shared/data/training-data//RNA-seq/`, and should be created before each training using the latest versions of the referenced notebooks. +The sub-path for each file should be the same as that for `training-modules/RNA-seq`. + +Files are listed below by the notebook that produces them: + +- `setup/ref_notebooks/04-nb_cell_line_tximeta.Rmd` (participants create their own notebook for this step) + - `data/NB-cell/txi/NB-cell_tximeta.rds` +- `05-nb_cell_line_DESeq2.Rmd` + - `results/NB-cell/NB-cell_DESeq_amplified_v_nonamplified.rds` + - `results/NB-cell/NB-cell_DESeq_amplified_v_nonamplified_results.tsv` From fc6af47346e809f54d6528a10cd59817350208e9 Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Wed, 7 Aug 2024 18:26:35 +0000 Subject: [PATCH 2/3] Write a section on dependence on NB DGE notebook --- pathway-analysis/setup/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pathway-analysis/setup/README.md b/pathway-analysis/setup/README.md index 8dfbcaca..3d9eef7c 100644 --- a/pathway-analysis/setup/README.md +++ b/pathway-analysis/setup/README.md @@ -6,6 +6,9 @@ This document provides background for preparing data for use in the guided noteb ⚠️ **The setup of this module relies on data or results prepared as part of other modules.** +For the GSEA notebook, we use differential expression results from the neuroblastoma cell line data generated in `RNA-seq/05-nb_cell_line_DESeq2.Rmd`. +The DGE notebook relies on the output of `tximeta`, which can be generated by running `RNA-seq/setup/ref_notebooks/04-nb_cell_line_tximeta.Rmd` after RNA-seq files have been linked appropriately. + For the GSVA notebook, we use a subset of the OpenPBTA stranded dataset (medulloblastoma samples) that we download and process in `machine-learning/setup`. It also depends on a step, performed in `machine-learning/03-openpbta_PLIER.Rmd`, where duplicate gene symbols are collapsed to the row that has the highest mean value. ### Data Preparation From cf87c4e111f891d0ee9d34a4ebbe6c57423ae980 Mon Sep 17 00:00:00 2001 From: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com> Date: Wed, 7 Aug 2024 18:27:26 +0000 Subject: [PATCH 3/3] Name chunk with head --- .../02-gene_set_enrichment_analysis.Rmd | 3 +- .../02-gene_set_enrichment_analysis.nb.html | 3332 +++++++++++++++-- 2 files changed, 3002 insertions(+), 333 deletions(-) diff --git a/pathway-analysis/02-gene_set_enrichment_analysis.Rmd b/pathway-analysis/02-gene_set_enrichment_analysis.Rmd index 02c08413..3c79f68b 100644 --- a/pathway-analysis/02-gene_set_enrichment_analysis.Rmd +++ b/pathway-analysis/02-gene_set_enrichment_analysis.Rmd @@ -132,8 +132,9 @@ hs_hallmark_df <- msigdbr(species = "Homo sapiens", # Read in the DGE results dge_results_df <- readr::read_tsv(dge_results_file) ``` +Let's take a peek at the contents of this file. -```{r} +```{r head_dge, live = TRUE} head(dge_results_df) ``` diff --git a/pathway-analysis/02-gene_set_enrichment_analysis.nb.html b/pathway-analysis/02-gene_set_enrichment_analysis.nb.html index a46c6641..ea09ec8e 100644 --- a/pathway-analysis/02-gene_set_enrichment_analysis.nb.html +++ b/pathway-analysis/02-gene_set_enrichment_analysis.nb.html @@ -15,39 +15,2657 @@ Pathway analysis: Gene Set Enrichment Analysis (GSEA) - - + + - - - - - - - - - - - - - + + + + + + + + + + + + + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} + - + -

Since this data frame of DGE results includes gene symbols, we do not need to perform any kind of gene identifier conversion. We do, however, need to check for duplicate gene symbols. We can accomplish this with duplicated(), which returns a logical vector (e.g., TRUE or FALSE). The function sum() will count TRUE values as 1s and FALSE as 0s, so using it with duplicated() will count the number of duplicate values.

+

Since this data frame of DGE results includes gene symbols, we do not +need to perform any kind of gene identifier conversion. We do, however, +need to check for duplicate gene symbols. We can accomplish this with +duplicated(), which returns a logical vector (e.g., +TRUE or FALSE). The function +sum() will count TRUE values as 1s and +FALSE as 0s, so using it with duplicated() +will count the number of duplicate values.

- +
sum(duplicated(dge_results_df$gene_symbol))
@@ -501,14 +3129,21 @@

Differential gene expression results

This will cause a problem when we go to run GSEA.

Removing duplicates

-

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. Practically speaking, gene sets are just collections of gene identifiers! When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.

-

Compared to the total number of genes that are in our results, there are not a lot of duplicates but we’ll still need to make a decision about how to handle them.

-

Let’s get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.

+

The GSEA approach requires on discriminating between genes that are +in a gene set and those that are not. Practically speaking, gene sets +are just collections of gene identifiers! When the function we use for +GSEA pre-ranked gets a list with duplicated gene identifiers, it can +produce unexpected results.

+

Compared to the total number of genes that are in our results, there +are not a lot of duplicates but we’ll still need to make a decision +about how to handle them.

+

Let’s get a vector of the duplicated gene symbols so we can use it to +explore our filtering steps.

- -
duplicated_gene_symbols <- dge_results_df %>%
-  dplyr::filter(duplicated(gene_symbol)) %>%
+
+
duplicated_gene_symbols <- dge_results_df |>
+  dplyr::filter(duplicated(gene_symbol)) |>
   dplyr::pull(gene_symbol)
@@ -516,29 +3151,45 @@

Removing duplicates

Now we’ll look at the values for the the duplicated gene symbols.

- -
dge_results_df %>%
-  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
+
+
dge_results_df |>
+  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
   dplyr::arrange(gene_symbol)
+
+

We can see that the associated values vary for each row.

-

Let’s keep the gene symbols associated with the higher absolute value of the log2 fold change.

-

Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

-

In the next chunk, we are going to filter out the duplicated rows using the dplyr::distinct() function after sorting by absolute value of the log2 fold change. This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.

+

Let’s keep the gene symbols associated with the higher absolute value +of the log2 fold change.

+

Retaining the instance of the gene symbols with the higher absolute +value of a gene-level statistic means that we will retain the value that +is likely to be more highly- or lowly-ranked or, put another way, the +values less likely to be towards the middle of the ranked gene list. We +should keep this decision in mind when interpreting our results. For +example, if all the duplicate identifiers happened to be in a particular +gene set, we may get an overly optimistic view of how perturbed that +gene set is because we preferentially selected instances of the +identifier that have a higher absolute value of the statistic used for +ranking.

+

In the next chunk, we are going to filter out the duplicated rows +using the dplyr::distinct() function after sorting by +absolute value of the log2 fold change. This will keep the first row +with the duplicated value thus keeping the row with the largest absolute +value.

- -
filtered_dge_df <- dge_results_df %>%
+
+
filtered_dge_df <- dge_results_df |>
   # Sort so that the highest absolute values of the log2 fold change are at the
   # top
-  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
+  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) |>
   # Filter out the duplicated rows using `dplyr::distinct()`
   dplyr::distinct(gene_symbol, .keep_all = TRUE)
@@ -547,29 +3198,33 @@

Removing duplicates

Let’s see what happened to our duplicate identifiers.

- -
# Subset to & arrange by gene symbols that were duplicated in the original 
+
+
# Subset to & arrange by gene symbols that were duplicated in the original
 # data frame of results
-filtered_dge_df %>%
-  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
+filtered_dge_df |>
+  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
   dplyr::arrange(gene_symbol)
+
+

Now we’re ready to prep our pre-ranked list for GSEA.

Pre-ranked list

-

The GSEA() function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. This is step 1 – gene-level statistics.

+

The GSEA() function takes a pre-ranked (sorted) named +vector of statistics, where the names in the vector are gene +identifiers. This is step 1 – gene-level statistics.

- -
lfc_vector <- filtered_dge_df %>%
+
+
lfc_vector <- filtered_dge_df |>
   # Extract a vector of `log2FoldChange` named by `gene_symbol`
   dplyr::pull(log2FoldChange, name = gene_symbol)
 lfc_vector <- sort(lfc_vector, decreasing = TRUE)
@@ -579,26 +3234,26 @@

Pre-ranked list

Let’s look at the top ranked values.

- +
# Look at first entries of the log2 fold change vector
 head(lfc_vector)
- +
  GAGE12D    GABBR1   FABP5P7  KIAA0355      MNX1    GAGE2A 
-23.620487 23.130046 22.682641 22.562941  6.813764  6.801245 
+23.620487 23.130027 22.682631 22.562940 6.813764 6.801245

And the bottom of the list.

- +
# Look at the last entries of the log2 fold change vector
 tail(lfc_vector)
- +
     MUC15      TFPI2      LENG1       MT1M       DAZ2   CSPG4P4Y 
- -8.017004  -8.768669 -22.754783 -25.318887 -25.444288 -26.095612 
+ -8.017004 -8.768669 -22.754783 -25.318887 -25.444428 -26.095612
@@ -607,10 +3262,12 @@

Pre-ranked list

Run GSEA

Now for the analysis!

-

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

+

We can use the GSEA() function to perform GSEA with any +generic set of gene sets, but there are several functions for using +specific, commonly used gene sets (e.g., gseKEGG()).

- +
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                      minGSSize = 25,  # minimum gene set size
                      maxGSSize = 500,  # maximum gene set set
@@ -620,101 +3277,117 @@ 

Run GSEA

gs_name, gene_symbol))
- -
preparing geneSet collections...
- - -
GSEA analysis...
- - -
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.06% of the list).
-The order of those tied genes will be arbitrary, which may produce unexpected results.
- - -
leading edge analysis...
- - -
done...
- + +
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
+
+preparing geneSet collections...
+GSEA analysis...
+Warning: There are ties in the preranked stats (0.1% of the list).
+The order of those tied genes will be arbitrary, which may produce unexpected results.leading edge analysis...
+done...
+ -

The warning about ties means that there are multiple genes that have the same log2 fold change value. This percentage is small and unlikely to impact our results. A large number of ties might tell us there’s something wrong with our DGE results (Ballereau et al. 2018).

+

The warning about ties means that there are multiple genes that have +the same log2 fold change value. This percentage is small and unlikely +to impact our results. A large number of ties might tell us there’s +something wrong with our DGE results (Ballereau +et al. 2018).

Let’s take a look at the GSEA results.

- -
View(gsea_results@result %>%
+
+
View(gsea_results@result |>
        dplyr::arrange(dplyr::desc(NES))
 )
-

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

+

Normalized enrichment scores (NES) are enrichment scores that are +scaled to make gene sets that contain different number of genes +comparable.

Let’s write these results to file.

- -
gsea_results@result %>% readr::write_tsv(gsea_results_file)
+ +
gsea_results@result |> readr::write_tsv(gsea_results_file)

Visualizing GSEA results

-

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 3 different pathways – one with a highly positive NES, one with a highly negative NES, and one that was not a significant result – to get more insight into how ES are calculated.

+

We can visualize GSEA results for individual pathways or gene sets +using enrichplot::gseaplot(). Let’s take a look at 3 +different pathways – one with a highly positive NES, one with a highly +negative NES, and one that was not a significant result – to get more +insight into how ES are calculated.

Highly Positive NES

-

The gene set HALLMARK_MYC_TARGETS_V1 had high positive log2 fold changes. Recall a positive log2 fold change means a it had a higher expression value in MYCN amplified cell lines.

+

The gene set HALLMARK_MYC_TARGETS_V1 had high positive +log2 fold changes. Recall a positive log2 fold change means a it had a +higher expression value in MYCN amplified cell lines.

- +
enrichplot::gseaplot(gsea_results,
                      geneSetID = "HALLMARK_MYC_TARGETS_V1",
                      title = "HALLMARK_MYC_TARGETS_V1",
                      color.line = "#0066FF")
- -

+ +

-

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

+

Notice how the genes that are in the gene set, indicated by the black +bars, tend to be on the left side of the graph indicating that they have +positive gene-level scores.

Highly Negative NES

-

The gene set HALLMARK_INTERFERON_ALPHA_RESPONSE had a highly negative NES.

+

The gene set HALLMARK_INTERFERON_ALPHA_RESPONSE had a +highly negative NES.

- +
enrichplot::gseaplot(gsea_results,
                      geneSetID = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                      title = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                      color.line = "#0066FF")
- -

+ +

-

This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph.

+

This gene set shows the opposite pattern – genes in the pathway tend +to be on the right side of the graph.

A non-significant result

-

The @results slot will only show gene sets that pass the pvalueCutoff threshold we supplied to GSEA(), but we can plot any gene set so long as we know its name. Let’s look at HALLMARK_P53_PATHWAY, which was not in the results we viewed earlier.

+

The @results slot will only show gene sets that pass the +pvalueCutoff threshold we supplied to GSEA(), +but we can plot any gene set so long as we know its name. Let’s look at +HALLMARK_P53_PATHWAY, which was not in the results we +viewed earlier.

- +
enrichplot::gseaplot(gsea_results,
                      geneSetID = "HALLMARK_P53_PATHWAY",
                      title = "HALLMARK_P53_PATHWAY",
                      color.line = "#0066FF")
- -

+ +

-

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more “middling” score.

-

Note: The plots returned by enrichplot::gseaplot are ggplots, so we could use ggplot2::ggsave() to save them to file if we wanted to.

+

Genes in the pathway are distributed more evenly throughout the +ranked list, resulting in a more “middling” score.

+

Note: The plots returned by enrichplot::gseaplot are +ggplots, so we could use ggplot2::ggsave() to save them to +file if we wanted to.

@@ -722,69 +3395,61 @@

A non-significant result

Session Info

- +
sessionInfo()
- -
R version 4.0.3 (2020-10-10)
-Platform: x86_64-pc-linux-gnu (64-bit)
-Running under: Ubuntu 20.04 LTS
+
+
R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
 
 Matrix products: default
-BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
 
 locale:
- [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
- [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
- [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
- [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
- [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
+ [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
+[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
-[1] msigdbr_7.2.1          clusterProfiler_3.18.1 optparse_1.6.6        
+[1] msigdbr_7.5.1          clusterProfiler_4.12.0
 
 loaded via a namespace (and not attached):
- [1] enrichplot_1.10.2    bit64_4.0.5          RColorBrewer_1.1-2  
- [4] tools_4.0.3          R6_2.5.0             DBI_1.1.1           
- [7] BiocGenerics_0.36.0  colorspace_2.0-0     tidyselect_1.1.0    
-[10] gridExtra_2.3        bit_4.0.4            compiler_4.0.3      
-[13] cli_2.2.0            Biobase_2.50.0       scatterpie_0.1.5    
-[16] labeling_0.4.2       shadowtext_0.0.7     scales_1.1.1        
-[19] readr_1.4.0          stringr_1.4.0        digest_0.6.27       
-[22] rmarkdown_2.6        DOSE_3.16.0          pkgconfig_2.0.3     
-[25] htmltools_0.5.1.1    fastmap_1.1.0        rlang_0.4.10        
-[28] rstudioapi_0.13      RSQLite_2.2.3        farver_2.0.3        
-[31] generics_0.1.0       jsonlite_1.7.2       BiocParallel_1.24.1 
-[34] GOSemSim_2.16.1      dplyr_1.0.3          magrittr_2.0.1      
-[37] GO.db_3.12.1         Matrix_1.3-2         fansi_0.4.2         
-[40] Rcpp_1.0.6           munsell_0.5.0        S4Vectors_0.28.1    
-[43] viridis_0.5.1        lifecycle_0.2.0      stringi_1.5.3       
-[46] yaml_2.2.1           ggraph_2.0.4         MASS_7.3-53         
-[49] plyr_1.8.6           qvalue_2.22.0        grid_4.0.3          
-[52] blob_1.2.1           parallel_4.0.3       ggrepel_0.9.1       
-[55] DO.db_2.9            crayon_1.3.4         lattice_0.20-41     
-[58] graphlayouts_0.7.1   cowplot_1.1.1        splines_4.0.3       
-[61] hms_1.0.0            ps_1.5.0             knitr_1.30          
-[64] pillar_1.4.7         fgsea_1.16.0         igraph_1.2.6        
-[67] reshape2_1.4.4       stats4_4.0.3         fastmatch_1.1-0     
-[70] glue_1.4.2           evaluate_0.14        downloader_0.4      
-[73] data.table_1.13.6    BiocManager_1.30.10  vctrs_0.3.6         
-[76] tweenr_1.0.1         gtable_0.3.0         getopt_1.20.3       
-[79] purrr_0.3.4          polyclip_1.10-0      tidyr_1.1.2         
-[82] assertthat_0.2.1     cachem_1.0.1         ggplot2_3.3.3       
-[85] xfun_0.20            ggforce_0.3.2        tidygraph_1.2.0     
-[88] viridisLite_0.3.0    tibble_3.0.5         rvcheck_0.1.8       
-[91] AnnotationDbi_1.52.0 memoise_1.1.0        IRanges_2.24.1      
-[94] ellipsis_0.3.1      
+ [1] DBI_1.2.2 shadowtext_0.1.3 gson_0.1.0 gridExtra_2.3 rlang_1.1.3 + [6] magrittr_2.0.3 DOSE_3.30.0 compiler_4.4.0 RSQLite_2.3.6 png_0.1-8 + [11] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.2 + [16] fastmap_1.1.1 XVector_0.44.0 labeling_0.4.3 ggraph_2.2.1 utf8_1.2.4 + [21] HDO.db_0.99.1 tzdb_0.4.0 enrichplot_1.24.0 UCSC.utils_1.0.0 purrr_1.0.2 + [26] bit_4.0.5 xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8 aplot_0.2.2 + [31] GenomeInfoDb_1.40.0 jsonlite_1.8.8 blob_1.2.4 BiocParallel_1.38.0 tweenr_2.0.3 + [36] parallel_4.4.0 R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3 GOSemSim_2.30.0 + [41] knitr_1.46 Rcpp_1.0.12 readr_2.1.5 IRanges_2.38.0 Matrix_1.7-0 + [46] splines_4.4.0 igraph_2.0.3 tidyselect_1.2.1 qvalue_2.36.0 rstudioapi_0.16.0 + [51] viridis_0.6.5 codetools_0.2-20 lattice_0.22-6 tibble_3.2.1 plyr_1.8.9 + [56] Biobase_2.64.0 treeio_1.28.0 withr_3.0.0 KEGGREST_1.44.0 gridGraphics_0.5-1 + [61] scatterpie_0.2.2 polyclip_1.10-6 Biostrings_2.72.0 pillar_1.9.0 ggtree_3.12.0 + [66] stats4_4.4.0 ggfun_0.1.4 generics_0.1.3 vroom_1.6.5 hms_1.1.3 + [71] S4Vectors_0.42.0 ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0 tidytree_0.4.6 + [76] glue_1.7.0 lazyeval_0.2.2 tools_4.4.0 data.table_1.15.4 fgsea_1.30.0 + [81] babelgene_22.9 fs_1.6.4 graphlayouts_1.1.1 fastmatch_1.1-4 tidygraph_1.3.1 + [86] cowplot_1.1.3 grid_4.4.0 tidyr_1.3.1 ape_5.8 AnnotationDbi_1.66.0 + [91] colorspace_2.1-0 nlme_3.1-164 GenomeInfoDbData_1.2.12 patchwork_1.2.0 ggforce_0.4.2 + [96] cli_3.6.2 fansi_1.0.6 viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.5 +[101] yulab.utils_0.1.4 digest_0.6.35 BiocGenerics_0.50.0 ggrepel_0.9.5 ggplotify_0.1.2 +[106] farver_2.1.1 memoise_2.0.1 lifecycle_1.0.4 httr_1.4.7 GO.db_3.19.1 +[111] bit64_4.0.5 MASS_7.3-60.2
-
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output: 
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2020
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA), including how to remove duplicate gene identifiers
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we will perform Gene Set Enrichment Analysis (GSEA) on the neuroblastoma cell line differential gene expression (DGE) results we generated during the RNA-seq module.

To refresh our memory, we analyzed data from [Harenza *et al.* (2017)](https://doi.org/10.1038/sdata.2017.33) and specifically tested for DGE between with _MYCN_ amplified cell lines and non-amplified cell lines using `DESeq2`.
We have a table of results that contains our log2 fold changes and adjusted p-values.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
FCS methods are better suited for identifying these pathways that show coordinated changes than ORA.
In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):
 
1. Calculate a gene-level statistic (we'll use log2 fold change from DESeq2 here)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r}
# Where the DGE results are stored
input_dir <- file.path("..", "RNA-seq", "results", "NB-cell")

# We will create a directory to specifically hold our GSEA results if it does 
# not yet exist
output_dir <- file.path("results", "NB-cell")
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}
```

#### Input files

```{r input_files}
# DGE results
dge_results_file <- file.path(input_dir,
                             "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
```

#### Output files

```{r output_files}
# GSEA pathway-level scores and statistics
gsea_results_file <- file.path(output_dir,
                               "NB-cell_gsea_results.tsv")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)) 

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing. 
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets. 
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

### Gene sets

In the previous notebook, we used KEGG pathways for over-representation analysis.
We identified pathways that were significantly over-represented and found that the significant pathways shared genes.
FCS methods analyze each pathway independently and essentially ignore this overlap between gene sets.

MSigDB includes a collection called Hallmark gene sets ([Liberzon *et al.* 2015](https://dx.doi.org/10.1016%2Fj.cels.2015.12.004))
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

We'll use the Hallmark collection for GSEA.
Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r hallmark_sets, live = TRUE}
hs_hallmark_df <- msigdbr(species = "Homo sapiens",
                          category = "H")
```

## Differential gene expression results

```{r read_in_dge, live = TRUE}
# Read in the DGE results
dge_results_df <- readr::read_tsv(dge_results_file)
```

```{r}
head(dge_results_df)
```

Since this data frame of DGE results includes gene symbols, we do not need to perform any kind of gene identifier conversion.
We do, however, need to check for duplicate gene symbols. 
We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`).
The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values.

```{r any_duplicated, live = TRUE}
sum(duplicated(dge_results_df$gene_symbol))
```

This will cause a problem when we go to run GSEA.

### Removing duplicates

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. 
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.

Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them.

Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.

```{r gene_dups, live = TRUE}
duplicated_gene_symbols <- dge_results_df %>%
  dplyr::filter(duplicated(gene_symbol)) %>%
  dplyr::pull(gene_symbol)
```

Now we'll look at the values for the the duplicated gene symbols.

```{r show_gene_dups}
dge_results_df %>%
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
  dplyr::arrange(gene_symbol)
```

We can see that the associated values vary for each row.

Let's keep the gene symbols associated with the higher absolute value of the log2 fold change.

Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list.
We should keep this decision in mind when interpreting our results.
For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the log2 fold change.
This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.

```{r filter_dge}
filtered_dge_df <- dge_results_df %>%
  # Sort so that the highest absolute values of the log2 fold change are at the
  # top
  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
  # Filter out the duplicated rows using `dplyr::distinct()`
  dplyr::distinct(gene_symbol, .keep_all = TRUE)
```

Let's see what happened to our duplicate identifiers.

```{r show_filtered_dge, live = TRUE}
# Subset to & arrange by gene symbols that were duplicated in the original 
# data frame of results
filtered_dge_df %>%
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
  dplyr::arrange(gene_symbol)
```

Now we're ready to prep our pre-ranked list for GSEA.

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- filtered_dge_df %>%
  # Extract a vector of `log2FoldChange` named by `gene_symbol`
  dplyr::pull(log2FoldChange, name = gene_symbol)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log2 fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log2 fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmark_df,
                                               gs_name,
                                               gene_symbol))
```

The warning about ties means that there are multiple genes that have the same log2 fold change value. 
This percentage is small and unlikely to impact our results.
A large number of ties might tell us there's something wrong with our DGE results ([Ballereau _et al._ 2018](https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html)).

Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result %>%
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Let's write these results to file.

```{r write_gsea}
gsea_results@result %>% readr::write_tsv(gsea_results_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

The gene set `HALLMARK_MYC_TARGETS_V1` had high positive log2 fold changes.
Recall a positive log2 fold change means a it had a higher expression value in _MYCN_ amplified cell lines.

```{r myc_v1}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V1",
                     title = "HALLMARK_MYC_TARGETS_V1",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_INTERFERON_ALPHA_RESPONSE` had a highly negative NES.

```{r inflammatory}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                     title = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

+
---
title: "Pathway analysis: Gene Set Enrichment Analysis (GSEA)"
output:
  html_notebook:
    toc: true
    toc_float: true
author: CCDL for ALSF
date: 2020
---

## Objectives

This notebook will demonstrate how to:

- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA), including how to remove duplicate gene identifiers
- Perform GSEA with the `clusterProfiler` package
- Visualize GSEA results with the `enrichplot` package

---

In this notebook, we will perform Gene Set Enrichment Analysis (GSEA) on the neuroblastoma cell line differential gene expression (DGE) results we generated during the RNA-seq module.

To refresh our memory, we analyzed data from [Harenza *et al.* (2017)](https://doi.org/10.1038/sdata.2017.33) and specifically tested for DGE between with _MYCN_ amplified cell lines and non-amplified cell lines using `DESeq2`.
We have a table of results that contains our log2 fold changes and adjusted p-values.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
FCS methods are better suited for identifying these pathways that show coordinated changes than ORA.
In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes.

There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):

1. Calculate a gene-level statistic (we'll use log2 fold change from DESeq2 here)
2. Gene-level statistics are aggregated into a pathway-level statistic
3. Assess the statistical significance of the pathway-level statistic

#### Other resources

* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.

## Set up

### Libraries

```{r libraries}
# Package to run GSEA
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
```

### Directories and Files

#### Directories

```{r}
# Where the DGE results are stored
input_dir <- file.path("..", "RNA-seq", "results", "NB-cell")

# We will create a directory to specifically hold our GSEA results if it does
# not yet exist
output_dir <- file.path("results", "NB-cell")
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}
```

#### Input files

```{r input_files}
# DGE results
dge_results_file <- file.path(input_dir,
                             "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
```

#### Output files

```{r output_files}
# GSEA pathway-level scores and statistics
gsea_results_file <- file.path(output_dir,
                               "NB-cell_gsea_results.tsv")
```

## Gene Set Enrichment Analysis

_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_

![](diagrams/subramanian_fig1.jpg)

**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))

Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
The enrichment score for a pathway is the running sum's maximum deviation from zero.
GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).

The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

### Gene sets

In the previous notebook, we used KEGG pathways for over-representation analysis.
We identified pathways that were significantly over-represented and found that the significant pathways shared genes.
FCS methods analyze each pathway independently and essentially ignore this overlap between gene sets.

MSigDB includes a collection called Hallmark gene sets ([Liberzon *et al.* 2015](https://dx.doi.org/10.1016%2Fj.cels.2015.12.004))
Here's an excerpt of the [collection description](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H):

> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

We'll use the Hallmark collection for GSEA.
Notably, there are only 50 gene sets included in this collection.
The fewer gene sets we test, the lower our multiple hypothesis testing burden.

We can retrieve only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function.

```{r hallmark_sets, live = TRUE}
hs_hallmark_df <- msigdbr(species = "Homo sapiens",
                          category = "H")
```

## Differential gene expression results

```{r read_in_dge, live = TRUE}
# Read in the DGE results
dge_results_df <- readr::read_tsv(dge_results_file)
```
Let's take a peek at the contents of this file.

```{r head_dge, live = TRUE}
head(dge_results_df)
```

Since this data frame of DGE results includes gene symbols, we do not need to perform any kind of gene identifier conversion.
We do, however, need to check for duplicate gene symbols.
We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`).
The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values.

```{r any_duplicated, live = TRUE}
sum(duplicated(dge_results_df$gene_symbol))
```

This will cause a problem when we go to run GSEA.

### Removing duplicates

The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
Practically speaking, gene sets are just collections of gene identifiers!
When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.

Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them.

Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.

```{r gene_dups, live = TRUE}
duplicated_gene_symbols <- dge_results_df |>
  dplyr::filter(duplicated(gene_symbol)) |>
  dplyr::pull(gene_symbol)
```

Now we'll look at the values for the the duplicated gene symbols.

```{r show_gene_dups}
dge_results_df |>
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
  dplyr::arrange(gene_symbol)
```

We can see that the associated values vary for each row.

Let's keep the gene symbols associated with the higher absolute value of the log2 fold change.

Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list.
We should keep this decision in mind when interpreting our results.
For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the log2 fold change.
This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.

```{r filter_dge}
filtered_dge_df <- dge_results_df |>
  # Sort so that the highest absolute values of the log2 fold change are at the
  # top
  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) |>
  # Filter out the duplicated rows using `dplyr::distinct()`
  dplyr::distinct(gene_symbol, .keep_all = TRUE)
```

Let's see what happened to our duplicate identifiers.

```{r show_filtered_dge, live = TRUE}
# Subset to & arrange by gene symbols that were duplicated in the original
# data frame of results
filtered_dge_df |>
  dplyr::filter(gene_symbol %in% duplicated_gene_symbols) |>
  dplyr::arrange(gene_symbol)
```

Now we're ready to prep our pre-ranked list for GSEA.

### Pre-ranked list

The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.

```{r lfc_vector}
lfc_vector <- filtered_dge_df |>
  # Extract a vector of `log2FoldChange` named by `gene_symbol`
  dplyr::pull(log2FoldChange, name = gene_symbol)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```

Let's look at the top ranked values.

```{r head_lfc, live = TRUE}
# Look at first entries of the log2 fold change vector
head(lfc_vector)
```

And the bottom of the list.

```{r tail_lfc, live = TRUE}
# Look at the last entries of the log2 fold change vector
tail(lfc_vector)
```

## Run GSEA

Now for the analysis!

We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).

```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
                     minGSSize = 25,  # minimum gene set size
                     maxGSSize = 500,  # maximum gene set set
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",  # correction for multiple hypothesis testing
                     TERM2GENE = dplyr::select(hs_hallmark_df,
                                               gs_name,
                                               gene_symbol))
```

The warning about ties means that there are multiple genes that have the same log2 fold change value.
This percentage is small and unlikely to impact our results.
A large number of ties might tell us there's something wrong with our DGE results ([Ballereau _et al._ 2018](https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html)).

Let's take a look at the GSEA results.

```{r view_gsea, live = TRUE, eval = FALSE}
View(gsea_results@result |>
       dplyr::arrange(dplyr::desc(NES))
)
```

Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.

Let's write these results to file.

```{r write_gsea}
gsea_results@result |> readr::write_tsv(gsea_results_file)
```

### Visualizing GSEA results

We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.

#### Highly Positive NES

The gene set `HALLMARK_MYC_TARGETS_V1` had high positive log2 fold changes.
Recall a positive log2 fold change means a it had a higher expression value in _MYCN_ amplified cell lines.

```{r myc_v1}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_MYC_TARGETS_V1",
                     title = "HALLMARK_MYC_TARGETS_V1",
                     color.line = "#0066FF")
```

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.

#### Highly Negative NES

The gene set `HALLMARK_INTERFERON_ALPHA_RESPONSE` had a highly negative NES.

```{r inflammatory}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                     title = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
                     color.line = "#0066FF")
```

This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.

#### A non-significant result

The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.

```{r p53, live = TRUE}
enrichplot::gseaplot(gsea_results,
                     geneSetID = "HALLMARK_P53_PATHWAY",
                     title = "HALLMARK_P53_PATHWAY",
                     color.line = "#0066FF")
```

Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.

*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*

## Session Info

```{r session_info}
sessionInfo()
```

@@ -825,7 +3490,7 @@

Session Info

$(document).ready(function () { $('.tabset-dropdown > .nav-tabs > li').click(function () { - $(this).parent().toggleClass('nav-tabs-open') + $(this).parent().toggleClass('nav-tabs-open'); }); }); @@ -841,6 +3506,9 @@

Session Info