Add AUCell notebook for scAdvanced pathway analysis #809

jaclyn-taroni · 2024-11-25T18:06:28Z

Stacked on #808
Closes #806

I am adding a notebook on AUCell as the second pathway analysis notebook in the scRNA-seq-advanced module. It uses SCPCS000490 and the two Ewing-related MSigDB gene sets that yield the cleanest results.

I want to demonstrate intuition behind the AUC values and the difference between high and low AUC values. That is my rationale for not using AUCell::AUCell_run and instead running the individual steps + making recovery curves.
I am not using the built-in plotting functions and am using ggplot2 instead. Please take a look at the threshold data frame calculation; I am rusty.
I also want to make sure I show getting the AUC values back into the SCE and using that with scater::plotUMAP().

…cell

allyhawkins

I think the order of things and general content of plots look good! My biggest comments are just that we need to have some more explanatory text, in particular I think it would be helpful to talk about why you might use AUCell in the beginning and then before calculating any ranks talk about the process of AUCell so that we have some context about why the ranks are important and how they are used in the overall AUC calculation. Also mention earlier on that a higher AUC value points to higher expression of the gene sets.

I wasn't sure if you are planning to expand on text here or in a later PR though?

allyhawkins · 2024-11-25T19:11:38Z

scRNA-seq-advanced/05-aucell.Rmd

+In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463), to calculate numeric values that estimate the activity of a gene set in an individual cell.
+
+AUCell calculates the area under the recovery curve (AUC), which "represents the proportion of expressed genes in the signature and their relative expression value compared to the other genes within the cell" ([Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463)).
+
+We will use an snRNA-seq of a Ewing sarcoma sample from the [`SCPCP000015` project](https://scpca.alexslemonade.org/projects/SCPCP000015) on the Single-cell Pediatric Cancer Atlas Portal and two relevant gene sets from the Molecular Signatures Database (MSigDB) to demonstrate this method.


Not sure if you are planning to expand on text in this PR or later, but I might include some context here on when you would use this package and why it might be helpful.

allyhawkins · 2024-11-25T19:12:30Z

scRNA-seq-advanced/05-aucell.Rmd

+# We will be loading a SingleCellExperiment object into our environment but 
+# don't need to see the startup messages
+suppressPackageStartupMessages({
+  library(SingleCellExperiment)
+})


Minor, but I noticed that the two lines with comments are broken up by a message in the rendered HTML which seems weird. I might make that comment on one line to avoid that?

scRNA-seq-advanced/05-aucell.Rmd

allyhawkins · 2024-11-25T19:22:33Z

scRNA-seq-advanced/05-aucell.Rmd

+```{r sparse_matrix}
+# Extract counts matrix and save in sparse format
+counts_matrix <- counts(sce) |>
+  as("dgCMatrix")


Do you need to do this? When I had run it using AUCell_run I provided the counts matrix directly but not sure if you need to convert it because of how you use it later?

No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce from the environment, too.

allyhawkins · 2024-11-25T19:23:28Z

scRNA-seq-advanced/05-aucell.Rmd

+
+### Cell ranking
+
+The first step in AUCell is to rank genes for each cell from highest to lowest values.


Suggested change

The first step in AUCell is to rank genes for each cell from highest to lowest values.

The first step in AUCell is to rank the number of genes for each cell from highest to lowest values.

I would clarify that we are ranking the total number of genes detected.

we are ranking the total number of genes detected.

I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)

allyhawkins · 2024-11-25T19:24:30Z

scRNA-seq-advanced/05-aucell.Rmd

+AUCell relies on gene rankings, but we expect a proportion of genes are not detected.
+Genes can also have the same expression level (i.e., ties).
+These undetected genes and ties will be randomly ordered.
+To make our results reproducible, we will set a seed.


Same comment about expanding on the text a little bit. I would have a little intro here that talks about the method itself and then at the end state that because ties will be randomly ordered we first need to set a seed.

scRNA-seq-advanced/05-aucell.Rmd

allyhawkins · 2024-11-25T19:40:37Z

scRNA-seq-advanced/05-aucell.Rmd

+We want to make sure most cells have at least the number of genes we will use to calculate the AUC.
+By default, the max rank is the top 5% highest expressed genes.
+We can calculate the default max rank by taking into account the number of genes.


I would talk a little more about how we are going to use max rank to get the AUC values here otherwise I think there's going to be confusion about why we are doing this.

allyhawkins · 2024-11-25T19:47:11Z

scRNA-seq-advanced/05-aucell.Rmd

+ewing_gene_set_names <- c("ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
+                          "RIGGI_EWING_SARCOMA_PROGENITOR_UP")


I might make this a named list so that we can avoid all the ewing_gene_set_names[1] and use ewing_gene_set_names["zhang"]. Or I think you could rename the msigdbr results to have shorter names so you can easily use them in later code.

allyhawkins · 2024-11-25T19:49:30Z

scRNA-seq-advanced/05-aucell.Rmd

+      # Use shorter names
+      dplyr::rename(zhang_auc = ewing_gene_set_names[1],
+                    riggi_auc = ewing_gene_set_names[2]),
+    by = "barcodes"


If you rename from the beginning you could avoid the renaming here and in the other data frames you make.

That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.

Co-authored-by: Ally Hawkins <[email protected]>

jaclyn-taroni

Not even close to done addressing comments but returning replies

jaclyn-taroni · 2024-11-26T14:00:57Z

scRNA-seq-advanced/05-aucell.Rmd

+```{r sparse_matrix}
+# Extract counts matrix and save in sparse format
+counts_matrix <- counts(sce) |>
+  as("dgCMatrix")


No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce from the environment, too.

jaclyn-taroni · 2024-11-26T14:04:04Z

scRNA-seq-advanced/05-aucell.Rmd

+
+### Cell ranking
+
+The first step in AUCell is to rank genes for each cell from highest to lowest values.


we are ranking the total number of genes detected.

I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)

jaclyn-taroni · 2024-11-26T14:06:02Z

scRNA-seq-advanced/05-aucell.Rmd

+      # Use shorter names
+      dplyr::rename(zhang_auc = ewing_gene_set_names[1],
+                    riggi_auc = ewing_gene_set_names[2]),
+    by = "barcodes"


That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.

jaclyn-taroni · 2024-11-26T15:28:14Z

Thank you for your feedback, @allyhawkins! I've added a lot of text, and now I'm concerned it is too repetitive – let me know what you think!

It is unfortunate that we can't show the recovery curve until we have a max rank value. However, slides (forthcoming) will also go over the recovery curve, so the visualizations in the notebook will be the second time we cover/visualize the concept.

allyhawkins

The added text looks good! I just had a few more minor comments. I think the most substantial comment is about not having the extra step of converting to a sparse matrix since the counts matrix is already sparse when you use the counts() function.

allyhawkins · 2024-11-26T16:22:12Z

scRNA-seq-advanced/05-aucell.Rmd

+
+---
+
+In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463).


I like this new intro a lot!

allyhawkins · 2024-11-26T16:30:49Z

scRNA-seq-advanced/05-aucell.Rmd

+
+## Set up gene sets
+
+We are going to use two gene sets pertaining to Ewing sarcoma.


I keep going back and forth on if we need to expand on the explanation here and include more biology on why we are doing this, but I think I've decided that it's probably too much and not the point anyways.

The very last instruction notebook doesn't get any biology as a treat!

allyhawkins · 2024-11-26T16:37:39Z

scRNA-seq-advanced/05-aucell.Rmd

+We can save a counts matrix in sparse format for use with `AUCell`.
+If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern.
+If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step.
+


Okay, sorry for the confusion about my earlier comment. I didn't think you needed it because counts() pulls out the counts matrix already in sparse form. So you should be able to remove this text and the as() in the chunk below.

I'm curious if one would want to as("dgCMatrix") if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.

scRNA-seq-advanced/05-aucell.Rmd

allyhawkins · 2024-11-26T16:44:47Z

scRNA-seq-advanced/05-aucell.Rmd

+To save this to a file, we'll likely want this in a tabular format.
+
+```{r auc_to_table}
+# Extract AUC
+auc_df <- cell_auc@assays@data$AUC |>
+  # Transpose
+  t() |>
+  # Convert to data frame
+  as.data.frame() |>
+  # Make the barcodes a column
+  tibble::rownames_to_column("barcodes") 
+
+# Look at first few rows
+head(auc_df)


I might move saving the file to the end of the notebook and save both the auc results and the assignment as one table.
Part of my reasoning for making this comment is that once you calculate it I want to look at the results and see what you can do with them. It makes more sense to me to have saving as the last step.

I am now saving the table used for plotting, which does include assignment information.

scRNA-seq-advanced/05-aucell.Rmd

Co-authored-by: Ally Hawkins <[email protected]>

jaclyn-taroni

Will push new changes in just a second

jaclyn-taroni · 2024-11-26T17:03:15Z

scRNA-seq-advanced/05-aucell.Rmd

+
+## Set up gene sets
+
+We are going to use two gene sets pertaining to Ewing sarcoma.


The very last instruction notebook doesn't get any biology as a treat!

jaclyn-taroni · 2024-11-26T17:18:49Z

scRNA-seq-advanced/05-aucell.Rmd

+We can save a counts matrix in sparse format for use with `AUCell`.
+If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern.
+If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step.
+


I'm curious if one would want to as("dgCMatrix") if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.

jaclyn-taroni · 2024-11-26T17:25:55Z

scRNA-seq-advanced/05-aucell.Rmd

+To save this to a file, we'll likely want this in a tabular format.
+
+```{r auc_to_table}
+# Extract AUC
+auc_df <- cell_auc@assays@data$AUC |>
+  # Transpose
+  t() |>
+  # Convert to data frame
+  as.data.frame() |>
+  # Make the barcodes a column
+  tibble::rownames_to_column("barcodes") 
+
+# Look at first few rows
+head(auc_df)


I am now saving the table used for plotting, which does include assignment information.

allyhawkins

LGTM!

scRNA-seq-advanced/05-aucell.Rmd

Co-authored-by: Ally Hawkins <[email protected]>

jaclyn-taroni added 5 commits November 25, 2024 11:31

Add initial draft of the AUCell notebook

d6e5dfc

Document plot_recovery_code()

0ffb4c2

Polish AUCell notebook

ce363da

Merge branch 'jaclyn-taroni/711-use-rms-de' into jaclyn-taroni/806-au…

339881e

…cell

Add AUCell notebook to render live script

d0b8e3d

jaclyn-taroni requested a review from allyhawkins November 25, 2024 18:06

Update renv lockfile

4497a77

Base automatically changed from jaclyn-taroni/711-use-rms-de to master November 25, 2024 18:19

jaclyn-taroni added 3 commits November 25, 2024 13:19

Merge branch 'master' into jaclyn-taroni/806-aucell

d33e941

Spell check fixes; rerun notebook

596049f

Actually save HTML version

dfdf159

jaclyn-taroni mentioned this pull request Nov 25, 2024

Move renv changes for AUCell to own branch #811

Merged

allyhawkins reviewed Nov 25, 2024

View reviewed changes

jaclyn-taroni and others added 2 commits November 25, 2024 14:59

Merge branch 'master' into jaclyn-taroni/806-aucell

f9efb86

Apply suggestions from code review

150c17f

Co-authored-by: Ally Hawkins <[email protected]>

jaclyn-taroni commented Nov 26, 2024

View reviewed changes

jaclyn-taroni added 2 commits November 26, 2024 15:22

Respond to review

501abb2

Add outsized to dictionary

917ac6e

jaclyn-taroni requested a review from allyhawkins November 26, 2024 15:28

allyhawkins reviewed Nov 26, 2024

View reviewed changes

jaclyn-taroni and others added 2 commits November 26, 2024 12:04

Apply suggestions from code review

9afc87e

Co-authored-by: Ally Hawkins <[email protected]>

Merge branch 'master' into jaclyn-taroni/806-aucell

0d3ab00

jaclyn-taroni commented Nov 26, 2024

View reviewed changes

Respond to round 2 of review

a6aa692

jaclyn-taroni requested a review from allyhawkins November 26, 2024 17:30

allyhawkins approved these changes Nov 26, 2024

View reviewed changes

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved

Apply suggestions from code review

b7e92ac

Co-authored-by: Ally Hawkins <[email protected]>

jaclyn-taroni merged commit 875d310 into master Nov 26, 2024
2 checks passed

jaclyn-taroni deleted the jaclyn-taroni/806-aucell branch November 26, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AUCell notebook for scAdvanced pathway analysis #809

Add AUCell notebook for scAdvanced pathway analysis #809

jaclyn-taroni commented Nov 25, 2024

allyhawkins left a comment

allyhawkins Nov 25, 2024

allyhawkins Nov 25, 2024

allyhawkins Nov 25, 2024

jaclyn-taroni Nov 26, 2024

allyhawkins Nov 25, 2024

jaclyn-taroni Nov 26, 2024

allyhawkins Nov 25, 2024

allyhawkins Nov 25, 2024

allyhawkins Nov 25, 2024

allyhawkins Nov 25, 2024

jaclyn-taroni Nov 26, 2024

jaclyn-taroni left a comment

jaclyn-taroni Nov 26, 2024

jaclyn-taroni Nov 26, 2024

jaclyn-taroni Nov 26, 2024

jaclyn-taroni commented Nov 26, 2024

allyhawkins left a comment

allyhawkins Nov 26, 2024

allyhawkins Nov 26, 2024

jaclyn-taroni Nov 26, 2024

allyhawkins Nov 26, 2024

jaclyn-taroni Nov 26, 2024

allyhawkins Nov 26, 2024

jaclyn-taroni Nov 26, 2024

jaclyn-taroni left a comment

jaclyn-taroni Nov 26, 2024

jaclyn-taroni Nov 26, 2024

jaclyn-taroni Nov 26, 2024

allyhawkins left a comment


		### Cell ranking

		The first step in AUCell is to rank genes for each cell from highest to lowest values.

	The first step in AUCell is to rank genes for each cell from highest to lowest values.
	The first step in AUCell is to rank the number of genes for each cell from highest to lowest values.

		ewing_gene_set_names <- c("ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
		"RIGGI_EWING_SARCOMA_PROGENITOR_UP")


		---

		In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463).


		## Set up gene sets

		We are going to use two gene sets pertaining to Ewing sarcoma.

Add AUCell notebook for scAdvanced pathway analysis #809

Add AUCell notebook for scAdvanced pathway analysis #809

Conversation

jaclyn-taroni commented Nov 25, 2024

allyhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Nov 26, 2024

allyhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins left a comment

Choose a reason for hiding this comment