Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AUCell notebook for scAdvanced pathway analysis #809

Merged
merged 17 commits into from
Nov 26, 2024

Conversation

jaclyn-taroni
Copy link
Member

Stacked on #808
Closes #806

I am adding a notebook on AUCell as the second pathway analysis notebook in the scRNA-seq-advanced module. It uses SCPCS000490 and the two Ewing-related MSigDB gene sets that yield the cleanest results.

  • I want to demonstrate intuition behind the AUC values and the difference between high and low AUC values. That is my rationale for not using AUCell::AUCell_run and instead running the individual steps + making recovery curves.
  • I am not using the built-in plotting functions and am using ggplot2 instead. Please take a look at the threshold data frame calculation; I am rusty.
  • I also want to make sure I show getting the AUC values back into the SCE and using that with scater::plotUMAP().

Base automatically changed from jaclyn-taroni/711-use-rms-de to master November 25, 2024 18:19
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order of things and general content of plots look good! My biggest comments are just that we need to have some more explanatory text, in particular I think it would be helpful to talk about why you might use AUCell in the beginning and then before calculating any ranks talk about the process of AUCell so that we have some context about why the ranks are important and how they are used in the overall AUC calculation. Also mention earlier on that a higher AUC value points to higher expression of the gene sets.

I wasn't sure if you are planning to expand on text here or in a later PR though?

Comment on lines 21 to 25
In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463), to calculate numeric values that estimate the activity of a gene set in an individual cell.

AUCell calculates the area under the recovery curve (AUC), which "represents the proportion of expressed genes in the signature and their relative expression value compared to the other genes within the cell" ([Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463)).

We will use an snRNA-seq of a Ewing sarcoma sample from the [`SCPCP000015` project](https://scpca.alexslemonade.org/projects/SCPCP000015) on the Single-cell Pediatric Cancer Atlas Portal and two relevant gene sets from the Molecular Signatures Database (MSigDB) to demonstrate this method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are planning to expand on text in this PR or later, but I might include some context here on when you would use this package and why it might be helpful.

Comment on lines 32 to 36
# We will be loading a SingleCellExperiment object into our environment but
# don't need to see the startup messages
suppressPackageStartupMessages({
library(SingleCellExperiment)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I noticed that the two lines with comments are broken up by a message in the rendered HTML which seems weird. I might make that comment on one line to avoid that?

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
```{r sparse_matrix}
# Extract counts matrix and save in sparse format
counts_matrix <- counts(sce) |>
as("dgCMatrix")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to do this? When I had run it using AUCell_run I provided the counts matrix directly but not sure if you need to convert it because of how you use it later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce from the environment, too.


### Cell ranking

The first step in AUCell is to rank genes for each cell from highest to lowest values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step in AUCell is to rank genes for each cell from highest to lowest values.
The first step in AUCell is to rank the number of genes for each cell from highest to lowest values.

I would clarify that we are ranking the total number of genes detected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are ranking the total number of genes detected.

I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)

Comment on lines 159 to 162
AUCell relies on gene rankings, but we expect a proportion of genes are not detected.
Genes can also have the same expression level (i.e., ties).
These undetected genes and ties will be randomly ordered.
To make our results reproducible, we will set a seed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about expanding on the text a little bit. I would have a little intro here that talks about the method itself and then at the end state that because ties will be randomly ordered we first need to set a seed.

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
Comment on lines 175 to 177
We want to make sure most cells have at least the number of genes we will use to calculate the AUC.
By default, the max rank is the top 5% highest expressed genes.
We can calculate the default max rank by taking into account the number of genes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would talk a little more about how we are going to use max rank to get the AUC values here otherwise I think there's going to be confusion about why we are doing this.

Comment on lines 99 to 100
ewing_gene_set_names <- c("ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
"RIGGI_EWING_SARCOMA_PROGENITOR_UP")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might make this a named list so that we can avoid all the ewing_gene_set_names[1] and use ewing_gene_set_names["zhang"]. Or I think you could rename the msigdbr results to have shorter names so you can easily use them in later code.

Comment on lines 359 to 362
# Use shorter names
dplyr::rename(zhang_auc = ewing_gene_set_names[1],
riggi_auc = ewing_gene_set_names[2]),
by = "barcodes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rename from the beginning you could avoid the renaming here and in the other data frames you make.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.

Copy link
Member Author

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even close to done addressing comments but returning replies

```{r sparse_matrix}
# Extract counts matrix and save in sparse format
counts_matrix <- counts(sce) |>
as("dgCMatrix")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce from the environment, too.


### Cell ranking

The first step in AUCell is to rank genes for each cell from highest to lowest values.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are ranking the total number of genes detected.

I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)

Comment on lines 359 to 362
# Use shorter names
dplyr::rename(zhang_auc = ewing_gene_set_names[1],
riggi_auc = ewing_gene_set_names[2]),
by = "barcodes"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.

@jaclyn-taroni
Copy link
Member Author

Thank you for your feedback, @allyhawkins! I've added a lot of text, and now I'm concerned it is too repetitive – let me know what you think!

It is unfortunate that we can't show the recovery curve until we have a max rank value. However, slides (forthcoming) will also go over the recovery curve, so the visualizations in the notebook will be the second time we cover/visualize the concept.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added text looks good! I just had a few more minor comments. I think the most substantial comment is about not having the extra step of converting to a sparse matrix since the counts matrix is already sparse when you use the counts() function.


---

In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new intro a lot!


## Set up gene sets

We are going to use two gene sets pertaining to Ewing sarcoma.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep going back and forth on if we need to expand on the explanation here and include more biology on why we are doing this, but I think I've decided that it's probably too much and not the point anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very last instruction notebook doesn't get any biology as a treat!

Comment on lines 161 to 164
We can save a counts matrix in sparse format for use with `AUCell`.
If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern.
If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sorry for the confusion about my earlier comment. I didn't think you needed it because counts() pulls out the counts matrix already in sparse form. So you should be able to remove this text and the as() in the chunk below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if one would want to as("dgCMatrix") if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
Comment on lines 282 to 295
To save this to a file, we'll likely want this in a tabular format.

```{r auc_to_table}
# Extract AUC
auc_df <- cell_auc@assays@data$AUC |>
# Transpose
t() |>
# Convert to data frame
as.data.frame() |>
# Make the barcodes a column
tibble::rownames_to_column("barcodes")

# Look at first few rows
head(auc_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might move saving the file to the end of the notebook and save both the auc results and the assignment as one table.
Part of my reasoning for making this comment is that once you calculate it I want to look at the results and see what you can do with them. It makes more sense to me to have saving as the last step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now saving the table used for plotting, which does include assignment information.

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
Copy link
Member Author

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push new changes in just a second


## Set up gene sets

We are going to use two gene sets pertaining to Ewing sarcoma.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very last instruction notebook doesn't get any biology as a treat!

Comment on lines 161 to 164
We can save a counts matrix in sparse format for use with `AUCell`.
If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern.
If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if one would want to as("dgCMatrix") if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.

Comment on lines 282 to 295
To save this to a file, we'll likely want this in a tabular format.

```{r auc_to_table}
# Extract AUC
auc_df <- cell_auc@assays@data$AUC |>
# Transpose
t() |>
# Convert to data frame
as.data.frame() |>
# Make the barcodes a column
tibble::rownames_to_column("barcodes")

# Look at first few rows
head(auc_df)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now saving the table used for plotting, which does include assignment information.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

scRNA-seq-advanced/05-aucell.Rmd Outdated Show resolved Hide resolved
@jaclyn-taroni jaclyn-taroni merged commit 875d310 into master Nov 26, 2024
2 checks passed
@jaclyn-taroni jaclyn-taroni deleted the jaclyn-taroni/806-aucell branch November 26, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pathway analysis for scAdvanced: Write an AUCell notebook
2 participants