-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AUCell notebook for scAdvanced pathway analysis #809
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order of things and general content of plots look good! My biggest comments are just that we need to have some more explanatory text, in particular I think it would be helpful to talk about why you might use AUCell in the beginning and then before calculating any ranks talk about the process of AUCell so that we have some context about why the ranks are important and how they are used in the overall AUC calculation. Also mention earlier on that a higher AUC value points to higher expression of the gene sets.
I wasn't sure if you are planning to expand on text here or in a later PR though?
scRNA-seq-advanced/05-aucell.Rmd
Outdated
In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463), to calculate numeric values that estimate the activity of a gene set in an individual cell. | ||
|
||
AUCell calculates the area under the recovery curve (AUC), which "represents the proportion of expressed genes in the signature and their relative expression value compared to the other genes within the cell" ([Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463)). | ||
|
||
We will use an snRNA-seq of a Ewing sarcoma sample from the [`SCPCP000015` project](https://scpca.alexslemonade.org/projects/SCPCP000015) on the Single-cell Pediatric Cancer Atlas Portal and two relevant gene sets from the Molecular Signatures Database (MSigDB) to demonstrate this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you are planning to expand on text in this PR or later, but I might include some context here on when you would use this package and why it might be helpful.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
# We will be loading a SingleCellExperiment object into our environment but | ||
# don't need to see the startup messages | ||
suppressPackageStartupMessages({ | ||
library(SingleCellExperiment) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but I noticed that the two lines with comments are broken up by a message in the rendered HTML which seems weird. I might make that comment on one line to avoid that?
scRNA-seq-advanced/05-aucell.Rmd
Outdated
```{r sparse_matrix} | ||
# Extract counts matrix and save in sparse format | ||
counts_matrix <- counts(sce) |> | ||
as("dgCMatrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to do this? When I had run it using AUCell_run
I provided the counts matrix directly but not sure if you need to convert it because of how you use it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce
from the environment, too.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
|
||
### Cell ranking | ||
|
||
The first step in AUCell is to rank genes for each cell from highest to lowest values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first step in AUCell is to rank genes for each cell from highest to lowest values. | |
The first step in AUCell is to rank the number of genes for each cell from highest to lowest values. |
I would clarify that we are ranking the total number of genes detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are ranking the total number of genes detected.
I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)
scRNA-seq-advanced/05-aucell.Rmd
Outdated
AUCell relies on gene rankings, but we expect a proportion of genes are not detected. | ||
Genes can also have the same expression level (i.e., ties). | ||
These undetected genes and ties will be randomly ordered. | ||
To make our results reproducible, we will set a seed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about expanding on the text a little bit. I would have a little intro here that talks about the method itself and then at the end state that because ties will be randomly ordered we first need to set a seed.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
We want to make sure most cells have at least the number of genes we will use to calculate the AUC. | ||
By default, the max rank is the top 5% highest expressed genes. | ||
We can calculate the default max rank by taking into account the number of genes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would talk a little more about how we are going to use max rank to get the AUC values here otherwise I think there's going to be confusion about why we are doing this.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
ewing_gene_set_names <- c("ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION", | ||
"RIGGI_EWING_SARCOMA_PROGENITOR_UP") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might make this a named list so that we can avoid all the ewing_gene_set_names[1]
and use ewing_gene_set_names["zhang"]
. Or I think you could rename the msigdbr
results to have shorter names so you can easily use them in later code.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
# Use shorter names | ||
dplyr::rename(zhang_auc = ewing_gene_set_names[1], | ||
riggi_auc = ewing_gene_set_names[2]), | ||
by = "barcodes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rename from the beginning you could avoid the renaming here and in the other data frames you make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.
Co-authored-by: Ally Hawkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even close to done addressing comments but returning replies
scRNA-seq-advanced/05-aucell.Rmd
Outdated
```{r sparse_matrix} | ||
# Extract counts matrix and save in sparse format | ||
counts_matrix <- counts(sce) |> | ||
as("dgCMatrix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not appear to be necessary. However, I view this as the same kind of lesson as never uncompress a gzipped file if your tool can take the compressed version directly, which is to say it might come in handy someday. I will add text about how it can help us save memory if that is a concern and how, if that were true, we'd probably remove sce
from the environment, too.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
|
||
### Cell ranking | ||
|
||
The first step in AUCell is to rank genes for each cell from highest to lowest values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are ranking the total number of genes detected.
I don't think we are, though? We are ranking genes from highest to lowest value in an individual cell and producing a plot that describes the number of genes detected in each cell to help us pick an AUC max rank value. (I will write this.)
scRNA-seq-advanced/05-aucell.Rmd
Outdated
# Use shorter names | ||
dplyr::rename(zhang_auc = ewing_gene_set_names[1], | ||
riggi_auc = ewing_gene_set_names[2]), | ||
by = "barcodes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I did initially, but then chose to leave the original (very untypeable) name in the facets of the density plot.
Thank you for your feedback, @allyhawkins! I've added a lot of text, and now I'm concerned it is too repetitive – let me know what you think! It is unfortunate that we can't show the recovery curve until we have a max rank value. However, slides (forthcoming) will also go over the recovery curve, so the visualizations in the notebook will be the second time we cover/visualize the concept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added text looks good! I just had a few more minor comments. I think the most substantial comment is about not having the extra step of converting to a sparse matrix since the counts matrix is already sparse when you use the counts()
function.
|
||
--- | ||
|
||
In this notebook, we'll demonstrate how to use the AUCell method, introduced in [Aibar _et al_. 2017.](https://doi.org/10.1038/nmeth.4463). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new intro a lot!
|
||
## Set up gene sets | ||
|
||
We are going to use two gene sets pertaining to Ewing sarcoma. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep going back and forth on if we need to expand on the explanation here and include more biology on why we are doing this, but I think I've decided that it's probably too much and not the point anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very last instruction notebook doesn't get any biology as a treat!
scRNA-seq-advanced/05-aucell.Rmd
Outdated
We can save a counts matrix in sparse format for use with `AUCell`. | ||
If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern. | ||
If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sorry for the confusion about my earlier comment. I didn't think you needed it because counts()
pulls out the counts matrix already in sparse form. So you should be able to remove this text and the as()
in the chunk below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if one would want to as("dgCMatrix")
if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
To save this to a file, we'll likely want this in a tabular format. | ||
|
||
```{r auc_to_table} | ||
# Extract AUC | ||
auc_df <- cell_auc@assays@data$AUC |> | ||
# Transpose | ||
t() |> | ||
# Convert to data frame | ||
as.data.frame() |> | ||
# Make the barcodes a column | ||
tibble::rownames_to_column("barcodes") | ||
|
||
# Look at first few rows | ||
head(auc_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might move saving the file to the end of the notebook and save both the auc results and the assignment as one table.
Part of my reasoning for making this comment is that once you calculate it I want to look at the results and see what you can do with them. It makes more sense to me to have saving as the last step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now saving the table used for plotting, which does include assignment information.
Co-authored-by: Ally Hawkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will push new changes in just a second
|
||
## Set up gene sets | ||
|
||
We are going to use two gene sets pertaining to Ewing sarcoma. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very last instruction notebook doesn't get any biology as a treat!
scRNA-seq-advanced/05-aucell.Rmd
Outdated
We can save a counts matrix in sparse format for use with `AUCell`. | ||
If tools can take sparse matrices directly, saving matrices in sparse format can help us save on memory if that's a concern. | ||
If we were truly concerned about memory, we could remove `sce` from the environment with `rm()` after this step. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if one would want to as("dgCMatrix")
if extracting counts from a Seurat object but certainly not enough to 1) investigate 2) leave this in.
scRNA-seq-advanced/05-aucell.Rmd
Outdated
To save this to a file, we'll likely want this in a tabular format. | ||
|
||
```{r auc_to_table} | ||
# Extract AUC | ||
auc_df <- cell_auc@assays@data$AUC |> | ||
# Transpose | ||
t() |> | ||
# Convert to data frame | ||
as.data.frame() |> | ||
# Make the barcodes a column | ||
tibble::rownames_to_column("barcodes") | ||
|
||
# Look at first few rows | ||
head(auc_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now saving the table used for plotting, which does include assignment information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: Ally Hawkins <[email protected]>
Stacked on #808
Closes #806
I am adding a notebook on AUCell as the second pathway analysis notebook in the
scRNA-seq-advanced
module. It usesSCPCS000490
and the two Ewing-related MSigDB gene sets that yield the cleanest results.AUCell::AUCell_run
and instead running the individual steps + making recovery curves.ggplot2
instead. Please take a look at the threshold data frame calculation; I am rusty.scater::plotUMAP()
.