Results update and seeking feedback on annotating tumor cells in Ewing sarcoma samples #537
Replies: 3 comments
-
This is an incredible effort and mirrors the challenges that many of us face when investigating single-cell data. I applaud your detailed methodology and critical reflection at each step. Reviewing your analyses, one thing that may be critical is that Ewing sarcoma tends to have quieter genomes, which you acknowledged could be a factor in why the classification from the CNVs do not agree. I am more inclined to use gene expression to classify the tumor cells, given the presence of the pathognomonic EWS::FLI1 fusion protein. In terms of methodology, perhaps instead of looking at raw sum, you can look at clusters and investigate the logFC for the EWS::FLI1 marker genes. It should be quite high compared to cell clusters that contain normal cells. Of course the other assumption is that you do have normal cells in your data. Were the other cells in SCPCL000824 identified as endothelial or other immune cells? Sorry if there were other analyses elsewhere that I missed. The |
Beta Was this translation helpful? Give feedback.
-
I want to echo Danh and commend you on a very thorough job! Here are a few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
@danhtruong and @patelgrp thank you so much for your input! This is really helpful and valuable.
@danhtruong this is a great question! We are definitely assuming that at least some portion of the cells are normal cells, but I also expect that most cells are tumor cells and the proportion of normal cells will vary per sample. I think with SCPCL000822 we have a higher proportion of normal cells so the differences between the two populations are more clear, but my inclination with SCPCL000824 is mostly tumor cells. Included in all samples on the Portal are cell type annotations determined by As for looking at expression within clusters, I do think that's a great idea. We would need some time to make sure we have the right parameters for clustering as we have not done that with this dataset yet. But I do think that might be helpful in identifying which groups of cells are more likely to be tumor than normal.
I totally agree about the heterogeneity of the EWS-FLI1 target expression! This is actually why my initial analysis didn't rely on using gene set scores because I was worried we would miss tumor cells that have low EWS-FLI1 expression. Considering this, are there marker genes that you would trust to be consistently expressed independently of EWS-FLI1? I had previously used things like CD99, but that gene is expressed at a fairly high level in normal cells making it hard to differentiate the two populations. Any opinions on marker genes that you might trust would be really helpful.
@patelgrp thanks for this suggestion! I had not heard of
Yes, I have the same feeling about the
This would be really helpful! We don't have that information so any additional information that you may want to share about these samples would be great. |
Beta Was this translation helpful? Give feedback.
-
In an effort to complete the analysis proposed in #292, I have been working to first identify tumor and normal (non-malignant) cells in the Ewing sarcoma samples from SCPCP000015. I am aiming to be able to confidently identify tumor cells in a handful of samples first and then use those samples as a reference to annotate the remaining samples.
However, I am currently having trouble identifying a group of tumor cells that we feel confident in for some of the samples. I am posting here to hopefully receive feedback from others about how they might approach identifying tumor cells.
Continue reading for an outline of the approach we have taken and a summary of the results thus far. You can also see the current code and results in
analyses/cell-type-ewings
.Analysis approach for identifying tumor cells
I used SCPCS000490/SCPCL000822 for original exploration of different classification methods. Then for a handful of samples, I annotated cells as tumor cells using three distinct methods:
analyses/cell-type-ewings/references/tumor-marker-genes.tsv
file. We grabbed the expression of each of the genes in the list across all cells in the sample, z-scaled the individual gene expression vectors, and then summed the scaled expression of all marker genes for each cell. For SCPCL000822, this showed a bimodal distribution centered around 0, so we took any cells with the sum of all marker genes > 0 and labeled those as tumor cells.CopyKAT
: I ranCopyKAT
using the default arguments without specifying a group of reference cells as the normal cells. The output of this method includes annotation of each cell as eitheraneuploid
,diploid
, ornot.defined
. Any cells labeled asaneuploid
were categorized as tumor cells.InferCNV
: I ranInferCNV
withcutoff = 0.1
(recommended for 10x samples) and specified any cells that were identified as endothelial cells when using normal references with bothSingleR
andCellAssign
(output available on the ScPCA Portal) as the reference cells to use as a baseline for CNV detection.InferCNV
produces a scaled value indicating the proportion of each chromosome with a detected copy number gain or loss. To classify cells as tumor or normal, I first calculated the CNV proportion for each cell across all chromosomes, weighted by the number of genes in a chromosome. Cells with a genomic CNV proportion greater than the mean CNV proportion across all cells are called as tumor cells.Annotating tumor cells when methods agree
After identifying tumor cells using these three methods, I looked at the results across all three methods and aimed to find a consensus on which group of cells are actually tumor cells. For SCPCS000490/SCPCL000822, this was fairly straightforward as we saw a lot of agreement between the three methods.
The below UMAP shows cells that were labeled as either normal or tumor using each of the three classification methods. The reference cells are those that were used as a baseline reference when running
InferCNV
.We called cells that were identified as tumor by both
CopyKAT
andInferCNV
as tumor, normal cells identified as normal by bothCopyKAT
andInferCNV
as normal and then all other cells were labeled with "Ambiguous". The combined classifications are shown in the UMAP below.I then validated these annotations by looking at the expression of EWS-FLI1 target gene sets obtained from
MsigDB
and looking for known CNVs that may be present in Ewing sarcoma samples.Gene set scores
For each of the following gene sets, I calculated the mean expression of all genes in the gene set to obtain a gene set score:
ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION
RIGGI_EWING_SARCOMA_PROGENITOR_UP
SILIGAN_TARGETS_OF_EWS_FLI1_FUSION_DN
The heatmap below shows the gene set score for all cells (columns) across all gene sets (rows). The annotation bar at the bottom identifies which cells were classified as tumor vs normal cells using our combined classification (intersection between
CopyKAT
andInferCNV
).Looking at the gene set scores across all cells in the sample, we see that they are higher in tumor cells than in normal cells.
Validation of common CNVs
There are a few known copy number variations in Ewing's sarcoma:
Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations.
See Tirode et al., and Crompton et al..
The heatmap below specifically looks at the proportion of each chromosome with a copy number gain as reported by
InferCNV
. It appears that there is a gain in Chr8 in tumor cells that is not as prevalent in normal cells. There also appears to be a general increase in chromosomes affected by CNVs in the tumor cells as compared to normal cells, validating that those are in fact tumor cells.Annotating tumor cells when methods disagree
The above sample was fairly straightforward to annotate, as all methods had a similar consensus on which cells should be tumor cells. These were also validated by looking at EWS-FLI1 target gene sets and known copy number variations. However, this sample appears to be abnormal, and the few other samples I have looked at don't seem to have a clear consensus between copy number inference and gene expression-based classification.
The below results are from SCPCS000492/SCPCL000824, but similar results can be seen across multiple other samples. The analysis shown below is in progress and you can follow along with #532.
In the following UMAP, each cell is colored by the classification and each panel is a different classification method that was used.
We then looked at marker gene expression and gene set scores for all cells, where we expect cells classified as tumor cells to have higher expression and scores.
The plots below show the distribution of either marker gene expression or gene set scores, colored by classification, and faceted by the method that was used to classify the cells as either tumor or normal. Reference cells are those that were used as a baseline reference when running
InferCNV
.Marker gene expression
Gene set scores
If you look at both the expression of marker genes (top plot) and EWS-FLI1 gene set scores (bottom plot) across all cells, we see:
CopyKAT
have lower marker gene expression and gene set scores than normal cells, which is opposite of what we would expect.InferCNV
have similar marker gene expression and gene set scores, which is greater than the cells used as references for runningInferCNV
. Perhaps many of the "normal" cells should actually be classified as "tumor" here.Another thing we noticed when looking at this sample vs. the other sample (SCPCS000490/SCPCL000822) is we don't quite see a bimodal distribution in either marker genes or gene set scores. This is something we did see in the other sample, making it easy to determine cutoffs to use for classifying cells as tumor cells.
Additionally, we do not see quite the same distinct CNV profiles for both normal and tumor cells and cannot identify any known CNVs in tumor cells. The below heatmap shows the proportion of each chromosome with a copy number gain as calculated by
InferCNV
. The annotations shown are the classifications obtained from bothcopyKAT
andInferCNV
. It's possible that we are misclassifying many normal cells, but we should also acknowledge that these genomes are very quiet, so it may be hard to distinguish tumor cells using CNVs alone.The major concern I have with taking the results from the CNV methods and using those to classify tumor cells is that those same tumor cells don't show any increase in marker gene expression or gene set scores. My inclination is that most of the cells in these samples will come from tumor cells and they should show some elevated level of EWS-FLI1 gene set expression. However, we need to identify a cut off or when we can reliably call a call as tumor over normal in this sample (and other similar samples).
Next steps
Moving forward, I believe the best approach would be to use information learned from SCPCS000490/SCPCL000822 to classify cells in additional samples. We feel very confident in which cells from SCPCL000822 are tumor cells, so can we take the characteristics of those tumor cells and define tumor cells in other samples?
Below are a few ideas that have been discussed but not fully implemented:
SingleR
:SingleR
is a method that takes a reference dataset and identifies the top feature genes for each cell type in that reference dataset. Then the expression of those reference genes in the reference are correlated with expression of those same genes in the query sample to identify which cells are likely to correspond to a given cell type in the reference. We could use SCPCL000822 as a reference to identify tumor cells in the difficult sample (SCPCL000824). I have already done this with just trying to classify cells as tumor or normal (see plot below). However, it's possible with only providing two cell types we are overfitting the model. The next thing I would like to do is use the multi reference feature inSingleR
and provide both a reference with possible normal cell types (fromcelldex
) and SCPCL000822 as a reference.I would appreciate any and all feedback on how someone might approach this issue or thoughts on additional analysis that might lead us in the right direction.
Beta Was this translation helpful? Give feedback.
All reactions