Results update and seeking feedback on annotating tumor cells in Ewing sarcoma samples #537

allyhawkins · 2024-06-20T16:59:42Z

allyhawkins
Jun 20, 2024
Maintainer

In an effort to complete the analysis proposed in #292, I have been working to first identify tumor and normal (non-malignant) cells in the Ewing sarcoma samples from SCPCP000015. I am aiming to be able to confidently identify tumor cells in a handful of samples first and then use those samples as a reference to annotate the remaining samples.

However, I am currently having trouble identifying a group of tumor cells that we feel confident in for some of the samples. I am posting here to hopefully receive feedback from others about how they might approach identifying tumor cells.

Would others take different approaches to classifying tumor cells that are not outlined here? Or should we modify any of our approaches?
How would you best suggest validating which cells are tumor cells?
Based on the data from both samples shown here, do you have recommendations for classifying tumor cells specifically in SCPCL000824 (the second sample shown here)? Are there ways we can use information obtained from SCPCL00822 to annotate additional samples like SCPCL000824?

Continue reading for an outline of the approach we have taken and a summary of the results thus far. You can also see the current code and results in analyses/cell-type-ewings.

Analysis approach for identifying tumor cells

I used SCPCS000490/SCPCL000822 for original exploration of different classification methods. Then for a handful of samples, I annotated cells as tumor cells using three distinct methods:

Marker genes only: A list of marker genes for tumor cells was curated from Visser, Bleijs, et al. (2023) and Goodspeed et al. (2024). The compiled marker gene list can be found in the analyses/cell-type-ewings/references/tumor-marker-genes.tsv file. We grabbed the expression of each of the genes in the list across all cells in the sample, z-scaled the individual gene expression vectors, and then summed the scaled expression of all marker genes for each cell. For SCPCL000822, this showed a bimodal distribution centered around 0, so we took any cells with the sum of all marker genes > 0 and labeled those as tumor cells.
CopyKAT: I ran CopyKAT using the default arguments without specifying a group of reference cells as the normal cells. The output of this method includes annotation of each cell as either aneuploid, diploid, or not.defined. Any cells labeled as aneuploid were categorized as tumor cells.
InferCNV: I ran InferCNV with cutoff = 0.1 (recommended for 10x samples) and specified any cells that were identified as endothelial cells when using normal references with both SingleR and CellAssign (output available on the ScPCA Portal) as the reference cells to use as a baseline for CNV detection. InferCNV produces a scaled value indicating the proportion of each chromosome with a detected copy number gain or loss. To classify cells as tumor or normal, I first calculated the CNV proportion for each cell across all chromosomes, weighted by the number of genes in a chromosome. Cells with a genomic CNV proportion greater than the mean CNV proportion across all cells are called as tumor cells.

Annotating tumor cells when methods agree

After identifying tumor cells using these three methods, I looked at the results across all three methods and aimed to find a consensus on which group of cells are actually tumor cells. For SCPCS000490/SCPCL000822, this was fairly straightforward as we saw a lot of agreement between the three methods.

The below UMAP shows cells that were labeled as either normal or tumor using each of the three classification methods. The reference cells are those that were used as a baseline reference when running InferCNV.

We called cells that were identified as tumor by both CopyKAT and InferCNV as tumor, normal cells identified as normal by both CopyKAT and InferCNV as normal and then all other cells were labeled with "Ambiguous". The combined classifications are shown in the UMAP below.

I then validated these annotations by looking at the expression of EWS-FLI1 target gene sets obtained from MsigDB and looking for known CNVs that may be present in Ewing sarcoma samples.

Gene set scores

For each of the following gene sets, I calculated the mean expression of all genes in the gene set to obtain a gene set score:

The heatmap below shows the gene set score for all cells (columns) across all gene sets (rows). The annotation bar at the bottom identifies which cells were classified as tumor vs normal cells using our combined classification (intersection between CopyKAT and InferCNV).
Looking at the gene set scores across all cells in the sample, we see that they are higher in tumor cells than in normal cells.

Validation of common CNVs

There are a few known copy number variations in Ewing's sarcoma:

Gain of Chr8
Gain of Chr12
Gain of Chr1p
Loss of Chr16q

Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations.
See Tirode et al., and Crompton et al..

The heatmap below specifically looks at the proportion of each chromosome with a copy number gain as reported by InferCNV. It appears that there is a gain in Chr8 in tumor cells that is not as prevalent in normal cells. There also appears to be a general increase in chromosomes affected by CNVs in the tumor cells as compared to normal cells, validating that those are in fact tumor cells.

Annotating tumor cells when methods disagree

The above sample was fairly straightforward to annotate, as all methods had a similar consensus on which cells should be tumor cells. These were also validated by looking at EWS-FLI1 target gene sets and known copy number variations. However, this sample appears to be abnormal, and the few other samples I have looked at don't seem to have a clear consensus between copy number inference and gene expression-based classification.

The below results are from SCPCS000492/SCPCL000824, but similar results can be seen across multiple other samples. The analysis shown below is in progress and you can follow along with #532.

In the following UMAP, each cell is colored by the classification and each panel is a different classification method that was used.

We then looked at marker gene expression and gene set scores for all cells, where we expect cells classified as tumor cells to have higher expression and scores.

The plots below show the distribution of either marker gene expression or gene set scores, colored by classification, and faceted by the method that was used to classify the cells as either tumor or normal. Reference cells are those that were used as a baseline reference when running InferCNV.

Marker gene expression

Gene set scores

If you look at both the expression of marker genes (top plot) and EWS-FLI1 gene set scores (bottom plot) across all cells, we see:

Tumor cells identified by CopyKAT have lower marker gene expression and gene set scores than normal cells, which is opposite of what we would expect.
Tumor cells and normal cells identified by InferCNV have similar marker gene expression and gene set scores, which is greater than the cells used as references for running InferCNV. Perhaps many of the "normal" cells should actually be classified as "tumor" here.
Tumor cells classified by using marker gene expression only have higher expression of marker genes (as expected since they are directly used for classification) and higher expression of gene set scores.

Another thing we noticed when looking at this sample vs. the other sample (SCPCS000490/SCPCL000822) is we don't quite see a bimodal distribution in either marker genes or gene set scores. This is something we did see in the other sample, making it easy to determine cutoffs to use for classifying cells as tumor cells.

Additionally, we do not see quite the same distinct CNV profiles for both normal and tumor cells and cannot identify any known CNVs in tumor cells. The below heatmap shows the proportion of each chromosome with a copy number gain as calculated by InferCNV. The annotations shown are the classifications obtained from both copyKAT and InferCNV. It's possible that we are misclassifying many normal cells, but we should also acknowledge that these genomes are very quiet, so it may be hard to distinguish tumor cells using CNVs alone.

The major concern I have with taking the results from the CNV methods and using those to classify tumor cells is that those same tumor cells don't show any increase in marker gene expression or gene set scores. My inclination is that most of the cells in these samples will come from tumor cells and they should show some elevated level of EWS-FLI1 gene set expression. However, we need to identify a cut off or when we can reliably call a call as tumor over normal in this sample (and other similar samples).

Next steps

Moving forward, I believe the best approach would be to use information learned from SCPCS000490/SCPCL000822 to classify cells in additional samples. We feel very confident in which cells from SCPCL000822 are tumor cells, so can we take the characteristics of those tumor cells and define tumor cells in other samples?

Below are a few ideas that have been discussed but not fully implemented:

SingleR: SingleR is a method that takes a reference dataset and identifies the top feature genes for each cell type in that reference dataset. Then the expression of those reference genes in the reference are correlated with expression of those same genes in the query sample to identify which cells are likely to correspond to a given cell type in the reference. We could use SCPCL000822 as a reference to identify tumor cells in the difficult sample (SCPCL000824). I have already done this with just trying to classify cells as tumor or normal (see plot below). However, it's possible with only providing two cell types we are overfitting the model. The next thing I would like to do is use the multi reference feature in SingleR and provide both a reference with possible normal cell types (from celldex) and SCPCL000822 as a reference.

Use expression cutoffs from SCPCL000822 to classify tumor cells in SCPCL000824. Specifically, if we look at the raw sum of all marker genes expressed and determine a cutoff for SCPCL000822, can we apply that same cutoff to SCPCL000824?

I would appreciate any and all feedback on how someone might approach this issue or thoughts on additional analysis that might lead us in the right direction.

danhtruong · 2024-06-20T20:50:48Z

danhtruong
Jun 20, 2024

This is an incredible effort and mirrors the challenges that many of us face when investigating single-cell data. I applaud your detailed methodology and critical reflection at each step.

Reviewing your analyses, one thing that may be critical is that Ewing sarcoma tends to have quieter genomes, which you acknowledged could be a factor in why the classification from the CNVs do not agree. I am more inclined to use gene expression to classify the tumor cells, given the presence of the pathognomonic EWS::FLI1 fusion protein.

In terms of methodology, perhaps instead of looking at raw sum, you can look at clusters and investigate the logFC for the EWS::FLI1 marker genes. It should be quite high compared to cell clusters that contain normal cells. Of course the other assumption is that you do have normal cells in your data. Were the other cells in SCPCL000824 identified as endothelial or other immune cells? Sorry if there were other analyses elsewhere that I missed.

The SingleR method may be adequate, but with a limited reference like SCPCL000822, it may overfit as you say. One thing you may come across is heterogeneity in EWS::FLI1 target expression low and high, as described in Wrenn et al. Another issue is whether there are Ewing sarcoma cells at all or the case was misdiagnosed. For instance, the tumor could have been any number of undifferentiated round cell diagnoses.

0 replies

patelgrp · 2024-06-21T13:27:30Z

patelgrp
Jun 21, 2024

I want to echo Danh and commend you on a very thorough job! Here are a few thoughts:

Like Danh, we have struggled with malignant cell inference based off CNVs alone. CopyKAT, in particular, seems to underperform with low-mutation burden tumors (like Ewing). That's not a knock on CopyKAT - it really was intended to identify aneuploid cells, and it does a good job in tumors with lots of copy-number alterations. InferCNV is quite slow which makes it a challenge to optimize parameters. My group has transitioned to using tumor inference methods likeNumbat, which use both single-nucleotide variants and inferred copy number variation to predict which cells are malignant versus non-malignant. In my hands, we have found that the allele-based calling is very specific but has low sensitivity (ie, if a Leiden cluster has many cells with SNVs, you can be very confident that that cluster is malignant).
Danh mentions normalizing the raw sum for the EWS::FLI1 marker gene expression. I agree. Seurat has a convenient "AddModuleScore" function - it's a little murky what the command does but the math checks out. Recently, we've been using UCell in rhabdomyosarcoma, and have found it to be very robust.
For the Ewing samples submitted from the Dyer group, we should've provided EWS-FLI1 fusion status in the metadata (let me know if that was missed).

0 replies

allyhawkins · 2024-06-21T14:48:53Z

allyhawkins
Jun 21, 2024
Maintainer Author

@danhtruong and @patelgrp thank you so much for your input! This is really helpful and valuable.

In terms of methodology, perhaps instead of looking at raw sum, you can look at clusters and investigate the logFC for the EWS::FLI1 marker genes. It should be quite high compared to cell clusters that contain normal cells. Of course the other assumption is that you do have normal cells in your data. Were the other cells in SCPCL000824 identified as endothelial or other immune cells? Sorry if there were other analyses elsewhere that I missed.

@danhtruong this is a great question! We are definitely assuming that at least some portion of the cells are normal cells, but I also expect that most cells are tumor cells and the proportion of normal cells will vary per sample. I think with SCPCL000822 we have a higher proportion of normal cells so the differences between the two populations are more clear, but my inclination with SCPCL000824 is mostly tumor cells. Included in all samples on the Portal are cell type annotations determined by SingleR and CellAssign. However, we use normal tissue as references for both those methods so the results there aren't going to be perfect. Any cells that were annotated as the same normal cell type by both methods, I am considering to be more likely normal than tumor. For example, in both of these samples there was a group of endothelial cells identified by both SingleR and CellAssign.

As for looking at expression within clusters, I do think that's a great idea. We would need some time to make sure we have the right parameters for clustering as we have not done that with this dataset yet. But I do think that might be helpful in identifying which groups of cells are more likely to be tumor than normal.

The SingleR method may be adequate, but with a limited reference like SCPCL000822, it may overfit as you say. One thing you may come across is heterogeneity in EWS::FLI1 target expression low and high, as described in Wrenn et al. Another issue is whether there are Ewing sarcoma cells at all or the case was misdiagnosed. For instance, the tumor could have been any number of undifferentiated round cell diagnoses

I totally agree about the heterogeneity of the EWS-FLI1 target expression! This is actually why my initial analysis didn't rely on using gene set scores because I was worried we would miss tumor cells that have low EWS-FLI1 expression. Considering this, are there marker genes that you would trust to be consistently expressed independently of EWS-FLI1? I had previously used things like CD99, but that gene is expressed at a fairly high level in normal cells making it hard to differentiate the two populations. Any opinions on marker genes that you might trust would be really helpful.

Like Danh, we have struggled with malignant cell inference based off CNVs alone. CopyKAT, in particular, seems to underperform with low-mutation burden tumors (like Ewing). That's not a knock on CopyKAT - it really was intended to identify aneuploid cells, and it does a good job in tumors with lots of copy-number alterations. InferCNV is quite slow which makes it a challenge to optimize parameters. My group has transitioned to using tumor inference methods likeNumbat, which use both single-nucleotide variants and inferred copy number variation to predict which cells are malignant versus non-malignant. In my hands, we have found that the allele-based calling is very specific but has low sensitivity (ie, if a Leiden cluster has many cells with SNVs, you can be very confident that that cluster is malignant).

@patelgrp thanks for this suggestion! I had not heard of Numbat, so I will definitely look into it and see how that performs here.

Danh mentions normalizing the raw sum for the EWS::FLI1 marker gene expression. I agree. Seurat has a convenient "AddModuleScore" function - it's a little murky what the command does but the math checks out. Recently, we've been using UCell in rhabdomyosarcoma, and have found it to be very robust.

Yes, I have the same feeling about the AddModuleScore function, which is part of the reason I had not chosen to use it. I will check out UCell though. I did try and play around with AUCell, but without a clear bimodal distribution in the expression of the gene sets, it did not seem to work well.

For the Ewing samples submitted from the Dyer group, we should've provided EWS-FLI1 fusion status in the metadata (let me know if that was missed).

This would be really helpful! We don't have that information so any additional information that you may want to share about these samples would be great.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results update and seeking feedback on annotating tumor cells in Ewing sarcoma samples #537

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Results update and seeking feedback on annotating tumor cells in Ewing sarcoma samples #537

allyhawkins Jun 20, 2024 Maintainer

Analysis approach for identifying tumor cells

Annotating tumor cells when methods agree

Annotating tumor cells when methods disagree

Next steps

Replies: 3 comments

danhtruong Jun 20, 2024

patelgrp Jun 21, 2024

allyhawkins Jun 21, 2024 Maintainer Author

allyhawkins
Jun 20, 2024
Maintainer

danhtruong
Jun 20, 2024

patelgrp
Jun 21, 2024

allyhawkins
Jun 21, 2024
Maintainer Author