Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out notebook for individual cell predictions visualization #88

Open
jaclyn-taroni opened this issue Dec 18, 2024 · 2 comments
Open
Assignees

Comments

@jaclyn-taroni
Copy link
Member

We now are predicting labels for individual cells on two datasets: Smart-seq2 and 10X. We should split the individual cell portions of analysis_notebooks/pseudobulk_and_single_cells.Rmd into a new notebook that visualizes the results.

@jaclyn-taroni jaclyn-taroni self-assigned this Dec 18, 2024
@jaclyn-taroni
Copy link
Member Author

I'm posting some preliminary results from the 10X dataset so I can write down some thoughts about next steps. The code is available on this branch: jaclyn-taroni/88-plot-single-cells

Here's a plot looking at the percentage of individual cells labeled for each cell type:

Screenshot 2024-12-24 at 9 07 38 AM

The Smart-seq2 results (left hand side) show, for the most part, most cells are labeled "correctly" for that cell type. On the right hand side, the 10X dataset shows that many cells – regardless of the true subgroup – are labeled as G3. This is particularly true for the RF model.

It is worth looking into why this might be. Like many things, there might be a technical or biological explanation. Some thoughts on what we might look into:

Technical

  • Look at what genes are included in the pair-based rules.
    • Are they different between the two models?
    • What does the expression of genes included in rules between subgroups look like (e.g., do the genes in the SHH rules tend not to be detected in the 10X dataset)?
    • Look at gene lengths for genes used in rules between subgroups. This thought is not fully formed yet, but we use TPM for both and I believe the 10X data set should have less of a gene length bias.
  • Look at characteristics of the cells broken down by label (e.g., UMIs detected, genes detected, etc.). It seems plausible to me that there is some "quality floor" that exists for cells to which one could apply these models.

Biological

The first thing I'd want to explore is the cell type identity of cells labeled G3 vs. others in the 10X dataset. We can get cell type labels from UCSC Cell Browser (and that's added in the branch mentioned above ☝🏻). From a cursory look, most cells are labeled malignant without a "finer" cell label. However, the cell metadata from UCSC also includes cluster labels. We can see if we can align this information with what is included in the publication to get more information about cell state.

Tagging @envest for visibility.

@jaclyn-taroni
Copy link
Member Author

I looked at the difference between the max subgroup score and the second highest subgroup score within an individual cell. (This is an idea I am toying with as an alternative to "confidence" [max/total score].) I was interested in whether the G3 cell type labels in the 10X were "closer calls" (i.e., the differences were smaller) in samples from other subgroups. That doesn't necessarily seem to be true in the one WNT 10x sample or the G4 10x samples for the RF model.

Screenshot 2024-12-24 at 11 02 29 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant