Split out notebook for individual cell predictions visualization #88

jaclyn-taroni · 2024-12-18T18:47:31Z

We now are predicting labels for individual cells on two datasets: Smart-seq2 and 10X. We should split the individual cell portions of analysis_notebooks/pseudobulk_and_single_cells.Rmd into a new notebook that visualizes the results.

The text was updated successfully, but these errors were encountered:

jaclyn-taroni · 2024-12-24T14:26:36Z

I'm posting some preliminary results from the 10X dataset so I can write down some thoughts about next steps. The code is available on this branch: jaclyn-taroni/88-plot-single-cells

Here's a plot looking at the percentage of individual cells labeled for each cell type:

The Smart-seq2 results (left hand side) show, for the most part, most cells are labeled "correctly" for that cell type. On the right hand side, the 10X dataset shows that many cells – regardless of the true subgroup – are labeled as G3. This is particularly true for the RF model.

It is worth looking into why this might be. Like many things, there might be a technical or biological explanation. Some thoughts on what we might look into:

Technical

Look at what genes are included in the pair-based rules.
- Are they different between the two models?
- What does the expression of genes included in rules between subgroups look like (e.g., do the genes in the SHH rules tend not to be detected in the 10X dataset)?
- Look at gene lengths for genes used in rules between subgroups. This thought is not fully formed yet, but we use TPM for both and I believe the 10X data set should have less of a gene length bias.
Look at characteristics of the cells broken down by label (e.g., UMIs detected, genes detected, etc.). It seems plausible to me that there is some "quality floor" that exists for cells to which one could apply these models.

Biological

The first thing I'd want to explore is the cell type identity of cells labeled G3 vs. others in the 10X dataset. We can get cell type labels from UCSC Cell Browser (and that's added in the branch mentioned above ☝🏻). From a cursory look, most cells are labeled malignant without a "finer" cell label. However, the cell metadata from UCSC also includes cluster labels. We can see if we can align this information with what is included in the publication to get more information about cell state.

Tagging @envest for visibility.

jaclyn-taroni · 2024-12-24T16:10:00Z

I looked at the difference between the max subgroup score and the second highest subgroup score within an individual cell. (This is an idea I am toying with as an alternative to "confidence" [max/total score].) I was interested in whether the G3 cell type labels in the 10X were "closer calls" (i.e., the differences were smaller) in samples from other subgroups. That doesn't necessarily seem to be true in the one WNT 10x sample or the G4 10x samples for the RF model.

jaclyn-taroni self-assigned this Dec 18, 2024

jaclyn-taroni mentioned this issue Dec 24, 2024

Track down cell type labels for single-cell RNA-seq datasets #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split out notebook for individual cell predictions visualization #88

Split out notebook for individual cell predictions visualization #88

jaclyn-taroni commented Dec 18, 2024

jaclyn-taroni commented Dec 24, 2024

jaclyn-taroni commented Dec 24, 2024

Split out notebook for individual cell predictions visualization #88

Split out notebook for individual cell predictions visualization #88

Comments

jaclyn-taroni commented Dec 18, 2024

jaclyn-taroni commented Dec 24, 2024

jaclyn-taroni commented Dec 24, 2024