Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods for cell type annotation and AnnData conversion #61

Merged
merged 10 commits into from
Mar 7, 2024

Conversation

allyhawkins
Copy link
Member

Closes #42
Closes #44
Stacked on #58

This PR adds in the methods section for cell type annotation and then conversion of all objects to AnnData objects.
For the cell type annotation section, what do we think about this level of detail? I included information about the delta median statistic since that's something we calculate. Are there other details regarding either building the references or running cell typing that I'm missing?

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, but I think we need a bit more detail about the references, if that does not appear elsewhere. I also hav a few smaller comments, and a clarification about CellAssign scores.

content/04.methods.md Outdated Show resolved Hide resolved
content/04.methods.md Outdated Show resolved Hide resolved
content/04.methods.md Outdated Show resolved Hide resolved
Comment on lines 115 to 116
Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB`.
References for each ScPCA project were assigned based on the tissue from which the sample was obtained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we might want a bit more detail here about how we made some of our decisions here, and the fact that we were often combining organs?

content/04.methods.md Outdated Show resolved Hide resolved
content/04.methods.md Outdated Show resolved Hide resolved
content/04.methods.md Outdated Show resolved Hide resolved
content/04.methods.md Outdated Show resolved Hide resolved

All merged `SingleCellExperiment` objects were converted to `AnnData` objects and saved as `.hdf5` files.
If a merged `SingleCellExperiment` object contains any ADT data, the RNA and ADT data was exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`).
In contrast, if a merged `SingleCellExperiment` object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the `SingleCellExperiment` object and not included in the exported `AnnData` object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, I kind of feel like we probably should just not merge the multiplexed data... The logic written out like this seems very strange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not include multiplexed libraries in any merged objects? Because we have both regular libraries and multiplexed libraries in the same project so we would need to adjust the workflow to remove any multiplexed libraries before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really only one project, right? I kind of feel like we could just skip the whole thing in that case. (This is a discussion largely for somewhere else)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I filed https://github.com/AlexsLemonade/ScPCA-admin/issues/832 to walk through some options around this. I think we leave it for now.

content/04.methods.md Outdated Show resolved Hide resolved
Base automatically changed from allyhawkins/spatial-bulk-merged-methods to main March 4, 2024 19:50
Copy link

github-actions bot commented Mar 4, 2024

Click the link below to download the manuscript build as a ZIP file.
This build is associated with commit f871952.

Manuscript build

Copy link

github-actions bot commented Mar 4, 2024

Click the link below to download the manuscript build as a ZIP file.
This build is associated with commit e766aa4.

Manuscript build

@allyhawkins
Copy link
Member Author

@jashapiro I added some wording regarding why we picked the BlueprintEncodeData reference and some more information on building the organ specific references. It was a little hard without explaining the organs used in every reference. We could be really specific and include a table with all references used and all organs that were used to create that reference?

I'm also not sure how much detail you want on the celldex reference. I added a little bit and related it back to the delta median statistic.

@allyhawkins allyhawkins requested a review from jashapiro March 4, 2024 20:38
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These updates look good, but I think I probably want someone else to weigh in on my first comment here. In particular, how much of the cell typing journey do we want to present in this paper? Do we want to comment about how difficult it is on a compendia-level basis, and particularly for cancer cells? I think this is probably something worth highlighting at the very least in the discussion, but we also might a bit of our little benchmarks in this paper.

That said, it opens an avenue for critique/suggestions of more experiments that maybe we don't want to highlight.

Comment on lines 115 to 117
The delta median statistic is helpful in evaluating how confident `SingleR` is in assigning each cell to a specific cell type, where low delta median values indicate ambiguous assignments and high delta median values indicate confident assignments.
To identify the most appropriate reference to use with `SingleR`, we annotated a handful of samples across multiple disease types with all human-specific references available in the `celldex` package.
`BlueprintEncodeData` had the most consistently high delta median statistic distribution across samples from multiple disease types and was chosen as the reference to use for all ScPCA samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too am not sure how much detail we want here! I think this is okay as far as it goes: I'm not sure if this maybe should actually be a result though? Probably doesn't need to be, but in some ways I think evaluating the applicability of cell typing methods to compendia is something worth talking a bit about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean including a figure around this? I already had this thought, but wanted to wait until we wrote up the text to decide what exactly to include. See AlexsLemonade/scpca-paper-figures#41.

So maybe we do want to include a supplemental figure that looks at the delta median statistic across a few samples and a few references.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to tag in @jaclyn-taroni to take a look at this and see what she thinks about including a figure showing reference comparisons and about the level of detail presented here. Just noting that this figure might look a bit messy, but we could make one and then make a decision on if it will help prove a point or just bring more questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to try a supplemental figure showing reference comparisons, which is discussed in the cell type annotation section of the results. I propose that we split the "Annotation cell types" section into two subsections: evaluating the methods themselves and the workflow part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to try a supplemental figure showing reference comparisons, which is discussed in the cell type annotation section of the results. I propose that we split the "Annotation cell types" section into two subsections: evaluating the methods themselves and the workflow part.

@jaclyn-taroni do you mean creating two sections in the results or here in the methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in results that uses this header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the typo, so "Annotating cell types"


For `CellAssign`, marker gene references were created using the marker gene lists available on `PanglaoDB` [@doi:10.1093/database/baz046].
Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB` to accommodate all ScPCA projects encompassing a variety of disease and tissue type.
If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined - e.g., for sarcomas that appear in bone or soft tissue, we created a reference containing bone, connective tissue, smooth muscle, and immune cells.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a good level of detail for the text, but we might want a supplemental table of the organ sets we used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea so I'm going to file an issue regarding this in the figures repo.


All merged `SingleCellExperiment` objects were converted to `AnnData` objects and saved as `.hdf5` files.
If a merged `SingleCellExperiment` object contains any ADT data, the RNA and ADT data was exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`).
In contrast, if a merged `SingleCellExperiment` object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the `SingleCellExperiment` object and not included in the exported `AnnData` object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really only one project, right? I kind of feel like we could just skip the whole thing in that case. (This is a discussion largely for somewhere else)

content/04.methods.md Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <[email protected]>
Copy link

github-actions bot commented Mar 5, 2024

Click the link below to download the manuscript build as a ZIP file.
This build is associated with commit ab59eca.

Manuscript build

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already returned my comment about doing AlexsLemonade/scpca-paper-figures#41 and moving text around choices to the results. Let's remove that from this PR for now and file a ticket.

I agree with leaving the merged section as is for now.

:shipit:

Comment on lines 115 to 117
The delta median statistic is helpful in evaluating how confident `SingleR` is in assigning each cell to a specific cell type, where low delta median values indicate ambiguous assignments and high delta median values indicate confident assignments.
To identify the most appropriate reference to use with `SingleR`, we annotated a handful of samples across multiple disease types with all human-specific references available in the `celldex` package.
`BlueprintEncodeData` had the most consistently high delta median statistic distribution across samples from multiple disease types and was chosen as the reference to use for all ScPCA samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend taking this out for now and filing an issue, blocked by AlexsLemonade/scpca-paper-figures#41, to move talking about picking a cell type annotation method and references into results.

@allyhawkins
Copy link
Member Author

I removed the delta median discussion and filed #70, @jashapiro did you want to take another look at this or are we good to go?
We can revisit the merged objects after our discussion today.

Copy link

github-actions bot commented Mar 6, 2024

Click the link below to download the manuscript build as a ZIP file.
This build is associated with commit 91099b7.

Manuscript build

Copy link

github-actions bot commented Mar 7, 2024

Click the link below to download the manuscript build as a ZIP file.
This build is associated with commit 1876413.

Manuscript build

@allyhawkins allyhawkins merged commit 1f262a0 into main Mar 7, 2024
1 check passed
@allyhawkins allyhawkins deleted the allyhawkins/cell-type-methods branch March 7, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conversion to AnnData Cell type annotation methods
3 participants