Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of results - portal overview #26

Merged
merged 11 commits into from
Feb 28, 2024
51 changes: 32 additions & 19 deletions content/03.results.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,37 @@

## The Single-cell Pediatric Cancer Atlas Portal

1. History and overview of the Portal
- In 2022, the Childhood Cancer Data Lab launched the Single-cell Pediatric Cancer Atlas (ScPCA) Portal to make uniformly processed, summarized single-cell and single-nuclei RNA-seq data and de-identified metadata available for download
- The Portal currently holds X amount of samples from X amount of tumor types
- Data available on the Portal was obtained using two mechanisms - accepting raw data from ALSF-funded investigators and investigators who used our open-source pipeline to produce summarized gene expression data for inclusion on the portal.
- In addition to providing summarized gene expression data, we collect a core set of metadata that is provided on the Portal for all samples including, age, sex, diagnosis, subdiagnosis (if applicable), tissue location, and disease stage.
- All metadata that is provided by the submitter is reviewed to standardize as much as possible. We also utilize ontology ID's where possible.
- Fig. 1A shows how many samples we have from each type of tumor. For each diagnosis, we also indicate what proportion of the samples come from each disease stage (e.g., initial diagnosis, recurrence, post-mortem).
- The samples obtained on the portal are mostly from patient tumors, although some are from patient-derived xenografts and human cell lines
- In addition to single-cell and single-nuclei RNA-seq, many samples have associated bulk RNA-seq, ADT data (CITE-seq), cell hashing, or spatial transcriptomics.
- Fig. 1B summarizes the total number of samples that are single-cell vs. single-nuclei. Additionally, we show how many of the samples on the portal also have either bulk, CITE, cell hashing, or spatial data.
- Supplemental Table 1 shows a breakdown of how many of each modality is found in each project.

2. Obtaining additional project information
- On the Portal, samples are organized by project. Each project is a collection of similar samples from a single investigator.
- To select projects of interest, users can filter based on diagnosis, modality included, single-cell or single-nuclei and 10X version. Additionally, users will be able to filter based on if the project includes cell line samples or xenografts.
- A summary of each project, including a list of samples found in each project, is displayed on the Portal.
- Fig.1C shows an example of this summary which include an abstract, links to any external information about the projects such as any associated publication information, and links to external places where data may be stored such as SRA or GEO.
- If a project includes bulk, CITE, spatial, or multiplexing, this will also be indicated on the project card.

In March of 2022, the Childhood Cancer Data Lab launched the Single-cell Pediatric Cancer Atlas (ScPCA) Portal to make uniformly processed, summarized single-cell and single-nuclei RNA-seq data and de-identified metadata from pediatric tumor samples available for download.
Data available on the Portal was obtained using two different mechanisms: raw data was accepted from ALSF-funded investigators and processed using our open-source pipeline, `scpca-nf`, or investigators processed their raw data using `scpca-nf` and submitted the output for inclusion on the Portal.

All samples on the Portal include a core set of metadata obtained from investigators, including age, sex, diagnosis, subdiagnosis (if applicable), tissue location, and disease stage.
Some investigators submitted additional metadata, such as treatment and tumor stage, which can also be found on the Portal.
All submitted metadata was standardized to maintain consistency across projects before adding to the Portal.
In addition to providing a human-readable value for the submitted metadata, we also provide an ontology term ID, if applicable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more information about why it's helpful to have ontologies? Or is stating that we include them enough?

Yes, and it would be helpful to specify which ontologies are used, too, in my opinion. You're underselling the value-add (not to mention the work that went into the metadata) the way this is currently written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the first methods PR reminded me that we should probably file an issue to track methods for the ontologies!

We mapped submitted metadata for age, sex, organism, disease, tissue, and ethnicity (if applicable), to their associated ontology term IDs using the ontology lookup service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not fully address my comment (https://github.com/AlexsLemonade/ScPCA-manuscript/pull/26/files#r1501617420):

Yes, and it would be helpful to specify which ontologies are used, too, in my opinion.

I meant the ontologies themselves. So, UBERON, MONDO, etc. If you want to add a TODO in an HTML comment and let others weigh in later, that would be fine. I expect that #50 would include this level of detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't sure how much detail to go into here vs. the methods regarding the names of the actual ontologies. I updated this to indicate both where the ontology comes from and what metadata they are used for in d76f191 if you want to take another look.

Including ontology term IDs for each sample provides users with standardized metadata terms that can be used across all projects.

The Portal contains data from 500 samples and over 50 tumor types.
<!-- TODO: Update numbers -->
The total number of samples for each diagnosis is shown in Figure 1A, along with a breakdown of the proportion of samples from each disease stage within a diagnosis group.
allyhawkins marked this conversation as resolved.
Show resolved Hide resolved
Figure 1A summarizes all samples from patient tumors or patient-derived xenografts currently available on the Portal.
The majority of samples found on the Portal were obtained from patients with leukemia.
Still, the Portal also includes samples from brain and central nervous system tumors, sarcoma and soft tissue tumors, and a variety of other solid tumors.
Most samples were collected at initial diagnosis, with a smaller percentage of samples collected either at recurrence, during progressive disease, or post-mortem.
Along with the patient tumors, the Portal contains a small number of human tumor cell line samples.
allyhawkins marked this conversation as resolved.
Show resolved Hide resolved


Each of the available samples contains summarized gene expression data from either single-cell or single-nuclei RNA sequencing.
However, some samples also include additional data, such as quantified expression data from tagging cells with Antibody-derived tags (ADT), like CITE-seq antibodies [@doi:10.1038/nmeth.4380], or multiplexing samples with hashtag oligonucleotides (HTO)[@doi:10.1186/s13059-018-1603-1] prior to sequencing.
Out of the 500 samples, 96 have associated CITE-seq data, and 19 have associated multiplexing data.
In some cases, multiple libraries from the same sample were collected for additional sequencing, either for bulk RNA-seq or spatial transcriptomics.
Specifically, 118 samples on the Portal were sequenced using bulk RNA-seq and 94 samples were sequenced using spatial transcriptomics.
A summary of the number of samples with each additional modality is shown in Figure 1B, and a detailed summary of the total samples with each sequencing method broken down by project is available in Supplemental Table 1.

Samples on the Portal are organized by project, where each project is a collection of similar samples from an individual lab.
Users can filter projects based on diagnosis, included modalities (e.g., CITE-seq, bulk RNA-seq), 10X Genomics version (e.g., 10Xv2, 10Xv3), and whether or not a project includes samples derived from patient-derived xenografts or cell lines.
The project card displays an abstract, the total number of samples included, a list of diagnoses for all samples included in the Project, and links to any external information associated with the project, such as publications and links to external data, such as SRA or GEO (Figure 1C).
The project card will also indicate the type(s) of sequencing performed, including the 10X Genomics kit version, the suspension type (cell or nucleus), and if additional sequencing is present, like bulk RNA-seq or multiplexing.

## Uniform processing of data available on the ScPCA Portal

Expand Down Expand Up @@ -118,3 +130,4 @@
- Along with the merged objects, for each project, a merged summary report is created and output.
- This report includes a brief summary of the samples and libraries included in the merged object, including a summary of the type of libraries (e.g., single-cell, single-nuclei, with CITE-seq) and sample diagnoses included in the object.
- The report also contains a UMAP showing all cells from all libraries included in the merged object. For each library, a separate panel is shown, and cells from that library are colored while all other cells are gray (Fig. 3D).

Loading