From dabfc49962ade2b4c46b09b7cbe902b563346006 Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Fri, 1 Mar 2024 09:08:18 -0600 Subject: [PATCH 1/6] start cell typing methods --- content/04.methods.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/content/04.methods.md b/content/04.methods.md index 8aeec76..6e79fa5 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -103,6 +103,19 @@ In addition to using the default parameters for `salmon quant`, we utilized the ### Cell type annotation +Cell type labels were added to processed `SingleCellExperiment` objects using both `SingleR`[REF] and `CellAssign`[REF]. +For `SingleR`, a reference dataset was obtained from the `celldex` package, `BlueprintEncode`, and used to build a `SingleR` model with `SingleR::...`. +`classifySingleR` is run using the trained SingleR model and processed `SingleCellExperiment` object and cell type annotations are obtained for each cell in the processed object. +Score matrix? +Calculation of the delta median. + +For `CellAssign`, marker gene references were created using the marker gene list available on PanglaoDB [REF]. +References were unique to the organ from which the tissue was obtained and contained marker genes for all cell types listed in the specified organ in PanglaoDB. +details on how the model was run with the reference... +Plotting probability. + +The cell type annotations from both `SingleR` and `CellAssign` are made available as part of the processed `SingleCellExpriment` object output by `scpca-nf`. + - Implementation of SingleR and CellAssign - Description of metrics used (e.g., what is the delta median and where does the probability come from) From 21dac656a4bd291139bf699d45bcfe7ef22e6a47 Mon Sep 17 00:00:00 2001 From: Ally Hawkins Date: Fri, 1 Mar 2024 13:38:57 -0600 Subject: [PATCH 2/6] fill out cell typing and add zellkonverter --- content/04.methods.md | 46 +++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 17 deletions(-) diff --git a/content/04.methods.md b/content/04.methods.md index 6e79fa5..d16b256 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -103,22 +103,26 @@ In addition to using the default parameters for `salmon quant`, we utilized the ### Cell type annotation -Cell type labels were added to processed `SingleCellExperiment` objects using both `SingleR`[REF] and `CellAssign`[REF]. -For `SingleR`, a reference dataset was obtained from the `celldex` package, `BlueprintEncode`, and used to build a `SingleR` model with `SingleR::...`. -`classifySingleR` is run using the trained SingleR model and processed `SingleCellExperiment` object and cell type annotations are obtained for each cell in the processed object. -Score matrix? -Calculation of the delta median. - -For `CellAssign`, marker gene references were created using the marker gene list available on PanglaoDB [REF]. -References were unique to the organ from which the tissue was obtained and contained marker genes for all cell types listed in the specified organ in PanglaoDB. -details on how the model was run with the reference... -Plotting probability. - -The cell type annotations from both `SingleR` and `CellAssign` are made available as part of the processed `SingleCellExpriment` object output by `scpca-nf`. - - - Implementation of SingleR and CellAssign - - Description of metrics used (e.g., what is the delta median and where does the probability come from) - +Cell type labels were added to processed `SingleCellExperiment` objects using both `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1]. +To build the references used for assigning cell types, a separate workflow within `scpca-nf` was run, `build-celltype-index.nf`. +For `SingleR`, an appropriate reference dataset was identified and obtained from the `celldex` package [@doi:10.18129/B9.bioc.celldex], `BlueprintEncodeData` [@doi:10.3324/haematol.2013.094243;10.1038/nature11247], and used to train the `SingleR` classification model with `SingleR::trainSingleR()`. +The model and the processed `SingleCellExperiment` object were input to `SingleR::classifySingleR()`. +The output from `SingleR` included assigned cell type labels and a score matrix with a score calculated by `SingleR` for each cell and each possible cell type. +Cell type annotations and the score matrix were added to the processed `SingleCellExperiment` object output by `scpca-nf`. +For all cell type annotations obtained from `SingleR`, a delta median statistic was calculated by subtracting the median score from the maximum score for each cell. + +For `CellAssign`, marker gene references were created using the marker gene list available on `PanglaoDB` [@doi:10.1093/database/baz046]. +Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB`. +References for each ScPCA project were assigned based on the tissue from which the sample was obtained. +`scvi.external.CellAssign` was used to train the model and predict the assigned cell type. +For each cell type in the reference, `CellAssign` calculates the likelihood that each cell is assigned to that cell type. +The output of `CellAssign` includes a matrix with the assigned probability for each cell and each cell type. +The cell type label with the highest probability for a given cell is assigned to that cell. +The final predictions and the probability matrix were added as cell type annotations to the processed `SingleCellExperiment` object output by `scpca-nf`. + +If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all `SingleCellExperiment` objects (unfiltered, filtered, and processed). +In this case, cell type annotations were also determined using `SingleR` and `CellAssign` with results from all three available in the processed `SingleCellExperiment` object. +Cell type annotation was not performed for any samples derived from cell lines. ### Generating merged data Merged objects are created by running the `merge.nf` workflow within `scpca-nf`. @@ -136,6 +140,14 @@ If any libraries included in the ScPCA project contain additional ADT data, the If any libraries included in the ScPCA project are multiplexed and contain HTO data, the HTO data is not merged and will not be present in the merged `SingleCellExperiment` object. ### Converting SingleCellExperiment objects to AnnData objects - - use of zellkonverter + +All `SingleCellExperiment` objects output by `scpca-nf` were converted to `AnnData` objects and saved as `.hdf5` files. +`zellkonverter::writeH5AD()` was used to convert and export the objects as `.hdf5` files. +For any `SingleCellExperiment` objects containing an `altExp` (e.g., ADT data), the RNA and ADT data were exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`) files. +Any individual libraries that were multiplexed and contained HTO data were not converted to `AnnData` objects. + +All merged `SingleCellExperiment` objects were converted to `AnnData` objects and saved as `.hdf5` files. +If a merged `SingleCellExperiment` object contains any ADT data, the RNA and ADT data was exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`). +In contrast, if a merged `SingleCellExperiment` object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the `SingleCellExperiment` object and not included in the exported `AnnData` object. ### Code and data availability From f871952f18f30ce4c9d5e9e79a2322943ed031e2 Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Mon, 4 Mar 2024 14:02:31 -0600 Subject: [PATCH 3/6] Apply suggestions from code review Co-authored-by: Joshua Shapiro --- content/04.methods.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/content/04.methods.md b/content/04.methods.md index e4ff77d..d23d373 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -105,13 +105,12 @@ In addition to using the default parameters for `salmon quant`, we utilized the Cell type labels were added to processed `SingleCellExperiment` objects using both `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1]. To build the references used for assigning cell types, a separate workflow within `scpca-nf` was run, `build-celltype-index.nf`. -For `SingleR`, an appropriate reference dataset was identified and obtained from the `celldex` package [@doi:10.18129/B9.bioc.celldex], `BlueprintEncodeData` [@doi:10.3324/haematol.2013.094243;10.1038/nature11247], and used to train the `SingleR` classification model with `SingleR::trainSingleR()`. +For `SingleR` we used the `BlueprintEncodeData` from the `celldex` package [@doi:10.3324/haematol.2013.094243;@doi: 10.1038/nature11247;@doi:10.18129/B9.bioc.celldex] to train the `SingleR` classification model with `SingleR::trainSingleR()`. The model and the processed `SingleCellExperiment` object were input to `SingleR::classifySingleR()`. -The output from `SingleR` included assigned cell type labels and a score matrix with a score calculated by `SingleR` for each cell and each possible cell type. -Cell type annotations and the score matrix were added to the processed `SingleCellExperiment` object output by `scpca-nf`. -For all cell type annotations obtained from `SingleR`, a delta median statistic was calculated by subtracting the median score from the maximum score for each cell. +The `SingleR` output of cell type annotations and a score matrix for each cell and each possible cell type were added to the processed `SingleCellExperiment` object output. +As a measure of the reliability of the `SingleR` cell type assignments, we also calculated a delta median statistic for each cell by subtracting the median cell type score from the maximum score for that cell. -For `CellAssign`, marker gene references were created using the marker gene list available on `PanglaoDB` [@doi:10.1093/database/baz046]. +For `CellAssign`, marker gene references were created using the marker gene lists available on `PanglaoDB` [@doi:10.1093/database/baz046]. Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB`. References for each ScPCA project were assigned based on the tissue from which the sample was obtained. `scvi.external.CellAssign` was used to train the model and predict the assigned cell type. @@ -142,13 +141,12 @@ By contrast, if any libraries included in the ScPCA project are multiplexed and ### Converting SingleCellExperiment objects to AnnData objects -All `SingleCellExperiment` objects output by `scpca-nf` were converted to `AnnData` objects and saved as `.hdf5` files. -`zellkonverter::writeH5AD()` was used to convert and export the objects as `.hdf5` files. +`zellkonverter::writeH5AD()` was used to convert `SingleCellExperiment` objects to `AnnData` format and export the objects as `.hdf5` files. For any `SingleCellExperiment` objects containing an `altExp` (e.g., ADT data), the RNA and ADT data were exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`) files. -Any individual libraries that were multiplexed and contained HTO data were not converted to `AnnData` objects. +Multiplexed libraries were not converted to `AnnData` objects, due to the potential for ambiguity in sample origin assignments. All merged `SingleCellExperiment` objects were converted to `AnnData` objects and saved as `.hdf5` files. -If a merged `SingleCellExperiment` object contains any ADT data, the RNA and ADT data was exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`). +If a merged `SingleCellExperiment` object contained any ADT data, the RNA and ADT data were exported and saved separately as RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`). In contrast, if a merged `SingleCellExperiment` object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the `SingleCellExperiment` object and not included in the exported `AnnData` object. ### Code and data availability From e766aa4cdd08ae10fa77e529dc33040065bf7259 Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Mon, 4 Mar 2024 14:33:43 -0600 Subject: [PATCH 4/6] justify singler and panglao references --- content/04.methods.md | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/content/04.methods.md b/content/04.methods.md index e342ba8..93b4459 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -103,25 +103,27 @@ In addition to using the default parameters for `salmon quant`, we applied the ` ### Cell type annotation -Cell type labels were added to processed `SingleCellExperiment` objects using both `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1]. +If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all `SingleCellExperiment` objects (unfiltered, filtered, and processed). +Cell type labels determined by both `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1] were added to processed `SingleCellExperiment` objects. + To build the references used for assigning cell types, a separate workflow within `scpca-nf` was run, `build-celltype-index.nf`. For `SingleR` we used the `BlueprintEncodeData` from the `celldex` package [@doi:10.3324/haematol.2013.094243;@doi: 10.1038/nature11247;@doi:10.18129/B9.bioc.celldex] to train the `SingleR` classification model with `SingleR::trainSingleR()`. The model and the processed `SingleCellExperiment` object were input to `SingleR::classifySingleR()`. -The `SingleR` output of cell type annotations and a score matrix for each cell and each possible cell type were added to the processed `SingleCellExperiment` object output. -As a measure of the reliability of the `SingleR` cell type assignments, we also calculated a delta median statistic for each cell by subtracting the median cell type score from the maximum score for that cell. +The `SingleR` output of cell type annotations and a score matrix for each cell and all possible cell types were added to the processed `SingleCellExperiment` object output. +To evaluate confidence in `SingleR` cell type assignments, we also calculated a delta median statistic for each cell by subtracting the median cell type score from the maximum score for that cell [@url:https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells]. + +The delta median statistic is helpful in evaluating how confident `SingleR` is in assigning each cell to a specific cell type, where low delta median values indicate ambiguous assignments and high delta median values indicate confident assignments. +To identify the most appropriate reference to use with `SingleR`, we annotated a handful of samples across multiple disease types with all human-specific references available in the `celldex` package. +`BlueprintEncodeData` had the most consistently high delta median statistic distribution across samples from multiple disease types and was chosen as the reference to use for all ScPCA samples. For `CellAssign`, marker gene references were created using the marker gene lists available on `PanglaoDB` [@doi:10.1093/database/baz046]. -Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB`. -References for each ScPCA project were assigned based on the tissue from which the sample was obtained. -`scvi.external.CellAssign` was used to train the model and predict the assigned cell type. -For each cell type in the reference, `CellAssign` calculates the likelihood that each cell is assigned to that cell type. -The output of `CellAssign` includes a matrix with the assigned probability for each cell and each cell type. -The cell type label with the highest probability for a given cell is assigned to that cell. -The final predictions and the probability matrix were added as cell type annotations to the processed `SingleCellExperiment` object output by `scpca-nf`. +Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB` to accommodate all ScPCA projects encompassing a variety of disease and tissue type. +If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined - e.g., for sarcomas that appear in bone or soft tissue, we created a reference containing bone, connective tissue, smooth muscle, and immune cells. + +Given the processed `SingleCellExperiment` object and organ-specific reference, `scvi.external.CellAssign` was used to train the model and predict the assigned cell type. +For each cell type in the reference, `CellAssign` calculates the probability that each cell is assigned to that cell type. +The probability matrix and a prediction based on the most likely cell type were added as cell type annotations to the processed `SingleCellExperiment` object output by `scpca-nf`. -If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all `SingleCellExperiment` objects (unfiltered, filtered, and processed). -In this case, cell type annotations were also determined using `SingleR` and `CellAssign` with results from all three available in the processed `SingleCellExperiment` object. -Cell type annotation was not performed for any samples derived from cell lines. ### Generating merged data Merged objects are created with the `merge.nf` workflow within `scpca-nf`. From ab59ecabb47083e924bcf72c0b59bb5338a8e64a Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Tue, 5 Mar 2024 11:24:53 -0600 Subject: [PATCH 5/6] rewording Co-authored-by: Joshua Shapiro --- content/04.methods.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/04.methods.md b/content/04.methods.md index 93b4459..4c3cfbb 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -121,8 +121,8 @@ Organ-specific references were built using all cell types in a specified organ l If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined - e.g., for sarcomas that appear in bone or soft tissue, we created a reference containing bone, connective tissue, smooth muscle, and immune cells. Given the processed `SingleCellExperiment` object and organ-specific reference, `scvi.external.CellAssign` was used to train the model and predict the assigned cell type. -For each cell type in the reference, `CellAssign` calculates the probability that each cell is assigned to that cell type. -The probability matrix and a prediction based on the most likely cell type were added as cell type annotations to the processed `SingleCellExperiment` object output by `scpca-nf`. +For each cell, `CellAssign` calculates a probability of assignment to each cell type in the reference. +The probability matrix and a prediction based on the most probable cell type were added as cell type annotations to the processed `SingleCellExperiment` object output. ### Generating merged data From 6d812dffb4096ccecca00d99b43274ab6501fd6f Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Wed, 6 Mar 2024 10:14:14 -0600 Subject: [PATCH 6/6] remove section about references going to results --- content/04.methods.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/content/04.methods.md b/content/04.methods.md index 4c3cfbb..2a75478 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -101,7 +101,7 @@ We created a decoy-aware reference created from spliced cDNA sequences with the The trimmed reads were then provided as input to `salmon quant` for selective alignment. In addition to using the default parameters for `salmon quant`, we applied the `--seqBias` and `--gcBias` flags to correct for sequence-specific biases due to random hexamer priming and fragment-level GC biases, respectively. -### Cell type annotation +### Cell type annotation If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all `SingleCellExperiment` objects (unfiltered, filtered, and processed). Cell type labels determined by both `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1] were added to processed `SingleCellExperiment` objects. @@ -112,10 +112,6 @@ The model and the processed `SingleCellExperiment` object were input to `SingleR The `SingleR` output of cell type annotations and a score matrix for each cell and all possible cell types were added to the processed `SingleCellExperiment` object output. To evaluate confidence in `SingleR` cell type assignments, we also calculated a delta median statistic for each cell by subtracting the median cell type score from the maximum score for that cell [@url:https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells]. -The delta median statistic is helpful in evaluating how confident `SingleR` is in assigning each cell to a specific cell type, where low delta median values indicate ambiguous assignments and high delta median values indicate confident assignments. -To identify the most appropriate reference to use with `SingleR`, we annotated a handful of samples across multiple disease types with all human-specific references available in the `celldex` package. -`BlueprintEncodeData` had the most consistently high delta median statistic distribution across samples from multiple disease types and was chosen as the reference to use for all ScPCA samples. - For `CellAssign`, marker gene references were created using the marker gene lists available on `PanglaoDB` [@doi:10.1093/database/baz046]. Organ-specific references were built using all cell types in a specified organ listed in `PanglaoDB` to accommodate all ScPCA projects encompassing a variety of disease and tissue type. If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined - e.g., for sarcomas that appear in bone or soft tissue, we created a reference containing bone, connective tissue, smooth muscle, and immune cells.