Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions for Seurat conversion #15

Merged
merged 25 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ repos:
exclude: '\.Rd'

- repo: https://github.com/crate-ci/typos
rev: v1.28.1
rev: v1.28.2
hooks:
- id: typos
exclude: '\.nb\.html'
Expand Down
7 changes: 6 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,12 @@ LazyData: true
Suggests:
testthat (>= 3.0.0),
scater,
scran,
Seurat,
splatter
splatter,
scuttle,
Matrix,
SeuratObject
Config/testthat/edition: 3
RoxygenNote: 7.3.2
Imports:
Expand All @@ -34,6 +38,7 @@ Imports:
purrr,
S4Vectors,
SingleCellExperiment,
SummarizedExperiment,
tibble,
tidyr
biocViews:
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ export(calculate_silhouette)
export(calculate_stability)
export(ensembl_to_symbol)
export(extract_pc_matrix)
export(sce_to_seurat)
export(sce_to_symbols)
export(sum_duplicate_genes)
export(sweep_clusters)
import(SingleCellExperiment)
import(SummarizedExperiment)
import(methods)
importFrom(S4Vectors,`metadata<-`)
importFrom(S4Vectors,metadata)
Expand Down
78 changes: 52 additions & 26 deletions R/convert-gene-ids.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,21 @@
#'
#'
#' @param ensembl_ids A character vector of Ensembl gene ids to translate to
#' gene symbols.
#' @param reference The reference gene list to use for translation. One of `scpca`,
#' `10x2020`, `10x2024`. The `scpca` reference is the default.
#' @param sce A SingleCellExperiment object to use as a reference for gene symbols.
#' If provided, the `reference` argument will be ignored. The `sce` object must
#' include columns with the names `gene_ids` (containing Ensembl ids) and
#' `gene_symbol` (containing the symbols) to use for conversion.
#' @param unique Whether to use unique gene symbols, as would be done if
#' data had been read in with gene symbols by Seurat. Default is FALSE.
#' @param leave_na Whether to leave NA values in the output vector.
#' If FALSE, any missing values will be replaced with the input ensembl_id value.
#' Default is FALSE.
#' gene symbols.
#' @param reference The reference gene list to use for translation. One of
#' `scpca`, `10x2020`, `10x2024`. The `scpca` reference is the default.
#' @param sce A SingleCellExperiment object to use as a reference for gene
#' symbols. If provided, the `reference` argument will be ignored. The `sce`
#' object must include columns with the names `gene_ids` (containing Ensembl
#' ids) and `gene_symbol` (containing the symbols) to use for conversion.
#' @param unique Whether to use unique gene symbols, as would be done if data
#' had been read in with gene symbols by Seurat. Default is FALSE.
#' @param leave_na Whether to leave NA values in the output vector. If FALSE,
#' any missing values will be replaced with the input ensembl_id value.
#' Default is FALSE.
#' @param seurat_compatible Whether to return a vector that is compatible with
#' Seurat, translating and underscores to dashes.
#' Default is FALSE.
#'
#' @return A vector of gene symbols corresponding to the input Ensembl ids.
#' @export
Expand All @@ -51,7 +54,8 @@ ensembl_to_symbol <- function(
reference = c("scpca", "10x2020", "10x2024"),
sce = NULL,
unique = FALSE,
leave_na = FALSE) {
leave_na = FALSE,
seurat_compatible = FALSE) {
reference <- match.arg(reference)
stopifnot(
"`ensembl_ids` must be a character vector." = is.character(ensembl_ids),
Expand Down Expand Up @@ -85,10 +89,17 @@ ensembl_to_symbol <- function(
)
}
if (!leave_na && any(missing_symbols)) {
warning("Not all input ids have corresponding gene symbols, using input ids for missing values.")
message("Not all input ids have corresponding gene symbols, using input ids for missing values.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

gene_symbols[missing_symbols] <- ensembl_ids[missing_symbols]
}

if (seurat_compatible) {
if (any(grepl("_", gene_symbols))) {
warning("Replacing underscores ('_') with dashes ('-') in gene symbols for Seurat compatibility.")
}
gene_symbols <- gsub("_", "-", gene_symbols)
}

return(gene_symbols)
}

Expand All @@ -112,15 +123,19 @@ ensembl_to_symbol <- function(
#' (and not disabled by the `convert_hvg` and `convert_pca` arguments).
#'
#'
#' @param sce A SingleCellExperiment object containing gene ids and gene symbols.
#' @param reference The reference gene list for conversion. One of `sce`, `scpca`,
#' `10x2020`, or `10x2024`. If `sce` (the default) the internal row data is used.
#' @param unique Whether to use unique gene symbols, as would be done if
#' data had been read in with gene symbols by Seurat. Default is FALSE.
#' @param convert_hvg Logical indicating whether to convert highly variable genes to gene symbols.
#' Default is TRUE.
#' @param convert_pca Logical indicating whether to convert PCA rotation matrix to gene symbols.
#' Default is TRUE.
#' @param sce A SingleCellExperiment object containing gene ids and gene
#' symbols.
#' @param reference The reference gene list for conversion. One of `sce`,
#' `scpca`, `10x2020`, or `10x2024`. If `sce` (the default) the internal row
#' data is used.
#' @param unique Whether to use unique gene symbols, as would be done if data
#' had been read in with gene symbols by Seurat. Default is FALSE.
#' @param convert_hvg Logical indicating whether to convert highly variable
#' genes to gene symbols. Default is TRUE.
#' @param convert_pca Logical indicating whether to convert PCA rotation matrix
#' to gene symbols. Default is TRUE.
#' @param seurat_compatible Logical indicating whether to make gene symbols
#' Seurat-compatible by replacing underscores with dashes. Default is FALSE.
#'
#' @return A SingleCellExperiment object with row names set as gene symbols.
#' @export
Expand All @@ -145,7 +160,8 @@ sce_to_symbols <- function(
reference = c("sce", "scpca", "10x2020", "10x2024"),
unique = FALSE,
convert_hvg = TRUE,
convert_pca = TRUE) {
convert_pca = TRUE,
seurat_compatible = FALSE) {
reference <- match.arg(reference)
stopifnot(
"`sce` must be a SingleCellExperiment object." = is(sce, "SingleCellExperiment"),
Expand All @@ -164,9 +180,19 @@ sce_to_symbols <- function(
}

if (reference == "sce") {
gene_symbols <- ensembl_to_symbol(ensembl_ids, sce = sce, unique = unique)
gene_symbols <- ensembl_to_symbol(
ensembl_ids,
sce = sce,
unique = unique,
seurat_compatible = seurat_compatible
)
} else {
gene_symbols <- ensembl_to_symbol(ensembl_ids, reference = reference, unique = unique)
gene_symbols <- ensembl_to_symbol(
ensembl_ids,
reference = reference,
unique = unique,
seurat_compatible = seurat_compatible
)
}
row_ids <- gene_symbols
# set Ensembl ids as original ids for later translations
Expand Down
167 changes: 167 additions & 0 deletions R/make-seurat.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
#' Convert an SCE object to Seurat
#'
#' Converts an ScPCA SingleCellExperiment (SCE) object to Seurat format. This is
#' primarily a wrapper around Seurat::as.Seurat() with some additional steps to
#' include ScPCA metadata and options for converting the feature index from
#' Ensembl gene ids to gene symbols.
#'
#' If present, reduced dimensions from SCE objects will be included, renamed to
#' match Seurat default naming.
#'
#'
#' @param sce The input SCE object
#' @param use_symbols Logical indicating whether the Seurat object uses gene
#' symbols for indexing. Default is TRUE.
#' @param reference If `use_symbols` is TRUE, the reference to use for gene
#' symbols. One of `sce`, `scpca`, `10x2020`, `10x2024`, where `sce` uses the
#' symbols stored in the `gene_symbol` column of the SCE object's row data,
#' and the others use standardized translation tables. See `ensembl_to_symbol()`
#' for more information. Default is `sce`.
#' @param dedup_method Method to handle duplicated gene symbols. If `unique`,
#' the gene symbols will be made unique following standard Seurat procedures.
Comment on lines +20 to +21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Seurat version to specify here, or do they all (to our knowledge) follow the same approach?

#' If `sum`, the expression values for each gene symbols will be summed.
#' @param seurat_assay_version Whether to create Seurat `v3` or `v5` assay
#' objects. Default is `v3`.
jashapiro marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @return a Seurat object
#' @export
#'
#' @examples
#' \dontrun{
#' # convert an ScPCA SCE to Seurat renaming with gene symbols
#' seurat_obj <- sce_to_seurat(sce)
#'
#' # convert SCE to Seurat keeping the current names (usually Ensembl)
#' seurat_obj <- sce_to_seurat(sce, use_symbols = FALSE)
#'
#' # convert SCE to Seurat, merging duplicated gene names
#' seurat_obj <- sce_to_seurat(sce, dedup_method = "sum")
#'
#' # convert SCE to Seurat with v5 assay objects
#' seurat_obj <- sce_to_seurat(sce, seurat_assay_version = "v5")
#' }
#'
sce_to_seurat <- function(
sce,
use_symbols = TRUE,
reference = c("sce", "scpca", "10x2020", "10x2024"),
dedup_method = c("unique", "sum"),
seurat_assay_version = c("v3", "v5")) {
reference <- match.arg(reference)
dedup_method <- match.arg(dedup_method)
seurat_assay_version <- match.arg(seurat_assay_version)

stopifnot(
"Package `Seurat` must be installed for SCE to Seurat conversion." =
requireNamespace("Seurat", quietly = TRUE),
"Package `SeuratObject` must be installed for SCE to Seurat conversion." =
requireNamespace("SeuratObject", quietly = TRUE),
jashapiro marked this conversation as resolved.
Show resolved Hide resolved
"`sce` must be a SingleCellExperiment object." = is(sce, "SingleCellExperiment"),
"`sce` must contain both `gene_ids` and `gene_symbol` columns in the row data if it is being used as a reference." =
reference != "sce" || all(c("gene_ids", "gene_symbol") %in% names(rowData(sce)))
)

if (use_symbols) {
sce <- suppressMessages(sce_to_symbols(
sce,
reference = reference,
unique = (dedup_method == "unique"),
seurat_compatible = TRUE
))
}

if (dedup_method == "sum") {
sce <- sum_duplicate_genes(sce)
}

mainExpName(sce) <- "RNA"

# convert reducedDims to Seurat standard naming
reducedDimNames(sce) <- tolower(reducedDimNames(sce))
reducedDims(sce) <- reducedDims(sce) |>
purrr::map(\(mat) {
# insert underscore between letters and numbers in axis labels
rd_names <- gsub("([A-Za-z]+)([0-9]+)", "\\1_\\2", colnames(mat)) |>
tolower()
colnames(mat) <- rd_names
if (!is.null(attr(mat, "rotation"))) {
colnames(attr(mat, "rotation")) <- rd_names
}
return(mat)
})

# pull out any altExps to handle manually
alt_exps <- altExps(sce)
altExps(sce) <- NULL

# Let Seurat do initial conversion to capture most things
if ("logcounts" %in% assayNames(sce)) {
data_name <- "logcounts"
} else {
data_name <- NULL
}
sobj <- Seurat::as.Seurat(sce, data = data_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests clearly pass, but I'm unable to run sce_to_seurat() standalone and the error is showing up at this line, both with a processed and filtered SCE. Attaching Seurat(Object) didn't help. Are you able to run this outside of the test suite?

Error in UseMethod(generic = "DefaultAssay<-", object = object) : 
  no applicable method for 'DefaultAssay<-' applied to an object of class "call"
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] rOpenScPCA_0.1.0            SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
 [4] Biobase_2.64.0              GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
 [7] IRanges_2.38.1              S4Vectors_0.42.1            BiocGenerics_0.50.0        
[10] MatrixGenerics_1.16.0       matrixStats_1.4.1           Seurat_5.1.0               
[13] SeuratObject_5.0.2          sp_2.1-4                    testthat_3.2.2             

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22        splines_4.4.0           later_1.3.2            
  [4] tibble_3.2.1            polyclip_1.10-7         fastDummies_1.7.4      
  [7] lifecycle_1.0.4         rprojroot_2.0.4         globals_0.16.3         
 [10] lattice_0.22-6          MASS_7.3-60.2           magrittr_2.0.3         
 [13] plotly_4.10.4           yaml_2.3.10             remotes_2.5.0          
 [16] httpuv_1.6.15           sctransform_0.4.1       spam_2.10-0            
 [19] sessioninfo_1.2.2       pkgbuild_1.4.4          spatstat.sparse_3.1-0  
 [22] reticulate_1.39.0       cowplot_1.1.3           pbapply_1.7-2          
 [25] RColorBrewer_1.1-3      abind_1.4-5             pkgload_1.4.0          
 [28] zlibbioc_1.50.0         Rtsne_0.17              purrr_1.0.2            
 [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6           irlba_2.3.5.1          
 [34] listenv_0.9.1           spatstat.utils_3.1-0    goftest_1.2-3          
 [37] RSpectra_0.16-2         spatstat.random_3.3-1   fitdistrplus_1.2-1     
 [40] parallelly_1.38.0       leiden_0.4.3.1          codetools_0.2-20       
 [43] DelayedArray_0.30.1     tidyselect_1.2.1        UCSC.utils_1.0.0       
 [46] pdfCluster_1.0-4        spatstat.explore_3.3-2  jsonlite_1.8.8         
 [49] BiocNeighbors_1.22.0    ellipsis_0.3.2          progressr_0.14.0       
 [52] ggridges_0.5.6          survival_3.7-0          tools_4.4.0            
 [55] ica_1.0-3               Rcpp_1.0.13             glue_1.7.0             
 [58] gridExtra_2.3           SparseArray_1.4.8       usethis_3.0.0          
 [61] dplyr_1.1.4             withr_3.0.2             BiocManager_1.30.25    
 [64] fastmap_1.2.0           bluster_1.14.0          fansi_1.0.6            
 [67] digest_0.6.37           R6_2.5.1                mime_0.12              
 [70] colorspace_2.1-1        scattermore_1.2         tensor_1.5             
 [73] spatstat.data_3.1-2     utf8_1.2.4              tidyr_1.3.1            
 [76] generics_0.1.3          renv_1.0.11             data.table_1.16.0      
 [79] httr_1.4.7              htmlwidgets_1.6.4       S4Arrays_1.4.1         
 [82] uwot_0.2.2              pkgconfig_2.0.3         gtable_0.3.5           
 [85] lmtest_0.9-40           XVector_0.44.0          brio_1.1.5             
 [88] htmltools_0.5.8.1       profvis_0.4.0           dotCall64_1.1-1        
 [91] scales_1.3.0            png_0.1-8               spatstat.univar_3.0-1  
 [94] geometry_0.5.0          rstudioapi_0.17.1       reshape2_1.4.4         
 [97] nlme_3.1-164            magic_1.6-1             cachem_1.1.0           
[100] zoo_1.8-12              stringr_1.5.1           KernSmooth_2.23-24     
[103] parallel_4.4.0          miniUI_0.1.1.1          desc_1.4.3             
[106] pillar_1.9.0            grid_4.4.0              vctrs_0.6.5            
[109] RANN_2.6.2              urlchecker_1.0.1        promises_1.3.0         
[112] xtable_1.8-4            cluster_2.1.6           waldo_0.6.1            
[115] cli_3.6.3               compiler_4.4.0          rlang_1.1.4            
[118] crayon_1.5.3            future.apply_1.11.2     plyr_1.8.9             
[121] fs_1.6.4                stringi_1.8.4           deldir_2.0-4           
[124] viridisLite_0.4.2       BiocParallel_1.38.0     munsell_0.5.1          
[127] lazyeval_0.2.2          devtools_2.4.5          spatstat.geom_3.3-2    
[130] Matrix_1.7-0            RcppHNSW_0.6.0          patchwork_1.2.0        
[133] future_1.34.0           ggplot2_3.5.1           shiny_1.9.1            
[136] ROCR_1.0-11             igraph_2.0.3            memoise_2.0.1 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

914e500 seems to have done it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably related to an renv bug in the latest version; do you have a variable named object in your environment? (Seurat must be doing something bad here too, because it shouldn't read a global variable wherever it is doing that, but I can't work around all the Seurat bugs...) It should not affect running the function if you load the package from an external source. I just pushed a temporary fix until renv update comes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a variable named object in your environment?

No definitely not.

...But I just went back to that commit hash and re-ran my exact code, and uh.... magic? Clearly, black magic. But black magic from devtools or testthat.

Restarting R session...

- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> devtools::load_all(".")
ℹ Loading rOpenScPCA
> sce <- readRDS("tests/testthat/data/scpca_sce.rds")
> s <- sce_to_seurat(sce)
Error in UseMethod(generic = "DefaultAssay<-", object = object) : 
  no applicable method for 'DefaultAssay<-' applied to an object of class "call"
In addition: Warning message:
In ensembl_to_symbol(ensembl_ids, sce = sce, unique = unique, seurat_compatible = seurat_compatible) :
  Replacing underscores ('_') with dashes ('-') in gene symbols for Seurat compatibility.
> 
> object
test_check("rOpenScPCA")

I restarted R again, and...

Restarting R session...

- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> object
test_check("rOpenScPCA")

I have made a short film.
https://github.com/user-attachments/assets/9b96416a-83f2-4091-b8d9-7aab22fa5d05

Copy link
Member

@sjspielman sjspielman Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this was at the previous commit; this behavior no longer happens after 914e500, so at least what I was experiencing was consistent with the known bug. Wild stuff, though.


# update identity values, using cluster if present
sce_meta <- metadata(sce)
sobj$orig.ident <- sce_meta$library_id
if (is.null(sce$cluster)) {
Seurat::Idents(sobj) <- sce_meta$library_id
} else {
Seurat::Idents(sobj) <- sce$cluster
}

# add variable features if present
if (!is.null(sce_meta$highly_variable_genes)) {
sobj[["RNA"]]@var.features <- sce_meta$highly_variable_genes
sce_meta$highly_variable_genes <- NULL # remove from metadata to avoid redundancy
}

# convert and set functions depending on assay version requested
if (seurat_assay_version == "v5") {
create_seurat_assay <- SeuratObject::CreateAssay5Object
suppressWarnings(sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay5"))
} else {
create_seurat_assay <- SeuratObject::CreateAssayObject
suppressWarnings(sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay"))
}

# add spliced data if present
if ("spliced" %in% assayNames(sce)) {
sobj[["spliced"]] <- create_seurat_assay(counts = assay(sce, "spliced"))
}

# add altExps as needed.
for (alt_exp_name in names(alt_exps)) {
alt_exp <- alt_exps[[alt_exp_name]]
stopifnot(
"All altExps must contain a `counts` assay." = "counts" %in% assayNames(alt_exp)
)

# check name compatibility for Seurat
alt_exp_rownames <- rownames(alt_exp)
if (any(grepl("_", alt_exp_rownames))) {
warning(
"Replacing underscores ('_') with dashes ('-') in feature names from ",
alt_exp_name,
" for Seurat compatibility."
)
rownames(alt_exp) <- gsub("_", "-", alt_exp_rownames)
}

sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp))
if ("logcounts" %in% assayNames(alt_exp)) {
sobj[[alt_exp_name]]$data <- logcounts(alt_exp)
Comment on lines +154 to +156
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that these lines may both give the Seurat warning

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

I got this when running a multiplexed library through, so this was coming from the MULTI_X names. We might want to wrangle the altexps a bit more first if we want to avoid this Seurat warning. I think would be good since it's not very transparent to users that this is where the warning comes from, and we might add our own warning (message?) instead to be clearer what's changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added conversion for altExps with a warning.

}
}

# add sample-level metadata after removing non-vector objects
sobj@misc <- sce_meta |>
purrr::discard(is.object) |>
purrr::discard(is.list)

# scale data because many Seurat functions expect this
sobj <- Seurat::ScaleData(sobj, assay = "RNA", verbose = FALSE)

return(sobj)
}
Loading