diff --git a/docs/index.html b/docs/index.html index f57792c..82cbb9e 100644 --- a/docs/index.html +++ b/docs/index.html @@ -13,7 +13,7 @@ -
In this tutorial, we will describe an R package “Mixscale” for
-analyzing Perturb-seq data. Mixscale contains functions designed to
-tackle the following tasks:
-1. calculate ‘Mixscale scores’ for cells that receives the same
-perturbation to quantify the heterogeneity in perturbation
-strength
-2. perform a scoring-based weighted differential expression (DE) tests
-to identify DE genes for each perturbation
-3. perform different levels of decomposition analysis to identify
-correlated perturbations and group them into a program
-4. perform a PCA-based permutation test to extract shared genes for the
-perturbation programs (program gene signature)
-5. identify shared and unique signature between two relevant
-programs
-6. perform gene set enrichment tests using the program signature for new
-datasets
-7. perform module score analyses using the program signature to quantify
-the program activity in new datasets
-
-The tutorial is divided into two sections. The first section will
-describe task 1 to 5 using a public Perturb-seq dataset from the
-Weissman Lab (Jost et al
-2020), which can be downloaded from GSE132080.
-The second section will describe task 6 and 7 using the pathway gene
-lists generated from our study (available at Zenodo) and an interferon-beta stimulated
-human PBMCs dataset (ifnb) from Kang et al 2017.
-This ifnb dataset is available via the SeuratData package (see section 2
-below).
options(Seurat.object.assay.version = 'v3')
-
-library(Seurat)
-library(ggridges)
-library(stringr)
-library(Mixscale)
-library(ggplot2)
-In this section we will focus on how to use Mixscale to analyze -Perturb-seq data.
-The demo dataset from Jost et al
-2020 contains CRISPRi Perturb-seq data targeting 25 key genes
-involved in essential cell biological processes. We will load in the
-count matrix to create a Seurat object and append the provided meta data
-to it.
-One special feature of this dataset is that, for each perturbation
-target gene, there are five different gRNAs designed to target it. One
-of the gRNA has the perfectly matched sequence for the target region
-(labelled with “_00”), while the others contain 1~3 nucleotide
-mismatches so that their perturbation stength is “titrated”. We will
-treat the cells that have the same target gene as the same group in our
-downstream analyses.
# load in the count matrix downloaded from GSE132080
-ct_mat = ReadMtx(mtx = "GSE132080/GSE132080_10X_matrix.mtx",
- cells = "GSE132080/GSE132080_10X_barcodes.tsv",
- features = "GSE132080/GSE132080_10X_genes.tsv")
-# load the meta_data
-meta_data = read.csv("GSE132080/GSE132080_cell_identities.csv")
-rownames(meta_data) = meta_data$cell_barcode
-
-# create a seurat object
-seurat_obj = CreateSeuratObject(counts = ct_mat, meta.data = meta_data)
-rm(ct_mat, meta_data)
-
-# retrieve the guide information for each cell
-txt = seurat_obj$guide_identity
-txt2 = str_extract(txt, "^[^_]+")
-txt3 = gsub(pattern = "^[^_]+_", replacement = "", txt)
-seurat_obj[['gene']] = txt2
-seurat_obj[['gRNA_name']] = txt3
-seurat_obj[['cell_type']] = "K562"
-rm(txt, txt2, txt3)
-
-# remove ambiguous cells
-seurat_obj = subset(seurat_obj, subset = number_of_cells == 1) # 19594 cells remain
-seurat_obj = subset(seurat_obj, subset = guide_identity != '*') # 19587 cells remain
-
-
We will first run standard pre-processing (normalization, find -variable features, etc) for the dataset. Then, we will follow the -standard Mixscape -analysis to calculate local perturbation signatures that mitigate -confounding effects. Briefly speaking, for each cell we will search for -its 20 nearest neighbors from the non-targeted (NT) cells, and then -remove all technical variation so that perturbation-specific effect can -be revealed.
-# quick check of the data
-head(seurat_obj)
-## orig.ident nCount_RNA nFeature_RNA cell_barcode
-## AAACCTGAGAGTAATC-1 SeuratProject 26560 4221 AAACCTGAGAGTAATC-1
-## AAACCTGAGGGATCTG-1 SeuratProject 11929 2935 AAACCTGAGGGATCTG-1
-## AAACCTGAGGTCATCT-1 SeuratProject 33870 5036 AAACCTGAGGTCATCT-1
-## AAACCTGCAATGGAGC-1 SeuratProject 14945 3439 AAACCTGCAATGGAGC-1
-## AAACCTGCACCAGGCT-1 SeuratProject 30713 4729 AAACCTGCACCAGGCT-1
-## AAACCTGCACTTACGA-1 SeuratProject 32122 4985 AAACCTGCACTTACGA-1
-## AAACCTGCAGTGGGAT-1 SeuratProject 11734 2945 AAACCTGCAGTGGGAT-1
-## AAACCTGCATCGATTG-1 SeuratProject 24898 3929 AAACCTGCATCGATTG-1
-## AAACCTGCATGGTCTA-1 SeuratProject 39127 5171 AAACCTGCATGGTCTA-1
-## AAACCTGGTAAGCACG-1 SeuratProject 41872 5459 AAACCTGGTAAGCACG-1
-## guide_identity read_count UMI_count
-## AAACCTGAGAGTAATC-1 RAN_RAN_+_131356438.23-P1P2_12 544 34
-## AAACCTGAGGGATCTG-1 neg_ctrl_non-targeting_00089 267 19
-## AAACCTGAGGTCATCT-1 POLR2H_POLR2H_+_184081251.23-P1P2_08 622 34
-## AAACCTGCAATGGAGC-1 TUBB_TUBB_+_30688126.23-P1_03 433 20
-## AAACCTGCACCAGGCT-1 CDC23_CDC23_-_137548987.23-P1P2_04 136 8
-## AAACCTGCACTTACGA-1 DUT_DUT_+_48624411.23-P1P2_01 143 11
-## AAACCTGCAGTGGGAT-1 HSPA5_HSPA5_+_128003624.23-P1P2_00 278 11
-## AAACCTGCATCGATTG-1 MTOR_MTOR_+_11322547.23-P1P2_05 854 43
-## AAACCTGCATGGTCTA-1 GATA1_GATA1_-_48645022.23-P1P2_12 681 42
-## AAACCTGGTAAGCACG-1 DUT_DUT_+_48624411.23-P1P2_08 1727 89
-## coverage gemgroup good_coverage number_of_cells gene
-## AAACCTGAGAGTAATC-1 16.00000 1 True 1 RAN
-## AAACCTGAGGGATCTG-1 14.05263 1 True 1 neg
-## AAACCTGAGGTCATCT-1 18.29412 1 True 1 POLR2H
-## AAACCTGCAATGGAGC-1 21.65000 1 True 1 TUBB
-## AAACCTGCACCAGGCT-1 17.00000 1 True 1 CDC23
-## AAACCTGCACTTACGA-1 13.00000 1 True 1 DUT
-## AAACCTGCAGTGGGAT-1 25.27273 1 True 1 HSPA5
-## AAACCTGCATCGATTG-1 19.86047 1 True 1 MTOR
-## AAACCTGCATGGTCTA-1 16.21429 1 True 1 GATA1
-## AAACCTGGTAAGCACG-1 19.40449 1 True 1 DUT
-## gRNA_name cell_type
-## AAACCTGAGAGTAATC-1 RAN_+_131356438.23-P1P2_12 K562
-## AAACCTGAGGGATCTG-1 ctrl_non-targeting_00089 K562
-## AAACCTGAGGTCATCT-1 POLR2H_+_184081251.23-P1P2_08 K562
-## AAACCTGCAATGGAGC-1 TUBB_+_30688126.23-P1_03 K562
-## AAACCTGCACCAGGCT-1 CDC23_-_137548987.23-P1P2_04 K562
-## AAACCTGCACTTACGA-1 DUT_+_48624411.23-P1P2_01 K562
-## AAACCTGCAGTGGGAT-1 HSPA5_+_128003624.23-P1P2_00 K562
-## AAACCTGCATCGATTG-1 MTOR_+_11322547.23-P1P2_05 K562
-## AAACCTGCATGGTCTA-1 GATA1_-_48645022.23-P1P2_12 K562
-## AAACCTGGTAAGCACG-1 DUT_+_48624411.23-P1P2_08 K562
-# standard pre-processing
-seurat_obj = NormalizeData(seurat_obj)
-seurat_obj = FindVariableFeatures(seurat_obj)
-seurat_obj = ScaleData(seurat_obj)
-seurat_obj = RunPCA(seurat_obj)
-
-# calculate Perturbation signatures
-seurat_obj <- CalcPerturbSig(
- object = seurat_obj,
- assay = "RNA",
- slot = "data",
- gd.class ="gene",
- nt.cell.class = "neg",
- reduction = "pca",
- ndims = 40,
- num.neighbors = 20,
- new.assay.name = "PRTB")
-Now we will calculate the Mixscale scores for each cell within each -perturbation group.
-# Mixscale
-seurat_obj = RunMixscale(
- object = seurat_obj,
- assay = "PRTB",
- slot = "scale.data",
- labels = "gene",
- nt.class.name = "neg",
- min.de.genes = 5,
- logfc.threshold = 0.2,
- de.assay = "RNA",
- max.de.genes = 100,
- prtb.type = "P", new.class.name = "mixscale_id", fine.mode = F)
-## [1] "Running Mixscale scoring to calculate the perturbation scores \n"
-## [1] "con1"
-We will now use some plotting functions to explore the perturbation -scores that we just calculated.
-# a. Check the distribution of the scores for the first 10 perturbations
-Mixscale_RidgePlot(object = seurat_obj,
- nt.class.name = "neg",
- PRTB = unique(seurat_obj$gene)[unique(seurat_obj$gene) != "neg"][1:10],
- facet_wrap = "gene", facet_scale = "fixed")
-
-# b. Check if the scores correlate with the expression level of the target gene itself
-Mixscale_ScatterPlot(object = seurat_obj, nt.class.name = "neg",
- PRTB = unique(seurat_obj$gene)[unique(seurat_obj$gene) != "neg"][1:10],
- facet_wrap = "gene", facet_scale = "free_y", nbin = 10)
-
-After calculating the scores, we can use the scores to enhance the -statistical power of DE analysis by using them as a “weights” in the -regression model. Briefly speaking, instead of coding the NT cells as 0 -and the targeted cells as 1, we used the standardized scores to code the -targeted cells in the regression, so that cells with stronger -perturbation strength will have higher “weights” and vice versa.
-# run score-based weighted DE test for 12 selected perturbations. It will return a list of data frames (one for each perturbation)
-de_res = Run_wtDE(object = seurat_obj, assay = "RNA", slot = "counts",
- labels = "gene", nt.class.name = "neg",
- logfc.threshold = 0.1)
-## [1] "Running scoring DE test!\n"
-## [1] "ALDOA is running weighted DE test! "
-## [1] "DE test for 'ALDOA' is completed. Number of remaining = 24"
-## [1] "ATP5E is running weighted DE test! "
-## [1] "DE test for 'ATP5E' is completed. Number of remaining = 23"
-## [1] "BCR is running weighted DE test! "
-## [1] "DE test for 'BCR' is completed. Number of remaining = 22"
-## [1] "CAD is running weighted DE test! "
-## [1] "DE test for 'CAD' is completed. Number of remaining = 21"
-## [1] "CDC23 is running weighted DE test! "
-## [1] "DE test for 'CDC23' is completed. Number of remaining = 20"
-## [1] "COX11 is running weighted DE test! "
-## [1] "DE test for 'COX11' is completed. Number of remaining = 19"
-## [1] "DBR1 is running weighted DE test! "
-## [1] "DE test for 'DBR1' is completed. Number of remaining = 18"
-## [1] "DUT is running weighted DE test! "
-## [1] "DE test for 'DUT' is completed. Number of remaining = 17"
-## [1] "EIF2S1 is running weighted DE test! "
-## [1] "DE test for 'EIF2S1' is completed. Number of remaining = 16"
-## [1] "GATA1 is running weighted DE test! "
-## [1] "DE test for 'GATA1' is completed. Number of remaining = 15"
-## [1] "GINS1 is running weighted DE test! "
-## [1] "DE test for 'GINS1' is completed. Number of remaining = 14"
-## [1] "GNB2L1 is running weighted DE test! "
-## [1] "DE test for 'GNB2L1' is completed. Number of remaining = 13"
-## [1] "HSPA5 is running weighted DE test! "
-## [1] "DE test for 'HSPA5' is completed. Number of remaining = 12"
-## [1] "HSPA9 is running weighted DE test! "
-## [1] "DE test for 'HSPA9' is completed. Number of remaining = 11"
-## [1] "HSPE1 is running weighted DE test! "
-## [1] "DE test for 'HSPE1' is completed. Number of remaining = 10"
-## [1] "MTOR is running weighted DE test! "
-## [1] "DE test for 'MTOR' is completed. Number of remaining = 9"
-## [1] "POLR1D is running weighted DE test! "
-## [1] "DE test for 'POLR1D' is completed. Number of remaining = 8"
-## [1] "POLR2H is running weighted DE test! "
-## [1] "DE test for 'POLR2H' is completed. Number of remaining = 7"
-## [1] "RAN is running weighted DE test! "
-## [1] "DE test for 'RAN' is completed. Number of remaining = 6"
-## [1] "RPL9 is running weighted DE test! "
-## [1] "DE test for 'RPL9' is completed. Number of remaining = 5"
-## [1] "RPS14 is running weighted DE test! "
-## [1] "DE test for 'RPS14' is completed. Number of remaining = 4"
-## [1] "RPS15 is running weighted DE test! "
-## [1] "DE test for 'RPS15' is completed. Number of remaining = 3"
-## [1] "RPS18 is running weighted DE test! "
-## [1] "DE test for 'RPS18' is completed. Number of remaining = 2"
-## [1] "SEC61A1 is running weighted DE test! "
-## [1] "DE test for 'SEC61A1' is completed. Number of remaining = 1"
-## [1] "TUBB is running weighted DE test! "
-## [1] "DE test for 'TUBB' is completed. Number of remaining = 0"
-# have a quick look at the DE results
-head(de_res[[1]])
-## gene_ID beta_Intercept beta_weight beta_log_ct p_Intercept p_weight
-## A1BG A1BG -8.902799 0.017826112 1.921905 8.840714e-24 0.17577534
-## AACS AACS -7.277041 0.014223140 1.314532 1.832249e-07 0.50710637
-## AAK1 AAK1 -9.488100 -0.046482197 1.885136 1.490219e-14 0.02775026
-## AAR2 AAR2 -10.687522 0.008629909 2.195599 3.863385e-22 0.59564226
-## AARS2 AARS2 -11.979437 -0.039386722 2.176107 1.209127e-07 0.29298586
-## AASDH AASDH -11.518426 0.034001181 2.351510 3.069796e-21 0.04429346
-## p_log_ct fc_con1 DE_method
-## A1BG 1.062138e-22 -0.1196471 weighted
-## AACS 2.124449e-05 0.1811394 weighted
-## AAK1 4.995289e-12 0.2841105 weighted
-## AAR2 2.355928e-19 -0.1492889 weighted
-## AARS2 1.363023e-05 0.2976730 weighted
-## AASDH 2.309025e-18 -0.1662662 weighted
-We can now explore the top DE genes for each perturbations using the -customized DoHeatmap function, where cells are ordered by Mixscale -scores.
-# select the top 20 DE genes from on of the perturbation
-top_res = de_res[["GATA1"]][order(de_res[["GATA1"]]$p_weight)[1:20], ]
-# order the DE genes based on its log-fold-change
-top_DEG = rownames(top_res[order(top_res$beta_weight), ])
-
-# heatmap for the top DE genes. cells ordered by Mixscale scores
-Mixscale_DoHeatmap(object = seurat_obj, PRTB = "GATA1",
- slct_condition = "con1",
- nt.class.name = "neg",
- labels = "gene",
- slct_features = top_DEG,
- slct_ident = "gene")
-
-In this section we will focus on how to use the pathway signatures -from our study to run gene set enrichment test in external datasets.
-We will use the interferon-beta (IFNB) stimulated human PBMCs dataset -(ifnb) from Kang et -al 2017 (available via SeuratData package) -to demonstrate how to perform gene set enrichment analyses using the -pathway gene sets from our study. We aim to show that by using our -pathway gene lists, we can correctly infer the pathway activation of -IFNB across different cell types in the human PBMCs.
-# install the ifnb dataset
-SeuratData::InstallData("ifnb")
-# load dataset
-ifnb <- SeuratData::LoadData("ifnb")
-We can then load the pathway gene sets we generated (can be -downloaded from Zenodo). There are two -versions of pathway gene lists provided. One is the standard pathway -gene list for different pathway programs we compiled, and the other one -is the pathway exclusive gene list that filtered out the shared genes -shared with other relevant pathways in the experiment.
-plist = readRDS("pathway_genelist.rds")
-exclusive_plist = readRDS("Exclusive_pathway_genelist.rds")
-
-# only extract the exclusive gene lists that are relevant to IFNB pathway
-exclusive_plist = exclusive_plist[c("IFNG_REMOVE_IFNB", "IFNB_REMOVE_IFNG",
- "IFNB_REMOVE_TNFA", "TNFA_REMOVE_IFNB")]
-We will first conduct Wilcox DE tests between the control and the -IFNB-stimulated cells in each cell types in the ifnb dataset. Then, we -will perform Fisher enrichment tests for the DE genes from each of the -cell types, testing them against the pathway gene lists we just load. -These two steps are merged by a wrapper function.
-# Normalize the counts
-ifnb = NormalizeData(ifnb)
-
-# A wrapper function to perform both DE and enrichment test
-res = Mixscale_DEenrich(object = ifnb,
- plist = plist,
- labels = "seurat_annotations",
- conditions = "stim",
- ident.1 = "STIM",
- ident.2 = "CTRL",
- direction = "up",
- logfc.threshold = 0.2,
- p.val.cutoff = 0.05,
- min.pct = 0.1)
-
-# check the enrichment results for CD14 Monocytes
-head(res$`CD14 Mono`)
-## GO_term OR Pval combined_score num_LH num_PH
-## 1 IFNB_program1_down 30.551460 1.566042e-69 4840.26233710267 148 164
-## 3 IFNG_program1_down 13.764122 6.311636e-47 1464.21496809152 125 154
-## 5 TNFA_program1_down 7.315289 8.145677e-26 422.602219842384 91 128
-## 4 IFNG_program2_down 14.092937 5.223443e-21 658.156092839788 54 64
-## 2 IFNB_program2_down 1.353308 1.792922e-01 2.3259826904708 20 57
-## 6 TNFA_program2_down 1.537152 3.469431e-01 1.62722098095663 5 11
-## n_GO_term num_DEG direction_DEG
-## 1 300 590 upDEG
-## 3 300 590 upDEG
-## 5 300 590 upDEG
-## 4 200 590 upDEG
-## 2 223 590 upDEG
-## 6 98 590 upDEG
-Gene lists from related pathways, such as IFNG, IFNB, and TNFA which -are all linked to immune responses, frequently share many genes. This -overlap makes it challenging to differentiate the activation of these -pathways. For example, as the result above shows, DE genes due to IFNB -stimulation are enriched in not just the IFNB pathway, but also in IFNG -and TNFA pathways. To overcome this challenge, we have introduced a -concept of pathway-exclusive gene lists. Essentially, for any two -related pathways, we define the exclusive genes of one pathway as those -that are absent from the gene list of the other. To refine this further, -we employed a more stringent criterion to exclude genes that, while -potentially related, are not explicitly listed in the gene list of the -other pathway (For a detailed explanation, please refer to our paper). Performing enrichment tests using the -exclusive gene lists enhances our ability to accurately distinguish -activations among closely associated pathways.
-# A wrapper function to perform both DE and enrichment test
-res_exclusive = Mixscale_DEenrich(object = ifnb,
- plist = exclusive_plist,
- labels = "seurat_annotations",
- conditions = "stim",
- ident.1 = "STIM",
- ident.2 = "CTRL",
- direction = "up",
- logfc.threshold = 0.2,
- p.val.cutoff = 0.05,
- min.pct = 0.1)
-
-# check the enrichment results for CD14 Monocytes
-head(res_exclusive$`CD14 Mono`)
-## GO_term OR Pval combined_score num_LH num_PH
-## 3 IFNB_REMOVE_TNFA 9.8573087 8.350578e-30 659.998290889661 90 117
-## 2 IFNB_REMOVE_IFNG 9.8435531 1.200209e-19 428.850346678085 58 74
-## 1 IFNG_REMOVE_IFNB 1.7696262 6.530225e-02 4.82882984756938 16 38
-## 4 TNFA_REMOVE_IFNB 0.1313289 9.999754e-01 3.22910851679664e-06 3 42
-## n_GO_term num_DEG direction_DEG
-## 3 433 590 upDEG
-## 2 289 590 upDEG
-## 1 198 590 upDEG
-## 4 188 590 upDEG
-We can see that the exclusive gene lists for IFNB (removing TNFA) and -IFNB (removing IFNG) are still enriched for IFNB-stimulated DE genes. -But we do not observe signals from IFNG (removing IFNB) or TNFA -(removing IFNB), indicating that the underlying activated pathway during -IFNB stimulation is indeed IFNB, while IFNG and TNFA are showing -enrichment just because of their substantial overlap with IFNB.
-We can now visualize the enrichment results across all the cell types -in the ifnb dataset. First we will check the results for the standard -enrichment test
-DEenrich_DotPlot(res,
- direction = "up",
- plot_title = "Standard pathway gene lists")
-
-DEenrich_DotPlot(res_exclusive,
- direction = "up",
- plot_title = "Pathway exclusive gene lists",
- OR_cutoff = 10)
-
-Apart from performing enrishment test, we can also evaluate the -pathway activity by calculating the over all expression level of all the -genes within a gene list (the so-called module score analysis). We will -use package “UCell” -for module score analysis. Alternatively, we can use the built-in -function AddModuleScore() -from Seurat as well.
-ifnb = UCell::AddModuleScore_UCell(ifnb,
- features = plist[c("IFNB_program1_down", "IFNG_program1_down",
- "TNFA_program1_down", "TGFB1_program1_down")] )
-
-# using VlnPlot to visualize the score of each cell
-VlnPlot(ifnb,
- features = grep("_UCell", names(ifnb@meta.data), value = T),
- pt.size = 0,
- group.by = "seurat_annotations",
- split.by = "stim",
- ncol = 2) &
- theme(legend.position = "NA",
- axis.title = element_text(size = 15),
- axis.text = element_text(size = 12),
- plot.title = element_text(size = 18)) &
- ylim(0.1, 0.4)
-
-We can observe very similar results as in our enrichment tests, where -all IFNB, IFNG, and TNF pathways show a high activity (and not for TGFB -pathway) in the IFNB-stimulated cells compared to the non-stimulated -cells. And if we repeat the module score analysis using the pathway -exclusive gene lists, we should be able to determine the pathway -actually being activated (i.e., IFNB pathway).
-ifnb = UCell::AddModuleScore_UCell(ifnb,
- features = exclusive_plist[c("IFNB_REMOVE_IFNG", "IFNB_REMOVE_TNFA",
- "IFNG_REMOVE_IFNB", "TNFA_REMOVE_IFNB")] )
-
-# using VlnPlot to visualize the score of each cell
-VlnPlot(ifnb,
- features = grep("_REMOVE_.*UCell", names(ifnb@meta.data), value = T),
- pt.size = 0,
- group.by = "seurat_annotations",
- split.by = "stim",
- ncol = 2) &
- theme(legend.position = "NA",
- axis.title = element_text(size = 15),
- axis.text = element_text(size = 12),
- plot.title = element_text(size = 18))
-
-