forked from NBISweden/single-cell_sib_scilifelab
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathseurat_01_qc_compiled.Rmd
232 lines (166 loc) · 12.1 KB
/
seurat_01_qc_compiled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Seurat: Quality control"
author: "Åsa Björklund & Paulo Czarnewski"
date: "Sept 13, 2019"
output:
html_document:
self_contained: true
highlight: tango
df_print: paged
toc: yes
toc_float:
collapsed: false
smooth_scroll: true
toc_depth: 3
keep_md: yes
html_notebook:
self_contained: true
highlight: tango
df_print: paged
toc: yes
toc_float:
collapsed: false
smooth_scroll: true
toc_depth: 3
keep_md: yes
---
***
# Get data
In this tutorial, we will be using 3 publicly available dataset downloaded from 10X Genomics repository. They can be downloaded using the following bash commands. Simply create a folder called `data` and then use `curl` to pull the data from the 10X database.
```{bash, results='hide'}
mkdir data
curl -o data/pbmc_1k_v2_filtered_feature_bc_matrix.h5 -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v2/pbmc_1k_v2_filtered_feature_bc_matrix.h5
curl -o data/pbmc_1k_v3_filtered_feature_bc_matrix.h5 -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5
curl -o data/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5 -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5
```
With data in place, now we can start loading libraries we will use in this tutorial.
```{r, message='hide',warning='hide',results='hold'}
suppressMessages(require(Seurat))
suppressMessages(require(scater))
suppressMessages(require(Matrix))
```
We can first load the data individually by reading directly from HDF5 file format (.h5). Note that among those , the dataset p3.1k actually has both gene expression and CITE-seq data, so we will use only the `Gene Expression` here.
```{r,message='hide',warning='hide',results='hold'}
v3.1k <- Read10X_h5("data/pbmc_1k_v3_filtered_feature_bc_matrix.h5", use.names = T)
v2.1k <- Read10X_h5("data/pbmc_1k_v2_filtered_feature_bc_matrix.h5", use.names = T)
p3.1k <- Read10X_h5("data/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5", use.names = T)
p3.1k <- p3.1k$`Gene Expression`
```
***
# Create Object
We can now load the expression matricies into objects and then merge them into a single merged object. Each analysis workflow (Seurat, Scater, Scranpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column `Chemistry` in the metadata for plotting later on.
```{r}
sdata.v2.1k <- CreateSeuratObject(v2.1k, project = "v2.1k")
sdata.v3.1k <- CreateSeuratObject(v3.1k, project = "v3.1k")
sdata.p3.1k <- CreateSeuratObject(p3.1k, project = "p3.1k")
# Merge datasets into one single seurat object
alldata <- merge(sdata.v2.1k, c(sdata.v3.1k,sdata.p3.1k), add.cell.ids=c("v2.1k","v3.1k","p3.1k"))
# Add in a metadata column that indicates v2 vs v3 chemistry
alldata$Chemistry <- ifelse(alldata$orig.ident == "v2.1k","v2","v3")
```
Here it is how the count matrix and the metatada look like for every cell.
```{r , results='hold'}
as.data.frame(alldata@assays$RNA@counts[1:10,1:2])
head([email protected],10)
```
***
# Calculate QC
Having the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribossomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this, and here manually calculate the proportion of mitochondrial reads and add to the metadata table.
Citing from "Simple Single Cell" workflows (Lun, McCarthy & Marioni, 2017): "High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane."
```{r, results='hold'}
# Way1: Doing it using Seurat function
alldata <- PercentageFeatureSet(alldata, "^MT-", col.name = "percent_mito")
# Way2: Doing it manually
total_counts_per_cell <- colSums( alldata@assays$RNA@counts )
mito_genes <- rownames(alldata)[grep("^MT-",rownames(alldata))]
head(mito_genes,10)
alldata$percent_mito <- colSums( alldata@assays$RNA@counts[mito_genes,] ) / total_counts_per_cell
```
In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins.
```{r, results='hold'}
# Way1: Doing it using Seurat function
alldata <- PercentageFeatureSet(alldata, "^RP[SL]", col.name = "percent_ribo")
# Way2: Doing it manually
ribo_genes <- rownames(alldata)[grep("^RP[SL]",rownames(alldata))]
head(ribo_genes,10)
alldata$percent_ribo <- colSums( alldata@assays$RNA@counts[ribo_genes,] ) / total_counts_per_cell
```
***
# Plot QC
Now we can plot some of the QC-features as violin plots.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
feats <- c("nFeature_RNA","nCount_RNA","percent_mito","percent_ribo")
VlnPlot(alldata, group.by= "orig.ident", features = feats, pt.size = 0.1,ncol = 4) + NoLegend()
```
As you can see, the v2 chemistry gives lower gene detection, but higher detection of ribosomal proteins. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
cowplot::plot_grid(ncol = 4,
FeatureScatter(alldata, "nCount_RNA" , "nFeature_RNA", group.by = "orig.ident", pt.size = .1),
FeatureScatter(alldata, "percent_mito", "nFeature_RNA", group.by = "orig.ident", pt.size = .1),
FeatureScatter(alldata, "percent_ribo", "nFeature_RNA", group.by = "orig.ident", pt.size = .1),
FeatureScatter(alldata, "percent_ribo", "percent_mito", group.by = "orig.ident", pt.size = .1)
)
```
***
# Filtering
## Detection-based filtering
A standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=4}
selected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)
selected_f <- rownames(alldata)[ Matrix::rowSums(alldata) > 3]
data.filt <- subset(alldata, features=selected_f, cells=selected_c)
dim(data.filt)
```
Additionaly, Extremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In these datasets, there is also a clear difference between the v2 vs v3 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them. Also, in the protein assay data there is a lot of cells with few detected genes giving a bimodal distribution. This type of distribution is not seen in the other 2 datasets. Considering that they are all PBMC datasets it makes sense to regard this distribution as low quality libraries. Filter the cells with high gene detection (putative doublets) with cutoffs 4100 for v3 chemistry and 2000 for v2. Here, we will filter the cells with low gene detection (low quality libraries) with less than 1000 genes for v2 and < 500 for v2.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
#start with cells with many genes detected.
high.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA > 4100)
high.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA > 2000 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(high.det.v2,high.det.v3)))
# check number of cells
ncol(data.filt)
```
Additionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=12}
#Compute the relative expression of each gene per cell
rel_expression <- t( t(data.filt@assays$RNA@counts) / Matrix::colSums(data.filt@assays$RNA@counts)) * 100
most_expressed <- sort(Matrix::rowSums( rel_expression ),T)[20:1] / ncol(data.filt)
par(mfrow=c(1,2),mar=c(4,6,1,1))
boxplot( as.matrix(t(rel_expression[names(most_expressed),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(20)[20:1],horizontal=TRUE)
```
As you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. Let us assemble some information about such genes, which are important for quality control and downstream filtering.
## Mito/Ribo filtering
We also have quite a lot of cells with high proportion of mitochondrial and ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the `percent_mito` variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 25% mitochondrial reads and that will be used as a cutoff.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
selected_mito <- WhichCells(data.filt, expression = percent_mito < 0.25)
selected_ribo <- WhichCells(data.filt, expression = percent_ribo > 0.05)
# and subset the object to only keep those cells
data.filt <- subset(data.filt, cells = selected_mito)
data.filt <- subset(data.filt, cells = selected_ribo)
```
As you can see, there is still quite a lot of variation in `percent_mito`, so it will have to be dealt with in the data analysis step. We can also notice that the `percent_ribo` are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.
## Plot filtered QC
Lets plot the same QC-stats another time.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
feats <- c("nFeature_RNA","nCount_RNA","percent_mito","percent_ribo")
cowplot::plot_grid(ncol = 1,
VlnPlot(data.filt, group.by= "orig.ident", features = feats, pt.size = 0.1,ncol = 4) + NoLegend())
```
# Calculate cell-cycle scores
We here perform cell cycle scoring. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=4,fig.width=8}
data.filt <- NormalizeData(data.filt)
data.filt <- CellCycleScoring(object = data.filt,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes)
```
We can now plot a violin plot for the cell cycle scores as well.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
VlnPlot(data.filt, features = c("S.Score","G2M.Score"), group.by= "orig.ident",ncol = 4, pt.size = .1)
```
In this case it looks like we only have a few cycling cells in the datasets.
Finally, lets save the QC-filtered data for further analysis.
```{r,message='hide',warning='hide', results='hold',results='hold',fig.height=5,fig.width=16}
saveRDS(data.filt,"data/3pbmc_qc.rds")
```