forked from NBISweden/single-cell_sib_scilifelab
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathseurat_02_dim_reduction_compiled.Rmd
281 lines (204 loc) · 10 KB
/
seurat_02_dim_reduction_compiled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "Dimensionality reduction"
author: "Åsa Björklund & Paulo Czarnewski"
date: "Sept 13, 2019"
output:
html_document:
self_contained: true
highlight: tango
df_print: paged
toc: yes
toc_float:
collapsed: false
smooth_scroll: true
toc_depth: 3
keep_md: yes
html_notebook:
self_contained: true
highlight: tango
df_print: paged
toc: yes
toc_float:
collapsed: false
smooth_scroll: true
toc_depth: 3
---
# Dimensionality reduction
Paulo Czarnewski
```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, result='hold',fig.width=16)
```
<br/>
## Data preparation
***
First, let's load all necessary libraries and the QC-filtered dataset from the previous step.
```{r}
suppressWarnings(suppressMessages(require(Seurat)))
suppressWarnings(suppressMessages(require(scran)))
suppressWarnings(suppressMessages(require(ggplot2)))
suppressWarnings(suppressMessages(require(cowplot)))
alldata <- readRDS("data/3pbmc_qc.rds")
```
<br/>
### Feature selection
Next, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.
```{r}
suppressWarnings(suppressMessages(alldata <- FindVariableFeatures(alldata, selection.method = "vst", nfeatures = 850,verbose = FALSE,assay = "RNA")))
top20 <- head(VariableFeatures(alldata), 20)
LabelPoints(plot = VariableFeaturePlot(alldata), points = top20, repel = TRUE)
```
<br/>
### Z-score transformation
Now that the data is prepared, we now proceed with PCA. Since each gene has a different expression level, it means that genes with higher expression values will naturally have higher variation that will be captured by PCA. This means that we need to somehow give each gene a similar weight when performing PCA (see below). The common practice is to center and scale each gene before performing PCA. This exact scaling is called Z-score normalization it is very useful for PCA, clustering and plotting heatmaps. Additionally, we can use this function to remove any unwanted sources of variation from the dataset, such as `cell cycle`, `sequencing depth`, `percent mitocondria`. This is achieved by doing a generalized linear regression using these parameters as covariates in the model. Then the residuals of the model are taken as the "regressed data". Although not in the best way, batch effect regression can also be done here.
```{r}
alldata <- ScaleData(alldata, vars.to.regress = "percent_mito", assay = "RNA")
```
<br/>
## PCA
***
Performing PCA has many useful applications and interpretations, which much depends on the data used. In the case of life sciences, we want to segregate samples based on gene expression patterns in the data.
```{r}
alldata <- RunPCA(alldata, npcs = 50, reduction.name = "PCA_on_RNA", assay = "RNA",verbose = F)
```
We can plot the first 6 dimensions like so.
```{r, fig.asp=.28}
plot_grid(ncol = 3,
DimPlot(alldata, reduction = "PCA_on_RNA", group.by = "orig.ident",dims = 1:2),
DimPlot(alldata, reduction = "PCA_on_RNA", group.by = "orig.ident",dims = 3:4),
DimPlot(alldata, reduction = "PCA_on_RNA", group.by = "orig.ident",dims = 5:6) )
```
To identify which genes (Seurat) or metadata paramters (Scater/Scran) contribute the most to each PC, one can retreive the loading matrix information. Unfortunatelly this is not implemented in Scater/Scran, so you will need to compute PCA using `logcounts`.
```{r,fig.asp=.5}
VizDimLoadings(alldata, dims = 1:5, reduction = "PCA_on_RNA",ncol = 5,balanced = T)
```
We can also plot the amount of variance explained by each PC.
```{r,fig.asp=.3}
ElbowPlot(alldata, reduction = "PCA_on_RNA",ndims = 50)
```
Based on this plot, we can see that the top 7 PCs retain a lot of information, while other PCs contain pregressivelly less. However, it is still advisable to use more PCs since they might contain informaktion about rare cell types (such as platelets and DCs in this dataset)
<br/>
## tSNE
***
We can now run [BH-tSNE](https://arxiv.org/abs/1301.3342).
```{r,fig.asp=1}
alldata <- RunTSNE(alldata, reduction = "PCA_on_RNA", dims = 1:30, reduction.name = "TSNE_on_RNA",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
num_threads=0 )
#see ?Rtsne and ?RunTSNE for more info
```
We can now plot the tSNE colored per dataset. We can start now clearly see the effect of batches present in the dataset.
```{r,fig.asp=.28}
plot_grid(ncol = 3,DimPlot(alldata, reduction = "TSNE_on_RNA", group.by = "orig.ident"))
```
***
<br/>
## UMAP
***
We can now run [UMAP](https://arxiv.org/abs/1802.03426).
```{r}
alldata <- RunUMAP(alldata, reduction = "PCA_on_RNA", dims = 1:30,reduction.name = "UMAP_on_RNA",
n.components=2,
n.neighbors=30,
n.epochs=200,
min.dist=0.3,
learning.rate=1,
spread=1 )
#see ?RunUMAP for more info
```
Another usefullness of UMAP is that it is not limitted by the number of dimensions the data cen be reduced into (unlike tSNE). We can simply reduce the dimentions altering the `n.components` parameter.
```{r}
alldata <- RunUMAP(alldata, reduction.name = "UMAP10_on_RNA",
reduction = "PCA_on_RNA",
dims = 1:30,
n.components=10,
n.neighbors=30,
n.epochs=200,
min.dist=0.3,
learning.rate=1,
spread=1 )
#see ?RunUMAP for more info
```
We can now plot the UMAP colored per dataset. Although less distinct as in the tSNE, we still see quite an effect of the different batches in the data.
```{r,fig.asp=.28}
plot_grid(ncol = 3,
DimPlot(alldata, reduction = "UMAP_on_RNA", group.by = "orig.ident")+ ggplot2::ggtitle(label ="UMAP_on_RNA"),
DimPlot(alldata, reduction = "UMAP10_on_RNA", group.by = "orig.ident",dims = 1:2)+ ggplot2::ggtitle(label ="UMAP10_on_RNA"),
DimPlot(alldata, reduction = "UMAP10_on_RNA", group.by = "orig.ident",dims = 3:4)+ ggplot2::ggtitle(label ="UMAP10_on_RNA")
)
```
We can now plot PCA, UMAP and tSNE side by side for comparison. Here, we can conclude that our dataset contains a batch effect that needs to be corrected before proceeding to clustering and differential gene expression analysis.
```{r,fig.asp=.28}
plot_grid(ncol = 3,
DimPlot(alldata, reduction = "PCA_on_RNA", group.by = "orig.ident"),
DimPlot(alldata, reduction = "TSNE_on_RNA", group.by = "orig.ident"),
DimPlot(alldata, reduction = "UMAP_on_RNA", group.by = "orig.ident")
)
```
<br/>
## Using ScaledData and graphs for DR
***
Althought running a sencond dimmensionality reduction (i.e tSNE or UMAP) on PCA would be a standard approach (because it allows higher computation efficiency), the options are actually limiteless. Below we will show a couple of other common options such as running directly on the scaled data (which was used for PCA) or on a graph built from scaled data. We will show from now on only UMAP, but the same applies for tSNE.
<br/>
### Using ScaledData for UMAP
To run tSNE or UMAP on the scaled data, one firts needs to select the number of variables to use. This is because including dimentions that do contribute to the separation of your cell types will in the end mask those differences. Another reason for it is because running with all genes/features also will take longer or might be computationally unfeasible. Therefore we will use the scaled data of the highly variable genes.
```{r}
alldata <- RunUMAP(alldata, reduction.name = "UMAP_on_ScaleData",
features = alldata@[email protected],
assay = "RNA",
n.components=2,
n.neighbors=30,
n.epochs=200,
min.dist=0.3,
learning.rate=1,
spread=1 )
```
<br/>
### Using a Graph for UMAP
To run tSNE or UMAP on the a graph, we first need to build a graph from the data. In fact, both tSNE and UMAP first build a graph from the data using a specified distance metrix and then optimize the embedding. Since a graph is just a matrix containing distances from cell to cell and as such, you can run either UMAP or tSNE using any other distance metric desired. Euclidean and Correlation are ususally the most commonly used.
```{r}
#Build Graph
alldata <- FindNeighbors(alldata,
reduction = "PCA_on_RNA",
graph.name = "SNN",
assay = "RNA",
k.param = 20,
features = alldata@[email protected])
#Run UMAP on a graph
alldata <- RunUMAP(alldata, reduction.name = "UMAP_on_Graph",
graph = "SNN",
assay = "RNA" )
```
We can now plot the UMAP comparing both on PCA vs ScaledSata vs Graph.
```{r, fig.asp=.28}
plot_grid(ncol = 3,
DimPlot(alldata, reduction = "UMAP_on_RNA", group.by = "orig.ident")+ ggplot2::ggtitle(label ="UMAP_on_RNA"),
DimPlot(alldata, reduction = "UMAP_on_ScaleData", group.by = "orig.ident")+ ggplot2::ggtitle(label ="UMAP_on_ScaleData"),
DimPlot(alldata, reduction = "UMAP_on_Graph", group.by = "orig.ident")+ ggplot2::ggtitle(label ="UMAP_on_Graph")
)
```
<br/>
## Ploting genes of interest
***
Let's plot some marker genes for different celltypes onto the embedding. Some genes are:
Markers | Cell Type
--- | ---
CD3E | T cells
CD3E CD4 | CD4+ T cells
CD3E CD8A | CD8+ T cells
GNLY, NKG7 | NK cells
MS4A1 | B cells
CD14, LYZ, CST3, MS4A7 | CD14+ Monocytes
FCGR3A, LYZ, CST3, MS4A7 | FCGR3A+ Monocytes
FCER1A, CST3 | DCs
```{r,fig.asp=1.1}
myfeatures <- c("CD3E","CD4","CD8A","NKG7","GNLY","MS4A1","CD14","LYZ","MS4A7","FCGR3A","CST3","FCER1A")
FeaturePlot(alldata, reduction = "UMAP_on_RNA",dims = 1:2,
features = myfeatures,ncol = 3,order = T)
```
We can finally save the object for use in future steps.
```{r}
saveRDS(alldata,"data/3pbmc_qc_dm.rds")
```