-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_kallisto.Rmd
226 lines (140 loc) · 7.48 KB
/
lab_kallisto.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: 'Kallisto '
subtitle: Workshop on RNA-Seq
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
# Kallisto
**Kallisto** is an "alignment-free" RNA-Seq quantification method that runs very fast with a small memory footprint, so that it can be run on most laptops. It is a command-line program that can be downloaded as binary executables for Linux or Mac, or in source code format. For a first insight into the program, read [here](https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/) and for the published article, see [here](http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3519.html).
Kallisto is geared towards quantification on the transcript (isoform) level, rather than the gene-level (although the latter can also be done by post-processing Kallisto output.) However, read assignment to transcript isoforms cannot (in general) be done unambiguously, so there is an intrinsic "quantification noise" or variability in this process. Kallisto can thus be run either in a single step (which is very fast) or in "bootstrap" mode (which takes longer, but can be done on several processors in parallel) in order to get uncertainty estimates for the expression levels - a kind of error bars for the quantification process. Running with bootstraps is mandatory if you want to perform differential expression analysis of isoforms with Sleuth (see below).
Kallisto is primarily meant for quantification of an existing set of FASTA sequences, that is, it does not perform transcript assembly and it cannot quantify the expression of novel transcripts that are not in the transcript index that you provide to it. With that said, you can of course use contigs from an assembly that you have produced in some other program in your Kallisto index. It would also be possible to use the software for eg: metagenomics or metatranscriptomics quantification.
## Build index
Download the cDNA reference and create a index file to run on (This has already been done!). But if you want to do it yourself you can uncomment the code and run it yourselves.
```{sh,eval=FALSE,chunk.title=TRUE}
#cd ~/RNAseq/
#mkdir rawData/references/ref_kallisto
#cd rawData/references/ref_kallisto
#wget ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
#kallisto index \
# --index kallisto_mm10_transcriptome_index.idx \
# Mus_musculus.GRCm38.cdna.all.fa.gz
#cd ~/RNAseq/
```
It takes less than 10 minutes.
## Pseudoalign reads
Now you can pseudoalign all your reads to the Kallisto index using Kallisto quantification
```{sh,eval=FALSE,chunk.title=TRUE}
cd ~/RNAseq/
for sample_name in $(ls rawData/fastqFiles/*.clean.fq.gz.1M_reads.fq.gz); do
file=${sample_name#rawData/fastqFiles/}
sample=${file%.clean.fq.gz.1M_reads.fq.gz}
echo $sample
mkdir results/05-kallisto/$sample
kallisto quant \
--index rawData/references/ref_kallisto/kallisto_mm10_transcriptome_index.idx \
--output-dir results/05-kallisto/$sample \
--threads 3 \
--single \
--fragment-length 200 \
--sd 20 \
$sample_name
done
```
# Quantification
## Get gene quantifications
Since Kallisto uses transcripts estimates and many programs, like DEseq and EdgeR, works better with gene counts there are programs to get gene counts from transcript estimates in our case we will use txImport. TxImport can import estimates from a lot of different sources. You can read more about it in the [txImport vignette](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html)
For the program to work, it needs a two-column-file that assigns each transcript to a gene. This can found using ensembl database or other public databases. In our case the relationship between the transcripts and gene is found in the sequencenames in the cDNA file that we downloaded. By using awk we can extract the transcript gene dependency and write it to a file.
```{sh,eval=FALSE,chunk.title=FALSE}
# Example of a sequence name in file
# >ENSMUST00000177564.1 cdna chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]
# Extract all transcriptnames (1st) and genenames (4th) from sequence names and write to a file. If "gzcat" does not work, you can try replacing it by "gunzip -c"
gzcat rawData/references/ref_kallisto/Mus_musculus.GRCm38.cdna.all.fa.gz| \
grep '>' |
awk '{FS= " "}BEGIN{ print "TXNAME,GENEID"};{print substr($1,2) "," substr($4,6)};' \
>rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv
```
## TxImport
The next steps are carried out in R so make sure you start R in the conda environment.
```{r ,eval=FALSE,chunk.title=FALSE}
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq
library(tximport) # importing kalisto transcript counts to geneLevels
library(readr) # Fast readr of files.
library(rhdf5) # read/convert kalisto output files.
```
If *tximport* and *rhdf5* are not installed you can install them now.
```{r ,eval=FALSE,chunk.title=FALSE}
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("tximport")
#BiocManager::install("rhdf5")
```
Load the metadata for the samples.
```{r ,eval=FALSE,chunk.title=FALSE}
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
# download metadata
download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
mr
```
### Convert to gene counts
Read in the Kallisto transcript abundances and convert the TxImport.
TxImport is created to
```{r ,eval=FALSE,chunk.title=FALSE}
setwd("~/RNAseq")
files <- paste("results/05-kallisto" ,
list.files(path = "results/05-kallisto",
pattern = "abundance.tsv",
recursive = TRUE),
sep = "/")
names(files) <- mr$SampleName
tx2gene <- read_csv("rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv")
txi.kallisto.tsv <- tximport(files,
type = "kallisto",
tx2gene = tx2gene,
ignoreAfterBar = TRUE)
```
### Convert to DEseq2 object for analysis
```{r ,eval=FALSE,chunk.title=FALSE}
mr = mr %>% mutate(Day = as.factor(Day))
dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, mr, ~Day)
test = DESeq(dds)
```
and now you can go on with the differential gene expression in a similar way as the one you did with the feature count table.
### Convert to a edgeR object for analysis
```{r,eval=FALSE,chunk.title=FALSE}
cts <- txi.kallisto.tsv$counts
normMat <- txi.kallisto.tsv$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# y is now ready for estimate dispersion functions see edgeR User's Guide
```
# Session info
```{r,echo=FALSE}
sessionInfo()
```
***