-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_mapping.Rmd
239 lines (150 loc) · 8.27 KB
/
lab_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "Mapping"
subtitle: "Workshop on RNA-Seq"
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
In this tutorial we will use a splice aware aligner HISAT2 that is the best solution when using a laptop or a personal computer with less than 16GB. If you have a computer with memory >32GB or access to computer cluster it is worthwhile to consider using STAR and compare the results.
# HISAT2
## Data
All FASTQ files that you will need for this exercise can be found in the `fastq` subfolder of your data folder.
A pre-built mouse genome index for HISAT2 can be found in the `references/hisat2_build` subfolder of your data folder.
## Mapping
Here, you will map the reads to the mouse reference genome using the RNA-seq aligner **HISAT2**.
There are many features that can be tweaked using HISAT2. For more information on all flags that can be used go [here](https://daehwankimlab.github.io/hisat2/manual/).
Read below for the flags we use for this exercise. Remember to change filenames accordingly so that you can run your program properly and know which files you have used.
Now you can map the reads from one of the samples (or several; it's up to you which ones) using a command such as the one below.
```{sh,eval=FALSE,chunk.title=TRUE}
mkdir outDir
# for paired end reads
hisat2 -p N -x path/to/index/fileName \
-1 path/to/reads/sample_1.fastq \
-2 path/to/reads/sample_2.fastq \
-S outDir/hisat2.sam \
--summary-file outDir/hisat2_summary.txt
# for single end reads
hisat2 -p N -x path/to/index/fileName \
-U path/to/reads/sample.fastq \
-S outDir/hisat2.sam \
--summary-file outDir/hisat2_summary.txt
```
The flags used are:
* ``-p N`` specifies the number of threads that will be used by the program
* ``-x /path/to/index/fileName`` specifies the path to the pre-built genome index. Note that the index consists of multiple files ending in ``.ht2``, and only the shared part of the filename should be indicated (e.g. ``genome`` if the files are called ``genome.1.ht2``, ``genome.2.ht2``, etc).
* `` -1 /path/to/reads/sample_1.fastq `` is where you should list the first-read FASTQ file of paired end reads that you wish to map
* `` -2 /path/to/reads/sample_2.fastq `` is where you should list the second-read FASTQ file of paired end reads that you wish to map
* `` -U /path/to/reads/sample.fastq `` is where you should list the single end read FASTQ file that you wish to map
* ``-S outDir/hisat2.sam`` is the name of the result file that will be created
* ``--summary-file outDir/hisat2_summary.txt`` is the name of a file for summary information about the alignments
This should run fairly quickly and create the files you specified with ``-S`` and ``--summary-file``.
If everything worked, HISAT2 should report some statistics about how many reads were mapped, on your terminal and in the summary file.
<i class="fas fa-comments"></i> **Try to answer the following:**
* How many RNA-seq reads were provided as input to HISAT2?
* How many of those reads were mapped by HISAT2?
* How many reads were uniquely mapped, i.e. mapped to one genomic location?
* In general, do the alignments seem to be good? I.e. do they cover the entire reads and contain few mismatches?
To answer these questions, you should look at the input and output from HISAT2. You may also need to consult the [HISAT2 manual](https://daehwankimlab.github.io/hisat2/manual/), [information about the FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format) and the [SAM format specification](https://github.com/samtools/hts-specs).
# Convert SAM to BAM
If you were able to run HISAT2, this should have produced files with mapped reads in SAM format. These files need to be converted to *sorted* and *indexed* BAM files for efficient downstream analysis.
You should try to give the BAM files representable names, in order to make it easier to manage your files. A good naming scheme for BAM files is to use names that indicate what you mapped and how. As an example, if you mapped sample 12 using HISAT2 you could create a file named `sample12.mmusculus.HISAT2.bam`.
The most commonly used tool for converting from SAM to BAM is [Samtools](http://www.htslib.org/doc/samtools.html) (follow the link for more information about Samtools).
The Samtools command to convert from SAM to a sorted BAM file is:
```{sh,eval=FALSE,chunk.title=TRUE}
samtools sort -o output.bam input.sam
```
Remember to use an appropriate filename instead of `output.bam`! Next, we need to index the BAM file.
```{sh,eval=FALSE,chunk.title=TRUE}
samtools index properName.bam
```
The indexing step should create an index file with the suffix `.bai`. This sorted, indexed BAM file can be viewed in the Integrative Genomics Viewer (IGV).
You can also get a report on your mapped reads using the samtools command *flagstat*:
```{sh,eval=FALSE,chunk.title=TRUE}
samtools flagstat properName.sorted.bam > properName.sorted.flagstat
```
<i class="fas fa-lightbulb"></i> Since the BAM file contains all the information from the original SAM file, remember to remove the SAM file once you are finished, in order to free up disk space.
```{sh,eval=FALSE,chunk.title=TRUE}
rm input.sam
```
***
# Code to run all the steps above on data on course data
```{sh ,eval=FALSE,chunk.title=TRUE}
###############################################################################
### MAP READS TO REFERENCE #####
###############################################################################
######################################
### BUILD A HISAT2 REFERENCE ####
######################################
###########################
### ALIGN WITH HISAT2 ###
###########################
cd ~/RNAseq/
#map all the small files
for i in $(ls rawData/fastqFiles/*.clean.fq.gz.1M_reads.fq.gz); do
file=${i#rawData/fastqFiles/} # This line removes "rawData/fastqFiles/" from the paht stoded in "i".
sample=${file%.clean.fq.gz.1M_reads.fq.gz} # Here the suffix ".clean.fq.gz.1M_reads.fq.gz" is removed from "file" variable and assigned to the variable "sample"
echo -e "mapping sample $sample"
#Single-end read
hisat2 -p 4 \
-x rawData/references/hisat2_build/mm10 \
-U $i \
-S results/02-mapping/$sample.1M.sam \
--summary-file results/02-mapping/$sample.hisat2_summary.txt
done
#map the large file
for i in $(ls rawData/fastqFiles/*.clean.fq.gz); do
file=${i#rawData/fastqFiles/}
sample=${file%.clean.fq.gz}
echo -e "mapping sample $sample"
#Single-end read
hisat2 -p 4 \
-x rawData/references/hisat2_build/mm10 \
-U $i \
-S results/02-mapping/$sample.10M.sam \
--summary-file results/02-mapping/$sample.10M.hisat2_summary.txt
done
############################
#### CONVERT TO BAM ####
############################
for i in $(ls results/02-mapping/*.sam); do
file=${i#results/02-mapping/}
sample=${file%.sam}
echo -e "converting sam to bam, sample $sample"
# This is how it was done in a older version of samtools (Version: 0.1.19-44428cd)
#samtools view -bS -o results/02-mapping/$sample.bam results/02-mapping/$sample.sam
#samtools sort results/02-mapping/$sample.bam results/02-mapping/$sample.sorted
# This is how it should be done in the version that is used in our conda enviroment (Version: 1.10 (using htslib 1.10.2))
samtools sort -o results/02-mapping/$sample.sorted.bam results/02-mapping/$sample.sam
samtools index results/02-mapping/$sample.sorted.bam
samtools flagstat results/02-mapping/$sample.sorted.bam > results/02-mapping/$sample.sorted.flagstat
done
############################
#### REMOVE SAM FILES ####
############################
# Remove the intermediate sam and bamfiles.
for i in $(ls results/02-mapping/*.sorted.bam); do
file=${i#results/02-mapping/}
sample=${file%.sorted.bam}
if test -f "results/02-mapping/$sample.sorted.bam"; then
echo "$sample.sorted.bam exists. Removing $sample.sam and $sample.bam"
rm results/02-mapping/$sample.sam
fi
done
```