-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAlignment.Rmd
215 lines (168 loc) · 8.12 KB
/
Alignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Alignment"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
eval = FALSE,
fig.path = "figures/",
out.width = "50%")
```
**Author**: [Léo-Paul Dagallier](https://github.com/LPDagallier)
**Last update**: `r format(Sys.Date())`
***
After extracting the sequences (exons, supercontigs or multicopies exons), we need to align them.
Here we'll use [MAFFT](https://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html) to align, but other program do exist (e.g. [MUSCLE](https://drive5.com/muscle5/manual/commands.html), [Clustal](http://www.clustal.org/), [MACSE](https://www.agap-ge2pop.org/macse/?menu=releases)).
Sequences can be aligned "naively" or "informed" by the locus reference sequence. As a rule of thumb, I would advise to align with the locus reference sequence, because it is conceptually less prone to alignment errors.
Here I'll present only the "informed" method, but the extra steps this method requires can easily be removed if you want to go with the "naive" method.
## Multiple sequences alignment (MSA)
### Define the paths and variables
- for the inputs:
```{bash}
analysis_ID="my_analysis_ID"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_hybpiper2_extract/retrieved_exons/formatted_fastas";
path_to_ref="<base_directory>/DATASETS/PHYLOGENOMICS/target_references"
reference_fasta_file="target_reference.FNA"
```
- for the outputs:
```{bash}
step_ID="hybpiper2_exons_align_w_refs"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/$analysis_ID"_"$step_ID"/";
```
- temporary directory for local users:
```{bash}
path_to_tmp=$path_to_dir_out
```
- temporary directory for cluster users: depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:
```{bash}
path_to_tmp=$SLURM_TMPDIR
```
### Copy all the files in the working directory
Go to working directory.
```{bash}
cd $path_to_tmp
```
Copy the fasta files.
```{bash}
scp -r $path_to_dir_in/* $path_to_tmp
```
Remove the empty files that are sometimes created by HybPiper extract.
```{bash}
find . -type f -empty -delete
```
The working directory should only contain sequences file.
### Gather reference sequences for each locus
Append the suffix "w_refs" to all the fasta files.
```{bash}
rename '.FNA' '_w_refs.FNA' *
```
Extract the loci names from the list of _w_refs.FNA files, that is list the files (`ls -1`), remove the suffix "_w_refs.FNA" ([`sed`](https://www.gnu.org/software/sed/manual/sed.html)), `sort` and remove duplicate (`uniq`). For each of these extracted locus name, extract all the sequences that contain these extracted locus name in their own sequence name from the reference file ([`seqkit grep`](https://bioinf.shenwei.me/seqkit/usage/#grep)), and append these extracted sequences to the corresponding _w_refs.FNA file (`>>`). Be careful of the names of your .FNA files; in the case other prefix or suffix have been introduced earlier, you would need to remove them using the [`sed`](https://www.gnu.org/software/sed/manual/sed.html) command (`sed 's/<character string to remove>//g'`).
```{bash}
for locus in $(ls -1 *_w_refs.FNA | sed 's/_w_refs.FNA//g' | sort | uniq)
do
seqkit grep -w0 -nrp -$locus $path_to_ref/$reference_fasta_file >> $locus"_w_refs.FNA";
done
```
Note that this can be done in a faster way using the `parallel` command (e.g. here paralleled on 4 different cores):
```{bash}
ls -1 *_w_refs.FNA | sed 's/_w_refs.FNA//g' | sort | uniq | \
parallel -j4 "echo {}; seqkit grep -w0 -nrp -{} $path_to_ref/$reference_fasta_file >> {}_w_refs.FNA;"
```
Move all the "with ref" fastas to a separate subdirectory.
```{bash}
mkdir w_refs
mv *w_refs.FNA w_refs
```
### Run the alignment with MAFFT
Go to the subdirectory.
```{bash}
cd $path_to_tmp/w_refs
```
Alignment can be run one after the other...
```{bash}
for file in $(ls -1 *_w_refs.FNA)
do
mafft --thread 2 --quiet --auto $file > aligned.$file
done
```
... or in parallel (here on 4 different cores):
```{bash}
for file in $(ls -1 *_w_refs.FNA)
do
echo "mafft --thread 2 --quiet --auto $file > aligned.$file"
done | parallel -j4
```
... which is also equivalent to:
```{bash}
ls -1 *_w_refs.FNA | \
parallel -j4 "echo {}; mafft --thread 2 --quiet --auto {} > aligned.{}"
```
In this step, MAFFT is run with an automatic selection of the alignment algorithm (`--auto`). This can be changed according to your needs, for more details, see the [MAFFT documentation](https://mafft.cbrc.jp/alignment/software/algorithms/algorithms.html). In the above commands, MAFFT is run on 2 threads (`--threads`). I would not recommend using more threads to avoid over-threading. Also, be aware of the number of threads requested when running in parallel (2 x 4 in the above examples).
### Remove reference sequences
Now that our sequences are aligned, we need to remove the reference sequences that we gathered, because we don't want to reconstruct phylogenetic trees with reference sequences in them.
For each locus name (i.e. locus name extracted from the `aligned.<locus>_w_refs.FNA` file names): extract all the sequences that DO NOT contain the locus name in their own sequence name, and append these extracted sequences into a corresponding newly created `aligned.<locus>.FNA file`.
```{bash}
for locus in $(ls -1 aligned.*_w_refs.FNA | sed 's/_w_refs.FNA//g' | sed 's/aligned.//g')
do
seqkit grep -w0 -v -nrp -$locus aligned.$locus_w_refs.FNA > aligned.$locus.FNA
done
```
... do the same in parallel:
```{bash}
ls -1 aligned.*_w_refs.FNA | sed 's/_w_refs.FNA//g' | sed 's/aligned.//g' | \
parallel -j4 "echo {}; seqkit grep -w0 -v -nrp -{} aligned.{}_w_refs.FNA > aligned.{}.FNA"
```
Move the files that include the reference sequence in a separate subdirectory.
```{bash}
mkdir alignments_including_refs
mv aligned.*w_refs.FNA alignments_including_refs
rm *_w_refs.FNA
```
## Alignment trimming
After the sequence have been aligned, poorly aligned regions have to be trimmed from multiple sequence alignments (MSAs).
Several programs exist to do so, such as [ClipKIT](https://jlsteenwyk.com/ClipKIT/index.html), [TrimAl](http://trimal.cgenomics.org/trimal) or [Gblocks](https://home.cc.umanitoba.ca/~psgendb/doc/Castresana/Gblocks_documentation.html). These programs usually trim the alignments based on the quality of the alignment at a given position. Other approaches such as [HmmCleaner](https://bioinformaticshome.com/tools/msa/descriptions/HmmCleaner.html) remove poorly aligned regions on a sequence by sequence basis. Both approaches can be combined.
Here I present alignment trimming with ClipKIT and TrimAl.
### Trimming using ClipKIT
Here we used the `smart-gap` mode, but this can be changed depending on your needs (see details of the different modes in the [ClipKit documentation](https://jlsteenwyk.com/ClipKIT/advanced/index.html)).
For each aligned file, trim using `clipkit`:
```{bash}
for file in $(ls -1 aligned.*.FNA)
do
clipkit $file -m smart-gap;
done
```
... same command running in parallel:
```{bash}
ls -1 aligned.*.FNA | \
parallel -j4 "echo Trimming {}; clipkit {} -m smart-gap;"
```
Create a subdirectory, move the trimmed alignment in there and rename the ClipKit-trimmed alignment to .FNA.
```{bash}
mkdir clipkit
mv *clipkit clipkit
cd clipkit
rename -v '.FNA.clipkit' '.FNA' *
```
### Trimming using TrimAl
Here we used the `automated1` mode, but this can be changed depending on your needs (see details of the different modes in the [TrimAl documentation](http://trimal.cgenomics.org/getting_started_with_trimal_v1.2)).
For each aligned file, trim using `trimal`:
```{bash}
for file in $(ls -1 aligned.*.FNA)
do
trimal -in $file -out $file.trimal -automated1;
done
```
... same command running in parallel:
```{bash}
ls -1 aligned.*.FNA | \
parallel -j$SLURM_NTASKS "echo Trimming {}; trimal -in {} -out {}.trimal -automated1;"
```
Create a subdirectory, move the trimmed alignment in there and rename the TrimAl-trimmed alignment to .FNA.
```{bash}
mkdir trimal
mv *.trimal trimal
cd trimal
rename -v '.FNA.trimal' '.FNA' *
```