-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGene_trees_approach.Rmd
388 lines (308 loc) · 15.3 KB
/
Gene_trees_approach.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
---
title: "Gene Trees Approach"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
eval = FALSE,
fig.path = "figures/",
out.width = "50%")
```
**Author**: [Léo-Paul Dagallier](https://github.com/LPDagallier)
**Last update**: `r format(Sys.Date())`
***
The gene trees are first inferred in a maximum likelihood (ML) framework. Here we'll present tree inferences with [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/) and [IQ-TREE](http://www.iqtree.org/). You'll find plenty of information and details on the algorithms used and parameters in their documentation and publications. Other ML programs exist and include e.g. [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html), [PhyML](http://www.atgc-montpellier.fr/phyml/), or [FastTree](http://www.microbesonline.org/fasttree/).
The inferred gene tress are then summarized in a species tree using a pseudo-coalescent framework, as implemented in ASTRAL. Several recent improvements and variations of the ASTRAL-III program have been published and are now embedded in the package [ASTER](https://github.com/chaoszhang/ASTER). We will present here 3 different variations:
- [ASTRAL](https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral.md): similar to ASTRAL-III, requires to define a threshold to collapse poorly supported branches in the gene trees
- [Weigthed ASTRAL](https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral-hybrid.md): similar to ASTRAL, but weights the species tree inference with the branch support inferred in the gene trees, supposedly providing better accuracy ([Zhang & Mirarab, 2022](https://doi.org/10.1093/molbev/msac215))
- [ASTRAL-Pro2](https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral-pro.md): similar to ASTRAL but allows for multi-copy genes (i.e. putative paralogs)
# Gene trees inference
For the examples here, we will infer gene trees based on the exons sequences, aligned with MAFFT and cleaned with TrimAl. You can easily change the code in the "inputs" chunks to use supercontigs or multi-copy exons (paralogs) instead.
## Infer gene trees with RAxML
### Define the paths and variables
- for the inputs:
```{bash}
analysis_ID="my_analysis_ID"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_hybpiper2_exons_align_w_refs/w_refs/trimal";
```
- for the outputs:
```{bash}
step_ID="hybpiper2_exons_trimal_genetrees_raxml"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/$analysis_ID"_"$step_ID"/";
```
- temporary directory for local users:
```{bash}
path_to_tmp=$path_to_dir_out
```
- temporary directory for cluster users: depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:
```{bash}
path_to_tmp=$SLURM_TMPDIR
```
### Copy all the files in the working directory
Go to working directory.
```{bash}
cd $path_to_tmp
```
Copy the fasta files.
```{bash}
scp $path_to_dir_in/*.FNA $path_to_tmp
rename -v '.FNA' '.fna' *
```
### Run RAxML
RAxML can be run on one locus after another...
```{bash}
for file in $(ls -1 *.fna)
do
raxmlHPC-PTHREADS-SSE3 -f a -x 12345 -p 12345 -# 100 -T 2 -m GTRGAMMA -O -s $file -n $file
done
```
... or in parallel (here on 4 different cores):
```{bash}
ls -1 *.fna | \
parallel -j 4 "echo Starting RAXML for alignment {}; raxmlHPC-PTHREADS-SSE3 -f a -x 12345 -p 12345 -# 100 -T 2 -m GTRGAMMA -O -s {} -n {}"
```
In the above command:
- `-f a` tells RAxML to conduct a rapid Bootstrap analysis and search for the best-scoring ML tree in one single program run
- `-x 12345` turns on rapid bootstrapping and set 12345 as seed number
- `-p 12345` specifies a random number seed for the parsimony inferences
- `-# 100` tells RAxML to conduct 100 bootstrap
- `-T 2` tells RAxML to use 2 threads
- `-m GTRGAMMA` use the following model of substitution: GTR + Optimization of substitution rates + GAMMA model of rate heterogeneity
- `-O` disables check for completely undetermined sequence in alignment
- `-s` specifies the name of the alignment data file
- `-n` specifies the name of the output file
Note that the `raxmlHPC-PTHREADS-SSE3` command only works on some HPC systems, check with your own system and installation what command should be used to run RAxML.
### Collect gene trees
Create a subdirectory and move the inferred trees in it.
```{bash}
mkdir trees
find . -name '*bipartitions.*' -exec mv -t trees {} +
```
### Collect infos on the gene trees (optional)
Create a subdirectory and move the info files in it.
```{bash}
mkdir trees_info
find . -name '*info.*' -exec mv -t trees_info {} +
```
## Infer gene trees with IQ-TREE
### Define the paths and variables
- for the inputs:
```{bash}
analysis_ID="my_analysis_ID"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_hybpiper2_exons_align_w_refs/w_refs/trimal";
```
- for the outputs:
```{bash}
step_ID="hybpiper2_exons_trimal_genetrees_iqtree"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/$analysis_ID"_"$step_ID"/";
```
- temporary directory for local users:
```{bash}
path_to_tmp=$path_to_dir_out
```
- temporary directory for cluster users: depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:
```{bash}
path_to_tmp=$SLURM_TMPDIR
```
### Copy all the files in the working directory
Go to working directory.
```{bash}
cd $path_to_tmp
```
Copy the fasta files into a directory called alignments.
```{bash}
mkdir alignments
scp $path_to_dir_in/*.FNA $path_to_tmp/alignments
cd $path_to_tmp/alignments
rename -v '.FNA' '.fna' *
```
### Run IQ-TREE
IQ-TREE can be run on one locus after another...
```{bash}
for file in $(ls -1 alignments/*.fna )
do
iqtree2 -s $file -m MFP+MERGE -B 1000 -bnni -alrt 1000 -T AUTO -ntmax 2 -mem 8G
done
```
... or in parallel (here on 4 different cores):
```{bash}
ls -1 alignments/*.fna | \
parallel -j 4 "echo Starting IQ-TREE for alignment {}; iqtree2 -s {} -m MFP+MERGE -B 1000 -bnni -alrt 1000 -T AUTO -ntmax 2 -mem 8G"
```
In the command above:
- `-s` specifies the name of the alignment file
- `-m MFP+MERGE` specifies to carry on a model selection step with ModelFinderPlus (across all available models)
- `-B 1000` specifies to assess branch support with ultrafast bootstrap
- `-bnni` tells to optimize each bootstrap using a hill-climbing nearest neighbor interchange search
- `-alrt 1000` specifies to assess branch support with SH-like approximate likelihood ratio test and 1000 replicates
- `-T AUTO` specifies to select automatically the number of threads to use
- `-ntmax 2` specifies the maximum number of threads to use
- `-mem 8G` specifies the maximum amount of RAM to use
### Collect gene trees
Create a subdirectory and move the inferred trees in it.
```{bash}
mkdir trees
find . -name '*.fna.treefile' -exec mv -t trees {} +
rename -v '.fna.treefile' '.fna' trees/*
```
### Collect infos on the gene trees (optional)
Create a subdirectory and move the info files in it.
```{bash}
mkdir trees_info
find . -name '*.fna.iqtree' -exec mv -t trees_info {} +
```
# Species tree inference
The use of subsets of loci (as generated in [Loci filtering](Loci_filtering.md)) will generally happen just before reconstructing the species tree from gene trees. The subsets of loci are listed in different `.txt` files stored in `<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA/my_analysis_ID`.
## ASTRAL
#### Define the path variables for downstream file manipulation
Specify the path to the directory where the ASTRAL output will be stored.
```{bash}
path_to_data="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA/my_analysis_ID"
analysis_ID="my_analysis_ID"
cd $path_to_data
```
Specify the path to the directory where the gene trees were output from RAxML or IQ-TREE.
```{bash}
path_to_jobs_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS"
```
Specify the directory we want to retrieve the gene trees from.
```{bash}
folder=my_analysis_ID_hybpiper2_exons_trimal_genetrees_raxml
```
For `.txt` files generated on a Windows machine and transferred to a Linux machine (e.g. the lists of subsets of loci), this commands formats the `.txt` files accordingly.
```{bash}
dos2unix $path_to_data/*.txt
```
#### Collect the subsets of trees
Rename tree files (.fna) to .FNA. (Side note: this step is not mandatory *per se*, I just happened to add it to homogenize the file format between all my scripts).
```{bash}
cd $path_to_jobs_out/$folder/trees ;
rename '.fna' '.FNA' *
```
Concatenate **all** the trees in a `.trees` file.
```{bash}
cat *.FNA > $folder"_ALL.trees";
```
Concatenate a subset of trees. This step uses the text files that contain subsets of loci (generated in [Loci filtering](Loci_filtering.md)) to retrieve the gene trees to append to the final `.trees` file.
```{bash}
echo "" > $folder"_L_N.trees"
while read locus;
do
cat *aligned.$locus*.FNA >> $folder"_L_N.trees"
done < $path_to_data/list_L_N.txt
```
#### Collapse branches with low support bootstrap < 10
Collapsing branches with low support in the gene trees helps ASTRAL to infer branch support in the species tree. As a rule of thumb, a threshold of 10 for regular bootstrap is generally used, but this can be changed (e.g increased in the case of ultrafast bootstrap).
We will use [Newick Utils](https://github.com/tjunier/newick_utils) to collapse branch with a support below 10, in both file with all the trees and the file with the subset of trees.
```{bash}
cd $path_to_jobs_out/$folder/trees
tree=$folder"_ALL.trees"
nw_ed $tree.trees 'i & b<=10' o > $tree"_bs10.trees";
tree=$folder"_L_N.trees"
nw_ed $tree.trees 'i & b<=10' o > $tree"_bs10.trees";
```
In case you have more subsets (you can run ASTRAL for as many subsets as you want!), it can be cumbersome to write the command for every subset, so let's write a loop to run Newick Utils on all of them:
```{bash}
cd $path_to_jobs_out/$folder/trees
trees=$(ls -1 *.trees | grep -v "_bs" | sed 's/\.trees//g')
for tree in $trees
do
echo Collapsing branches for $tree...
nw_ed $tree.trees 'i & b<=10' o > $tree"_bs10.trees";
done
```
#### Run ASTRAL
Prepare the subdirectory that will receive the ASTRAL outputs:
```{bash}
path_to_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA/"$analysis_ID"/Astral/"
mkdir $path_to_out
```
For all the loci:
```{bash}
path_to_input=$path_to_jobs_out/$folder/trees/
input_file=$folder"_ALL_bs10.trees"
output_file="astral_"$folder"_ALL_bs10.tree"
output_log="astral_"$folder"_ALL_bs10.log"
echo "astral -t 2 -i $path_to_input$input_file -o $path_to_out$output_file 2> $path_to_out$output_log"
```
For the L_N subset of loci:
```{bash}
path_to_input=$path_to_jobs_out/$folder/trees/
input_file=$folder"_L_N_bs10.trees"
output_file="astral_"$folder"_L_N_bs10.tree"
output_log="astral_"$folder"_L_N_bs10.log"
echo "astral -t 2 -i $path_to_input$input_file -o $path_to_out$output_file 2> $path_to_out$output_log"
```
Similarly as above, writing manually the command for every subset can be cumbersome, so let's write a loop to run ASTRAL on all the files in the folder:
```{bash}
path_to_input=$path_to_jobs_out/$folder/trees/
cd $path_to_input
ls -1 *bs10.trees | \
while read input_file;
do
output_file="astral_"$input_file".tree"
output_log="astral_"$input_file".log"
echo "astral -t 2 -i $path_to_input$input_file -o $path_to_out$output_file 2> $path_to_out$output_log"
done
```
#### Root the tree(s)
ASTRAL inferes unrooted species tree, so we need to root the tree. This can be done in different ways, we'll here use the unix-based program [phyx](https://github.com/FePhyFoFum/phyx). As an alternative you can also root the trees in R (e.g. function `root()` in the package `ape`).
Go to the directory where ASTRAL trees are stored and define a variable `TREES` containing all the ASTRAL tree files in the directory.
```{bash}
cd $path_to_out
TREES=astral*.tree
```
Root all the trees in a loop using `pxrr`.
```{bash}
for t in $TREES
do
echo $t
pxrr -t $t -g outgroup_sample1,outgroup_sample2 > rooted.$t
done
```
- `-t` defines the tree file name
- `-g` defines the name(s) of the sample(s) to be set as outgroup(s); it can be a list of samples
#### Visualize the trees and export figures
Once trees are rooted, you may want to download them locally (in case you were running ASTRAL, phyx, etc. on a cluster). You can visualize the trees with [any visualization software](https://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software), such as [FigTree](http://tree.bio.ed.ac.uk/software/figtree/) or [Dendroscope](https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/).
In R, you can use your own function, or use the `plot_astral()` function that is available here in the [`plot_astral_tree.R`](PHYLOGENY_RECONSTRUCTION/R/plot_astral_tree.R) script. This function is a wrapper of different functions including the `geom_tree()` function (package `ggtree`), and allows to plot different branch support, different branch length configurations, and to rename the tips in your tree. The plots are directly saved into a .pdf file.
First set your R working directory to the path where you have your ASTRAL trees.
```{r warning=FALSE}
setwd(dir = "<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA/my_analysis_ID")
```
Load relevant packages.
```{r warning=FALSE}
library(treeio)
library(ggplot2)
library(ggtree)
library(ggimage)
library(tidyr)
```
Create a list of the files of the trees you want to plot.
```{r warning=FALSE}
trees_list <- list.files(pattern = "rooted.*.tree$")
```
If you want to rename the tip labels in your tree, you must prepare a table with old names in one columna and the corresponding new names in another column (see e.g. the [`tips_rename.txt`](PHYLOGENY_RECONSTRUCTION/DATA/example_analysis_01/tips_rename.txt) text file).
```{r warning=FALSE}
tiplabels <- read.table(file = "<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA/my_analysis_ID/tips_rename.txt", col.names = c("OLD", "NEW"))
```
Define the path to the script and source the script. This will load the `plot_astral()` function into your R environment.
```{r warning=FALSE}
path_to_plot_astral_script <- "<base_directory>DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/plot_astral_tree.R"
source(path_to_plot_astral_script)
```
Loop over all the tree file names in the `trees_list` vector defined earlier.
```{r warning=FALSE}
for (tree_file in trees_list){
print(tree_file)
plot_astral(tree_file, rename = T, tiplabels = tiplabels, annotations = "LPP")
}
```
`plot_astral()` has several options, see details in the [`plot_astral_tree.R`](PHYLOGENY_RECONSTRUCTION/R/plot_astral_tree.R) script.
**Important note:** the trees are expected to be annotated with the `-t 2` option in ASTRAL. Specifically, it is expected to have node support named 'pp1' (for local posterior probabilities, LPP) and 'q1','q2' and 'q3' (for quartet scores, QS).
## ASTRAL-weighted
The steps to inferring species tree with ASTRAL-weighted are exactly the same as for ASTRAL, except that you don't need to collapse branches with low support.
🚧🚧🚧 Under construction, for more details see: https://github.com/chaoszhang/ASTER
## ASTRAL-Pro2
🚧🚧🚧