Author: Léo-Paul
Dagallier
Last update: 2023-10-18
Resource material for the plant phylogenomics workshop led at NYBG (May 8th - 10th 2023).
The presentation can be found here
To go from a set of plant specimens to a phylogenetic inference, the main steps are ordered as follow:
- (DNA extraction –not covered here)
- Sequence recovery
- Reads cleaning
- Assembly
- Loci extraction
- Phylogenetic reconstruction
- Alignment
- Alignment trimming
- Tree inferences
The details of these steps may vary depending of the type of data analysed and the methods used. Here are presented the details for targeted sequencing data and different methods, with the associated commands. A simple presentation of the commands for each step is given in the presentation .pdf file.
Advanced workflow details, with associated commands and scripts, are
provided here on separate pages. Please read the notes for the advanced
workflow. For SLURM users (cluster),
additional .sh scripts are provided for each step (to be run with
sbatch
).
Targets low-copy elements of the genome, ideally single-copy orthologous loci.
The downstream analyses need that we use a set of target sequences.
From this clean and updated probe set, it is also recommended to remove
the “outlier” loci and to further remove short
sequences.
✨ The final clean probe set (outlier loci and short sequences
removed) can be found in the present repo: .FNA
file and .FAA
file. ✨
For ‘universal’ probe sets, see e.g.:
- Angiosperm353 (Johnson et al. 2019)
- Mega353 (McLay et al. 2021)
The reads obtained from the sequencing have to be cleaned before they
can be used for downstream analysis.
🔎 See the Reads Cleaning document for a detailed
workflow.
HybPiper uses clean reads to create per samples assemblies and to extract the targeted sequences.
🔎 See HybPiper’s original tutorial for basic use.
🔎 See HybPiper2 for advanced HybPiper workflow
details, and associated:
👉 💻 scripts for local use
(assembly
and
extraction)
👉 👩💻 scripts for cluster (SLURM) use
(assembly
and
extraction).
At the end of the HybPiper steps (including the paralogy extraction step, see below), you should usually end up with sequences extracted for the successful loci in the following directories:
retrieved_exons
: extracted exonsretrieved_supercontigs
: extracted exons + (partial) intronsretrieved_aa
: extracted exons translated as amino acidsparalogs_all
andparalogs_no_chimeras
: extracted multi-copies exons
HybPiper also allow to asses paralogy and to extract putative paralogous sequences. Then you can either assess the putative paralogs one by one and decide if these should be discarded, or use ParaGone to run a phylogenetic aware paralogy resolution step.
🔎 See Paralogs for more details.
See the associated:
👉 💻 scripts for local
use
👉 👩💻 scripts for cluster (SLURM)
use
🚧 🚧 🚧
👉 (TO DO) See the associated 💻 scripts for local use and 👩💻
scripts for cluster (SLURM) use. 🚧 🚧
🚧
Loci can be filtered. As we have seen earlier, they can be filtered on their putative paralogy status (i.e. completely remove putative paralogous loci).
Loci can also be filtered on their assembly statistics. Specifically they can be filtered on a percentage of samples (N) for which a percentage of the length of the loci has been assembled (L). For simplicity, let’s call these the L_N subsets. For example, a 75_75 subset will only include those loci that have been recovered for at least 75% of their length in 75% of the samples.
Additional lists of loci can be drawn from the paralogy statistics. These are exploratory and should be used with caution. They include e.g. list of loci with maximum 1 copy (no paralogy at all), or loci with a median of 2 copies per sample.
🔎 See loci filtering for full details.
As a general rule, I advise to actually filter the loci after the gene trees reconstruction step.
After extracting the sequences (exons, supercontigs or multicopies exons), we need to align them.
Here we’ll use MAFFT to align, but other program do exist (e.g. MUSCLE, Clustal, MACSE).
Sequences can be aligned “naively” or informed by the locus reference sequence. I would advise to align with the locus reference sequence, because it is conceptually less prone to alignment errors.
Several programs exist to do so, such as ClipKIT, TrimAl or Gblocks. These programs usually trim the alignments based on the quality of the alignment at a given position. Other approaches such as HmmCleaner remove poorly aligned regions on a sequence by sequence basis. Both approaches can be combined.
Here I present alignment trimming with both ClipKIT and TrimAl.
🔎 See Alignment for details on the alignment and
trimming steps, and see the associated:
👉 💻 script for local
use
👉 👩💻 script for cluster (SLURM)
use.
In Alignment and associated scripts cited above, the
alignment and trimming is run for the extracted exons (retrieved_exons
folder). The exact same steps can be carried out for the extracted
supercontigs (retrieved_supercontigs
) and paralogs (paralogs_all
or
paralogs_no_chimeras
). You would just need to change the input
directories in your scripts.
IMPORTANT NOTE. Automating the alignment and cleaning steps do not precludes for alignment errors to occur. Always have a look at your alignments. This is how you can detect errors, and tweak with the alignment parameters to limit as much as possible the alignment errors. For example, I was first super-confident in ClipKit, but finally decided to go with TrimAl because the alignments looked better with it. You can also modify the alignments manually to improve the alignments, but this is very long and prone to subjectivity. Programs such as AliView or Seaview can help visualize alignments.
Once the sequences are aligned in multiple sequence alignments (MSAs), the phylogenetic reconstruction can be undertaken.
It can be done using the gene trees approach and/or using the concatenation approach.
In the gene trees approach we will first infer a tree for each locus separately, and then summarize the gene trees in a species tree using a pseudo-coalescent model implemented in ASTRAL (or related programs/algorithms). This approach accommodates incomplete lineage sorting (ILS).
🔎 See Gene Trees Approach for more details.
🚧🚧🚧 🚧🚧🚧 In the concatenation approach we will first concatenate the MSAs of all the loci into a single MSA (or ‘supermatrix’), and then infer the species tree from the supermatrix.
🔎 See Concatenation Approach for more details. 🚧🚧🚧 🚧🚧🚧
🚧🚧🚧 [to be completed…] 🚧🚧🚧