Skip to content

LPDagallier/Phylogenomics_Workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phylogenomics Workshop

Workshop Material Workshop Material License: GPL (>= 2) Badge last commit Project Status: Active

Author: Léo-Paul Dagallier
Last update: 2023-10-18


Resource material for the plant phylogenomics workshop led at NYBG (May 8th - 10th 2023).

The presentation can be found here

To go from a set of plant specimens to a phylogenetic inference, the main steps are ordered as follow:

  • (DNA extraction –not covered here)
  • Sequence recovery
    • Reads cleaning
    • Assembly
    • Loci extraction
  • Phylogenetic reconstruction
    • Alignment
    • Alignment trimming
    • Tree inferences

The details of these steps may vary depending of the type of data analysed and the methods used. Here are presented the details for targeted sequencing data and different methods, with the associated commands. A simple presentation of the commands for each step is given in the presentation .pdf file.

Advanced workflow details, with associated commands and scripts, are provided here on separate pages. Please read the notes for the advanced workflow. For SLURM users (cluster), additional .sh scripts are provided for each step (to be run with sbatch).

Targeted sequencing

Targets low-copy elements of the genome, ideally single-copy orthologous loci.

Target sequences and probe sets

The downstream analyses need that we use a set of target sequences.

Melastomataceae

⚠️ For Melastomataceae a probe set was designed (Jantzen et al. 2020), but due to several concerns, it was cleaned and updated (Dagallier & Michelangeli, in press.). 👉 See here for more details.

From this clean and updated probe set, it is also recommended to remove the “outlier” loci and to further remove short sequences.
✨ The final clean probe set (outlier loci and short sequences removed) can be found in the present repo: .FNA file and .FAA file. ✨

Others

For ‘universal’ probe sets, see e.g.:

Reads cleaning

The reads obtained from the sequencing have to be cleaned before they can be used for downstream analysis.
🔎 See the Reads Cleaning document for a detailed workflow.

HybPiper 2

HybPiper uses clean reads to create per samples assemblies and to extract the targeted sequences.

🔎 See HybPiper’s original tutorial for basic use.

🔎 See HybPiper2 for advanced HybPiper workflow details, and associated:
👉 💻 scripts for local use (assembly and extraction)
👉 👩‍💻 scripts for cluster (SLURM) use (assembly and extraction).

At the end of the HybPiper steps (including the paralogy extraction step, see below), you should usually end up with sequences extracted for the successful loci in the following directories:

  • retrieved_exons: extracted exons
  • retrieved_supercontigs: extracted exons + (partial) introns
  • retrieved_aa: extracted exons translated as amino acids
  • paralogs_all and paralogs_no_chimeras: extracted multi-copies exons

Paralogs assessement and resolution

HybPiper also allow to asses paralogy and to extract putative paralogous sequences. Then you can either assess the putative paralogs one by one and decide if these should be discarded, or use ParaGone to run a phylogenetic aware paralogy resolution step.

🔎 See Paralogs for more details.

Paralogy assessement with HybPiper

See the associated:
👉 💻 scripts for local use
👉 👩‍💻 scripts for cluster (SLURM) use

Paralogy resolution with ParaGone

🚧 🚧 🚧
👉 (TO DO) See the associated 💻 scripts for local use and 👩‍💻 scripts for cluster (SLURM) use. 🚧 🚧 🚧

Loci filtering

Loci can be filtered. As we have seen earlier, they can be filtered on their putative paralogy status (i.e. completely remove putative paralogous loci).

Loci can also be filtered on their assembly statistics. Specifically they can be filtered on a percentage of samples (N) for which a percentage of the length of the loci has been assembled (L). For simplicity, let’s call these the L_N subsets. For example, a 75_75 subset will only include those loci that have been recovered for at least 75% of their length in 75% of the samples.

Additional lists of loci can be drawn from the paralogy statistics. These are exploratory and should be used with caution. They include e.g. list of loci with maximum 1 copy (no paralogy at all), or loci with a median of 2 copies per sample.

🔎 See loci filtering for full details.

As a general rule, I advise to actually filter the loci after the gene trees reconstruction step.

Alignment

After extracting the sequences (exons, supercontigs or multicopies exons), we need to align them.

Here we’ll use MAFFT to align, but other program do exist (e.g. MUSCLE, Clustal, MACSE).

Sequences can be aligned “naively” or informed by the locus reference sequence. I would advise to align with the locus reference sequence, because it is conceptually less prone to alignment errors.

Several programs exist to do so, such as ClipKIT, TrimAl or Gblocks. These programs usually trim the alignments based on the quality of the alignment at a given position. Other approaches such as HmmCleaner remove poorly aligned regions on a sequence by sequence basis. Both approaches can be combined.

Here I present alignment trimming with both ClipKIT and TrimAl.

🔎 See Alignment for details on the alignment and trimming steps, and see the associated:
👉 💻 script for local use
👉 👩‍💻 script for cluster (SLURM) use.

In Alignment and associated scripts cited above, the alignment and trimming is run for the extracted exons (retrieved_exons folder). The exact same steps can be carried out for the extracted supercontigs (retrieved_supercontigs) and paralogs (paralogs_all or paralogs_no_chimeras). You would just need to change the input directories in your scripts.

IMPORTANT NOTE. Automating the alignment and cleaning steps do not precludes for alignment errors to occur. Always have a look at your alignments. This is how you can detect errors, and tweak with the alignment parameters to limit as much as possible the alignment errors. For example, I was first super-confident in ClipKit, but finally decided to go with TrimAl because the alignments looked better with it. You can also modify the alignments manually to improve the alignments, but this is very long and prone to subjectivity. Programs such as AliView or Seaview can help visualize alignments.

Phylogenetic reconstruction

Once the sequences are aligned in multiple sequence alignments (MSAs), the phylogenetic reconstruction can be undertaken.

It can be done using the gene trees approach and/or using the concatenation approach.

Gene trees approach

In the gene trees approach we will first infer a tree for each locus separately, and then summarize the gene trees in a species tree using a pseudo-coalescent model implemented in ASTRAL (or related programs/algorithms). This approach accommodates incomplete lineage sorting (ILS).

🔎 See Gene Trees Approach for more details.

Concatenation approach

🚧🚧🚧 🚧🚧🚧 In the concatenation approach we will first concatenate the MSAs of all the loci into a single MSA (or ‘supermatrix’), and then infer the species tree from the supermatrix.

🔎 See Concatenation Approach for more details. 🚧🚧🚧 🚧🚧🚧

Genome skimming

🚧🚧🚧 [to be completed…] 🚧🚧🚧


About

💻🧬 Phylogenomics Workshop

Resources

License

Stars

Watchers

Forks