Skip to content

Astral III and dataset filtration tutorial

Carl R Hutter edited this page Nov 25, 2023 · 2 revisions

Contents

A) ASTRAL-III setup and analysis B) ASTRAL-III batch analysis C) Alignment and genetree dataset filtration

A) ASTRAL-III setup, analysis, and batch analysis

I have included an R script in the main repository with some examples. It is also described here in detail.

  1. first install and load the R package. Its a good idea to install new (or check) every time as this package is being updated frequently.
devtools::install_github("chutter/PhyloConfigR")
library(PhyloConfigR)
  1. You will want a character variable that includes your full path to the astral jar file. NOTE: if you move this file, you will need to move the lib/ directory along with it, as astral depends on it.
astral.path = "/usr/local/bin/Astral-5-14/astral.5.14.2.jar"

3)Setup your working directory and create if necessary

work.dir = "/Test_Astral"
dir.create(work.dir)
setwd(work.dir)
  1. Next, this step will walk through a single dataset example.

First, you will want to add a character variable with the path to your gene tree directory. The gene trees here were generated using IQTree and the script should work with other tree types as well, as long as they have branch lengths and support values. Also indicate your outgroups for rooting the tree later. Finally, the output name can be put in a variable or directly entered, your choice.

tree.dir = "/Trees/Gene_Trees/trimmed_exons"
outgroups = c("Species_A", "Species_B")
save.name = "test-dataset"
  1. the setupAstral function is used to take your folder of gene trees, apply some filters to the gene trees, and then save them in a single file that can be read by ASTRAL-III. This should take about a minute to run.
setupAstral(genetree.folder = tree.dir,
            output.name = save.name,
            min.n.samples = 4,
            min.sample.prop = 0,
            make.polytomy = TRUE,
            polytomy.limit = 10)

Parameter explanations:

genetree.folder: a folder of genetrees to prepare for astral analyses
output.name: the save name for your concatenated gene tree file
overwrite: whether to overwrite an existing dataset
taxa.remove: species that you would like removed from each gene tree
min.n.samples: the minimum number of samples to keep a gene tree
min.sample.prop: the minimum proportion of samples to keep a gene tree
make.polytomy: whether to collapse poorly supported nodes into polytomies
polytomy.limit: if make.polytomy = TRUE, the threshold value for node collapsing
  1. When the setup function finishes running, you can now run ASTRAL-III using the runAstral function. This uses the astral jar directly, and should a minute or two depending on your number of gene trees using multi-threading and around 10 without the multi-threading option.
runAstral(input.genetrees = save.name,
          output.name = save.name,
          astral.path = astral.path,
          astral.t = 2,
          quiet = FALSE,
          load.tree = FALSE,
          multi.thread = TRUE,
          memory = "8g")

Parameter explanations:

input.genetrees: a file of genetrees from setupAstral
output.name: the save name for the ASTRAL-III file
astral.path: the absolute file path to the ASTRAL-III jar file
astral.t: the ASTRAL-III "t" parameter for different annotations, t = 2 is all annotation
quiet: hides the screen output from astral if desired
load.tree: TRUE to laod the tree into R
overwrite: TRUE to overwrite an existing dataset
multi.thread: TRUE to use Astral-MP multithreading 
memory: memory value to be passed to java. Should be in "Xg" format, X = an integer
  1. Next, you can read in the astral data using the astralPlane S4 Object class, which organizes all the analysis data into different slots in the object that can be accessed using the @ symbol.
astral.data = createAstralPlane(astral.tree = save.name,
                                outgroups = outgroups,
                                tip.length = 1)

Parameter explanations:

astral.tree: phylogenetic tree from ape read.tree or a file path to a tree file
outgroups: a vector of outgroups to root the tree
tip.length: arbitrary value for the terminal tip lengths for plotting purposes
  1. Finally, you can plot your results using the astralProjection function. You give the function the astralPlane object from the previous step, and select your settings for plotting, and what you would like to plot. An example plot is provided in the main Github repository.
astralProjection(astral.plane = astral.data,
                 local.posterior = TRUE,
                 pie.plot = TRUE,
                 pie.data = "genetree",
                 save.file = "example_plot.pdf",
                 pie.colors = c("purple", "blue", "green"),
                 node.color.text = c("white"),
                 node.color.bg = c("black"),
                 node.label.size = 0.5,
                 tip.label.size = 0.75,
                 pie.chart.size = 1)

Parameter explanations:

astral.plane: AstralPlane S4 object of data generated from AstralPlane function
local.posterior: plot the local posterior support = TRUE
pie.plot: TRUE to plot pie charts on branches. FALSE ignores whatever is selected for "pie.data"
pie.data: 'qscore' the quartet support or 'genetree' proportion of gene trees that support a branch
save.file: if you wish to save to file, put file name. Saves as PDF
pie.colors: select three colors to plot your pie.plot
node.color.text: if local.posterior = TRUE, select the color of posterior support text
node.color.bg: if local.posterior = TRUE, select the color of posterior support background
node.label.size: size of the node labels, passed to cex in plotting function
tip.label.size: size of the tip labels, passed to cex in plotting function
pie.chart.size: size of pie chart, passed to edgelabel plotting function

B) ASTRAL-III batch analysis

This section is a summary of batch ASTRAL-III data analysis, which incorporates the previous functions in a wrapper to analyze across multiple datasets. By multiple datasets, this means any sets of data you would like to analyze separately, such as exons, introns or UCEs.

1). To begin, the working directory and paths can be setup like in the previous section

library(PhyloConfigR)
astral.path = "/usr/local/bin/Astral-5-14/astral.5.14.2.jar"

work.dir = "/Test_Astral"
genetree.folder = "Genetree_Directory"

dir.create(work.dir)
setwd(work.dir)

2). Next, all of you datasets should have gene trees estimated from the alignments, and saved to their own directory (i.e. folder called "exons" or "uces"). There should only be the gene trees in this folder, as other files could cause the script to crash. These gene tree data types should be placed in an over-arching "genetree.folder" directory in order to be analyzed together.

batchAstral(genetree.datasets = genetree.folder,
            astral.t = 2,
            output.dir = "test-dataset",
            min.n.samples = 4,
            min.sample.prop = 0.1,
            taxa.remove = NULL,
            overwrite = TRUE,
            quiet = F,
            astral.path = astral.path,
            make.polytomy = TRUE,
            polytomy.limit = 10,
            multi.thread = TRUE,
            memory = "8g")

Parameter explanations:

genetree.datasets: a folder of genetrees to prepare for astral analyses
astral.t: the ASTRAL-III "t" parameter for different annotations, t = 2 is all annotation
output.dir: the save name for your concatenated gene tree file
min.n.samples: the minimum number of samples to keep a gene tree
min.sample.prop: the minimum proportion of samples to keep a gene tree
taxa.remove: species that you would like removed from each gene tree
overwrite: whether to overwrite an existing dataset
quiet: hides the screen output from astral if desired
astral.path: the absolute file path to the ASTRAL-III jar file
make.polytomy: whether to collapse poorly supported nodes into polytomies
polytomy.limit: if make.polytomy = TRUE, the threshold value for node collapsing
multi.thread: TRUE to use Astral-MP multithreading 
memory: memory value to be passed to java. Should be in "Xg" format, X = an integer

After this single function is run, the "test-dataset" output directory will be created with all of your astral results in them! You can create figures and plot the results as above. An example below shows a loop saving the plots for all the datasets:

#Obtains dataset names
datasets = list.dirs(genetree.folder, full.names = F, recursive = F)

for (i in 1:length(datasets)){

  #Read in the astral data and tree and organize it into different slots
  dataset.name = paste0("test-dataset/", datasets[i], "_astral.tre")
  astral.data = createAstralPlane(astral.tree = dataset.name,
                                  outgroups = outgroups,
                                  tip.length = 1)

  #Plots the astral data
  astralProjection(astral.plane = astral.data,
                   local.posterior = TRUE,
                   pie.plot = TRUE,
                   pie.data = "qscore",
                   save.file = paste0("test-dataset/", datasets[i], ".pdf"),
                   pie.colors = c("purple", "blue", "green"),
                   node.color.text = c("white"),
                   node.color.bg = c("black"),
                   tip.label.size = 0.75,
                   pie.chart.size = 1)

}#end i loop

C) Alignment filtration

  1. To begin, you will first need a folder of alignments in phylip format and a folder of gene trees from IQTREE (other programs will probably work; if not, let me know and I can add them in). Create your working directory first (or use an existing directory). tree.files and align.files link to the gene tree files and alignments that estimated them. The names must match between the genes and alignments (except for the file extension).
library(PhyloConfigR)
work.dir = "WorkingDirectory"
align.files = "WorkingDirectory/alignments"
  1. Next, you will want to select your filters to use, adjusted to the features of your dataset. Here is an example:
filter.length = c(100, 500, 1000, 1500, 2000, 2500) #number of base pairs
filter.sample = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) #proportion samples
filter.prop.pis = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) #proportion pis
filter.count.pis = c(10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500) #count of pis

This is for many different filtered datasets, for a single filtered dataset you can use:

filter.length = c(100) #number of base pairs
filter.sample = c(0.5) #proportion samples (higher, more samples filtered out)
filter.prop.pis = c(0.1) #proportion pis (higher, less informative alignments filtered out)
filter.count.pis = c(10) #count of pis (lower, less informative alignments filtered out)
  1. To obtain a table of statistics for each alignment, run the summarizeAlignments function. The inputs are the alignment directory path and the file export name. This function is general and can be used for other purposes.
#Estimated run time: 10 minutes
align.summary = summarizeAlignments(alignment.path = align.files,
                                    file.export = "alignment_stats",
                                    alignment.type = "phylip",
                                    dataset.name = "exons")

Parameter explanations:

alignment.path: path to a folder of multiple sequence alignments in phylip format
file.export: if a name is provided, the table is saved to file
overwrite: if TRUE overwrites file if it exists; FALSE the dataset is skipped
dataset.name: a unique name for your dataset. i.e. exons, introns, UCEs
alignment.type: select the format of the alignment. Phylip is avaialble, will be expanded in the future.

The summary table created by the function has the following columns:

alignment name = the file name of the alignment
number_samples = number of samples or taxa in the alignment (number of rows) 
proportion_samples = the proportion of samples in the alignment (number_samples / max samples)
alignment_length = total number of base pairs long for the alignment
count_pis = number of parsimony informative sites in the alignment
proportion_pis = proportion of sites that are informative (count_pis / alignment_length)
count_missing_bp = total number of bases missing from the alignment matrix
proportion_missing_bp = proportion of bases missing from the alignment matrix (count_missing_bp / total bp)
  1. Now that alignment statistics have been calculated, the filterSummary function can be used to obtain a quick summary of the datasets that will be generated using your selected filters (from above) and the alignment statistics.
#Estimated run time: 10 seconds
filt.summary = filterSummary(alignment.data = align.summary,
                             alignment.folder = align.dir,
                             dataset.name  = "exons",
                             file.out = "filter_summary",
                             length.filters = filter.length,
                             sample.filters = filter.sample,
                             prop.pis.filters = filter.prop.pis,
                             count.pis.filters = filter.count.pis)

Parameter explanations:

alignment.data: Alignment summary stats calculcated from summarizeAlignments
alignment.folder: The alignment folder from which the stats were calculated from in alignment.data
dataset.name: The name of your dataset, where all filtered datasets will be placed in this folder
file.out: if you wish to save to file, provide a file name for the summary
overwrite: whether to overwrite an existing dataset
length.filters: Your selected length filters as a vector of values for the alignment length
sample.filters: Your selected sampling fraction filters as a vector of values between 0-1
prop.pis.filters: Your selected parsimony informatives sites filter as a vector of values between 0-1
count.pis.filters: Your selected parsimony informatives sites filter as a vector of base pair counts
  1. After alignment and filtered datasets summary statistics have been calculated, these two can be combined together to create concatenated alignments and parittion files for each filtered dataset. These concatenated alignments can be used in concatenation-based phylogenetics software (e.g. IQTREE, RAXML) to test out the impact of filtering on concatenation analyses.
#Estimated run time: 10 minutes
filterAlignments(filter.summary = filt.summary,
                 alignment.data = align.summary,
                 alignment.folder = align.dir,
                 format = "concatenated",
                 min.alignments = 5,
                 overwrite = TRUE)

Parameter explanations:

filter.summary: summary data file from filterSummary
alignment.data: summary data file from alignmentSummary
alignment.folder: folder of alignments to be filtered
format: save format for alignments
min.alignments: minimum number of alignments for a filtered set of data
overwrite: if TRUE overwrites file if it exists; FALSE the dataset is skipped
  1. Prior to filtered summary species tree analysis in ASTRAL-III, the next function creates gene tree datasets for each filtered dataset, prepared for input in ASTRAL-III. Each filtered gene tree dataset is concatenated together (or placed in a folder with format = "folder") and saved to a folder called "filtered-genetrees-concatenated" for the concatenated gene trees or "filtered-genetrees-folders" for directories of gene trees for each filtered dataset.
#Estimated run time: 10 minutes
filterGeneTrees(filter.summary = filt.summary,
                alignment.data = align.summary,
                genetree.folder = tree.dir,
                format = "concatenated",
                overwrite = TRUE,
                min.trees = 5,
                min.n.samples = 4,
                make.polytomy = TRUE,
                polytomy.limit = 10,
                remove.node.labels = FALSE)

Parameter explanations:

filter.summary: summary data file from filterSummary
alignment.data: summary data file from alignmentSummary
genetree.folder: your target folder of gene trees that correspond to the alignments being filtered
format: save format for genetrees
overwrite: if TRUE overwrites file if it exists; FALSE the dataset is skipped
taxa.remove: species that you would like removed from each gene tree
min.trees: mimimum number of trees to keep filtered set
min.n.samples: the minimum number of samples to keep a gene tree
min.sample.prop: the minimum proportion of samples to keep a gene tree
make.polytomy: collapses polytomies in the gene trees
polytomy.limit: the value at which to collapse a node into a polytomy
remove.node.labels: strips trees of node labels if downstream analyses give you trouble (not recommended)
  1. To run the new filtered sets of gene trees through ASTRAL-III efficiently, the AstralPlane package function astralRunner can be used on a folder of concateanted gene trees. the concatenated gene tree dataset can be provided to the astralRunner function, which runs ASTRAL-III for each concatenated set of gene trees from each filtered dataset in the "filtered-genetrees-concatenated" folder. Each summary species tree is saved in the output.dir.
#Estimated run time: hours, depending on how many filters selected
astralRunner(concat.genetree.folder = "filtered-genetrees-concatenated",
             output.dir = "filtered-astral",
             overwrite = TRUE,
             astral.path = astral.path,
             astral.t = 2,
             quiet = TRUE,
             multi.thread = TRUE,
             memory = "8g")               

Parameter explanations:

concat.genetree.folder: a folder of genetree files that are concatenated.
output.dir: the output directory name for the astral file
overwrite: overwrite = TRUE to overwrite existing files
astral.path: the absolute file path to the ASTRAL-III jar file
astral.t: the ASTRAL-III "t" parameter for different annotations, t = 2 is all annotation
quiet: TRUE hides the screen output from astral
multi.thread: TRUE to use Astral-MP multithreading 
memory: memory value to be passed to java. Should be in "Xg" format, X = an integer
  1. Finally, concordance factors from IQTREE 2 can be computed on the species tree using the filtered gene trees, filtered alignments, and their corresponding species trees. The concordance factors for sites and genes will computed for every node in every filtered replicate, and these results can be plotted in the next section to display how filtering affects concordance factors across the entire tree or a focal node.
concordanceRunner(alignment.dir = "filtered-alignments-concatenated",
                  species.tree.dir = "filtered-Astral",
                  genetree.dir = "filtered-genetrees-concatenated",
                  output.dir = "concordance-factors",
                  iqtree.path = "iqtree2",
                  overwrite = TRUE,
                  quiet = TRUE,
                  threads  = 6)         

Parameter explanations:

alignment.dir: The alignment folder from which the stats were calculated from in alignment.data
species.tree.dir = output directory from previous step that contains filtered ASTRAL-III species trees
genetree.dir: a folder of genetree files that are concatenated
output.dir: the output directory name to save the concordance factors results from filtration replicates
iqtree.path: path to IQTREE 2 if R cannot find it
overwrite: overwrite = TRUE to overwrite existing files
quiet: TRUE hides the screen output from astral
threads: the number of threads to used, passed to IQTREE

More coming soon!