Skip to content

Specifications and example usage

Evan Staton edited this page Nov 9, 2017 · 32 revisions

This guide describes the files and formats produced by Transposome. The naming convention for the output files is chosen by the user, and is taken from the configuration file. First, we will look at a configuration file and describe each field. Next, we will execute the program transposome provided the configuration we will use, and then we will take a look at each file generated.


Configuration

Here is an example configuration file:

blast_input:
  - sequence_file:      SRR486236_trimmed_filteredall_100k_interl.fasta
  - sequence_format:    fasta
  - sequence_num:       10000
  - cpu:                2
  - thread:             2
  - output_directory:   maize_transposome_results_out_100k_8-21
clustering_options:
  - in_memory:          YES
  - percent_identity:   90
  - fraction_coverage:  0.55
  - merge_threshold:    100
annotation_input:
  - repeat_database:    maizeTEdb_23-Apr-2013_20-45_rfmt.fasta
annotation_options:
  - cluster_size:       100
  - blast_evalue:       10
output:
  - run_log_file:       maize_log_100k_8-21.txt
  - cluster_log_file:   maize_cluster_report_100k_8-21.txt

This file is in YAML format and has five sections: blast_input, clustering_options, annotation_input, annotation_options, and output. Each of these sections contains a list of key/value pairs which describe the options for part of the analysis or input/output files. Only the value (the right-hand field) should be modified.

Here is a detailed description of this example configuration broken down by section:

blast_input

  • sequence_file - (string) This is the sequence reads to be used for finding repeats. Typically, this data would be quality trimmed and contaminants (DNA from non-target species, organelles, etc.) would be removed.
  • sequence_format - (string) This specifies the input sequence format, which may be fastq or fasta (Default is fasta).
  • sequence_num - (int) This is the number of reads to be analyzed for similarity by each thread in the all vs. all BLAST step. It should be set to a number less than the number of reads, ideally something that is equally divisible by the number of reads and thread level so all CPUs are used. If running a serial analysis (i.e., 1 CPU and 1 thread) just leave this as the default value.
  • cpu - (int) The number of CPUs to be used by each thread in the all vs. all BLAST step, and the number of CPUs that will be used during the annotation step.
  • thread - (int) The number of threads to use during the all vs. all BLAST step. IMPORTANT: the number of CPUs used will be thread x cpu so choose this number carefully. See the wiki entry on setting the appropriate thread level for more information.

clustering_options

  • in_memory - (Bool - 1/0 or yes/no) This specifies whether the method for finding pairs in the graph should perform all calculations in RAM vs. using an index on-disk. The options are "1" or "YES" for yes AND 0 or "NO" for no (note that YES/NO are case insensitive so writing, for example, "yes" is perfectly fine). This should almost always be left set to YES unless computer memory limitations become an issue because the analysis is much faster when calculations are performed in-memory.
  • percent_identity - (int) This specifies the percent match between two pairs required to keep a match. This option typically has little impact on the results, but in may be informative to vary this and compare the results from other parameter settings.
  • fraction_coverage - (float) - The fraction of overlap between two reads required to keep a match. Similar to the percent_identity option, this usually has little impact on the results but it may be informative to vary this from the default of 0.55.
  • merge_threshold - (int) The number of paired-end reads required to be split between two clusters in order for them to be merged. This should be set to a fraction to a small fraction of the total reads such as 0.0001 or 0.001. This option can often have a large impact on the clustering results and an evaluation of this parameter on the results in advised.

annotation_input

  • repeat_database - (string) The FASTA file of sequences, typically transposable elements from a closely related species, to be used for annotation.

annotation_options

  • cluster_size - (int) The minimum cluster size, in terms of read number, to be used for annotation. Setting this parameter to a larger value will result in faster run times, but fewer clusters being annotated, while setting this parameter to less than 100 has very little impact on the results.
  • blast_evalue - (double) The expectation value cutoff used for the BLAST annotation step. This can be set to a smaller value to refine the results, but has little impact. The default of 10 is advised.

output

  • run_log_file - (string) The file to log the progress and results of Transposome. It is advised to set this to something unique reflecting the date, species, and parameter set.
  • cluster_log_file - (string) - This file will be used to store the clustering and annotation results. Note that all the annotation files will be generated from this file base name (Explained below).

Running Transposome

The input sequence file and the repeat database are assumed to be in the working directory in the above example. The full PATH to those files must be specified if they are not in the working directory. Here is how we would execute transposome with the above configuration (which we will call "transposome_maize.yml"):

#!/bin/bash

cd `pwd`

transposome --config transposome_maize.yml

We put this command in a Bash script (and call it "transposome.sh") so that it can be submitted to a queuing system, or executed at the command line. Here is how we would submit this to the queue (using SGE):

qsub -q queue -pe thread 4 transposome.sh -o transposome.out -e transposome.err

Note that "queue" would have to be a real queue name, and we specify 4 threads since we chose 2 CPUs X 2 threads in the configuration. When the process completes, which takes about 10 min. for 100k reads with maize, we should see our results in the output directory we specified in the configuration:

$ cd maize_transposome_results_out_100k_8-21
$ ls | xargs stat --printf "%n \n"
maize_cluster_report_100k_8-21_annotations_summary.tsv 
maize_cluster_report_100k_8-21_annotations.tsv 
maize_cluster_report_100k_8-21_singletons_annotations_summary.tsv 
maize_cluster_report_100k_8-21_singletons_annotations.tsv 
maize_cluster_report_100k_8-21.txt 
maize_log_100k_8-21.txt 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast.bln 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain_cls_fasta_files_08_21_2014_17_09_25 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain_cls_fasta_files_08_21_2014_17_09_25_annotations 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain.cls.membership.txt 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain.hs 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain.idx 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain.int 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain_merged_08_21_2014_17_09_25.cls 
SRR486236_trimmed_filteredall_100k_interl_allvall_blast_louvain.tree.log

Transposome annotations

The above command shows that the transposome command will produce 13 files and 2 directories in the output directory specified. This may seem overwhelming at first, but note that one of these files contains a summary of the repeat content of the genome and all the other files are there for analyzing features of specific clusters, or inspecting the annotation results, for example. First, let's look at the log file for the annotation results:

$ grep Results maize_transposome_results_out_100k_8-21/maize_log_100k_8-21.txt
INFO - Results - Total sequences:                        100000
INFO - Results - Total sequences clustered:              70003
INFO - Results - Total sequences unclustered:            29997
INFO - Results - Repeat fraction from clusters:          0.70003
INFO - Results - Singleton repeat fraction:              0.430576390972431
INFO - Results - Total repeat fraction:                  0.82919
INFO - Results - Total repeat fraction from annotations: 0.627816414749592    

Most of these metrics are self-explanatory, however, there are two numbers that should be carefully considered. The "repeat fraction from annotations" tells you what percentage of repeats were identified based on the reference library while the "repeat fraction from clusters" tells you the actual repeat percentage in the genome based solely on similarity within the genome. The difference in these two numbers tells you the amount of potential repeats identified. Since we used a repeat library for maize, these numbers are almost the same, as you would expect. If these numbers are very different, that indicates the genome being analyzed is quite different from the reference set.

Now let's look at the annotation summary file since this is file that contains the most interesting results.

$ head -3 maize_cluster_report_100k_8-21_annotations_summary.tsv
ReadNum	Superfamily	Family	GenomeFraction
100000	Copia	RLC_ji    0.135699572482705
100000	Copia	RLC_opie	  0.102157247355668

The annotation summary file contains list of repeat families ordered by genomic abundance, with the top line giving the most abundant family. Below is an explanation of each field in this file.

ReadNum Superfamily Family GenomeFraction
The number of input reads to the analysis The TE superfamily for that level of genomic abundance The TE family for that level of genomic abundance The genomic abundance (fraction) of that family, corrected for by the number of singletons

The next file is the cluster annotations file.

$ head -3 maize_cluster_report_100k_8-5_annotations.tsv 
Cluster	Read_count	Type	Class	Superfamily	Family	Top_hit	Top_hit_genome_fraction
singletons	13430	transposable_element	ltr_retrotransposon	Copia	RLC_giepum	    RLC_giepum_AC211251-11074	0.04
CL1	1566	transposable_element	ltr_retrotransposon	Copia	RLC_ji	RLC_ji_AC213834-12382	0.11

This file gives a listing of annotations for each cluster in descending order by size, with the singleton sequences typically being more abundant than the largest cluster (as shown above). Below are the specifications for this file:

Cluster Read_count Type Class Superfamily Family Top_hit Top_hit_genome_fraction
The cluster name, where CL stands for "cluster" and the number is the cluster number The number of sequence reads in the cluster The most repeat type that cluster The TE class The TE superfamily The TE family The top BLAST hit of the cluster The fraction of the cluster that is the top hit specifically

The output files with "singleton" in the name provide descriptions of singleton, or non-clustered, sequences. These files tell us about the low-copy TE sequences in the genome. The file singleton file ending with "*annotations.tsv" is simply a tab-delimited BLAST report of all singleton reads, so we will skip the description of that file. The singletons summary file (ending with "*annotations_summary.tsv") provides a listing of TE family abundance in the singleton sequences, listed in descending rank order.

$ head -3 maize_cluster_report_100k_8-21_singletons_annotations_summary.tsv 
RLG_doke_AC197224-5479	433	0.03
RLC_ji_AC209892-10463	428	0.03
RLC_giepum_AC211251-11074	390	0.03

The specification of the singletons annotation summary file is shown below.

TE family ReadCt SingletonPerc
The TE family name The number of singleton reads The percentage of all the singleton reads

The annotation files described thus far are of primary interest, and what is described next are files related to the clustering results. Note that unless you are interested in the implementation details of Transposome, or are interested in diagnosing your results, what follows will be of limited use.


Clustering results

There are seven files that are produced during the all vs. all BLAST comparison and clustering process. The first file (ending with "*.bln) is simply a tab-delimited BLAST file and this format is described elsewhere. This file is provided for advanced users who may wish to adjust the pair-finding process without redoing the entire BLAST process.

Three files are produced by Transposome from the BLAST output. The first ends with ".hs" and this contains the sequence pairs and edge weights. The next two files (".int" and ".idx") are formatted for Louvain clustering, and to allow transferring the clustering input back to the actual read names. The exact details of the clustering process are stored in the ".tree.log" file, and the ".cls.membership.txt" and ".cls" files contain the clustering results. The latter files are meant to be consumed computationally, and are thus not human-readable. Easy to read clustering progress and results are listed in the log file specified in the configuration and the clustering report (also specified in the configuration), respectively.

Lastly, there are two directories of results. One contains the FASTA files of each cluster, and the other contains annotations of each cluster. Note that these annotation results are already described above, and summarized, so there is no need to script a solution to perform this task. These results are included for diagnosing the results, and to easily perform additional analyses on specific clusters without the need to write any code.