Skip to content

Preparing sequence data for Tranposome

Evan Staton edited this page Nov 1, 2017 · 9 revisions

QUALITY TRIMMING

All sequence data contain regions of low quality and it is strongly advised that the data be quality trimmed prior to analysis. There are many published methods for quality trimming, but I personally prefer PRINSEQ. A standalone script for trimming can be downloaded from that page and this makes it easy to trim many files in parallel on a computer cluster.

LENGTH TRIMMING

The PRINSEQ tool will allow you to specify a lower length threshold when trimming. It is a good idea to remove very short sequences, e.g. less than 50 bp, because these short reads will produce lower confidence connections. Also, having sequences that are very short may produce warnings from BLAST (if the sequence is smaller than the word size).

CONTAMINANT REMOVAL

Another issue with WGS data is that the read set will typically be comprised of non-target sequences such as those derived from organelles. Also, many genomes have a high proportion of ribosomal DNA and simple repeats. It is advised that all of these sequences be removed to reduce the artifacts in the analysis, and to also reduce the computational resources required to run the analysis (e.g., leaving ribosomal DNA reads in the analysis is okay, but it will lead to much larger file sizes and greater memory usage).

PAIRED-END vs SINGLE-END READS

The indexing and clustering steps in Transposome use the read pair information in the FASTA/Q header. That means for paired-end data you need to have the FASTA/Q data in the expected formats. The commonly used Casava 1.8+ format as shown below

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

or the Casava 1.3+ format (below) is what is expected.

@EAS139:136:FC706VJ:2:2104:15343:197393/1

Almost all sequencing platforms generate this format, so unless you are using a custom format (like from the SRA) there is no formatting required. If the read pair information is missing, you can easily add the information for each pair with the Pairfq tool as shown below. For the forward (first) pair,

pairfq addinfo -i s_1_1_sample_500k.fq.gz -o s_1_1_sample_500k_pair.fq.gz -p 1 -c gzip

And, for the reverse (second) pair,

pairfq addinfo -i s_1_2_sample_500k.fq.gz -o s_1_2_sample_500k_pair.fq.gz -p 2 -c gzip

Next, interleave for the analysis (again, using Paifq):

pairfq joinpairs -f s_1_1_sample_500k_pair.fq.gz -r s_1_2_sample_500k_pair.fq.gz -o s_1_sample_1m_interl.fq.gz -c gzip

You can find out more about this tool on the Pairfq github page or the site wiki page.

Note that this step is not necessary for single-end data. However, for short, paired-end data, the cluster merging process of Transposome, which is one of the main improvements over RepeatExplorer, will not work without the pair information provided in the sequence name. There will also be warnings during the cluster merging process if this information is not found.