Disclaimer

This is a pretty specific quick script that can be improved if there is interest.

What do I do?

This program:

Compiles and concatenates sequence output from RAST annotation TSV file downloads from any genome annotation project if given a comparison TSV file for each gene.
Takes all compiled gene files and aligns them using MUSCLE
Concatenates all aligned gene files in order
Trims out all columns that contain a gap

Output

This program outputs a common gene set, aligned and ready for phylogenetic tree construction based on common nucleotide sites across common genes in all isolates.

Input

Genome/isolate annotation TSV formatted file.
Comparison TSV file generated from RAST genome comparison program.
Make sure you do not edit the information inside the cells of the TSV files.

Comparison TSV files are generated in RAST when you select the compare genomes option and select your genome contents to compare export the results to a TSV file. The TSV file will start with columns: Contig, Gene, Length, gene id and function for the reference you selected followed by Hit, contig, gene, gene id, percent id and function for each of your isolates.

Preprocessing steps to facilitate script working

Remove the first 4 columns from the RAST comparison file TSV, the Contig, gene, length and gene id for your reference in the RAST TSV comparison output.
Make a directory called RAST where you store all of the individual isolate genome annotation TSV files you download from each project.
Ensure you have no missing genes in your comparison TSV. This program is only meant to work with common gene sets among genomes.

Setup

Install Python 3.x version of miniconda
git clone https://github.com/Vallendrez-Bio/gene_rast_thing.git
If you do not have git installed, download and unzip the directory

gene_rast_thing

Do gene thing

Compilation of all gene files given a TSV comparison file and a directory of all annotation files TSV format from RAST. This script relys on default RAST outputs including file names and column inputs so do not change anything from the downloads. The script uses the gene_id column to fine the correct annotation file and then file the correct gene within feature_id of the RAST isolate annotation files.
```
./gene.py genes.tsv  RAST/*
```
Align

Aligns using MUSCLE (https://www.drive5.com/muscle/) at default settings. For right now this script is picky and needs to be in the same directory as your output directory from the previous step. AND since this script calls on the muscle.sh script you will need to make sure the muscle.sh script is also in the same directory as your output directory from the previous step. Don't worry about the fact you don't see muscle.sh in the command below, the script will deal with muscle.
```
./align.sh output
```
MUSCLE citations:

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res. 32(5):1792-1797 [Link to PubMed].

Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity BMC Bioinformatics, (5) 113 [Link to PubMed].
Rename output files so they sort gooder

This renames all output alignment files such that they can be sorted using natural order. That is, the files should be named 1.aln.fasta, 2.aln.fasta, 3.aln.fasta where the digit comes from the column in the fasta filename fig_321327_43_peg_<somedigits>.aln.fasta.
```
ls -1 output_aln/* | while read f; do nn=$(echo $f | sed 's|output_aln/fig_321327_43_peg_$[0-9]\+$.aln.fasta|\1.aln.fasta|'); mv $f output_aln/$nn; done
```
Make big-ass fasta concatenated file

Takes all of your aligned gene files and concatenates them into one long "common gene set - genome". See the identifer pattern? That's why you cannot deviate from what RAST outputs.
```
./sequence_files_concat.py --identifier-pattern "(fig\|\d+\.\d+\.peg)" output_aln/* > out.fasta
```
Trim out all '-' columns

Removes any site (column) from the entire alignment if it contains a dash.

Remember: This script trashes frame - so don't count of changing anything to amino acids, translation will be incorrect!

./trim_alignment.py out.fasta > out.trim.fasta

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tests/sequence_concat		tests/sequence_concat
.gitignore		.gitignore
README.md		README.md
align.sh		align.sh
fasta.py		fasta.py
genes.py		genes.py
muscle.sh		muscle.sh
sequence_files_concat.py		sequence_files_concat.py
test_genes.tsv		test_genes.tsv
test_index.tsv		test_index.tsv
trim_alignment.py		trim_alignment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

What do I do?

Output

Input

Preprocessing steps to facilitate script working

Setup

gene_rast_thing

About

Releases

Packages

Contributors 2

Languages

Vallendrez-Bio/gene_rast_thing

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

What do I do?

Output

Input

Preprocessing steps to facilitate script working

Setup

gene_rast_thing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages