Tutorial

In the following examples we will start with a file of DNA sequences for a set of genes of interest and we will end up with a mapping of GO terms for each sequence.

Finding Open Reading Frames (ORFs)

Starting with a file of DNA sequences (called "genes.fasta"), we first want to get the longest ORF for each gene and translate those sequences. In the simplest case, we can just specify the input and output files. For example,

hmmer2go getorf -i genes.fasta -o genes_orfs_aa.faa --verbose

We can easily get the nucleotides for those translated ORFs by specifying the output type with the "-t" option:

hmmer2go getorf -i genes.fasta -o genes_orfs_nt.fas -t 2 --verbose

There are additional options that we could specify with this command. One useful option is to specify the minimum length of ORFs to report. The default minimum ORF length is 80 nt, but would could increase this.

hmmer2go getorf -i genes.fasta -o genes_orfs.faa -l 100 --verbose

For some data sets, it may be of interest to find all the ORFs for a particular sequence, like a genomic contig, for example. That can be accomplished with the option to specify all ORFs should be returned.

hmmer2go getorf -i contigs.fasta -o contigs_orfs_all.faa --all --verbose

For more information on this command, and for an explanation of each option, read the documentation with:

hmmer2go getorf --man

Creating a custom HMM database

Often we have a specific question in mind and don't want to 1) search all of Pfam because it's time consuming, and 2) parse through all the results to get the interesting bits. The command below will run a Pfam search of one or more terms (separated by a comma) and log the results.

hmmer2go pfamsearch -t mads,mads-box -o mads_mads-box_pfamsearch.out

The log file is tab-delimited and contains three fields:

Pfam-accession Pfam-ID Description

This command should run very fast. If the results are relevant (determined by analyzing the log file), you can easily create a custom HMM database formatted for hmmer2go run (described below) with one extra flag.

hmmer2go pfamsearch -t mads,mads-box -o mads_mads-box_pfamsearch.out -d

This command will create a formatted database in a separate directory. To run the HMMER search on this specific custom database, you would do the following:

hmmer2go run -i genes_orfs.faa -d mads+mads-box_hmms/mads+mads-box.hmm

Note the database will be in a directory with the search term postfixed with "_hmms" and the database will be the search term postfixed with ".hmm" (also note that in both the directory and database name, the commas have be replaced with the plus ("+") sign.)

The examples below follow the typical case of searching against Pfam-A, instead of a custom database.

Finding protein domain matches

With the ORFs obtained from the hmmer2go getorf command we can now search for domain matches.

hmmer2go run -i genes_orfs.faa -d Pfam-A.hmm -n 4

The above command specifies 4 CPUs to be used, and when completed will produce three files:

genes_orfs_Pfam-A.out,
genes_orfs_Pfam-A.domtblout, 
and genes_orfs_Pfam-A.tblout

Obtaining Pfam->GO mappings (Optional)

We will now use the table of domain matches to map GO terms. To do this we first need to download the Pfam->Gene Ontology mappings. This can be done with a single command:

hmmer2go fetchmap -o pfam2go

The above command creates the file: pfam2go. Note that this step isn't strictly necessary, but it may save a little time in a pipeline to avoid retrieving the mapping file every iteration of an analysis.

Mapping domain matches to GO terms

Now we can map the protein domain matches to GO terms.

hmmer2go mapterms -i genes_orfs_Pfam-A.tblout -p pfam2go -o genes_orfs_Pfam-A_GO.tsv --map

Note that in the above command a 'pfam2go' mapping file is provided, though it isn't necessary. The same result is obtained with the following command:

hmmer2go mapterms -i genes_orfs_Pfam-A.tblout -o genes_orfs_Pfam-A_GO.tsv --map

This last command will create two output files:

genes_orfs_Pfam-A_GO.tsv, 
and genes_orfs_Pfam-A_GO_GOterm_mapping.tsv

The first output file is a tab-delimited table with a description of each domain, including the GO terms and the associated functions. The last file is a two column table with the sequence name in the first column and the GO terms associated with that sequence in the second column.

Some statistical analysis packages, such as Ontologizer, require a GAF file as input. This can be created with the hmmer2go map2gaf command.

hmmer2go map2gaf -i genes_orfs_Pfam-A_GO.tsv -o genes_orfs_Pfam-A_GO_GOterm_mapping.gaf -s 'Helianthus annuus'

The species name is required and it should be given in quotes as shown above.

Provide feedback

Saved searches