Adding support for partial genes #88

aleimba · 2015-03-17T09:53:57Z

I would like to bring up the issue again to include partial genes. See the pull request (#37) from @lguy where he implemented it.
I think it would be good for draft genomes and even more for highly fractionated metagenomes.

Also, as @sjackman observed, the modes of Prodigal are now called differently (#16). It might be useful to change 'single' to 'normal' and 'meta' to 'anon' in line 664 to not confuse users who look up the Prodigal docs.

EDIT: Just realized the newest Prodigal version still states 'single' and 'meta' in its command-line help text ... The Wiki however has only the new terms (https://github.com/hyattpd/prodigal/wiki/cheat-sheet and https://github.com/hyattpd/prodigal/wiki/Gene-Prediction-Modes)

aleimba · 2015-03-17T14:16:47Z

@hyattpd just answered in correspondence to the mode name changes. They'll be implemented from Prodigal v3.0.0 forward, the Wiki already has the new names in preparation for v3.x (hyattpd/Prodigal#11). My bad.

hyattpd · 2015-03-17T15:01:44Z

I haven't really followed this discussion, but I would not recommend the -c option for anything except finished chromosomes. With prokaryotic genomes being 85% coding, the likelihood of a partial gene running off either edge is extremely high (85% likely the edge bases are inside genes, less % that you have at least 60bp of coding). You're going to miss more than half the genes in some data sets (those with only small contigs) using -c.

Just as an example. I have a data set that has E. coli randomly sampled in thousands of 1200bp contigs, and the coordinates of the Genbank-annotated genes in those contigs.

36296 Genbank genes contained in 19331 fragments
13345 genes run off the left edge
13233 genes run off the right edge
2475 genes run off both edges

With the -c option on, you'd miss at least half those genes (the ones missing stop codons). The ones missing start codons would be truncated and you'd be reporting less of the protein than you actually could be.

aleimba · 2015-03-17T15:16:40Z

thanks for your insight, @hyattpd. Although I think it's very unlikely that you'll get a bacterial draft assembly with a maximum contig length of 1200bp, you'll definitely miss out on genes in draft genomes and especially metagenomes. Of course, as you said, depending on the amount of small contigs.

hyattpd · 2015-03-17T18:09:41Z

Even in long contigs, there will be a partial gene at each edge ~75% of the time, so you wind up missing ~1.5 genes per contig. I guess it's a question of how much one cares about partial proteins. I think it is better just to always call Prodigal without -c, and have an option to Prokka to not report partial genes below some length (rather than passing this option on to Prodigal).

tseemann · 2015-03-18T08:38:44Z

@hyattpd I had considered the -c option originally when implementing Prokka and agreed with @aleimba but I am having second thoughts now.

In general, genome assemblies break at repeats that are longer than the read length or span. In bacteria this is nearly always duplicate / paralogous genes, such as rRNA islands and insertion sequences. The break often occurs in the intergenic region, and the repeated gene gets its own contig.
An N50 of 1200bp is rare in modern bacterial genomics, and would simply be discarded.

But I think the idea of post-filtering is a good one. I will think more about it.

aleimba · 2015-03-18T15:19:39Z

I agree with @tseemann on gaps in repeats. I've also made the experience (through cumbersome manual gap finishing) that many short-read assembler don't cope very well at contig ends, with overlaps to other contigs, wrong solution of repeats etc.

Apart from that, including partial genes in Prokka would be a big plus and thanks to @tseemann for taking it up! As you mentioned a test suite would be sweet for future code integration, but sadly I have no experience in that. I'm guessing it's quite some overhead.

tseemann added the enhancement label Jul 4, 2015

tseemann self-assigned this Jul 4, 2015

jvollme mentioned this issue Mar 12, 2018

Partial v Complete Genes for Metagenomic Analysis #283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for partial genes #88

Adding support for partial genes #88

aleimba commented Mar 17, 2015

aleimba commented Mar 17, 2015

hyattpd commented Mar 17, 2015

aleimba commented Mar 17, 2015

hyattpd commented Mar 17, 2015

tseemann commented Mar 18, 2015

aleimba commented Mar 18, 2015

Adding support for partial genes #88

Adding support for partial genes #88

Comments

aleimba commented Mar 17, 2015

aleimba commented Mar 17, 2015

hyattpd commented Mar 17, 2015

aleimba commented Mar 17, 2015

hyattpd commented Mar 17, 2015

tseemann commented Mar 18, 2015

aleimba commented Mar 18, 2015