Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for partial genes #88

Open
aleimba opened this issue Mar 17, 2015 · 6 comments
Open

Adding support for partial genes #88

aleimba opened this issue Mar 17, 2015 · 6 comments
Assignees

Comments

@aleimba
Copy link

aleimba commented Mar 17, 2015

I would like to bring up the issue again to include partial genes. See the pull request (#37) from @lguy where he implemented it.
I think it would be good for draft genomes and even more for highly fractionated metagenomes.

Also, as @sjackman observed, the modes of Prodigal are now called differently (#16). It might be useful to change 'single' to 'normal' and 'meta' to 'anon' in line 664 to not confuse users who look up the Prodigal docs.

EDIT: Just realized the newest Prodigal version still states 'single' and 'meta' in its command-line help text ... The Wiki however has only the new terms (https://github.com/hyattpd/prodigal/wiki/cheat-sheet and https://github.com/hyattpd/prodigal/wiki/Gene-Prediction-Modes)

@aleimba
Copy link
Author

aleimba commented Mar 17, 2015

@hyattpd just answered in correspondence to the mode name changes. They'll be implemented from Prodigal v3.0.0 forward, the Wiki already has the new names in preparation for v3.x (hyattpd/Prodigal#11). My bad.

@hyattpd
Copy link

hyattpd commented Mar 17, 2015

I haven't really followed this discussion, but I would not recommend the -c option for anything except finished chromosomes. With prokaryotic genomes being 85% coding, the likelihood of a partial gene running off either edge is extremely high (85% likely the edge bases are inside genes, less % that you have at least 60bp of coding). You're going to miss more than half the genes in some data sets (those with only small contigs) using -c.

Just as an example. I have a data set that has E. coli randomly sampled in thousands of 1200bp contigs, and the coordinates of the Genbank-annotated genes in those contigs.

  • 36296 Genbank genes contained in 19331 fragments
  • 13345 genes run off the left edge
  • 13233 genes run off the right edge
  • 2475 genes run off both edges

With the -c option on, you'd miss at least half those genes (the ones missing stop codons). The ones missing start codons would be truncated and you'd be reporting less of the protein than you actually could be.

@aleimba
Copy link
Author

aleimba commented Mar 17, 2015

thanks for your insight, @hyattpd. Although I think it's very unlikely that you'll get a bacterial draft assembly with a maximum contig length of 1200bp, you'll definitely miss out on genes in draft genomes and especially metagenomes. Of course, as you said, depending on the amount of small contigs.

@hyattpd
Copy link

hyattpd commented Mar 17, 2015

Even in long contigs, there will be a partial gene at each edge ~75% of the time, so you wind up missing ~1.5 genes per contig. I guess it's a question of how much one cares about partial proteins. I think it is better just to always call Prodigal without -c, and have an option to Prokka to not report partial genes below some length (rather than passing this option on to Prodigal).

@tseemann
Copy link
Owner

@hyattpd I had considered the -c option originally when implementing Prokka and agreed with @aleimba but I am having second thoughts now.

In general, genome assemblies break at repeats that are longer than the read length or span. In bacteria this is nearly always duplicate / paralogous genes, such as rRNA islands and insertion sequences. The break often occurs in the intergenic region, and the repeated gene gets its own contig.
An N50 of 1200bp is rare in modern bacterial genomics, and would simply be discarded.

But I think the idea of post-filtering is a good one. I will think more about it.

@aleimba
Copy link
Author

aleimba commented Mar 18, 2015

I agree with @tseemann on gaps in repeats. I've also made the experience (through cumbersome manual gap finishing) that many short-read assembler don't cope very well at contig ends, with overlaps to other contigs, wrong solution of repeats etc.

Apart from that, including partial genes in Prokka would be a big plus and thanks to @tseemann for taking it up! As you mentioned a test suite would be sweet for future code integration, but sadly I have no experience in that. I'm guessing it's quite some overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants