Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-transcript cross compatibility #1

Open
AnotherSimon opened this issue May 30, 2018 · 4 comments
Open

multi-transcript cross compatibility #1

AnotherSimon opened this issue May 30, 2018 · 4 comments

Comments

@AnotherSimon
Copy link
Contributor

AnotherSimon commented May 30, 2018

Hi Jon,

After you referred me to this project from funannotate I'm getting some promising results but ran into a snag:

Traceback (most recent call last):
File "/data_n2/home/simon/software/phyloma/bin/busco4phylogeny.py", line 165, in
Proteins = SeqIO.index(os.path.join(tmpdir, file_list[i]), 'fasta')
File "/home/simon/.local/lib/python2.7/site-packages/Bio/SeqIO/init.py", line 885, in index
key_function, repr, "SeqRecord")
File "/home/simon/.local/lib/python2.7/site-packages/Bio/File.py", line 306, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'MyBug_000031_1'

This particular gene is from an aneuploid. This bug has mRNA-seq data so it's the only one that has multiple predicted transcripts per gene. The other fungi being compared, are purely in silico predictions so they "behave nicely".

This is the first gene from the aneuploid that has multiple transcripts from funannotate 1.3 so they're named with suffixes "-T1", "-T2" and "-T3". However in the working directory of phyloma the suffixes are lost and all genes with more than one transcript have an identical name like "MyBug_000031_1".

Is there a workaround for this? Could I manually edit the PhylomaWorkingDir/Mybug.prots.fa and restart the pipeline or will it not detect the previous progress? Should I keep only the longest isoform for each gene?

Update:
I worked around the multiple transcript issue by taking a copy of the prots.fa file and runnning sed -i -e '/_1$/,+1d'. Further had to make a small edit (see pull request) to copy fasta files to the temp dir.

@nextgenusfs
Copy link
Owner

Okay, I this was written before the multiple transcripts was supported in funannotate, so I think I just need to update some of the code to work with those multiple proteins. I think the selection is currently to take the "best-hit" from the BUSCO hits, so it should be behaving okay, I think we just need it to maintain the -TX at the end of the naming and then should be working.

@nextgenusfs
Copy link
Owner

When you got this original error, did you supply GBK files or protein fasta files? I think I have it fixed - although I'm not sure as I don't have a test set with multiple transcripts in front of me at the moment, but the parsing of the GBK files should now get the proteinID correctly (before it was defaulting to locusID which would mean the -T1, -T2, etc would be stripped). Let me know if latest commits fix the issue.

@AnotherSimon
Copy link
Contributor Author

Original error was with GBK files.

Are you still actively developing this tool? I went through the code and came to the conclusion that it is not quite what I'm looking for. We have some hybrids and we're trying to identify the closest parental strains. Ideally there exists a tool that would draw two trees for each gene, one with the candidate parentals and allele A and one with allele B. Then you could go back and summarize how often a certain parent or cluster of parents is closest in the tree. "Randomly" picking the best hit for the BUSCO group doesn't quite achieve this effect. (Although with enough genes you might end up with a similar result.)

In a nutshell, don't reopen development just on my account.

@nextgenusfs
Copy link
Owner

I do still use it for various projects, so gets updated when I need a slight modification here and there.

I would think it might make more sense to use DNA-based identity for identifying parental strains - maybe something like kmers/minhash could do this? It seems to me there would be more info using just the assembly as opposed to protein-coding regions -- but yes this tool is focused on protein sequences and likely not what you are looking for.

This might be something to look into -- on surface it deals with assembly, but the underlying method for what they call 'trio binning' could be helpful? (I've not read it).... https://www.biorxiv.org/content/biorxiv/early/2018/02/26/271486.full.pdf. But I think the idea is simply that if you had sequence for each of the parents you could chop up the data into unique kmers for each parent and then map those to your hybrid assembly - this should identify which regions are from which parent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants