multi-transcript cross compatibility #1

AnotherSimon · 2018-05-30T04:24:02Z

Hi Jon,

After you referred me to this project from funannotate I'm getting some promising results but ran into a snag:

Traceback (most recent call last):
File "/data_n2/home/simon/software/phyloma/bin/busco4phylogeny.py", line 165, in
Proteins = SeqIO.index(os.path.join(tmpdir, file_list[i]), 'fasta')
File "/home/simon/.local/lib/python2.7/site-packages/Bio/SeqIO/init.py", line 885, in index
key_function, repr, "SeqRecord")
File "/home/simon/.local/lib/python2.7/site-packages/Bio/File.py", line 306, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'MyBug_000031_1'

This particular gene is from an aneuploid. This bug has mRNA-seq data so it's the only one that has multiple predicted transcripts per gene. The other fungi being compared, are purely in silico predictions so they "behave nicely".

This is the first gene from the aneuploid that has multiple transcripts from funannotate 1.3 so they're named with suffixes "-T1", "-T2" and "-T3". However in the working directory of phyloma the suffixes are lost and all genes with more than one transcript have an identical name like "MyBug_000031_1".

Is there a workaround for this? Could I manually edit the PhylomaWorkingDir/Mybug.prots.fa and restart the pipeline or will it not detect the previous progress? Should I keep only the longest isoform for each gene?

Update:
I worked around the multiple transcript issue by taking a copy of the prots.fa file and runnning sed -i -e '/_1$/,+1d'. Further had to make a small edit (see pull request) to copy fasta files to the temp dir.

The text was updated successfully, but these errors were encountered:

nextgenusfs · 2018-05-30T13:19:24Z

Okay, I this was written before the multiple transcripts was supported in funannotate, so I think I just need to update some of the code to work with those multiple proteins. I think the selection is currently to take the "best-hit" from the BUSCO hits, so it should be behaving okay, I think we just need it to maintain the -TX at the end of the naming and then should be working.

nextgenusfs · 2018-05-30T13:48:55Z

When you got this original error, did you supply GBK files or protein fasta files? I think I have it fixed - although I'm not sure as I don't have a test set with multiple transcripts in front of me at the moment, but the parsing of the GBK files should now get the proteinID correctly (before it was defaulting to locusID which would mean the -T1, -T2, etc would be stripped). Let me know if latest commits fix the issue.

AnotherSimon · 2018-05-31T03:34:19Z

Original error was with GBK files.

Are you still actively developing this tool? I went through the code and came to the conclusion that it is not quite what I'm looking for. We have some hybrids and we're trying to identify the closest parental strains. Ideally there exists a tool that would draw two trees for each gene, one with the candidate parentals and allele A and one with allele B. Then you could go back and summarize how often a certain parent or cluster of parents is closest in the tree. "Randomly" picking the best hit for the BUSCO group doesn't quite achieve this effect. (Although with enough genes you might end up with a similar result.)

In a nutshell, don't reopen development just on my account.

nextgenusfs · 2018-05-31T13:21:30Z

I do still use it for various projects, so gets updated when I need a slight modification here and there.

I would think it might make more sense to use DNA-based identity for identifying parental strains - maybe something like kmers/minhash could do this? It seems to me there would be more info using just the assembly as opposed to protein-coding regions -- but yes this tool is focused on protein sequences and likely not what you are looking for.

This might be something to look into -- on surface it deals with assembly, but the underlying method for what they call 'trio binning' could be helpful? (I've not read it).... https://www.biorxiv.org/content/biorxiv/early/2018/02/26/271486.full.pdf. But I think the idea is simply that if you had sequence for each of the parents you could chop up the data into unique kmers for each parent and then map those to your hybrid assembly - this should identify which regions are from which parent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-transcript cross compatibility #1

multi-transcript cross compatibility #1

AnotherSimon commented May 30, 2018 •

edited

Loading

nextgenusfs commented May 30, 2018

nextgenusfs commented May 30, 2018

AnotherSimon commented May 31, 2018

nextgenusfs commented May 31, 2018

multi-transcript cross compatibility #1

multi-transcript cross compatibility #1

Comments

AnotherSimon commented May 30, 2018 • edited Loading

nextgenusfs commented May 30, 2018

nextgenusfs commented May 30, 2018

AnotherSimon commented May 31, 2018

nextgenusfs commented May 31, 2018

AnotherSimon commented May 30, 2018 •

edited

Loading