running for a long time; output stopped in the middle of a warning #12

jfass · 2020-03-27T22:43:34Z

I've run all the preliminary steps without problem (HapCut2, prep) but the final 'call' step has been running for over a week now, using 1 cpu, and stable at a couple GB of RAM. It also output (not sure if stdout or err) a number of stop codon warnings, then it looks to have paused in the middle of outputting a final warning:

[...]
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000339679.7; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000421682.1; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000597972.1; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000401510.1; this transcript may undergo degradation
  Warning,

... and that's it. There aren't any other processes running, and it is running, but it seems like something may be off. Any ideas how to troubleshoot this? I don't think the stop codon issues are the issue since ... I've seen discussion of that warning somewhere - maybe another issue? not sure ... but I could remove offending transcripts or cut down to a single chromosome ... just wondering if it looks like something you've seen before.

The calls:

./HapCUT2-1.2/build/extractHAIRS --indels 1 --bam tumor.bam --VCF tumor.vcf --out fragments_file
./HapCUT2-1.2/build/HAPCUT2 --fragments fragments_file --VCF tumor.vcf --output haplotype_output_file
neoepiscope prep -v tumor.vcf -c haplotype_output_file -o adjusted_haplotype_output
neoepiscope call -b /home/ubuntu/neoepiscope.data/hg19 -c adjusted_haplotype_output --no-affinity

The text was updated successfully, but these errors were encountered:

maryawood · 2020-03-27T22:48:35Z

Sorry you're running into this issue! The only time I've seen something like that before was when there were very, very long indel variants being processed - did you happen to use a tool like Pindel to create your VCF file? Either way, if it's possible for you to share your data with me, I'd be happy to try it myself and see if I can recreate the issue/figure out what's causing it!

jfass · 2020-03-30T16:32:44Z

Thanks @maryawood! There are some complex, long variants, up to ~40-60bp long. This particular set is from VarScan, I believe, but I also may want to test mutect, strelka, somaticsniper ...

What are some guidelines on lengths / types of variants that neoepiscope may have trouble handling?

maryawood · 2020-03-30T16:40:11Z

Hmm, I don't think variants of that size should be an issue - when I've had trouble in the past, it's been with very long indels of greater than 1000 bp in length. I'm not sure what your restrictions are on data sharing, but would it be possible for you to share your adjusted_haplotype_output file with me so I can try to recreate the problem? You're welcome to send it to the [email protected] account if you don't want to post it here. (I see now that you tried to email about this problem there - I'm sorry that we didn't see it more quickly!)

jfass · 2020-03-30T17:20:56Z

Thanks Mary ... looking into data-sharing restrictions right now ... I've got another question - maybe a dumb question. We run analysis on normal (blood) and tumor samples, separately, and call variants separately. Then we also use somatic callers on the combined data. I've run the separate VCFs through HAPCUT and the neoepiscope 'prep' step ... but is neoepiscope doing its own somatic 'status' calls by comparing the two? In other words: ./HapCUT2-1.2/build/extractHAIRS --indels 1 --bam tumor.bam --VCF tumor.vcf --out fragments_file ./HapCUT2-1.2/build/HAPCUT2 --fragments fragments_file --VCF tumor.vcf --output haplotype_output_file neoepiscope prep -v tumor.vcf -c haplotype_output_file -o adjusted_haplotype_output neoepiscope call -b /home/ubuntu/neoepiscope.data/hg19 -c adjusted_haplotype_output --no-affinity How would I go about incorporating germline variants in the neighborhood of tumor neo-epitopes? Where does the normal/germline data enter in? I've got a VarScan vcf file, but it has a "tumor" and a "normal" column / calls ... with the "somatic status" annotated in the INFO field ... but is that something that neoepiscope will pick up? I'm guessing not ... and I'm not sure what the VCF looks like in the cases of the other somatic callers: mutect, strelka ... ~Joe

…

On Mon, Mar 30, 2020 at 9:40 AM Mary Wood ***@***.***> wrote: Hmm, I don't think variants of that size should be an issue - when I've had trouble in the past, it's been with very long indels of greater than 1000 bp in length. I'm not sure what your restrictions are on data sharing, but would it be possible for you to share your adjusted_haplotype_output file with me so I can try to recreate the problem? You're welcome to send it to the ***@***.*** account if you don't want to post it here. (I see now that you tried to email about this problem there - I'm sorry that we didn't see it more quickly!) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEILKDS4Q4W5FWZ74OIYSTRKDDPVANCNFSM4LVKU4VA> .

-- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute <[email protected]> | <[email protected]>

maryawood · 2020-03-30T17:27:49Z

Okay, keep me posted on data sharing!

Re: incorporating germline variants, you can use neoepiscope merge before running the HapCUT2 steps to generate a merged VCF file that has somatic and germline variants together. That way, the haplotypes can be assigned incorporating both variant types simultaneously, and you can call neoepitopes that have the appropriate germline context included! One important thing to note is that if your somatic VCF lists the "normal" column before the "tumor" column, you'll want to run neoepiscope swap first to make sure you're getting tumor sample information for the variants, not normal sample information.

The README has information about running the swap and merge modes, but let me know if you have any questions after looking through that!

jfass · 2020-03-30T17:54:04Z

Sounds like I can't share VCFs, since they do contain potentially-identifiable germline variants. On the germline-context issue, I'm just concerned how neoepiscope is interpreting the somatic evidence in the VCFs. For example, Strelka has a QSS_NT ("Quality score reflecting the joint probability of a somatic variant and NT") among others, in the INFO field; VarScan has SS and SSC ("Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)" and "Somatic score in Phred scale (0-255) derived from somatic p-value") ... and I can't imagine neoepiscope is actually parsing these from different software. So do I just need to filter variants and only include only the somatic variants I'm confident of? Are there mandatory fields in the INFO field or call fields? Would you have an example of tumor and normal VCFs that definitely work with neoepiscope? Thanks, ~Joe

…

On Mon, Mar 30, 2020 at 10:28 AM Mary Wood ***@***.***> wrote: Okay, keep me posted on data sharing! Re: incorporating germline variants, you can use neoepiscope merge before running the HapCUT2 steps to generate a merged VCF file that has somatic and germline variants together. That way, the haplotypes can be assigned incorporating both variant types simultaneously, and you can call neoepitopes that have the appropriate germline context included! One important thing to note is that if your somatic VCF lists the "normal" column before the "tumor" column, you'll want to run neoepiscope swap first to make sure you're getting tumor sample information for the variants, not normal sample information. The README has information about running the swap and merge modes, but let me know if you have any questions after looking through that! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEILKB24KQIEIOP3SKHMMLRKDJC5ANCNFSM4LVKU4VA> .

-- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute <[email protected]> | <[email protected]>

maryawood · 2020-03-30T18:35:35Z

Ahh, yes, that's a good point! neoepiscope doesn't handle that at present, so filtering variants first to separate somatic variants would be necessary! There aren't any required INFO fields, but if you want to report VAF with your neoepitopes, your VCF will need to have a FREQ, AF, or FA field - definitely not mandatory though, and if those are missing, neoepiscope will just report an NA for VAF. I think most VCFs should work for neoepiscope, as long as they have the standard columns, and somatic and germline variants are kept separate at the beginning (and then merged through neoepiscope merge if desired).

As far as the initial problem, we might have to get a little creative if you can't share the data - would you feel comfortable editing the code of your neoepiscope install a bit to add some print statements? If so, I can give you some spots to start adding some so we can try to track down exactly where the problem is.

jfass · 2020-03-30T19:25:03Z

Thanks Mary, Yah, feel free to send me some edits, diffs, whatever..

…

On Mon, Mar 30, 2020 at 11:35 AM Mary Wood ***@***.***> wrote: Ahh, yes, that's a good point! neoepiscope doesn't handle that at present, so filtering variants first to separate somatic variants would be necessary! There aren't any required INFO fields, but if you want to report VAF with your neoepitopes, your VCF will need to have a FREQ, AF, or FA field - definitely not mandatory though, and if those are missing, neoepiscope will just report an NA for VAF. I think most VCFs should work for neoepiscope, as long as they have the standard columns, and somatic and germline variants are kept separate at the beginning (and then merged through neoepiscope merge if desired). As far as the initial problem, we might have to get a little creative if you can't share the data - would you feel comfortable editing the code of your neoepiscope install a bit to add some print statements? If so, I can give you some spots to start adding some so we can try to track down exactly where the problem is. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEILKC3JDUG6QS5C5FM7KTRKDRARANCNFSM4LVKU4VA> .

-- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute <[email protected]> | <[email protected]>

maryawood · 2020-03-30T19:56:10Z

Okay, let's just start with a few in the file transcript.py and see what happens...

First can you add this import statement after the other import statements?

from datetime import datetime

After the docstring in the get_peptides_from_transcripts function (so after line 3108 in the latest commit, although depending on your version could be a bit different), could you add the following statements:

print('Affected transcripts:', len(relevant_transcripts))

print('Homozygous variants:', len(homozygous_variants))

I just want to get an idea of how many variants/transcripts we're working with.

Next, after the line for affected_transcript in relevant_transcripts: (line 3112 in the latest commit), could you add these lines:

print(datetime.now(), affected_transcript, len(relevant_transcripts[affected_transcript]), 'haplotypes')

i = 0

Similarly, after the line for ht in haplotypes: (line 3146 in the latest commit), could you add these lines:

i += 1

print(datetime.now(), 'haplotype', i)

Also, after the line for transcript in homozygous_variants: (line 3208 in the latest commit), could you add this line:

print(datetime.now(), transcript)

I'd like to get an idea of about how long it's taking to process the relevant variants for each transcript/different haplotypes that are being applied to those transcripts.

Finally, just above the return statement for the get_peptides_from_transcripts function, could you add this line:

print(datetime.now(), 'Finished getting peptides')

I'm hoping we can get an idea of whether things are just taking a long time to process, or if the program is getting caught on a specific variant. If we get to the last print statement, then we can look elsewhere for the problem. I think these print statements should keep things relatively private (i.e. no variants should be printed), so if you could send me the output that would be great!

jfass · 2020-03-30T22:06:59Z

Cool ... listing ENST ids, with single or a handful of haplotypes ... I'll keep an eye on it for a while and update you later this evening; feel free to ignore my emails when you need to! Thanks for your help! ~Joe

…

On Mon, Mar 30, 2020 at 12:56 PM Mary Wood ***@***.***> wrote: Okay, let's just start with a few in the file transcript.py and see what happens... First can you add this import statement after the other import statements? from datetime import datetime After the docstring in the get_peptides_from_transcripts function (so after line 3108 in the latest commit, although depending on your version could be a bit different), could you add the following statements: print('Affected transcripts:', len(relevant_transcripts)) print('Homozygous variants:', len(homozygous_variants)) I just want to get an idea of how many variants/transcripts we're working with. Next, after the line for affected_transcript in relevant_transcripts: (line 3112 in the latest commit), could you add these lines: print(datetime.now(), affected_transcript, len(relevant_transcripts[affected_transcript]), 'haplotypes') i = 0 Similarly, after the line for ht in haplotypes: (line 3146 in the latest commit), could you add these lines: i += 1 print(datetime.now(), 'haplotype', i) Also, after the line for transcript in homozygous_variants: (line 3208 in the latest commit), could you add this line: print(datetime.now(), transcript) I'd like to get an idea of about how long it's taking to process the relevant variants for each transcript/different haplotypes that are being applied to those transcripts. Finally, just above the return statement for the get_peptides_from_transcripts function, could you add this line: print(datetime.now(), 'Finished getting peptides') I'm hoping we can get an idea of whether things are just taking a long time to process, or if the program is getting caught on a specific variant. If we get to the last print statement, then we can look elsewhere for the problem. I think these print statements should keep things relatively private (i.e. no variants should be printed), so if you could send me the output that would be great! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEILKD5QCPPYXUKN5SOFOTRKD2ORANCNFSM4LVKU4VA> .

-- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute <[email protected]> | <[email protected]>

jfass · 2020-03-31T17:52:56Z

Hey Mary! Seems stalled again - here are the stdout and err's... ~Joe

…

On Mon, Mar 30, 2020 at 3:06 PM Joseph Fass ***@***.***> wrote: Cool ... listing ENST ids, with single or a handful of haplotypes ... I'll keep an eye on it for a while and update you later this evening; feel free to ignore my emails when you need to! Thanks for your help! ~Joe On Mon, Mar 30, 2020 at 12:56 PM Mary Wood ***@***.***> wrote: > Okay, let's just start with a few in the file transcript.py and see what > happens... > > First can you add this import statement after the other import statements? > > from datetime import datetime > > After the docstring in the get_peptides_from_transcripts function (so > after line 3108 in the latest commit, although depending on your version > could be a bit different), could you add the following statements: > > print('Affected transcripts:', len(relevant_transcripts)) > > print('Homozygous variants:', len(homozygous_variants)) > > I just want to get an idea of how many variants/transcripts we're working > with. > > Next, after the line for affected_transcript in relevant_transcripts: > (line 3112 in the latest commit), could you add these lines: > > print(datetime.now(), affected_transcript, > len(relevant_transcripts[affected_transcript]), 'haplotypes') > > i = 0 > > Similarly, after the line for ht in haplotypes: (line 3146 in the latest > commit), could you add these lines: > > i += 1 > > print(datetime.now(), 'haplotype', i) > > Also, after the line for transcript in homozygous_variants: (line 3208 > in the latest commit), could you add this line: > > print(datetime.now(), transcript) > > I'd like to get an idea of about how long it's taking to process the > relevant variants for each transcript/different haplotypes that are being > applied to those transcripts. > > Finally, just above the return statement for the > get_peptides_from_transcripts function, could you add this line: > > print(datetime.now(), 'Finished getting peptides') > > I'm hoping we can get an idea of whether things are just taking a long > time to process, or if the program is getting caught on a specific variant. > If we get to the last print statement, then we can look elsewhere for the > problem. I think these print statements should keep things relatively > private (i.e. no variants should be printed), so if you could send me the > output that would be great! > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#12 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAEILKD5QCPPYXUKN5SOFOTRKD2ORANCNFSM4LVKU4VA> > . > -- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute ***@***.***> | ***@***.***>

-- Joseph Fass Bioinformatics Scientist Providence Health & Services | Robert W. Franz Cancer Center | Earl A. Chiles Research Institute <[email protected]> | <[email protected]>

maryawood · 2020-03-31T18:06:06Z

Hi Joe! I'm not seeing the stdout/stderr here - would you want to try emailing them to the [email protected] address? GitHub may not like the size or something...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running for a long time; output stopped in the middle of a warning #12

running for a long time; output stopped in the middle of a warning #12

jfass commented Mar 27, 2020 •

edited

Loading

maryawood commented Mar 27, 2020

jfass commented Mar 30, 2020

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

jfass commented Mar 31, 2020 via email

maryawood commented Mar 31, 2020

running for a long time; output stopped in the middle of a warning #12

running for a long time; output stopped in the middle of a warning #12

Comments

jfass commented Mar 27, 2020 • edited Loading

maryawood commented Mar 27, 2020

jfass commented Mar 30, 2020

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

maryawood commented Mar 30, 2020

jfass commented Mar 30, 2020 via email

jfass commented Mar 31, 2020 via email

maryawood commented Mar 31, 2020

jfass commented Mar 27, 2020 •

edited

Loading