Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMTintegrator cannot find "custom contaminant" and stops analysis #22

Open
jjGG opened this issue Sep 23, 2022 · 26 comments
Open

TMTintegrator cannot find "custom contaminant" and stops analysis #22

jjGG opened this issue Sep 23, 2022 · 26 comments

Comments

@jjGG
Copy link

jjGG commented Sep 23, 2022

Dear Developers,

I would like to understand why TMTintegrator has issues with our list of contaminants that we usually concatenating to all our databases.
https://fgcz-proteomics.uzh.ch/fasta/fgcz_contaminants2022_20220511.fasta

If I do use the "download fasta" option, all is working fine and I do get the TMTintergator reports.
If I would like to use my own fasta file (that is a proper uniprot database and I concatenate it with our fgcz_contaminants2022 I get FragPipe running till the TMTintegrator step (v3.3.3) and then it throws the error:

TMT-Integrator v3.3.3
UpdateColumns--- 0.14370 min.
Start to process GroupBy=0
Error: could not find protein ID zz|Y-FGCZCont00298| P41361 SWISS-PROT:P41361 (Bos taurus) Antithrombin-III precursor in database. Stopping analysis
Process 'TmtIntegrator' finished, exit code: 0

If I do delete this particular protein in my fasta file it throws an error at the next identified FGCZcontaminant.

All other reports are fine and there. I already tried several things with "adapting" the header or the accession or our contaminants but I still always get this error from TMTintegrator.

Can you see what TMTintegartor does not like on our fasta headers?

Best regards

jonas

@prvst
Copy link

prvst commented Sep 23, 2022

Hi @jjGG. Different software from our pipeline relies on using pattern finding to locate things like the protein ID, the gene symbol, and the description. The main reason we ask people to use known patterns from NCBI or UniProt is because they have known and well-documented formats, so it's easy for us to know where the information is. Having a custom pattern might make things difficult for this software, because we don't have a way to predict how you write those information, especially if you are mixing up the position of certain elements or reorganizing them. I'm not sure by just looking at your header, what the problem is, but you'll likely have issues with other headers too, and even worse, you might be susceptible to silent errors that might happen if one of the software mix it up two or more FASTA entries. So even though we can track the problem and let you fix it, I strongly suggest that you adapt your FASTA file, using a common format, like the one UniProt has (https://www.uniprot.org/help/fasta-headers).

@jjGG
Copy link
Author

jjGG commented Sep 23, 2022

Hello @prvst
Thanks for your swift reply. I partially agree that things are easier (from a software development point of view especially) that things are standardised or harmonised.

On the other hand - as you may know - there are so many different resources for AA-fastas that this is really difficult to get all on the same page. Also uniprot is (depending on the organism) NOT always the best resource to start with (e.g. Arabidopsis and also other model organisms like Flybase and Wormpeps..). Researchers in these areas usually are much more familiar with their community resource.
Also coming from a sequencing experiment you may generate AA-sequences and search these and then you will usually NOT have uniprot headers. Or- in other cases you want a "custom-sequence to be added to a uniprot fasta because you have a recombinant protein in your extracts.
In my case - it is important that we can have our contaminant list in our local uniprot databases and we even try to have the "accession" relatively close to a uniprot-accession (e.g. zz|Y-FGCZCont0001|Name).
We are aware and I think this is really critical thing that sometimes you may even "loose" a protein if the fasta-header is not formatted the right way (without any errors or something) e.g. we had issues with previous versions if the proteinAccession was just alone on the headerline (>MyProteinX\r) the protein would not show up in any result file! (took us a while to figure this out).

Meanwhile (after quite some testing!) I do have the TMTintegrator results again. My last change to the fasta-db was a replacement of zz|FGCZ... -> sp|FGCZ...

I assume that the TMTintegrator wants it in this form! (could someone confirm this? and think about if this shall be kept? Since all tools before were "less sensitive"?

Best regards
jonas

@huiyinc
Copy link
Collaborator

huiyinc commented Sep 24, 2022 via email

@jjGG
Copy link
Author

jjGG commented Sep 27, 2022

Hello Huiyin,

Thanks a lot for your email.
I confirm that this new TMTintegrator 4.0.2 is working with our custom contaminants headers and I get all the expected reports and outputs.
A quick check shows that I do have almost 20% fewer phopho-peptides (in only checked the abundance_single-site_MD.tsv).
I will do another quick check to see if I really did not mess up on the parameter settings but I dont think so.
Is there anything in the "new" TMTintegrator different with respect to filtering or stringency?

Best regards
jonas

@jjGG
Copy link
Author

jjGG commented Sep 27, 2022

Hello Huiyin,

I double checked - all fragger.params and TMT-integrator-conf.yml are identical (apart from the fasta).
Any idea where the discrepancy is coming from? 20% is quite a difference.

best regards
jonas

@huiyinc
Copy link
Collaborator

huiyinc commented Sep 27, 2022 via email

@jjGG
Copy link
Author

jjGG commented Sep 27, 2022

Hello Huiyin,

Please find attached a zip file with fragger.params, TMT-conf-yaml and two TMTintegrator outputs.
best regards & thanks for looking into this.
jonas
troubleshoot_forHuiyin.zip

@huiyinc
Copy link
Collaborator

huiyinc commented Sep 28, 2022 via email

@jjGG
Copy link
Author

jjGG commented Sep 28, 2022

Hei Huiyin,

Yes - correct. The "mod_fgcz..." is the one fasta-file where I tried to make some changes on our list of contaminant proteins sequences. In one attempt, there is one single "previously identified" protein deleted from this fasta. All the rest are changes on the Accession or Description lines.
All the human-uniprot entries are identical.

I assume this cannot make the up to 20% difference on "all levels".
Where shall I check the psm-tables?
Any other idea what I could test?

best regards
jonas

//////////////////
Hi Jonas,

According to your parameter files, two different fasta files were used
(fgcz_9606_reviewed_cnl_d_20220429.fasta and
mod_fgcz_9606_reviewed_cnl_d_20220429.fasta).
So, it is expected that the PSM tables and TMT-Integrator reports might be
different.
I think you might have to first check the PSM tables.
Can you please tell me what the differences between the two fasta files
are?
Thanks.

Huiyin

@anesvi
Copy link

anesvi commented Sep 28, 2022 via email

@jjGG
Copy link
Author

jjGG commented Sep 28, 2022

Hello FragPipe Team and Alexey,

I see there is already a difference in the PSM.tsv tables. So I assume that this is less a TMT-Integrator issue than a general "philosopher" issue?

I would expect/accept a "small" difference because in one fasta one of the identified contaminants is deleted (now w/ new TMT-I identified by 2 psms).

But I do see quite different number of lines in psm.tsv (FP18, TMT-I 3.3.3 = 53677 psms vs FP18, TMT-I 3.3.4 47050 psms)
-> this is not 20% anymore but more 10% but still quite a difference and unclear to me how this happens by only changing the fasta headers.
-> one explanation that I would see is that Contaminants (previously (in old-TMT-I) labeled as sp|FGCZCont..| are taken into account for fdr-filtering and/or mass correction (as they may look like regular proteins from the organism under investigation) while if labeled as zz|FGCZCont..| they are filtered out for fdr-filtering and/or mass filtering and therefore "thresholds" might be changed?

@anesvi: there is only 1 protein deleted in the database where all zz|FGCZCont are labeled as sp|FGCZCont. The one protein that is missing is only identified w/ 2 psms. Why should philosopher remove so many peptides?

best regards
jonas

@prvst
Copy link

prvst commented Sep 28, 2022

When inspecting the two log files from your runs, I noticed that they have different number of identifications . This is from files before the filtering:

oldTMTintegrator_FP18

INFO[17:10:32] 1+ Charge profile                             decoy=0 target=0
INFO[17:10:32] 2+ Charge profile                             decoy=778 target=13116
INFO[17:10:32] 3+ Charge profile                             decoy=1863 target=29261
INFO[17:10:32] 4+ Charge profile                             decoy=1685 target=13825
INFO[17:10:32] 5+ Charge profile                             decoy=883 target=3477
INFO[17:10:32] 6+ Charge profile                             decoy=284 target=623
INFO[17:10:32] Database search results                       ions=31965 peptides=21594 psms=65795

TMTintegrator402_FP18

INFO[10:31:31] 1+ Charge profile                             decoy=0 target=0
INFO[10:31:31] 2+ Charge profile                             decoy=256 target=10605
INFO[10:31:31] 3+ Charge profile                             decoy=396 target=24275
INFO[10:31:31] 4+ Charge profile                             decoy=207 target=10802
INFO[10:31:31] 5+ Charge profile                             decoy=76 target=2280
INFO[10:31:31] 6+ Charge profile                             decoy=30 target=338
INFO[10:31:32] Database search results                       ions=21837 peptides=13391 psms=49265

Please correct me if I'm wrong, but I had the impression from your details above that the only difference was one protein that you removed. Could you confirm that you are using the same parameters or input files?

@anesvi
Copy link

anesvi commented Sep 28, 2022 via email

@prvst
Copy link

prvst commented Sep 28, 2022

@anesvi see here #22 (comment)

@jjGG
Copy link
Author

jjGG commented Sep 28, 2022

Hello Felipe,

Yes - I confirm that I did (at least I wanted) to use identical parameters and same input files.

I wanted to load the TMT-16-phospho workflow, adjusted the fasta, changed in "QuantIsobaric" to TMT-18 method and loaded my annotation file and redirected the output to a new folder!

Meanwhile I did another test with the downloaded uniprot again (w/ the new TMT-I) and I do get another different number of PSMs. I am even more confused now.
My next test is a rerun with the "old" modified fasta-file to see if I get the "high numbers" back!

My only explanation would be that in the case where I label my Contaminants as sp| they are maybe suddenly included in the mass calibration and decoy filtering step and by this lowering the acceptance thresholds?

best regards - jonas

@anesvi
Copy link

anesvi commented Sep 30, 2022 via email

@anesvi
Copy link

anesvi commented Oct 11, 2022 via email

@anesvi
Copy link

anesvi commented Oct 11, 2022 via email

@huiyinc
Copy link
Collaborator

huiyinc commented Oct 11, 2022 via email

@jjGG
Copy link
Author

jjGG commented Oct 13, 2022

Hello everyone,

Thanks a lot for coming back to me on this. I am still pretty puzzled.
Find attached a zip with the 2 psm-tables and the fragger.params files as well as the 2 fasta files that I used.

So here are again the differences between the 2 searches:

fgcz9606_newTMTi:

  • here I used "our" standard uniprot databases that we usually "patch" with 504 contaminants that are labelled e.g:
  • zz|Y-FGCZCont00500| sp_P38507_SPA_STAAU Immunoglobulin G-binding protein A OS=Staphylococcus aureus OX=1280 GN=spa PE=1 SV=1
    LERRRGALLAAGLALSL...

-> this means we indicate our contamiants with zz|Y-FGCZCont_anyNumber| bla bla text
-> also we usually use REV_ for decoy proteins!

modifiedDB (or ModifiedDB):

  • this is the database I started to modify in order to get it running with the "old TMTintegrator" which was in the end succesful if I did indicate my FGCZContaminants not as zz|Y-FGCZCont_anyNumber but as sp|Y-FGCZCont_anyNumber
  • until I figured out this, I was deleting the one protein that gave the error in the log (only to get the error at the next one!)

My only explanation would be:
somehow my zz-proteins are taken differently and probably not used for decoy filtering!
--> while if I have the sp-label in front, my contaminants (which some of course are identified) change the fdr filters in such a way that suddently we have much more accepted psms.

do you have another idea?

best regards
jonas

@huiyinc
Copy link
Collaborator

huiyinc commented Jan 12, 2023

Hi Jonas,

Is the issue solved?
Thanks.

Huiyin

@huiyinc huiyinc closed this as completed Jan 12, 2023
@huiyinc huiyinc reopened this Jan 12, 2023
@jjGG
Copy link
Author

jjGG commented Jan 12, 2023

Hello Huiyin,

Thanks again for coming back to me and having a look at it.
While TMT-integrator is now successfully running and also quantifying my reporter channels - there is still an issue when it comes to the number of IDs. (see above)
The issue, that the number of identified proteins (and of course psms) is still quite different depending on the "accessions" of my contaminant proteins.

I am not sure how this is possible.
My only explanation would be:
somehow my zz-contamiant-proteins are taken differently and probably not used for decoy filtering!
--> while if I have the sp-label in front, my contaminants (which some of course are identified) change the fdr filters in such a way that suddently we have much more accepted psms.

do you have another idea?

@anesvi
Copy link

anesvi commented Jan 12, 2023 via email

@prvst
Copy link

prvst commented Jan 12, 2023

Changing sp| to zz| causes differences in numbers because if the PSM maps to a decoy protein, and has an alternative protein classified as reviewed, the program will swap their position. When you change sp to zz, you prevent that from happening. I also suggest tagging your contaminants with contam_ because it goes with what we do here.

@anesvi
Copy link

anesvi commented Jan 12, 2023 via email

@prvst
Copy link

prvst commented Jan 12, 2023

Since the program checks every PSM, it really depends on how many targets you have as an alternative to decoys. It will also reflect this in the number of PSMs because the program will only keep PSMs with identified proteins, which can range from anything between 1 PSM to potentially thousands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants