Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing analyses from an assembly ids #359

Open
zdk123 opened this issue May 2, 2024 · 3 comments
Open

Missing analyses from an assembly ids #359

zdk123 opened this issue May 2, 2024 · 3 comments

Comments

@zdk123
Copy link

zdk123 commented May 2, 2024

Hi MGnify team.

Apologies if this is not the correct place to report this issue, but I am interested in getting an protein -> contig -> assembly map and figured I could rely on the mgy_assemblies.tsv file on the FTP server. This contains the protein -> ERZ relationships, and then use the API to pull the contigs from the most recently available analysis (e.g. https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ509256/analyses seems sufficient ).

However, I noticed that a substantial number of the ERZ ids to not have any analysis data. For example https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ1744782/analyses.
By my count - 5840 out of the 33345 ids are missing in the API.
Is there another way to get a complete protein -> contig map?

thanks!

@SandyRogers
Copy link
Member

Hello @zdk123 , thanks for your question.

There are various reasons why assemblies you find in the protein database releases may not be accessible in the EMG API.

One of the most common reasons is where data have been "suppressed" in ENA (or more rarely, in MGnify) after the protein database snapshot was released, but before you query the EMG API.

We regularly/continuously reflect the suppression state of datasets from ENA into MGnify's live database (behind the API). (Suppression is usually at the request of data submitters.) We do not retrospectively remove proteins that were derived from suppressed assemblies from the released protein database snapshots though.

In the case of your example, this wasn't quite the case, but it was a similar scenario: the assembly was produced, proteins predicted and ingested into the protein database, however that assembly analysis was not uploaded to the EMG API MGnify database. This usually happens when we notice that the assembly or annotation quality is not as good as it could be, and in the case you linked; we reassembled that dataset with a different assembler and that is the one that is on the MGnify API/website.

So in general, you should expect that there are assemblies referenced in the protein database that are not available on the API, and vice versa there will be assemblies/analyses on the API that are not (yet) in a protein database release, since these data products have very different release cadences.

@zdk123
Copy link
Author

zdk123 commented May 2, 2024

Thanks for the quick reply and detailed comments. I have a followup based on what you just said here.

  1. Do you make the retired/hidden contigs/analysis available via any other mechanism other than the API (e.g. maybe a different part of the FTP site)?
  2. In the scenario described above, if the assembly gets retired/re-done - approximately when will assembly map in the ftp site updated to reflect the new protein -> assembly mapping?

@SandyRogers
Copy link
Member

  1. Do you make the retired/hidden contigs/analysis available via any other mechanism other than the API (e.g. maybe a different part of the FTP site)?

Not currently. In future, we may well enable FTP (etc) access to analysis data producrs, e.g. GFF files associated with assemblies. It is still likely that we would only serve "current and public" (i.e. not suppressed or embargoed) data products in this way though.

  1. In the scenario described above, if the assembly gets retired/re-done - approximately when will assembly map in the ftp site updated to reflect the new protein -> assembly mapping?
  • Assuming an identical MGYP was found in the first assembly (ERZ1) and the later re-assembly (ERZ2), the next release of the protein database would include ERZ2 in the mgy_assemblies.tsv. E.g. a line may go from:
- MGYP1    ERZ1;ERZ99
+ MGYP1    ERZ1;ERZ99;ERZ2
  • MGYPs may and do change though with new assemblies though, since the contigs change.
  • The old ERZ1 may or may not be removed as well, depending on the reasons for reassembling.
  • The cadence for new protein database releases should generally be multiple times per year (it has been less frequently recently due to substantial refactoring needed to handle the increasing scale).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants