-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing analyses from an assembly ids #359
Comments
Hello @zdk123 , thanks for your question. There are various reasons why assemblies you find in the protein database releases may not be accessible in the EMG API. One of the most common reasons is where data have been "suppressed" in ENA (or more rarely, in MGnify) after the protein database snapshot was released, but before you query the EMG API. We regularly/continuously reflect the suppression state of datasets from ENA into MGnify's live database (behind the API). (Suppression is usually at the request of data submitters.) We do not retrospectively remove proteins that were derived from suppressed assemblies from the released protein database snapshots though. In the case of your example, this wasn't quite the case, but it was a similar scenario: the assembly was produced, proteins predicted and ingested into the protein database, however that assembly analysis was not uploaded to the EMG API MGnify database. This usually happens when we notice that the assembly or annotation quality is not as good as it could be, and in the case you linked; we reassembled that dataset with a different assembler and that is the one that is on the MGnify API/website. So in general, you should expect that there are assemblies referenced in the protein database that are not available on the API, and vice versa there will be assemblies/analyses on the API that are not (yet) in a protein database release, since these data products have very different release cadences. |
Thanks for the quick reply and detailed comments. I have a followup based on what you just said here.
|
Not currently. In future, we may well enable FTP (etc) access to analysis data producrs, e.g. GFF files associated with assemblies. It is still likely that we would only serve "current and public" (i.e. not suppressed or embargoed) data products in this way though.
- MGYP1 ERZ1;ERZ99
+ MGYP1 ERZ1;ERZ99;ERZ2
|
Hi MGnify team.
Apologies if this is not the correct place to report this issue, but I am interested in getting an protein -> contig -> assembly map and figured I could rely on the mgy_assemblies.tsv file on the FTP server. This contains the protein -> ERZ relationships, and then use the API to pull the contigs from the most recently available analysis (e.g. https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ509256/analyses seems sufficient ).
However, I noticed that a substantial number of the ERZ ids to not have any analysis data. For example https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ1744782/analyses.
By my count -
5840
out of the33345
ids are missing in the API.Is there another way to get a complete protein -> contig map?
thanks!
The text was updated successfully, but these errors were encountered: