-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal for handling xml -> chado mappings #15
Comments
eutils module goals Assemblythinks to keep in mind about the analysis table: program and program version wont be available, and are not nullable. program + programversion + sourcename must be unique. with this in mind, we'll make the sourcename the ACCESSION, and the program/ programversion be euitils v 1.0 Base:name assemblyname missingprogram - won't be available. -- use eutils standard metadataencoded in the <Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat>
<Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat></Stats> so the category = the term to look for.Note that some of these props can be happily ignored. Additional metadatatheres a lot, this is a big one to tackle... Projectthe project table just has a name and an adescription, and the name is unique. name -do we use Name or Title? Additional metadatasome interesting ones: some in the <Relevance>
<Agricultural>yes</Agricultural>
<Evolution>yes</Evolution>
</Relevance> <AnnotationSource>
<Name>NCBI annotation pipeline</Name>
</AnnotationSource> others in the <RepliconSet>
<Replicon order="1">
<Type location="ePlastid">eChromosome</Type>
<Name>CHL</Name>
<Size units="Mb">0.155691</Size>
</Replicon>
<Count repliconType="eOther">1</Count>
</RepliconSet> Biosamplelook to analysis expression loader for base mappings. |
@mpoelchau i think you'll want to talk about this with me particularly the issue iwth analyses to assembly to clarify: we dont have clear mappings for the following chado analysis fields: I guess the alternative is to require these fields to be provided by the user when running the importer. program - won't be available. -- use eutils obviously the program should be the assembly software used, but that isnt reliably found in the XML (i dont see it in nay of hte examples i've assembled here: |
Right, I remember that we had to do some gymnastics to get the program from the NBCI ftp site. I have no idea why it's there and not in the eutils-supplied metadata. If you look into our internal issue on this and search for ftp, you'll find the comments on this and our workaround: https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/533 |
it did not occur to me we would have to do something so awful. ok im making a child issue for this. |
child of #8
heres the idea.
The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).
The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.
dbxrefs are also hard-coded.
linked records.... im not there yet. let's ignore for now.
for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.
We have a schema that stores:
ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).
how does the schema get populated? read on...
schema population
We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.
Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.
When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.
The text was updated successfully, but these errors were encountered: