proposal for handling xml -> chado mappings #15

bradfordcondon · 2018-11-27T16:38:11Z

child of #8

heres the idea.

The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).

The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.

dbxrefs are also hard-coded.

linked records.... im not there yet. let's ignore for now.

for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.

We have a schema that stores:

ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).

how does the schema get populated? read on...

schema population

We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.

Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.

When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.

bradfordcondon · 2018-11-27T20:55:52Z

eutils module goals

Assembly

thinks to keep in mind about the analysis table:

program and program version wont be available, and are not nullable.

program + programversion + sourcename must be unique.

with this in mind, we'll make the sourcename the ACCESSION, and the program/ programversion be euitils v 1.0

Base:

name assemblyname
description: assemblydescription
timeexecuted : we get this from either asmreleasedate_genbank or asmreleasedate_refseq. Use the earlier of the two?
sourcename -- the unique accession, since it msut be unique. Therefore, the AssemblyAccession tag.

missing

program - won't be available. -- use eutils
programversion - won't be available. -- use eutils
algorithm - null
sourceversion - null
sourceuri - null

standard metadata

encoded in the <Meta><Stats>tag. For example:

<Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat>
<Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat></Stats>

so the category = the term to look for.Note that some of these props can be happily ignored.

Additional metadata

theres a lot, this is a big one to tackle...

Project

the project table just has a name and an adescription, and the name is unique.

name -do we use Name or Title?
description - Description tag

Additional metadata

some interesting ones: some in the ProjectDescr tag....

<Relevance>
               <Agricultural>yes</Agricultural>
               <Evolution>yes</Evolution>
</Relevance>

<AnnotationSource>
                    <Name>NCBI annotation pipeline</Name>
</AnnotationSource>

others in the ProjectType tag:

<RepliconSet>
                            <Replicon order="1">
                                <Type location="ePlastid">eChromosome</Type>
                                <Name>CHL</Name>
                                <Size units="Mb">0.155691</Size>
                            </Replicon>
                            <Count repliconType="eOther">1</Count>
                        </RepliconSet>

Biosample

look to analysis expression loader for base mappings.

bradfordcondon · 2018-11-28T15:11:19Z

@mpoelchau i think you'll want to talk about this with me particularly the issue iwth analyses to assembly

to clarify:

we dont have clear mappings for the following chado analysis fields:
https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/assembly

I guess the alternative is to require these fields to be provided by the user when running the importer.

program - won't be available. -- use eutils
programversion - won't be available. -- use eutils
algorithm - null
sourceversion - null
sourceuri - null

obviously the program should be the assembly software used, but that isnt reliably found in the XML (i dont see it in nay of hte examples i've assembled here:

mpoelchau · 2018-11-29T14:24:14Z

Right, I remember that we had to do some gymnastics to get the program from the NBCI ftp site. I have no idea why it's there and not in the eutils-supplied metadata.

If you look into our internal issue on this and search for ftp, you'll find the comments on this and our workaround:

https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/533

bradfordcondon · 2018-11-29T15:04:22Z

it did not occur to me we would have to do something so awful. ok im making a child issue for this.

This was referenced Nov 29, 2018

assembly -> chado analysis mapping: we need to download FTP stuff... #28

Closed

supported databases #12

Closed

bradfordcondon closed this as completed Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal for handling xml -> chado mappings #15

proposal for handling xml -> chado mappings #15

bradfordcondon commented Nov 27, 2018 •

edited

Loading

bradfordcondon commented Nov 27, 2018

bradfordcondon commented Nov 28, 2018 •

edited

Loading

mpoelchau commented Nov 29, 2018 •

edited

Loading

bradfordcondon commented Nov 29, 2018

proposal for handling xml -> chado mappings #15

proposal for handling xml -> chado mappings #15

Comments

bradfordcondon commented Nov 27, 2018 • edited Loading

schema population

bradfordcondon commented Nov 27, 2018

Assembly

Base:

missing

standard metadata

Additional metadata

Project

Additional metadata

Biosample

bradfordcondon commented Nov 28, 2018 • edited Loading

mpoelchau commented Nov 29, 2018 • edited Loading

bradfordcondon commented Nov 29, 2018

bradfordcondon commented Nov 27, 2018 •

edited

Loading

bradfordcondon commented Nov 28, 2018 •

edited

Loading

mpoelchau commented Nov 29, 2018 •

edited

Loading