Skip to content

JSON Export Format

Kim Rutherford edited this page Sep 3, 2024 · 25 revisions

Exporting from Canto to a JSON file

The JSON export command is documented on the main Canto site.

A technical specification of the JSON export file (as a JSON Schema) is available at etc/export.schema.json on the Canto repository.

The output has this structure:

  
{
  "curation_sessions" : {
    "<session_key>": {  see note below

      "genes" : {  see note below
        "<organism_name_and_gene_uniquename>" : { see note below
          "organism" : "Drosophila melanogaster",
          "uniquename" : "FBgn0040505"
        },
        "<another_organism_name_and_gene_uniquename" : {
          ...
        }
      },

      "alleles" : {   see note below
        "<unique_allele_id>": {    see note below
          "allele_type" : "<allele_type>",
          "gene" : "<organism_name_and_gene_uniquename>",  see note below
          "name" : "<gene_name>",
          "primary_identifier" : "<unique_allele_id>",
          "synonyms" : []
        },
        "<another_unique_allele_id>" : {
          ...
        }
      },

      "genotypes" : {   see note below
        "<unique_genotype_id>": {   see note below
          "loci": [
            [
              {
                "id" : "FBal0157358"
              },
              {
                "id" : "FBal0157359"
              }
            ]
          ],
          "organism_taxonid" : 7227
        },
        "<another_unique_genotype_id>": {
          ...
        }
      },

      "metagenotypes": [  optional - see note below
        "<a_unique_metagenotype_id>": {
          "type": "pathgen-host",
          "host_genotype": "<unique_genotype_id>",
          "pathogen_genotype": "<another_unique_genotype_id>
        }
      ],

      "annotations" : [   see note below
        {
          "conditions" : [],
          "creation_date" : "2019-06-11",
          "curator" : {
            "community_curated" : false,
            "email" : "[email protected]",
            "name" : "Kim Rutherford"
          },
          "evidence_code" : "",
          "extension" : [],
          "genotype" : "<unique_genotype_id>",
          "publication" : "<some_pubmed_id>",
          "status" : "new",
          "submitter_comment" : null,
          "term" : "<some_term_id>",
          "type" : "phenotype",
          "with_gene_id" : null
        },
        {
          ...
        }
      ],
      "metadata": {   see note below
        "accepted_timestamp" : "2019-05-31 10:00:22",
        "annotation_status" : "APPROVED",
        "annotation_status_datestamp" : "2019-06-12 10:48:22",
        "approval_in_progress_timestamp" : "2019-06-12 10:47:39",
        "approved_timestamp" : "2019-06-12 10:48:22",
        "approver_email" : "[email protected]",
        "approver_name" : "Admin Person",
        "canto_session" : "4e77f8cbed7cd6c6",
        "curation_accepted_date" : "2019-05-31 10:00:22",
        "curation_pub_id" : "PMID:17285636",
        "curator_email" : "[email protected]",
        "curator_name" : "Kim Rutherford",
        "curator_role" : "FlyBase-test",
        "first_approved_timestamp" : "2019-06-12 10:48:22",
        "needs_approval_timestamp" : "2019-06-12 10:47:19",
        "session_created_timestamp" : "2019-05-27 05:53:39",
        "session_first_submitted_timestamp" : "2019-06-12 10:47:19",
        "session_genes_count" : "4",
        "session_term_suggestions_count" : "0",
        "session_unknown_conditions_count" : "0",
        "term_suggestion_count" : "0",
        "unknown_conditions_count" : "0"
      },
      "organisms" : {
        "7227" : {
          "full_name" : "Drosophila melanogaster"
        }
      },
      "publications": {   see note below
        "<some_pubmed_id>" : {}
      }
    },
    "<some_other_session_key>": {
      ...
    },
    ...
  }
}
  

<session_key>

The Canto session key is a unique 16 character hexadecimal ID for the session.

<organism_name_and_gene_uniquename>

These keys are used in the allele section to refer to genes in the genes section. The current format of organism + gene_uniquename may change in future.

genes section

The genes have two fields:

  • organism: the genus + species (+ optional strain)
  • uniquename: the gene primary ID

<unique_allele_id>

Used to uniquely refer to alleles in the alleles section from the genotypes section. If the allele has been loaded from an external JSON file (see JSON import file) then this ID will the imported ID. Allele that are added in a Canto session get assigned a unique ID for use when exporting.

alleles section

Every allele will have an allele_type and a primary_identifier field.

  • The primary_identifier is equivalent to be the uniquename from Chado and will match the key of this allele (see unique_allele_id )
  • Example allele types are "deletion", "wild_type" and "aberration"
  • gene is the ID of the gene of this allele unless the allele has type "aberration"
  • name and description are optional, depending on the allele type
  • notes is an optional map of notes attached to this allele
  • synonyms is a list of synonyms that have been added in this session

<unique_genotype_id>

A unique ID created for a genotype by Canto. These IDs are used in the annotations section to refer to genotypes.

genotypes section

  • organism_taxonid: the NCBI taxon ID of organism this genotype comes from
  • loci: a list of loci
  • comment: a genotype specific comment

Each locus is a list of allele IDs with optional expression eg.

  [
     {
        "id" : "FBal0157358"
     }
  ]

or:

  [
     {
        "expression" : "Overexpression",
        "id" : "SPBC28E12.06c:00e0a3ede15887bf-2"
     }
  ]

A diploid locus will look like:

  [
     {
        "id" : "FBal0157358"
     },
     {
        "id" : "FBal0157359"
     }
  ]

And for a multi-locus genotype the loci list will have 2 or more parts. eg.

  "4e77f8cbed7cd6c6-genotype-26" : {
     "loci" : [
        [
           {
              "id" : "FBal0157358"
           },
           {
              "id" : "FBal0157359"
           }
        ],
        [
           {
              "id" : "FBal0322737"
           },
           {
              "id" : "FBal0322736"
           }
        ]
     ],
     "organism_taxonid" : 7227
  },

Haploid and diploid loci can be mixed. eg.

  "4e77f8cbed7cd6c6-genotype-20" : {
     "loci" : [
        [
           {
              "id" : "FBal0125507"
           }
        ],
        [
           {
              "id" : "FBal0288220"
           }
        ],
        [
           {
              "id" : "FBal0157358"
           },
           {
              "id" : "FBal0157359"
           }
        ]
     ],
     "organism_taxonid" : 7227
  },

metagenotypes section

This section contains two types of objects (differentiated with the type field):

  • pathogen-host to represent the "metagenotype" of a host genotype and a pathogen genotype when Canto is used in pathogen-host mode
  • interaction for a genetic interaction

pathogen-host type

All three fields are required:

  • type - "pathogen-host"
  • host_genotype - an ID (from the genotypes section) of a host genotype
  • pathogen_genotype - a pathogen genotype ID

interaction type

Used to export genetic interactions

  • type - "interaction"
  • genotype_a - a genotype ID from the genotypes section
  • genotype_b - a genotype ID

annotations section

A list of annotations with these fields:

  • type: the annotation type (eg. "phenotype" or "molecular_function")
  • creation_date: when the annotation was made
  • curator:
    • community_curated: true if the curator is a non-admin user
    • name
    • email
  • evidence_code: eg. "Inferred from Physical Interaction" or "Microscopy"
  • publication: the publication/PubMed ID
  • status: currently always "new"
  • submitter_comment: a note from the curator for this annotation (if any)
  • term: the ID for the term that the gene or genotype was annotated with
  • extension: the extension for this annotation, if any. See Annotation extensions below.
  • gene: the unique gene ID of the gene that was annotated

GO annotations only:

  • with_gene_id: for GO IPI/IGI annotations this is the value for GAF column 8

Physical Interaction annotations only:

  • interacting_genes: for interaction annotations, a list of the IDs of the interacting genes

Phenotype/genotype annotations only

  • conditions: an optional list of IDs from an experimental condition ontology (eg. PECO

  • genotype: the unique genotype ID of the genotype that was annotated

  • genotype_interactions_no_phenotype: a list of genotype to genotype interactions associated with this phenotype annotation (See Genotype interactions below)

  • genotype_interactions_with_phenotype: a list of genotype to genotype interactions including details about single allele phenotype and extension (for example the rescued phenotype)

Not all of these fields have a value for all annotation types. These field will always be present:

  • creation_date
  • curator
  • one of gene or genotype
  • publication
  • term (for ontology annotations) or interacting_genes (for interactions)
  • type

Genotype to genotype interactions

Double mutant phenotypes can have associated (inferred) genetic interactions. These are attached to the phenotype annotation in the fields:

  • genotype_interactions_no_phenotype
  • genotype_interactions_with_phenotype

Common fields:

These fields are required by in all genotype-genotype interactions:

The alleles from genotype_a and genotype_b are also the two alleles in the doudble mutant of the phenotype annotation.

Single phenotype fields

If the interaction has details about the phenotype and extension of the single locus phenotype that is rescued, these aditional fields are required:

  • genotype_a_phenotype_termid: a term ID (example: "FYPO:0000091")
  • genotype_a_phenotype_extension: the external for the term in the same format as the extensions field, can be empty ([])

Annotation extension

The extensions field of an annotation is a list of lists. Each part of the extension is a relation and a range.

GO annotations will generally use these relations: http://wiki.geneontology.org/index.php/Annotation_usage_examples_for_each_annotation_extension_relation

The possible range types (rangeType) constrain the rangeValue:

  • "Ontology" - an ontology term ID
  • "Gene" - a gene uniquename/ID
  • "Metagenotype" - a metagenotype ID from the metagenotypes section
  • "Text" - a text field for other cases

Each extension part will also have a rangeDisplayName when appropriate.

Examples for GO annotations:

  {
     "rangeDisplayName" : "rsd1",
     "rangeType" : "Gene",
     "rangeValue" : "PomBase:rsd1",
     "relation" : "has_direct_input"
  }
  {
     "rangeDisplayName" : "cellular response to nitrogen starvation",
     "rangeType" : "Ontology",
     "rangeValue" : "GO:0006995",
     "relation" : "exists_during"
  }

Example for a phenotype annotation:

  {
     "rangeDisplayName" : "high",
     "rangeType" : "Ontology",
     "rangeValue" : "FYPO_EXT:0000001",
     "relation" : "has_expressivity"
  }

Extension field structure

The dependent and independent extension parts are written using a list-of-lists structure. The top level list contains independent extensions and the sub-lists holds the dependent parts.

The overview is:

"extension": [
  [
    {some_range_and_relation},
    {another_range_and_relation}
  ],
  [
    {an_independent_range_and_relation},
    ...
  ],
  ...
]

The top level list will be empty if the current annotation has no extension. The sub-lists (if any) must contain at lease one element.

In the simple case where the extension field has just one part looks like:

"extension": [
  [
    {
     "rangeDisplayName" : "high",
     "rangeType" : "Ontology",
     "rangeValue" : "FYPO_EXT:0000001",
     "relation" : "has_expressivity"
    }
  ]
]

An extension with two dependent parts like:

has_substrate(PomBase:SPATRNAASP.01), happens_during(cellular response to nitrogen starvation)

is represented as:

"extension": [
    [
      {
        "rangeValue": "PomBase:SPATRNAASP.01",
        "rangeType": "Gene",
        "relation": "has_substrate"
      },
      {
        "relation": "happens_during",
        "rangeValue": "GO:0006995",
        "rangeType": "Ontology",
        "rangeDisplayName": "cellular response to nitrogen starvation"
      }
    ]
  ]

The nested list contains the two dependent parts.

To represent two independent extensions on the same annotation the top level list will contain multiple elements. For example:

has_substrate PomBase:SPATRNAASP.01 , happens_during cellular response to nitrogen starvation |
has_substrate PomBase:SPATRNAASP.02 , happens_during cellular response to nitrogen starvation

is written:

  "extension": [
    [
      {
        "rangeValue": "PomBase:SPATRNAASP.01",
        "rangeType": "Gene",
        "relation": "has_substrate"
      },
      {
        "relation": "happens_during",
        "rangeValue": "GO:0006995",
        "rangeType": "Ontology",
        "rangeDisplayName": "cellular response to nitrogen starvation"
      }
    ],
    [
      {
        "relation": "has_substrate",
        "rangeValue": "PomBase:SPATRNAASP.02",
        "rangeType": "Gene"
      },
      {
        "rangeDisplayName": "cellular response to nitrogen starvation",
        "rangeType": "Ontology",
        "rangeValue": "GO:0006995",
        "relation": "happens_during"
      }
    ]
  ]

(See https://curation.pombase.org/pombe/curs/4a7f9665ed7386e8/ro for an example of this)

metadata

  • accepted_timestamp: when the session was accepted but the curator
  • annotation_status: the current annotation status, will always be "APPROVAL" if the --dump-approved was passed to the export script
  • annotation_status_datestamp: when the status last changed
  • approval_in_progress_timestamp: when the approval process started
  • approved_timestamp: when the session was approved, may be different from first_approved_timestamp if the session went through the approval process more than once
  • approver_email
  • approver_name: who approved the session
  • canto_session: the 16 character hexadecimal session ID
  • curation_pub_id: the PubMed ID
  • curator_email
  • curator_name: who curated the session
  • curator_role: "community" or the organisation name (eg. "PomBase" or "FlyBase"
  • first_approved_timestamp: when the session was first approved
  • needs_approval_timestamp: when the session was submitted to the curators for approval, might be different to session_first_submitted_timestamp if the session was re-submitted
  • session_created_timestamp: when the session created, either by the admins or when the user enters a PMID on the front page
  • session_first_submitted_timestamp
  • session_genes_count: number of genes in the session
  • term_suggestion_count: the number of terms that have outstanding term suggestion, should be 0 for approved sessions
  • unknown_conditions_count: the number of conditions in the session that haven't been assigned a condition ontology ID, should be 0 for approved sessions
  • has_community_curation: true if and only if there are any annotations in this session made by a community curator
  • annotation_curators: if the flag --export-curator-names is off this won't be exported. If --export-curator-names is set, this is an array of hashes with the keys:
    • name: the name of the curator (admin or community)
    • orcid: the ORCID of the curator, or null if not known
    • community_curator: true if this curator is a community curator
    • annotation_count: the number of annotations by this curator in this session