Skip to content

Latest commit

 

History

History
381 lines (295 loc) · 14.2 KB

format-analysis.org

File metadata and controls

381 lines (295 loc) · 14.2 KB

Encoding clones with V(D)J recombinations (2016b)

The following .json format allows to encode a set of clones with V(D)J immune recombinations, possibly with user annotations.

In Vidjil, this format is used by both the .analysis and the .vidjil files. The .vidjil file represents the actual data on clones (and that can reach megabytes, or even more), usually produced by processing reads by some RepSeq software. (for example with detailed information on the 100 or 1000 top clones). The .analysis file describes customizations done by the user (or by some automatic pre-processing) on the Vidjil web application. The web application can load or save such files (and possibly from/to the patient/sample database). It is intended to be very small (a few kilobytes). All settings in the .analysis file override the settings that could be present in the .vidjil file.

What is a clone ?

There are several definitions of what may be a clonotype, depending on different RepSeq software or studies. This format accept any kind of definition: Clones are identified by a id string that may be an arbitrary identifier such as clone-072a. Software computing clones may choose some relevant identifiers:

  • CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC, Vidjil algorithm, 50 nt window centered on the CDR3
  • CARPRDWNTYYYYGMDVW, a CDR3 AA sequence
  • CARPRDWNTYYYYGMDVW IGHV3-11*00 IGHJ6*00, a CDR3 AA sequence with additional V/J gene information (MiXCR)
  • the ‘clone sequence’ as computed by the ARReST in .clntab files (processed by fuse.py)
  • see also ‘IMGT clonotype (AA) or (nt)’

Examples

.vidjil file – one sample

This is an almost minimal .vidjil file, describing clones in one sample. The seg element is optional: clones without seg elements will be shown on the grid with ‘?/?’. All other elements are required. The reads.germlines list can have only one element the case of data on a unique locus. There is here one clone on the TRG locus with a designation TRGV5*01 5/CC/0 TRGJ1*02. Note that other elements could be added by some program (such as tag, to identify some clones, or clusters, to further cluster some clones, see below).

  {
      "producer": "program xyz version xyz",
      "timestamp": "2014-10-01 12:00:11",
      "vidjil_json_version": "2016b",

      "samples": {
           "number": 1, 
           "original_names": ["T8045-BC081-Diag.fastq"]
      },

      "reads" : {
          "total" :           [ 437164 ] ,
          "segmented" :       [ 335662 ] ,
          "germline" : {
              "TRG" :         [ 250000 ] ,
              "IGH" :         [ 85662  ]
          }
      },

      "clones": [
          {
              "id": "clone-001",
              "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241 ],
              "germline": "TRG",
              "top": 1,
              "seg":
              {
		    "5": {"name": "TRGV5*01",  "start": 1,   "stop": 86},
		    "3": {"name": "TRGJ1*02",  "start": 89,  "stop": 118},
                  "cdr3": { "start": 77, "stop": 104, "seq": "gccacctgggccttattataagaaactc" }
		}

          }
      ]
  }

.vidjil file – several related samples

This a .vidjil file obtained by merging with fuse.py two .vidjil files corresponding to two samples. Clones that have a same id are gathered (see ‘What is a clone?’, above). It is the responsibility of the program generating the initial .vidjil files to choose these id to do a correct gathering.

  {
      "producer": "program xyz version xyz / fuse.py version xyz",
      "timestamp": "2014-10-01 14:00:11",
      "vidjil_json_version": "2016b",

      "samples": {
           "number": 2, 
           "original_names": ["T8045-BC081-Diag.fastq", "T8045-BC082-fu1.fastq"]
      },

      "reads" : {
          "total" :           [ 437164, 457810 ] ,
          "segmented" :       [ 335662, 410124 ] ,
          "germline" : {
              "TRG" :         [ 250000, 300000 ] ,
              "IGH" :         [ 85662,   10124 ]
          }
      },

      "clones": [
          {
              "id": "clone-001",
              "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241, 14717 ],
              "germline": "TRG",
              "top": 1,
              "seg":
              {
		    "5": {"name": "TRGV5*01",  "start": 1,   "stop": 86},
		    "3": {"name": "TRGJ1*02",  "start": 89,  "stop": 118}
		}
          },
          {
              "id": "clone2",
              "sequence": "GATACA",
              "reads": [ 153, 10221 ],
              "germline": "TRG",
              "top": 2
          },
          {
              "id": "clone3",
              "sequence": "ATACAGA",
              "reads": [ 521, 42 ],
              "germline": "TRG",
              "top": 3,
              "seg":
              {
                  "5": {"start": 1, "stop": 100},
                  "3": {"start": 101, "stop": 200}
              }
          }
      ]
  }

.analysis file

This file reflects the annotations a user could have done within the Vidjil web application or some other tool. She has manually set sample names (names), tagged (tag, tags), named (name) and clustered (clusters) some clones, and added external data (data).

{
    "producer": "user Bob, via Vidjil webapp",
    "timestamp": "2014-10-01 12:00:11",
    "vidjil_json_version": "2016b",

    "samples": {
         "number": 2, 
         "names": ["diag", "fu1"],
         "original_names": ["file1.fastq", "file2.fastq"],
         "order": [1, 0]
    },

    "clones": [
        {
            "id": "clone-001",
            "name": "Main ALL clone",
            "tag": "0"
        },
        {
            "id": "spikeE",
            "label": "spike",
            "sequence": "ATGACTCTGGAGTCTATTACTGTGCCACCTGGGATGTGAGTATTATAAGAAAC",
            "tag": "3",
            "expected": "0.1"
        }

    ],

    "clusters": [
        [ "clone2", "clone3"],
        [ "clone-5", "clone-10", "clone-179" ]
    ],

    "data": {
         "qPCR": [0.83, 0.024],
         "spikeZ": [0.01, 0.02]
    },

    "tags": {
        "names": {
            "0" : "main clone",
            "3" : "spike",
            "5" : "custom tag"
        },
        "hide": [4, 5]
    }
}

The order field defines the order in which order the points should be considered. In that case we should first consider the second point (whose name is fu1) and the point to be considered in second should be the first one in the file (whose name is diag).

The clusters field indicate clones (by their id) that have been further clustered. Usually, these clones were defined in a related .vidjil file (as clone2 and clone3, see the .vidjil file in the previous section). If these clones do not exist, the clusters are just ignored. The first item of the cluster is considered as the representative clone of the cluster.

Detailed specification

Generic information for traceability [required]

"producer": "my-repseq-software -z -k (v. 123)",    // arbitrary string, user/software/version/options producing this file [required]
"timestamp": "2014-10-01 12:00:11",                 // last modification date [required]
"vidjil_json_version": "2016b",                     // version of the .json format  [required]

Statistics: the reads element [.vidjil only, required]

The number of analyzed reads (segmented) may be higher than the sum of the read number of all clones, when one choose to report only the ‘top’ clones (-t option for fuse).

{
    "total" : [],          // total number of reads per sample (with samples.number elements)
    "segmented" : [],      // number of analyzed/segmented reads per sample (with samples.number elements)
    "germline" : {         // number of analyzed/segmented reads per sample/germline (with samples.number elements)
        "TRG" : [],
        "IGH" : []
    }
}

samples element [required]

{
  "number": 2,      // number of samples [required]

  "original_names": [],  // original sample names (with samples.number elements) [required]

  "names": [],      // custom sample names (with samples.number elements) [optional]
                    // These names are editable and will be used on the graphs

  "order": [],      // custom sample order (lexicographic order by default) [optional]


  // traceability on each sample (with sample.number elements)
  "producer": [],
  "timestamp": [],
  "log": []
}

clones list, with read count, tags, V(D)J designation and other sequence features

Each element in the clones list describes properties of a clone.

In a .vidjil file, this is the main part, describing all clones. In the .analysis file, this section is intended to describe some specific clones.

 {
   "id": "",        // clone identifier, must be unique [required] [see above, 'What is a clone ?']
                    // the clone identifier in the .vidjil file and in .analysis file must match

   "germline": ""   // [required for .vidjil]
                    // (should match a germline defined in germline/germline.data)

   "name": "",      // clone custom name [optional]
                    // (the default name, in .vidjil, is computed from V/D/J information)

   "label": "",     // clone labels, separed by spaces [optional]
                    // These labels may add some information entered with a controled vocabulary

   "sequence": "",  // reference nt sequence [required for .vidjil]
                    // (for .analysis, not really used now in the web application,
                    //  for special clones/sequences that are known,
                    //  such as standard/spikes or know patient clones)

   "tag": "",       // tag id from 0 to 7 (see below) [optional]

   "expected": ""   // expected abundance of this clone (between 0 and 1) [optional]
                    // this will create a normalization option in the 
                    // settings web application menu

   "seg":           // detailed V(D)J designation/segmentation and other sequences features or values [optional]
                    // on the web application, clones that are not segmented will be shown on the grid with '?/?'
                    // positions are related to the 'sequence'
                    // names of V/D/J genes should match the ones in files referenced in germline/germline.data
                    // Positions on the sequence start at 1.
     {
        "5": {"name": "IGHV5*01", "start": 1, "stop": 120},    // V (or 5') segment
        "4": {"name": "IGHD1*01", "start": 124, "stop": 135, "info": "unsure designation"},  // D (or middle) segment
                    // Recombination with several D may use "4a", "4b"...
        "3": {"name": "IGHJ3*02", "start": 136, "stop": 171},  // J (or 3') segment

                    // any feature to be highlighted in the sequence, with optional fields related to this feature:
                    //  - "start"/"stop" : positions on the clone sequence (starting at 1)
                    //  - "seq" : a sequence
                    //  - "val" : a numerical value
                    //  - "info" : a textual vlaue
                    //
                    // JUNCTION//CDR3 should be stored that way (in fields called "junction" of "cdr3"),
                    // its productivity must be stored in a boolean field called "productive".
        "somefeature": { "start": 56, "stop": 61, "seq": "ACTGTA", "val": 145.7, "info": "analyzed with xyz" },

                    // Numerical or textual features concerning all the sequence or its analysis (such as 'evalue')
                    // can be provided by omitting "start" and "stop" elements.
        "someotherfeature": {"val": 0.004521},
        "anotherfeature": {"info": "VH CDR3 stereotypy"},
     }


   "reads": [],      // number of reads in this clones [.vidjil only, required] 
                     // (with samples.number elements)

   "top": 0,         // (not documented now) [required] threshold to display/hide the clone
   "stats": []       // (not documented now) [.vidjil only] (with sample.number elements)


}

germlines list [optional][work in progress, to be documented]

extend the germline.data default file with a custom germline

"germlines" : {
    "custom" : {
        "shortcut": "B",
        "5": ["TRBV.fa"],
        "4": ["TRBD.fa"],
        "3": ["TRBJ.fa"]
    }
}

Further clustering of clones: the clusters list [optional]

Each element in the ‘clusters’ list describe a list of clones that are ‘merged’. In the web application, it will be still possible to see them or to unmerge them. The first clone of each line is used as a representative for the cluster.

data list [optional][work in progress, to be documented]

Each element in the data list is a list of values (of size samples.number) showing additional data for each sample, as for example qPCR levels or spike information.

In the browser, it will be possible to display these data and to normalize against them (not implemented now).

Tagging some clones: tags list [optional]

The tags list describe the custom tag names as well as tags that should be hidden by default. The default tag names are defined in ../browser/js/vidjil-style.js.

"key" : "value"  // "key" is the tag id from 0 to 7 and "value" is the custom tag name attributed