diff --git a/docs/categories/index.xml b/docs/categories/index.xml index 02b45be..4b26c88 100644 --- a/docs/categories/index.xml +++ b/docs/categories/index.xml @@ -1,11 +1,11 @@ - Categories on - /categories/ - Recent content in Categories on + Categories on Sifter + https://bmeg.github.io/sifter/categories/ + Recent content in Categories on Sifter Hugo -- gohugo.io - en - + en-us + diff --git a/docs/docs/example/index.html b/docs/docs/example/index.html index a489c50..552309e 100644 --- a/docs/docs/example/index.html +++ b/docs/docs/example/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/index.html b/docs/docs/index.html index 5bc6ab4..6a0af11 100644 --- a/docs/docs/index.html +++ b/docs/docs/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + @@ -345,15 +365,15 @@

    Sifter pipelines

    Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of file extractors that operate as inputs to these pipelines. The pipeline engine -connects togeather arrays of transform steps into direct acylic graph that is processed +connects togeather arrays of transform steps into directed acylic graph that is processed in parallel.

    Example Message:

    -
    {
    -  "firstName" : "bob",
    -  "age" : "25"
    -  "friends" : [ "Max", "Alex"]
    -}
    -

    Once a stream of messages are produced, that can be run through a transform +

    {
    +  "firstName" : "bob",
    +  "age" : "25"
    +  "friends" : [ "Max", "Alex"]
    +}
    +

    Once a stream of messages are produced, that can be run through a transform pipeline. A transform pipeline is an array of transform steps, each transform step can represent a different way to alter the data. The array of transforms link togeather into a pipe that makes multiple alterations to messages as they are @@ -366,7 +386,132 @@

    Sifter pipelines

  • Table based field translation
  • Outputing the message as a JSON Schema checked object
  • - +

    Script structure

    +

    Pipeline File

    +

    An sifter pipeline file is in YAML format and describes an entire processing pipelines. +If is composed of the following sections: config, inputs, pipelines, outputs. In addition, +for tracking, the file will also include name and class entries.

    +
    
    +class: sifter
    +name: <script name>
    +outdir: <where output files should go, relative to this file>
    +
    +config:
    +  <config key>: <config value>
    +  <config key>: <config value> 
    +  # values that are referenced in pipeline parameters for 
    +  # files will be treated like file paths and be 
    +  # translated to full paths
    +
    +inputs:
    +  <input name>:
    +    <input driver>:
    +      <driver config>
    +
    +pipelines:
    +  <pipeline name>:
    +    # all pipelines must start with a from step
    +    - from: <name of input or pipeline> 
    +    - <transform name>:
    +       <transform parameters>
    +
    +outputs:
    +  <output name>:
    +    <output driver>:
    +      <driver config>
    +
    +

    Each sifter file starts with a set of field to let the software know this is a sifter script, and not some random YAML file. There is also a name field for the script. This name will be used for output file creation and logging. Finally, there is an outdir that defines the directory where all output files will be placed. All paths are relative to the script file, so the outdir set to my-results will create the directory my-results in the same directory as the script file, regardless of where the sifter command is invoked.

    +
    class : sifter
    +name: <name of script>
    +outdir: <where files should be stored>
    +

    Config and templating

    +

    The config section is a set of defined keys that are used throughout the rest of the script.

    +

    Example config:

    +
    config:
    +  sqlite:  ../../source/chembl/chembl_33/chembl_33_sqlite/chembl_33.db
    +  uniprot2ensembl: ../../tables/uniprot2ensembl.tsv
    +  schema: ../../schema/
    +

    Various fields in the script file will be be parsed using a Mustache template engine. For example, to access the various values within the config block, the template {{config.sqlite}}.

    +

    Inputs

    +

    The input block defines the various data extractors that will be used to open resources and create streams of JSON messages for processing. The possible input engines include:

    + +

    For any other file types, there is also a plugin option to allow the user to call their own code for opening files.

    +

    Pipeline

    +

    The pipelines defined a set of named processing pipelines that can be used to transform data. Each pipeline starts with a from statement that defines where data comes from. It then defines a linear set of transforms that are chained togeather to do processing. Pipelines may used emit steps to output messages to disk. The possible data transform steps include:

    + +

    Additionally, users are able to define their one transform step types using the plugin step.

    +

    Example script

    +
    class: sifter
    +
    +name: go
    +outdir: ../../output/go/
    +
    +config:
    +  oboFile: ../../source/go/go.obo
    +  schema: ../../schema
    +
    +inputs:
    +  oboData:
    +    plugin:
    +      commandLine: ../../util/obo_reader.py {{config.oboFile}}
    +
    +pipelines:
    +  transform:
    +    - from: oboData
    +    - project:
    +        mapping:
    +          submitter_id: "{{row.id[0]}}"
    +          case_id: "{{row.id[0]}}"
    +          id: "{{row.id[0]}}"
    +          go_id: "{{row.id[0]}}"
    +          project_id: "gene_onotology"
    +          namespace: "{{row.namespace[0]}}"
    +          name: "{{row.name[0]}}"
    +    - map: 
    +        method: fix
    +        gpython: | 
    +          def fix(row):
    +            row['definition'] = row['def'][0].strip('"')
    +            if 'xref' not in row:
    +              row['xref'] = []
    +            if 'synonym' not in row:
    +              row['synonym'] = []
    +            return row
    +    - objectValidate:
    +        title: GeneOntologyTerm
    +        schema: "{{config.schema}}"
    +    - emit:
    +        name: term
    +
    diff --git a/docs/docs/index.xml b/docs/docs/index.xml index 36e5687..fd78822 100644 --- a/docs/docs/index.xml +++ b/docs/docs/index.xml @@ -5,323 +5,252 @@ https://bmeg.github.io/sifter/docs/ Recent content in Docs on Sifter Hugo -- gohugo.io - en-us + en-us + accumulate https://bmeg.github.io/sifter/docs/transforms/accumulate/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/accumulate/ - accumulate Gather sequential rows into a single record, based on matching a field -Parameters name Type Description field string (field path) Field used to match rows dest string field to store accumulated records Example - accumulate: field: model_id dest: rows + accumulate Gather sequential rows into a single record, based on matching a field Parameters name Type Description field string (field path) Field used to match rows dest string field to store accumulated records Example - accumulate: field: model_id dest: rows - avroLoad https://bmeg.github.io/sifter/docs/inputs/avroload/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/avroload/ - avroLoad Load an AvroFile -Parameters name Description input Path to input file + avroLoad Load an AvroFile Parameters name Description input Path to input file - clean https://bmeg.github.io/sifter/docs/transforms/clean/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/clean/ - clean Remove fields that don&rsquo;t appear in the desingated list. -Parameters name Type Description fields [] string Fields to keep removeEmpty bool Fields with empty values will also be removed storeExtra string Field name to store removed fields Example - clean: fields: - id - synonyms + clean Remove fields that don&rsquo;t appear in the desingated list. Parameters name Type Description fields [] string Fields to keep removeEmpty bool Fields with empty values will also be removed storeExtra string Field name to store removed fields Example - clean: fields: - id - synonyms - debug https://bmeg.github.io/sifter/docs/transforms/debug/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/debug/ - debug Print out copy of stream to logging -Parameters name Type Description label string Label for log output format bool Use multiline spaced output Example - debug: {} + debug Print out copy of stream to logging Parameters name Type Description label string Label for log output format bool Use multiline spaced output Example - debug: {} - distinct https://bmeg.github.io/sifter/docs/transforms/distinct/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/distinct/ - distinct Using templated value, allow only the first record for each distinct key -Parameters name Type Description value string Key used for distinct value Example - distinct: value: &#34;{{row.key}}&#34; + distinct Using templated value, allow only the first record for each distinct key Parameters name Type Description value string Key used for distinct value Example - distinct: value: &#34;{{row.key}}&#34; - embedded https://bmeg.github.io/sifter/docs/inputs/embedded/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/embedded/ - embedded Load data from embedded structure -Example inputs: data: embedded: - { &#34;name&#34; : &#34;Alice&#34;, &#34;age&#34;: 28 } - { &#34;name&#34; : &#34;Bob&#34;, &#34;age&#34;: 27 } + embedded Load data from embedded structure Example inputs: data: embedded: - { &#34;name&#34; : &#34;Alice&#34;, &#34;age&#34;: 28 } - { &#34;name&#34; : &#34;Bob&#34;, &#34;age&#34;: 27 } - emit https://bmeg.github.io/sifter/docs/transforms/emit/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/emit/ - emit Send data to output file. The naming of the file is outdir/script name.pipeline name.emit name.json.gz -Parameters name Type Description name string Name of emit value example - emit: name: protein_compound_association + emit Send data to output file. The naming of the file is outdir/script name.pipeline name.emit name.json.gz Parameters name Type Description name string Name of emit value example - emit: name: protein_compound_association - Example https://bmeg.github.io/sifter/docs/example/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/example/ - Example Pipeline Our first task will be to convert a ZIP code TSV into a set of county level entries. -The input file looks like: -ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP 36003,Autauga County,AL,01001,H1 36006,Autauga County,AL,01001,H1 36067,Autauga County,AL,01001,H1 36066,Autauga County,AL,01001,H1 36703,Autauga County,AL,01001,H1 36701,Autauga County,AL,01001,H1 36091,Autauga County,AL,01001,H1 First is the header of the pipeline. This declares the unique name of the pipeline and it&rsquo;s output directory. -name: zipcode_map outdir: ./ docs: Converts zipcode TSV into graph elements Next the configuration is declared. + Example Pipeline Our first task will be to convert a ZIP code TSV into a set of county level entries. The input file looks like: ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP 36003,Autauga County,AL,01001,H1 36006,Autauga County,AL,01001,H1 36067,Autauga County,AL,01001,H1 36066,Autauga County,AL,01001,H1 36703,Autauga County,AL,01001,H1 36701,Autauga County,AL,01001,H1 36091,Autauga County,AL,01001,H1 First is the header of the pipeline. This declares the unique name of the pipeline and it&rsquo;s output directory. name: zipcode_map outdir: ./ docs: Converts zipcode TSV into graph elements Next the configuration is declared. - fieldParse https://bmeg.github.io/sifter/docs/transforms/fieldparse/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/fieldparse/ - fieldProcess https://bmeg.github.io/sifter/docs/transforms/fieldprocess/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/fieldprocess/ - fieldProcess Create stream of objects based on the contents of a field. If the selected field is an array each of the items in the array will become an independent row. -Parameters name Type Description field string Name of field to be processed mapping map[string]string Project templated values into child element itemField string If processing an array of non-dict elements, create a dict as {itemField:element} example - fieldProcess: field: portions mapping: sample: &#34;{{row. + fieldProcess Create stream of objects based on the contents of a field. If the selected field is an array each of the items in the array will become an independent row. Parameters name Type Description field string Name of field to be processed mapping map[string]string Project templated values into child element itemField string If processing an array of non-dict elements, create a dict as {itemField:element} example - fieldProcess: field: portions mapping: sample: &#34;{{row. - fieldType https://bmeg.github.io/sifter/docs/transforms/fieldtype/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/fieldtype/ - fieldType Set field to specific type, ie cast as float or integer -example - fieldType: t_depth: int t_ref_count: int t_alt_count: int n_depth: int n_ref_count: int n_alt_count: int start: int + fieldType Set field to specific type, ie cast as float or integer example - fieldType: t_depth: int t_ref_count: int t_alt_count: int n_depth: int n_ref_count: int n_alt_count: int start: int - filter https://bmeg.github.io/sifter/docs/transforms/filter/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/filter/ - filter Filter rows in stream using a number of different methods -Parameters name Type Description field string (field path) Field used to match rows value string (template string) Template string to match against match string String to match against check string How to check value, &rsquo;exists&rsquo; or &lsquo;hasValue&rsquo; method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) Example Field based match -- filter: field: table match: source_statistics Check based match + filter Filter rows in stream using a number of different methods Parameters name Type Description field string (field path) Field used to match rows value string (template string) Template string to match against match string String to match against check string How to check value, &rsquo;exists&rsquo; or &lsquo;hasValue&rsquo; method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) Example Field based match - filter: field: table match: source_statistics Check based match + + + flatMap + https://bmeg.github.io/sifter/docs/transforms/flatmap/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/flatmap/ + - from https://bmeg.github.io/sifter/docs/transforms/from/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/from/ - from Parmeters Name of data source -Example inputs: profileReader: tableLoad: input: &#34;{{config.profiles}}&#34; pipelines: profileProcess: - from: profileReader + from Parmeters Name of data source Example inputs: profileReader: tableLoad: input: &#34;{{config.profiles}}&#34; pipelines: profileProcess: - from: profileReader - glob https://bmeg.github.io/sifter/docs/inputs/glob/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/glob/ - glob Scan files using * based glob statement and open all files as input. -Parameters Name Description storeFilename Store value of filename in parameter each row input Path of avro object file to transform xmlLoad xmlLoad configutation tableLoad Run transform pipeline on a TSV or CSV jsonLoad Run a transform pipeline on a multi line json file avroLoad Load data from avro file Example inputs: pubmedRead: glob: input: &#34;{{config.baseline}}/*.xml.gz&#34; xmlLoad: {} + glob Scan files using * based glob statement and open all files as input. Parameters Name Description storeFilename Store value of filename in parameter each row input Path of avro object file to transform xmlLoad xmlLoad configutation tableLoad Run transform pipeline on a TSV or CSV jsonLoad Run a transform pipeline on a multi line json file avroLoad Load data from avro file Example inputs: pubmedRead: glob: input: &#34;{{config.baseline}}/*.xml.gz&#34; xmlLoad: {} - graphBuild https://bmeg.github.io/sifter/docs/transforms/graphbuild/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/graphbuild/ - graphBuild Build graph elements from JSON objects using the JSON Schema graph extensions. -example - graphBuild: schema: &#34;{{config.allelesSchema}}&#34; title: Allele + graphBuild Build graph elements from JSON objects using the JSON Schema graph extensions. example - graphBuild: schema: &#34;{{config.allelesSchema}}&#34; title: Allele - - - gripperLoad - https://bmeg.github.io/sifter/docs/inputs/gripperload/ - Mon, 01 Jan 0001 00:00:00 +0000 - - https://bmeg.github.io/sifter/docs/inputs/gripperload/ - - - hash https://bmeg.github.io/sifter/docs/transforms/hash/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/hash/ hash Parameters name Type Description field string Field to store hash value value string Templated string of value to be hashed method string Hashing method: sha1/sha256/md5 example - hash: value: &#34;{{row.contents}}&#34; field: contents-sha1 method: sha1 - Inputs https://bmeg.github.io/sifter/docs/inputs/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/ Every playbook consists of a series of inputs. - jsonLoad https://bmeg.github.io/sifter/docs/inputs/jsonload/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/jsonload/ - jsonLoad Load data from a JSON file. Default behavior expects a single dictionary per line. Each line is a seperate entry. The multiline parameter reads all of the lines of the files and returns a single object. -Parameters name Description input Path of JSON file to transform multiline Load file as a single multiline JSON object Example inputs: caseData: jsonLoad: input: &#34;{{config.casesJSON}}&#34; + jsonLoad Load data from a JSON file. Default behavior expects a single dictionary per line. Each line is a seperate entry. The multiline parameter reads all of the lines of the files and returns a single object. Parameters name Description input Path of JSON file to transform multiline Load file as a single multiline JSON object Example inputs: caseData: jsonLoad: input: &#34;{{config.casesJSON}}&#34; - lookup https://bmeg.github.io/sifter/docs/transforms/lookup/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/lookup/ - lookup Using key from current row, get values from a reference source -Parameters name Type Description replace string (field path) Field to replace lookup string (template string) Key to use for looking up data copy map[string]string Copy values from record that was found by lookup. The Key/Value record uses the Key as the destination field and copies the field from the retrieved records using the field named in Value tsv TSVTable TSV translation table file json JSONTable JSON data file table LookupTable Inline lookup table pipeline PipelineLookup Use output of a pipeline as a lookup table Example JSON file based lookup The JSON file defined by config. + lookup Using key from current row, get values from a reference source Parameters name Type Description replace string (field path) Field to replace lookup string (template string) Key to use for looking up data copy map[string]string Copy values from record that was found by lookup. The Key/Value record uses the Key as the destination field and copies the field from the retrieved records using the field named in Value tsv TSVTable TSV translation table file json JSONTable JSON data file table LookupTable Inline lookup table pipeline PipelineLookup Use output of a pipeline as a lookup table Example JSON file based lookup The JSON file defined by config. - map https://bmeg.github.io/sifter/docs/transforms/map/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/map/ - map Run function on every row -Parameters name Description method Name of function to call python Python code to be run gpython Python code to be run using GPython Example - map: method: response gpython: | def response(x): s = sorted(x[&#34;curve&#34;].items(), key=lambda x:float(x[0])) x[&#39;dose_um&#39;] = [] x[&#39;response&#39;] = [] for d, r in s: try: dn = float(d) rn = float(r) x[&#39;dose_um&#39;].append(dn) x[&#39;response&#39;].append(rn) except ValueError: pass return x + map Run function on every row Parameters name Description method Name of function to call python Python code to be run gpython Python code to be run using GPython Example - map: method: response gpython: | def response(x): s = sorted(x[&#34;curve&#34;].items(), key=lambda x:float(x[0])) x[&#39;dose_um&#39;] = [] x[&#39;response&#39;] = [] for d, r in s: try: dn = float(d) rn = float(r) x[&#39;dose_um&#39;].append(dn) x[&#39;response&#39;].append(rn) except ValueError: pass return x - objectValidate https://bmeg.github.io/sifter/docs/transforms/objectvalidate/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/objectvalidate/ - objectValidate Use JSON schema to validate row contents -parameters name Type Description title string Title of object to use for validation schema string Path to JSON schema definition example - objectValidate: title: Aliquot schema: &#34;{{config.schema}}&#34; + objectValidate Use JSON schema to validate row contents parameters name Type Description title string Title of object to use for validation schema string Path to JSON schema definition example - objectValidate: title: Aliquot schema: &#34;{{config.schema}}&#34; - Pipeline Steps https://bmeg.github.io/sifter/docs/transforms/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/ Transforms alter the data - + + plugin + https://bmeg.github.io/sifter/docs/inputs/plugin/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/plugin/ + plugin Run user program for customized data extraction. Example inputs: oboData: plugin: commandLine: ../../util/obo_reader.py {{config.oboFile}} The plugin program is expected to output JSON messages, one per line, to STDOUT that will then be passed to the transform pipelines. Example Plugin The obo_reader.py plugin, it reads a OBO file, such as the kind the describe the GeneOntology, and emits the records as single line JSON messages. #!/usr/bin/env python import re import sys import json re_section = re. + + + plugin + https://bmeg.github.io/sifter/docs/transforms/plugin/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/plugin/ + + project https://bmeg.github.io/sifter/docs/transforms/project/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/project/ - project Populate row with templated values -parameters name Type Description mapping map[string]any New fields to be generated from template rename map[string]string Rename field (no template engine) Example - project: mapping: type: sample id: &#34;{{row.sample_id}}&#34; + project Populate row with templated values parameters name Type Description mapping map[string]any New fields to be generated from template rename map[string]string Rename field (no template engine) Example - project: mapping: type: sample id: &#34;{{row.sample_id}}&#34; - reduce https://bmeg.github.io/sifter/docs/transforms/reduce/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/reduce/ - reduce Using key from rows, reduce matched records into a single entry -Parameters name Type Description field string (field path) Field used to match rows method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) init map[string]any Data to use for first reduce Example - reduce: field: dataset_name method: merge init: { &#34;compounds&#34; : [] } gpython: | def merge(x,y): x[&#34;compounds&#34;] = list(set(y[&#34;compounds&#34;]+x[&#34;compounds&#34;])) return x + reduce Using key from rows, reduce matched records into a single entry Parameters name Type Description field string (field path) Field used to match rows method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) init map[string]any Data to use for first reduce Example - reduce: field: dataset_name method: merge init: { &#34;compounds&#34; : [] } gpython: | def merge(x,y): x[&#34;compounds&#34;] = list(set(y[&#34;compounds&#34;]+x[&#34;compounds&#34;])) return x - regexReplace https://bmeg.github.io/sifter/docs/transforms/regexreplace/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/regexreplace/ - - - Sifter Pipeline File - https://bmeg.github.io/sifter/docs/playbook/ - Mon, 01 Jan 0001 00:00:00 +0000 - - https://bmeg.github.io/sifter/docs/playbook/ - Pipeline File An sifter pipeline file is in YAML format and describes an entire processing pipelines. If is composed of the following sections: config, inputs, pipelines, outputs. In addition, for tracking, the file will also include name and class entries. -class: sifter name: &lt;script name&gt; outdir: &lt;where output files should go, relative to this file&gt; config: &lt;config key&gt;: &lt;config value&gt; &lt;config key&gt;: &lt;config value&gt; # values that are referenced in pipeline parameters for # files will be treated like file paths and be # translated to full paths inputs: &lt;input name&gt;: &lt;input driver&gt;: &lt;driver config&gt; pipelines: &lt;pipeline name&gt;: # all pipelines must start with a from step - from: &lt;name of input or pipeline&gt; - &lt;transform name&gt;: &lt;transform parameters&gt; outputs: &lt;output name&gt;: &lt;output driver&gt;: &lt;driver config&gt; - - split https://bmeg.github.io/sifter/docs/transforms/split/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/transforms/split/ - split Split a field using string sep -Parameters name Type Description field string Field to the split sep string String to use for splitting Example - split: field: methods sep: &#34;;&#34; + split Split a field using string sep Parameters name Type Description field string Field to the split sep string String to use for splitting Example - split: field: methods sep: &#34;;&#34; - sqldump https://bmeg.github.io/sifter/docs/inputs/sqldump/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/sqldump/ - sqlDump Scan file produced produced from sqldump. -Parameters Name Type Description input string Path to the SQL dump file tables []string Names of tables to read out Example inputs: database: sqldumpLoad: input: &#34;{{config.sql}}&#34; tables: - cells - cell_tissues - dose_responses - drugs - drug_annots - experiments - profiles + sqlDump Scan file produced produced from sqldump. Parameters Name Type Description input string Path to the SQL dump file tables []string Names of tables to read out Example inputs: database: sqldumpLoad: input: &#34;{{config.sql}}&#34; tables: - cells - cell_tissues - dose_responses - drugs - drug_annots - experiments - profiles - sqliteLoad https://bmeg.github.io/sifter/docs/inputs/sqliteload/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/sqliteload/ - sqliteLoad Extract data from an sqlite file -Parameters Name Type Description input string Path to the SQLite file query string SQL select statement based input Example inputs: sqlQuery: sqliteLoad: input: &#34;{{config.sqlite}}&#34; query: &#34;select * from drug_mechanism as a LEFT JOIN MECHANISM_REFS as b on a.MEC_ID=b.MEC_ID LEFT JOIN TARGET_COMPONENTS as c on a.TID=c.TID LEFT JOIN COMPONENT_SEQUENCES as d on c.COMPONENT_ID=d.COMPONENT_ID LEFT JOIN MOLECULE_DICTIONARY as e on a.MOLREGNO=e.MOLREGNO&#34; + sqliteLoad Extract data from an sqlite file Parameters Name Type Description input string Path to the SQLite file query string SQL select statement based input Example inputs: sqlQuery: sqliteLoad: input: &#34;{{config.sqlite}}&#34; query: &#34;select * from drug_mechanism as a LEFT JOIN MECHANISM_REFS as b on a.MEC_ID=b.MEC_ID LEFT JOIN TARGET_COMPONENTS as c on a.TID=c.TID LEFT JOIN COMPONENT_SEQUENCES as d on c.COMPONENT_ID=d.COMPONENT_ID LEFT JOIN MOLECULE_DICTIONARY as e on a.MOLREGNO=e.MOLREGNO&#34; - tableLoad https://bmeg.github.io/sifter/docs/inputs/tableload/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/tableload/ - tableLoad Extract data from tabular file, includiong TSV and CSV files. -Parameters Name Type Description input string File to be transformed rowSkip int Number of header rows to skip columns []string Manually set names of columns extraColumns string Columns beyond originally declared columns will be placed in this array sep string Separator \t for TSVs or , for CSVs Example config: gafFile: ../../source/go/goa_human.gaf.gz inputs: gafLoad: tableLoad: input: &#34;{{config.gafFile}}&#34; columns: - db - id - symbol - qualifier - goID - reference - evidenceCode - from - aspect - name - synonym - objectType - taxon - date - assignedBy - extension - geneProduct + tableLoad Extract data from tabular file, includiong TSV and CSV files. Parameters Name Type Description input string File to be transformed rowSkip int Number of header rows to skip columns []string Manually set names of columns extraColumns string Columns beyond originally declared columns will be placed in this array sep string Separator \t for TSVs or , for CSVs Example config: gafFile: ../../source/go/goa_human.gaf.gz inputs: gafLoad: tableLoad: input: &#34;{{config.gafFile}}&#34; columns: - db - id - symbol - qualifier - goID - reference - evidenceCode - from - aspect - name - synonym - objectType - taxon - date - assignedBy - extension - geneProduct + + + tableWrite + https://bmeg.github.io/sifter/docs/transforms/tablewrite/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/tablewrite/ + + + + uuid + https://bmeg.github.io/sifter/docs/transforms/uuid/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/uuid/ + - xmlLoad https://bmeg.github.io/sifter/docs/inputs/xmlload/ Mon, 01 Jan 0001 00:00:00 +0000 - https://bmeg.github.io/sifter/docs/inputs/xmlload/ - xmlLoad Load an XML file -Parameters name Description input Path to input file Example inputs: loader: xmlLoad: input: &#34;{{config.xmlPath}}&#34; + xmlLoad Load an XML file Parameters name Description input Path to input file Example inputs: loader: xmlLoad: input: &#34;{{config.xmlPath}}&#34; - diff --git a/docs/docs/inputs/avroload/index.html b/docs/docs/inputs/avroload/index.html index c6c90e0..19bba10 100644 --- a/docs/docs/inputs/avroload/index.html +++ b/docs/docs/inputs/avroload/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/embedded/index.html b/docs/docs/inputs/embedded/index.html index 6219874..63084d3 100644 --- a/docs/docs/inputs/embedded/index.html +++ b/docs/docs/inputs/embedded/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/glob/index.html b/docs/docs/inputs/glob/index.html index 566ba4c..3bc2c33 100644 --- a/docs/docs/inputs/glob/index.html +++ b/docs/docs/inputs/glob/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/index.html b/docs/docs/inputs/index.html index 98cd3d4..8290b48 100644 --- a/docs/docs/inputs/index.html +++ b/docs/docs/inputs/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/jsonload/index.html b/docs/docs/inputs/jsonload/index.html index 3120406..47ad543 100644 --- a/docs/docs/inputs/jsonload/index.html +++ b/docs/docs/inputs/jsonload/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/plugin/index.html b/docs/docs/inputs/plugin/index.html new file mode 100644 index 0000000..3bd574a --- /dev/null +++ b/docs/docs/inputs/plugin/index.html @@ -0,0 +1,428 @@ + + + + + + + + + + + plugin · Sifter + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +

    plugin

    +

    Run user program for customized data extraction.

    +

    Example

    +
    inputs:
    +  oboData:
    +    plugin:
    +      commandLine: ../../util/obo_reader.py {{config.oboFile}}
    +

    The plugin program is expected to output JSON messages, one per line, to STDOUT that will then +be passed to the transform pipelines.

    +

    Example Plugin

    +

    The obo_reader.py plugin, it reads a OBO file, such as the kind the describe the GeneOntology, and emits the +records as single line JSON messages.

    +
     #!/usr/bin/env python
    +
    +import re
    +import sys
    +import json
    +
    +re_section = re.compile(r'^\[(.*)\]')
    +re_field = re.compile(r'^(\w+): (.*)$')
    +
    +def obo_parse(handle):
    +    rec = None
    +    for line in handle:
    +        res = re_section.search(line)
    +        if res:
    +            if rec is not None:
    +                yield rec
    +            rec = None
    +            if res.group(1) == "Term":
    +                rec = {"type": res.group(1)}
    +        else:
    +            if rec is not None:
    +                res = re_field.search(line)
    +                if res:
    +                    key = res.group(1)
    +                    val = res.group(2)
    +                    val = re.split(" ! | \(|\)", val)
    +                    val = ":".join(val[0:3])
    +                    if key in rec:
    +                        rec[key].append(val)
    +                    else:
    +                        rec[key] = [val]
    +
    +    if rec is not None:
    +        yield rec
    +
    +
    +def unquote(s):
    +    res = re.search(r'"(.*)"', s)
    +    if res:
    +        return res.group(1)
    +    return s
    +
    +
    +with open(sys.argv[1]) as handle:
    +    for rec in obo_parse(handle):
    +        print(json.dumps(rec))
    +
    +
    + +
    + + diff --git a/docs/docs/inputs/sqldump/index.html b/docs/docs/inputs/sqldump/index.html index 99d9ecc..bad87f6 100644 --- a/docs/docs/inputs/sqldump/index.html +++ b/docs/docs/inputs/sqldump/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/sqliteload/index.html b/docs/docs/inputs/sqliteload/index.html index c3c4c01..bd4cd3c 100644 --- a/docs/docs/inputs/sqliteload/index.html +++ b/docs/docs/inputs/sqliteload/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/tableload/index.html b/docs/docs/inputs/tableload/index.html index abcf3ab..29d4ff7 100644 --- a/docs/docs/inputs/tableload/index.html +++ b/docs/docs/inputs/tableload/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/inputs/xmlload/index.html b/docs/docs/inputs/xmlload/index.html index d95459c..b935725 100644 --- a/docs/docs/inputs/xmlload/index.html +++ b/docs/docs/inputs/xmlload/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/accumulate/index.html b/docs/docs/transforms/accumulate/index.html index cc10e90..94452f0 100644 --- a/docs/docs/transforms/accumulate/index.html +++ b/docs/docs/transforms/accumulate/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/clean/index.html b/docs/docs/transforms/clean/index.html index a1a7609..1d2b108 100644 --- a/docs/docs/transforms/clean/index.html +++ b/docs/docs/transforms/clean/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/debug/index.html b/docs/docs/transforms/debug/index.html index 84785b0..f3528d1 100644 --- a/docs/docs/transforms/debug/index.html +++ b/docs/docs/transforms/debug/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/distinct/index.html b/docs/docs/transforms/distinct/index.html index 0877ee0..dd6a9cb 100644 --- a/docs/docs/transforms/distinct/index.html +++ b/docs/docs/transforms/distinct/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/emit/index.html b/docs/docs/transforms/emit/index.html index e09e094..4547e42 100644 --- a/docs/docs/transforms/emit/index.html +++ b/docs/docs/transforms/emit/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/fieldparse/index.html b/docs/docs/transforms/fieldparse/index.html index de593a2..8653473 100644 --- a/docs/docs/transforms/fieldparse/index.html +++ b/docs/docs/transforms/fieldparse/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/fieldprocess/index.html b/docs/docs/transforms/fieldprocess/index.html index e740c6b..e2bffa4 100644 --- a/docs/docs/transforms/fieldprocess/index.html +++ b/docs/docs/transforms/fieldprocess/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/fieldtype/index.html b/docs/docs/transforms/fieldtype/index.html index 6b1d126..0f4eebc 100644 --- a/docs/docs/transforms/fieldtype/index.html +++ b/docs/docs/transforms/fieldtype/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/filter/index.html b/docs/docs/transforms/filter/index.html index b1e5156..1aa627e 100644 --- a/docs/docs/transforms/filter/index.html +++ b/docs/docs/transforms/filter/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/flatmap/index.html b/docs/docs/transforms/flatmap/index.html new file mode 100644 index 0000000..29fc1b1 --- /dev/null +++ b/docs/docs/transforms/flatmap/index.html @@ -0,0 +1,370 @@ + + + + + + + + + + + flatMap · Sifter + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    + +
    + +
    + + diff --git a/docs/docs/transforms/from/index.html b/docs/docs/transforms/from/index.html index 5098871..bbdc204 100644 --- a/docs/docs/transforms/from/index.html +++ b/docs/docs/transforms/from/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/graphbuild/index.html b/docs/docs/transforms/graphbuild/index.html index 9b84821..bb491fb 100644 --- a/docs/docs/transforms/graphbuild/index.html +++ b/docs/docs/transforms/graphbuild/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/hash/index.html b/docs/docs/transforms/hash/index.html index 39d1cc9..9672d81 100644 --- a/docs/docs/transforms/hash/index.html +++ b/docs/docs/transforms/hash/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/index.html b/docs/docs/transforms/index.html index dd20f11..7fb9613 100644 --- a/docs/docs/transforms/index.html +++ b/docs/docs/transforms/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/lookup/index.html b/docs/docs/transforms/lookup/index.html index 3225c3b..eb4c6ad 100644 --- a/docs/docs/transforms/lookup/index.html +++ b/docs/docs/transforms/lookup/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/map/index.html b/docs/docs/transforms/map/index.html index 4ccfabd..14f7abb 100644 --- a/docs/docs/transforms/map/index.html +++ b/docs/docs/transforms/map/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/objectvalidate/index.html b/docs/docs/transforms/objectvalidate/index.html index 87e2592..d318599 100644 --- a/docs/docs/transforms/objectvalidate/index.html +++ b/docs/docs/transforms/objectvalidate/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/plugin/index.html b/docs/docs/transforms/plugin/index.html new file mode 100644 index 0000000..dc86b99 --- /dev/null +++ b/docs/docs/transforms/plugin/index.html @@ -0,0 +1,370 @@ + + + + + + + + + + + plugin · Sifter + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    + +
    + +
    + + diff --git a/docs/docs/transforms/project/index.html b/docs/docs/transforms/project/index.html index c5aaaa2..9cb432d 100644 --- a/docs/docs/transforms/project/index.html +++ b/docs/docs/transforms/project/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/reduce/index.html b/docs/docs/transforms/reduce/index.html index 6b90649..378b3b6 100644 --- a/docs/docs/transforms/reduce/index.html +++ b/docs/docs/transforms/reduce/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/regexreplace/index.html b/docs/docs/transforms/regexreplace/index.html index dda8941..a22a736 100644 --- a/docs/docs/transforms/regexreplace/index.html +++ b/docs/docs/transforms/regexreplace/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/split/index.html b/docs/docs/transforms/split/index.html index a65587d..69b03a0 100644 --- a/docs/docs/transforms/split/index.html +++ b/docs/docs/transforms/split/index.html @@ -57,13 +57,6 @@ >Overview -
  • - - Sifter Pipeline File
  • - -
  • @@ -245,6 +238,15 @@ +
  • + +
  • +
  • +
  • + +
  • + +
  • + +
  • + diff --git a/docs/docs/transforms/tablewrite/index.html b/docs/docs/transforms/tablewrite/index.html new file mode 100644 index 0000000..e937dbc --- /dev/null +++ b/docs/docs/transforms/tablewrite/index.html @@ -0,0 +1,370 @@ + + + + + + + + + + + tableWrite · Sifter + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    + +
    + +
    + + diff --git a/docs/docs/transforms/uuid/index.html b/docs/docs/transforms/uuid/index.html new file mode 100644 index 0000000..a1769da --- /dev/null +++ b/docs/docs/transforms/uuid/index.html @@ -0,0 +1,370 @@ + + + + + + + + + + + uuid · Sifter + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    + +
    + +
    + + diff --git a/docs/index.html b/docs/index.html index 8b58546..120ebf1 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2,7 +2,7 @@ - + @@ -45,6 +45,9 @@ + + +
    @@ -57,10 +60,11 @@

    SIFTER

    files and external databases. It includes a pipeline description language to define a set of Transform steps to create object messages that can be validated using a JSON schema data.

    +

    Example of sifter code

    - +
    diff --git a/docs/index.xml b/docs/index.xml index 86fd22d..b64b623 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -1,11 +1,263 @@ - - / - Recent content on + Sifter + https://bmeg.github.io/sifter/ + Recent content on Sifter Hugo -- gohugo.io - en - + en-us + + + accumulate + https://bmeg.github.io/sifter/docs/transforms/accumulate/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/accumulate/ + accumulate Gather sequential rows into a single record, based on matching a field Parameters name Type Description field string (field path) Field used to match rows dest string field to store accumulated records Example - accumulate: field: model_id dest: rows + + + avroLoad + https://bmeg.github.io/sifter/docs/inputs/avroload/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/avroload/ + avroLoad Load an AvroFile Parameters name Description input Path to input file + + + clean + https://bmeg.github.io/sifter/docs/transforms/clean/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/clean/ + clean Remove fields that don&rsquo;t appear in the desingated list. Parameters name Type Description fields [] string Fields to keep removeEmpty bool Fields with empty values will also be removed storeExtra string Field name to store removed fields Example - clean: fields: - id - synonyms + + + debug + https://bmeg.github.io/sifter/docs/transforms/debug/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/debug/ + debug Print out copy of stream to logging Parameters name Type Description label string Label for log output format bool Use multiline spaced output Example - debug: {} + + + distinct + https://bmeg.github.io/sifter/docs/transforms/distinct/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/distinct/ + distinct Using templated value, allow only the first record for each distinct key Parameters name Type Description value string Key used for distinct value Example - distinct: value: &#34;{{row.key}}&#34; + + + embedded + https://bmeg.github.io/sifter/docs/inputs/embedded/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/embedded/ + embedded Load data from embedded structure Example inputs: data: embedded: - { &#34;name&#34; : &#34;Alice&#34;, &#34;age&#34;: 28 } - { &#34;name&#34; : &#34;Bob&#34;, &#34;age&#34;: 27 } + + + emit + https://bmeg.github.io/sifter/docs/transforms/emit/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/emit/ + emit Send data to output file. The naming of the file is outdir/script name.pipeline name.emit name.json.gz Parameters name Type Description name string Name of emit value example - emit: name: protein_compound_association + + + Example + https://bmeg.github.io/sifter/docs/example/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/example/ + Example Pipeline Our first task will be to convert a ZIP code TSV into a set of county level entries. The input file looks like: ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP 36003,Autauga County,AL,01001,H1 36006,Autauga County,AL,01001,H1 36067,Autauga County,AL,01001,H1 36066,Autauga County,AL,01001,H1 36703,Autauga County,AL,01001,H1 36701,Autauga County,AL,01001,H1 36091,Autauga County,AL,01001,H1 First is the header of the pipeline. This declares the unique name of the pipeline and it&rsquo;s output directory. name: zipcode_map outdir: ./ docs: Converts zipcode TSV into graph elements Next the configuration is declared. + + + fieldParse + https://bmeg.github.io/sifter/docs/transforms/fieldparse/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/fieldparse/ + + + + fieldProcess + https://bmeg.github.io/sifter/docs/transforms/fieldprocess/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/fieldprocess/ + fieldProcess Create stream of objects based on the contents of a field. If the selected field is an array each of the items in the array will become an independent row. Parameters name Type Description field string Name of field to be processed mapping map[string]string Project templated values into child element itemField string If processing an array of non-dict elements, create a dict as {itemField:element} example - fieldProcess: field: portions mapping: sample: &#34;{{row. + + + fieldType + https://bmeg.github.io/sifter/docs/transforms/fieldtype/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/fieldtype/ + fieldType Set field to specific type, ie cast as float or integer example - fieldType: t_depth: int t_ref_count: int t_alt_count: int n_depth: int n_ref_count: int n_alt_count: int start: int + + + filter + https://bmeg.github.io/sifter/docs/transforms/filter/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/filter/ + filter Filter rows in stream using a number of different methods Parameters name Type Description field string (field path) Field used to match rows value string (template string) Template string to match against match string String to match against check string How to check value, &rsquo;exists&rsquo; or &lsquo;hasValue&rsquo; method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) Example Field based match - filter: field: table match: source_statistics Check based match + + + flatMap + https://bmeg.github.io/sifter/docs/transforms/flatmap/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/flatmap/ + + + + from + https://bmeg.github.io/sifter/docs/transforms/from/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/from/ + from Parmeters Name of data source Example inputs: profileReader: tableLoad: input: &#34;{{config.profiles}}&#34; pipelines: profileProcess: - from: profileReader + + + glob + https://bmeg.github.io/sifter/docs/inputs/glob/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/glob/ + glob Scan files using * based glob statement and open all files as input. Parameters Name Description storeFilename Store value of filename in parameter each row input Path of avro object file to transform xmlLoad xmlLoad configutation tableLoad Run transform pipeline on a TSV or CSV jsonLoad Run a transform pipeline on a multi line json file avroLoad Load data from avro file Example inputs: pubmedRead: glob: input: &#34;{{config.baseline}}/*.xml.gz&#34; xmlLoad: {} + + + graphBuild + https://bmeg.github.io/sifter/docs/transforms/graphbuild/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/graphbuild/ + graphBuild Build graph elements from JSON objects using the JSON Schema graph extensions. example - graphBuild: schema: &#34;{{config.allelesSchema}}&#34; title: Allele + + + hash + https://bmeg.github.io/sifter/docs/transforms/hash/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/hash/ + hash Parameters name Type Description field string Field to store hash value value string Templated string of value to be hashed method string Hashing method: sha1/sha256/md5 example - hash: value: &#34;{{row.contents}}&#34; field: contents-sha1 method: sha1 + + + Inputs + https://bmeg.github.io/sifter/docs/inputs/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/ + Every playbook consists of a series of inputs. + + + jsonLoad + https://bmeg.github.io/sifter/docs/inputs/jsonload/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/jsonload/ + jsonLoad Load data from a JSON file. Default behavior expects a single dictionary per line. Each line is a seperate entry. The multiline parameter reads all of the lines of the files and returns a single object. Parameters name Description input Path of JSON file to transform multiline Load file as a single multiline JSON object Example inputs: caseData: jsonLoad: input: &#34;{{config.casesJSON}}&#34; + + + lookup + https://bmeg.github.io/sifter/docs/transforms/lookup/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/lookup/ + lookup Using key from current row, get values from a reference source Parameters name Type Description replace string (field path) Field to replace lookup string (template string) Key to use for looking up data copy map[string]string Copy values from record that was found by lookup. The Key/Value record uses the Key as the destination field and copies the field from the retrieved records using the field named in Value tsv TSVTable TSV translation table file json JSONTable JSON data file table LookupTable Inline lookup table pipeline PipelineLookup Use output of a pipeline as a lookup table Example JSON file based lookup The JSON file defined by config. + + + map + https://bmeg.github.io/sifter/docs/transforms/map/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/map/ + map Run function on every row Parameters name Description method Name of function to call python Python code to be run gpython Python code to be run using GPython Example - map: method: response gpython: | def response(x): s = sorted(x[&#34;curve&#34;].items(), key=lambda x:float(x[0])) x[&#39;dose_um&#39;] = [] x[&#39;response&#39;] = [] for d, r in s: try: dn = float(d) rn = float(r) x[&#39;dose_um&#39;].append(dn) x[&#39;response&#39;].append(rn) except ValueError: pass return x + + + objectValidate + https://bmeg.github.io/sifter/docs/transforms/objectvalidate/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/objectvalidate/ + objectValidate Use JSON schema to validate row contents parameters name Type Description title string Title of object to use for validation schema string Path to JSON schema definition example - objectValidate: title: Aliquot schema: &#34;{{config.schema}}&#34; + + + Overview + https://bmeg.github.io/sifter/docs/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/ + Sifter pipelines Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of file extractors that operate as inputs to these pipelines. The pipeline engine connects togeather arrays of transform steps into directed acylic graph that is processed in parallel. Example Message: { &#34;firstName&#34; : &#34;bob&#34;, &#34;age&#34; : &#34;25&#34; &#34;friends&#34; : [ &#34;Max&#34;, &#34;Alex&#34;] } Once a stream of messages are produced, that can be run through a transform pipeline. + + + Pipeline Steps + https://bmeg.github.io/sifter/docs/transforms/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/ + Transforms alter the data + + + plugin + https://bmeg.github.io/sifter/docs/inputs/plugin/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/plugin/ + plugin Run user program for customized data extraction. Example inputs: oboData: plugin: commandLine: ../../util/obo_reader.py {{config.oboFile}} The plugin program is expected to output JSON messages, one per line, to STDOUT that will then be passed to the transform pipelines. Example Plugin The obo_reader.py plugin, it reads a OBO file, such as the kind the describe the GeneOntology, and emits the records as single line JSON messages. #!/usr/bin/env python import re import sys import json re_section = re. + + + plugin + https://bmeg.github.io/sifter/docs/transforms/plugin/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/plugin/ + + + + project + https://bmeg.github.io/sifter/docs/transforms/project/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/project/ + project Populate row with templated values parameters name Type Description mapping map[string]any New fields to be generated from template rename map[string]string Rename field (no template engine) Example - project: mapping: type: sample id: &#34;{{row.sample_id}}&#34; + + + reduce + https://bmeg.github.io/sifter/docs/transforms/reduce/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/reduce/ + reduce Using key from rows, reduce matched records into a single entry Parameters name Type Description field string (field path) Field used to match rows method string Method name python string Python code string gpython string Python code string run using (https://github.com/go-python/gpython) init map[string]any Data to use for first reduce Example - reduce: field: dataset_name method: merge init: { &#34;compounds&#34; : [] } gpython: | def merge(x,y): x[&#34;compounds&#34;] = list(set(y[&#34;compounds&#34;]+x[&#34;compounds&#34;])) return x + + + regexReplace + https://bmeg.github.io/sifter/docs/transforms/regexreplace/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/regexreplace/ + + + + split + https://bmeg.github.io/sifter/docs/transforms/split/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/split/ + split Split a field using string sep Parameters name Type Description field string Field to the split sep string String to use for splitting Example - split: field: methods sep: &#34;;&#34; + + + sqldump + https://bmeg.github.io/sifter/docs/inputs/sqldump/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/sqldump/ + sqlDump Scan file produced produced from sqldump. Parameters Name Type Description input string Path to the SQL dump file tables []string Names of tables to read out Example inputs: database: sqldumpLoad: input: &#34;{{config.sql}}&#34; tables: - cells - cell_tissues - dose_responses - drugs - drug_annots - experiments - profiles + + + sqliteLoad + https://bmeg.github.io/sifter/docs/inputs/sqliteload/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/sqliteload/ + sqliteLoad Extract data from an sqlite file Parameters Name Type Description input string Path to the SQLite file query string SQL select statement based input Example inputs: sqlQuery: sqliteLoad: input: &#34;{{config.sqlite}}&#34; query: &#34;select * from drug_mechanism as a LEFT JOIN MECHANISM_REFS as b on a.MEC_ID=b.MEC_ID LEFT JOIN TARGET_COMPONENTS as c on a.TID=c.TID LEFT JOIN COMPONENT_SEQUENCES as d on c.COMPONENT_ID=d.COMPONENT_ID LEFT JOIN MOLECULE_DICTIONARY as e on a.MOLREGNO=e.MOLREGNO&#34; + + + tableLoad + https://bmeg.github.io/sifter/docs/inputs/tableload/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/tableload/ + tableLoad Extract data from tabular file, includiong TSV and CSV files. Parameters Name Type Description input string File to be transformed rowSkip int Number of header rows to skip columns []string Manually set names of columns extraColumns string Columns beyond originally declared columns will be placed in this array sep string Separator \t for TSVs or , for CSVs Example config: gafFile: ../../source/go/goa_human.gaf.gz inputs: gafLoad: tableLoad: input: &#34;{{config.gafFile}}&#34; columns: - db - id - symbol - qualifier - goID - reference - evidenceCode - from - aspect - name - synonym - objectType - taxon - date - assignedBy - extension - geneProduct + + + tableWrite + https://bmeg.github.io/sifter/docs/transforms/tablewrite/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/tablewrite/ + + + + uuid + https://bmeg.github.io/sifter/docs/transforms/uuid/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/transforms/uuid/ + + + + xmlLoad + https://bmeg.github.io/sifter/docs/inputs/xmlload/ + Mon, 01 Jan 0001 00:00:00 +0000 + https://bmeg.github.io/sifter/docs/inputs/xmlload/ + xmlLoad Load an XML file Parameters name Description input Path to input file Example inputs: loader: xmlLoad: input: &#34;{{config.xmlPath}}&#34; + diff --git a/docs/sifter_example.png b/docs/sifter_example.png new file mode 100644 index 0000000..284e0dd Binary files /dev/null and b/docs/sifter_example.png differ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index cd5ab70..e78f061 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,10 +2,84 @@ - / + https://bmeg.github.io/sifter/ - /categories/ + https://bmeg.github.io/sifter/docs/transforms/accumulate/ - /tags/ + https://bmeg.github.io/sifter/docs/inputs/avroload/ + + https://bmeg.github.io/sifter/categories/ + + https://bmeg.github.io/sifter/docs/transforms/clean/ + + https://bmeg.github.io/sifter/docs/transforms/debug/ + + https://bmeg.github.io/sifter/docs/transforms/distinct/ + + https://bmeg.github.io/sifter/docs/ + + https://bmeg.github.io/sifter/docs/inputs/embedded/ + + https://bmeg.github.io/sifter/docs/transforms/emit/ + + https://bmeg.github.io/sifter/docs/example/ + + https://bmeg.github.io/sifter/docs/transforms/fieldparse/ + + https://bmeg.github.io/sifter/docs/transforms/fieldprocess/ + + https://bmeg.github.io/sifter/docs/transforms/fieldtype/ + + https://bmeg.github.io/sifter/docs/transforms/filter/ + + https://bmeg.github.io/sifter/docs/transforms/flatmap/ + + https://bmeg.github.io/sifter/docs/transforms/from/ + + https://bmeg.github.io/sifter/docs/inputs/glob/ + + https://bmeg.github.io/sifter/docs/transforms/graphbuild/ + + https://bmeg.github.io/sifter/docs/transforms/hash/ + + https://bmeg.github.io/sifter/docs/inputs/ + + https://bmeg.github.io/sifter/docs/inputs/jsonload/ + + https://bmeg.github.io/sifter/docs/transforms/lookup/ + + https://bmeg.github.io/sifter/docs/transforms/map/ + + https://bmeg.github.io/sifter/docs/transforms/objectvalidate/ + + https://bmeg.github.io/sifter/docs/ + + https://bmeg.github.io/sifter/docs/transforms/ + + https://bmeg.github.io/sifter/docs/inputs/plugin/ + + https://bmeg.github.io/sifter/docs/transforms/plugin/ + + https://bmeg.github.io/sifter/docs/transforms/project/ + + https://bmeg.github.io/sifter/docs/transforms/reduce/ + + https://bmeg.github.io/sifter/docs/transforms/regexreplace/ + + https://bmeg.github.io/sifter/docs/transforms/split/ + + https://bmeg.github.io/sifter/docs/inputs/sqldump/ + + https://bmeg.github.io/sifter/docs/inputs/sqliteload/ + + https://bmeg.github.io/sifter/docs/inputs/tableload/ + + https://bmeg.github.io/sifter/docs/transforms/tablewrite/ + + https://bmeg.github.io/sifter/tags/ + + https://bmeg.github.io/sifter/docs/transforms/uuid/ + + https://bmeg.github.io/sifter/docs/inputs/xmlload/ diff --git a/docs/tags/index.xml b/docs/tags/index.xml index 99ec5b4..41241ef 100644 --- a/docs/tags/index.xml +++ b/docs/tags/index.xml @@ -1,11 +1,11 @@ - Tags on - /tags/ - Recent content in Tags on + Tags on Sifter + https://bmeg.github.io/sifter/tags/ + Recent content in Tags on Sifter Hugo -- gohugo.io - en - + en-us + diff --git a/transform/reduce.go b/transform/reduce.go index 8243f90..2f8a450 100644 --- a/transform/reduce.go +++ b/transform/reduce.go @@ -60,6 +60,8 @@ func (rp *reduceProcess) GetKey(i map[string]any) string { if xStr, ok := x.(string); ok { return xStr } + } else { + log.Printf("Missing field in reduce") } return "" } diff --git a/website/content/_index.md b/website/content/_index.md index 6ad7988..5c37e23 100644 --- a/website/content/_index.md +++ b/website/content/_index.md @@ -7,3 +7,5 @@ files and external databases. It includes a pipeline description language to define a set of Transform steps to create object messages that can be validated using a JSON schema data. + +![Example of sifter code](sifter_example.png) \ No newline at end of file diff --git a/website/content/docs.md b/website/content/docs.md index d047861..d45d044 100644 --- a/website/content/docs.md +++ b/website/content/docs.md @@ -40,6 +40,45 @@ be done in a transform pipeline these include: # Script structure +# Pipeline File + +An sifter pipeline file is in YAML format and describes an entire processing pipelines. +If is composed of the following sections: `config`, `inputs`, `pipelines`, `outputs`. In addition, +for tracking, the file will also include `name` and `class` entries. + +```yaml + +class: sifter +name: