Skip to content

Commit

Permalink
docs(readme): Edit for writing clarity
Browse files Browse the repository at this point in the history
  • Loading branch information
dbohdan committed Nov 3, 2017
1 parent dcfc2f9 commit 1b59fca
Showing 1 changed file with 22 additions and 24 deletions.
46 changes: 22 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ or, equivalently,
sqawk 'select distinct a7 from a order by a7' FS=: /etc/passwd
```

Sqawk allows you to be verbose to better document your script but aims to provide good defaults that save you keystrokes in interactive use.
Sqawk lets you be verbose to better document your script but aims to provide good defaults that save you keystrokes in interactive use.

[Skip down](#more-examples) for more examples.

Expand All @@ -51,13 +51,13 @@ On **FreeBSD** with [pkgng](https://wiki.freebsd.org/pkgng):
sudo pkg install tcl86 tcllib tcl-sqlite3
sudo ln -s /usr/local/bin/tclsh8.6 /usr/local/bin/tclsh

On **Windows** install [Magicsplat Tcl/Tk for Windows](http://www.magicsplat.com/tcl-installer/) or [ActiveTcl](https://www.activestate.com/activetcl/downloads) from ActiveState.
On **Windows** install [Magicsplat Tcl/Tk for Windows](http://www.magicsplat.com/tcl-installer/) (Windows 7 or later) or [ActiveTcl](https://www.activestate.com/activetcl/downloads) 8.5 from ActiveState (Windows XP/Vista).

On **macOS** use [MacPorts](https://www.macports.org/) or install [ActiveTcl](https://www.activestate.com/activetcl/downloads) for the Mac. With MacPorts:

sudo port install tcllib tcl-sqlite3

Once you have the dependencies installed on *nix, run
Once you have the dependencies installed on \*nix, run

git clone https://github.com/dbohdan/sqawk
cd sqawk
Expand Down Expand Up @@ -85,9 +85,9 @@ One of the filenames can be `-` for the standard input.

## SQL

A Sqawk `script` consist of one of more SQL statements in the SQLite version 3 [dialect](https://www.sqlite.org/lang.html) of SQL.
A Sqawk `script` consists of one or more statements in the SQLite version 3 [dialect](https://www.sqlite.org/lang.html) of SQL.

The default table names are `a` for the first input file, `b` for the second, `c` for the third, etc. You can change the table name for any one file with a file option. The table name is used as a prefix in its columns' names; by default, the columns are named `a1`, `a2`, etc. in the table `a`; `b1`, `b2`, etc. in `b`; and so on. `a0` is the raw input text of the whole record for each record (i.e., one line of input with the default record separator of `\n`). `anr` in `a`, `bnr` in `b`, and so on contain the record number and is the primary key of its respective table. `anf`, `bnf`, and so on contain the field count for a given record.
The default table names are `a` for the first input file, `b` for the second, `c` for the third, etc. You can change the table name for any file with a file option. The table name is used as a prefix in its columns' names; by default, the columns are named `a1`, `a2`, etc. in the table `a`; `b1`, `b2`, etc. in `b`; and so on. `a0` is the raw input text of the whole record for each record (i.e., one line of input with the default record separator of `\n`). `anr` in `a`, `bnr` in `b`, and so on contains the record number and is the primary key of its respective table. `anf`, `bnf`, and so on contain the field count for a given record.

## Options

Expand All @@ -97,25 +97,25 @@ These options affect all files.

| Option | Example | Comment |
|--------|---------|---------|
| -FS value | `-FS '[ \t]+'` | The input field separator for the default parser (for all input files). |
| -RS value | `-RS '\n'` | The input record separator for the default parser (for all input files). |
| -OFS value | `-OFS ' '` | The output field separator for the default serializer. |
| -ORS value | `-ORS '\n'` | The output record separator for the default serializer. |
| -FS value | `-FS '[ \t]+'` | The input field separator for the default `awk` parser (for all input files). |
| -RS value | `-RS '\n'` | The input record separator for the default `awk` parser (for all input files). |
| -OFS value | `-OFS ' '` | The output field separator for the default `awk` serializer. |
| -ORS value | `-ORS '\n'` | The output record separator for the default `awk` serializer. |
| -NF value | `-NF 10` | The maximum number of fields per record. The corresponding number of columns is added to the target table at the start (e.g., `a0`, `a1`, `a2`, ... , `a10` for ten fields). Increase this if you get errors like `table x has no column named x51` with `MNF` set to `error`. |
| -MNF value | `-MNF expand`, `-MNF crop`, `-MNF error` | The NF mode. This option tells Sqawk what to do if a record exceeds the maximum number of fields: `expand`, the default, will increase `NF` automatically and add columns to the table during import if the record contains more fields than available; `crop` will truncate the record to `NF` fields (i.e., the fields for which there aren't enough table columns will be omitted); `error` makes Sqawk quit with an error message like `table x has no column named x11`. |
| -output value | `-output awk` | The output format. See [Output formats](#output-formats). |
| -v | | Print the Sqawk version and exit. |
| -1 | | Do not split records into fields. The same as `-F 'x^'`. Improves the performance somewhat for when you only want to operate on whole records (lines). |
| -1 | | Do not split records into fields. The same as `-FS 'x^'` (`x^` is a regular expression that matches nothing). Improves the performance somewhat for when you only want to operate on whole records (lines). |

#### Output formats

The following are the possible values for the command line option `-output`. Some formats have format options to further customize the output. The options are appended to the format name and separated from the format name and each other with commas, e.g., `-output json,arrays=0,indent=1`.

| Format name | Format options | Examples | Comment |
|-------------|----------------|----------|---------|
| awk | none | `-output awk` | The default serializer, `awk`, works similarly to Awk. When it is selected, the output consists of the rows returned by your query separated with the output record separator (-ORS). Each row in turn consists of columns separated with the output field separator (-OFS). |
| awk | none | `-output awk` | The default serializer, `awk`, mimics its namesake, Awk. When it is selected, the output consists of the rows returned by your query separated with the output record separator (-ORS). Each row in turn consists of columns separated with the output field separator (-OFS). |
| csv | none | `-output csv` | Output CSV. |
| json | `arrays` (defaults to `0`), `indent` (defaults to `0`) | `-output json,indent=0,arrays=1` | Output the result of the query as JSON. If `arrays` is `0`, the result is an array of JSON objects with the column names as keys; if `arrays` is `1`, the result is an array of arrays. The values are all represented as strings in either case. If `indent` is `1`, each object will be indented for readability. |
| json | `arrays` (defaults to `0`), `indent` (defaults to `0`) | `-output json,indent=0,arrays=1` | Output the result of the query as JSON. If `arrays` is `0`, the result is an array of JSON objects with the column names as keys; if `arrays` is `1`, the result is an array of arrays. The values are all represented as strings in either case. If `indent` is `1`, each object (but not array) will be indented for readability. |
| table | `alignments` or `align`, `margins`, `style` | `-output table,align=center left right`, `-output table,alignments=c l r` | Output plain text tables. The `table` serializer uses [Tabulate](https://tcl.wiki/41682) to format the output as a table using box-drawing characters. Note that the default Unicode table output will not display correctly in `cmd.exe` on Windows even after `chcp 65001`. Use `style=loFi` to draw tables with plain ASCII characters instead. |
| tcl | `dicts` (defaults to `0`) | `-output tcl,dicts=1` | Dump raw Tcl data structures. With the `tcl` serializer Sqawk outputs a list of lists if `dicts` is `0` and a list of dictionaries with the column names as keys if `dicts` is `1`. |

Expand All @@ -125,25 +125,25 @@ These options are set before a filename and only affect one input source (file).

| Option | Example | Comment |
|--------|---------|---------|
| columns | `columns=id,name,sum`, `columns=id,a long name with spaces` | Give the columns for the next file custom names. If there are more columns than custom names, the columns after the last one with a custom name will be named automatically in the same manner as with the option `header=1`. Custom column names override names taken from the header. If you give a column an empty name, it will be named automatically or will retain its name from the header. |
| datatypes | `datatypes=integer,real,text` | Set the [datatypes](https://www.sqlite.org/datatype3.html) for the columns, starting with `a1` if your table is named `a`. The datatype for each column for which the datatype is not explicitly given is `INTEGER`. The datatype of `a0` is always `TEXT`. |
| columns | `columns=id,name,sum`, `columns=id,a long name with spaces` | Give custom names to the table columns for the next file. If there are more columns than custom names, the columns after the last one with a custom name will be named automatically in the same manner as with the option `header=1` (see below). Custom column names override names taken from the header. If you give a column an empty name, it will be named automatically or will retain its name from the header. |
| datatypes | `datatypes=integer,real,text` | Set the [datatypes](https://www.sqlite.org/datatype3.html) for the columns, starting with the first (`a1` if your table is named `a`). The datatype for each column for which the datatype is not explicitly given is `INTEGER`. The datatype of `a0` is always `TEXT`. |
| format | `format=csv csvsep=;` | Set the input format for the next source of input. See [Input formats](#input-formats). |
| header | `header=1` | Can be `0`/`false`/`no`/`off` or `1`/`true`/`yes`/`on`. Use the first row of the file as a source of column names. If the first row has five fields, then the first five columns will have custom names, and all the following columns will have automatically generated names (e.g., `name`, `surname`, `title`, `office`, `phone`, `a6`, `a7`, ...). |
| prefix | `prefix=x` | The column name prefix in the table. Defaults to the table name. For example, with `table=foo` and `prefix=bar` you need to use queries like `select bar1, bar2 from foo` to access the table `foo`. |
| prefix | `prefix=x` | The column name prefix in the table. Defaults to the table name. For example, with `table=foo` and `prefix=bar` you will have columns named `bar1`, `bar2`, `bar3`, etc. in the table `foo`. |
| table | `table=foo` | The table name. By default, the tables are named `a`, `b`, `c`, ... Specifying, e.g., `table=foo` for the second file only will result in the tables having the names `a`, `foo`, `c`, ... |
| F0 | `F0=no`, `F0=1` | Can be `0`/`false`/`no`/`off` or `1`/`true`/`yes`/`on`. Enable the zeroth column of the table that stores the input verbatim. Disabling this column can save memory. |
| NF | `NF=20` | The same as -NF, but for one file (table). |
| MNF | `MNF=crop` | The same as -MNF, but for one file (table). |
| F0 | `F0=no`, `F0=1` | Can be `0`/`false`/`no`/`off` or `1`/`true`/`yes`/`on`. Enable the zeroth column of the table that stores the input verbatim. Disabling this column lowers memory usage. |
| NF | `NF=20` | The same as the global option -NF, but for one file (table). |
| MNF | `MNF=crop` | The same as the global option -MNF, but for one file (table). |

#### Input formats

A format option (`format=x`) selects the input parser with which Sqawk will parse the next input source. Formats can have multiple synonymous names or multiple names that configure the parser in different ways. Selecting an input format can enable additional per-file options that only work with that format.
A format option (`format=x`) selects the input parser with which Sqawk will parse the next input source. Formats can have multiple synonymous names or multiple names that configure the parser in different ways. Selecting an input format can enable additional per-file options that only work for that format.

| Format | Additional options | Examples | Comment |
|--------|--------------------|--------- |---------|
| `awk` | `FS`, `RS`, `trim`, `fields` | `RS=\n`, `FS=:`, `trim=left`, `fields=1,2,3-5,auto` | The default input parser. Splits input into records then fields using regular expressions. The options `FS` and `RS` work the same as -FS and -RS respectively, but only apply to one file. The option `trim` removes whitespace at the beginning of each line of input (`trim=left`), at its end (`trim=right`), both (`trim=both`), or neither (`trim=none`). The option `fields` configures how the fields of the input are mapped to the columns of the corresponding database table. This option lets you discard some of the fields, which can save memory, and to merge the contents of others. For example, `fields=1,2,3-5,auto` tells Sqawk to insert the contents of the first field into the column `a1` (assuming table `a`), the second field into `a2`, the third through the fifth field into `a3`, and the rest of the fields starting with the sixth into the columns `a4`, 'a5', and so on, one field per column. If you merge several fields, the whitespace between them is preserved. |
| `awk` | `FS`, `RS`, `trim`, `fields` | `RS=\n`, `FS=:`, `trim=left`, `fields=1,2,3-5,auto` | The default input parser. Splits the input first into records then into fields using regular expressions. The options `FS` and `RS` work the same as -FS and -RS respectively, but only apply to one file. The option `trim` removes whitespace at the beginning of each line of input (`trim=left`), at its end (`trim=right`), both (`trim=both`), or neither (`trim=none`). The option `fields` configures how the fields of the input are mapped to the columns of the corresponding database table. This option lets you discard some of the fields, which can save memory, and to merge the contents of others. For example, `fields=1,2,3-5,auto` tells Sqawk to insert the contents of the first field into the column `a1` (assuming table `a`), the second field into `a2`, the third through the fifth field into `a3`, and the rest of the fields starting with the sixth into the columns `a4`, `a5`, and so on, one field per column. If you merge several fields, the whitespace between them is preserved. |
| `csv`, `csv2`, `csvalt` | `csvsep`, `csvquote` | `format=csv csvsep=, 'csvquote="'` | Parse the input as CSV. Using `format=csv2` or `format=csvalt` enables the [alternate mode](http://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/csv/csv.html#section3) meant for parsing CSV files exported by Microsoft Excel. `csvsep` sets the field separator; it defaults to `,`. `csvquote` selects the character with which the fields that contain the field separator are quoted; it defaults to `"`. Note that some characters (e.g., numbers and most letters) can't be be used as `csvquote`. |
| `tcl` | `dicts` | `format=tcl dicts=true` | The value for `dicts` can be `0`/`false`/`no`/`off` or `1`/`true`/`yes`/`on`. The input is read as a Tcl list of either lists (`dicts=0`, the default) or dictionaries (`dicts=1`). When `dicts` is `0`, each list becomes a row in the corresponding database table. If that table is `a`, its column `a0` contains the full list, `a1` contains the first element, `a2` contains the second element, and so on. When `dicts` is `1`, the first row of the table contains every unique key found in all of the dictionaries. It is intended as a table header for use with the [option](#per-file-options) `header=1`. The keys are in the same order they are in the first dictionary of the input (Tcl dictionaries are ordered). If some keys that aren't in the first dictionary but are in the subsequent ones, they follow those that are in the first dictionary in alphabetical order. From the second row on the table contains the input data with the values mapped to columns in the same way that the keys are in the first row. |
| `tcl` | `dicts` | `format=tcl dicts=true` | The value for `dicts` can be `0`/`false`/`no`/`off` or `1`/`true`/`yes`/`on`. The input is read as a Tcl list of either lists (`dicts=0`, the default) or dictionaries (`dicts=1`). When `dicts` is `0`, each list becomes a row in the database table. If that table is `a`, its column `a0` will contain the full list, `a1` will contain the first element, `a2` the second element, and so on. When `dicts` is `1`, the first row of the table will contain every unique key found in all of the dictionaries. It is intended as a table header for use with the [option](#per-file-options) `header=1`. The keys are in the same order they are in the first dictionary of the input (Tcl dictionaries are ordered). If some keys that aren't in the first dictionary but are in the subsequent ones, they follow those that are in the first dictionary in alphabetical order. From the second row on the table contains the input data with the values mapped to columns in the same way that the keys are in the first row. |


# More examples
Expand Down Expand Up @@ -261,7 +261,7 @@ This is the equivalent of the Awk code

### Commands

This example uses the files from the [happypenguin.com 2013 data dump](https://archive.org/details/happypenguin_xml_dump_2013) to generate metadata.
This example joins the data from two metadata files generated from the [happypenguin.com 2013 data dump](https://archive.org/details/happypenguin_xml_dump_2013). You do not need to download the data dump to try the query; `MD5SUMS` and `du-bytes` are included in the directory [`examples/hp/`](./examples/hp/).

# Generate input files -- see below
cd happypenguin_dump/screenshots
Expand All @@ -270,8 +270,6 @@ This example uses the files from the [happypenguin.com 2013 data dump](https://a
# Perform query
sqawk 'select a1, b1, a2 from a inner join b on a2 = b2 where b1 < 10000 order by b1' MD5SUMS du-bytes

You don't need to download the data yourself to recreate `MD5SUMS` and `du-bytes`; the files can be found in the directory [`examples/`](./examples/).

### Input files

#### MD5SUMS
Expand Down

0 comments on commit 1b59fca

Please sign in to comment.