Skip to content

Commit

Permalink
tsv-append first release (#13)
Browse files Browse the repository at this point in the history
* First release of tsv-append.

* Readme updates for tsv-append.
  • Loading branch information
jondegenhardt authored Jan 4, 2017
1 parent 1d77814 commit 3101585
Show file tree
Hide file tree
Showing 18 changed files with 954 additions and 14 deletions.
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ A short description of each tool follows. There is more detail in the [tool refe
* [tsv-uniq](#tsv-uniq) - Filter out duplicate lines using fields as a key.
* [tsv-select](#tsv-select) - Keep a subset of the columns in the input.
* [tsv-summarize](#tsv-summarize) - Aggregate field values, summarizing across the entire file or grouped by key.
* [tsv-append](#tsv-append) - Concatenate TSV files. Header aware; supports source file tracking.
* [csv2tsv](#csv2tsv) - Convert CSV files to TSV.
* [number-lines](#number-lines) - Number the input lines.
* [Useful bash aliases](#useful-bash-aliases)
Expand Down Expand Up @@ -84,7 +85,7 @@ See the [tsv-select reference](#tsv-select-reference) for details.

### tsv-summarize

tsv-summarize runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. As an example, consider the file `data.tsv`:
`tsv-summarize` runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. As an example, consider the file `data.tsv`:
```
color weight
red 6
Expand All @@ -109,6 +110,18 @@ Multiple fields can be used as the `--group-by` key. The file's sort order does

See the [tsv-summarize reference](#tsv-summarize-reference) for the list of statistical and other aggregation operations available.

### tsv-append

`tsv-append` concatenates multiple TSV files, similar to the Unix `cat` utility. It is header aware, writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row.

Concatenation with header support is useful when preparing data for traditional Unix utilities like `sort` and `sed` or applications that read a single file.

Source tracking is useful when creating long/narrow form tabular data. This format is used by many statistics and data mining packages. (See [Wide & Long Data - Stanford University](http://stanford.edu/~ejdemyr/r-tutorials/wide-and-long/) or Hadley Wickham's [Tidy data](http://vita.had.co.nz/papers/tidy-data.html) for more info.)

In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized.

See the [tsv-append reference](#tsv-append-reference) for the complete list of options available.

### csv2tsv

Sometimes you have a CSV file. This program does what you expect: convert CSV data to TSV. Example:
Expand Down Expand Up @@ -205,7 +218,9 @@ There is directory for each tool, plus one directory for shared code (`common`).

Documentation for each tool is found near the top of the main file, both in the help text and the option documentation.

The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools. `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison.
The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools. `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison.

`tsv-append` has a simple code structure. It's one of the newer tools. It's only additional complexity is that writes to an 'output range' rather than directly to standard output. This enables better encapsulation for unit testing.

`tsv-join` and `tsv-filter` also have relatively straightforward functionality, but support more use cases resulting in more code. `tsv-filter` in particular has more elaborate setup steps that take a bit more time to understand. `tsv-filter` uses several features like delegates (closures) and regular expressions not used in the other tools.

Expand Down Expand Up @@ -258,7 +273,7 @@ $ make test-nobuild

### Unit tests

D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are `csv2tsv` and `tsv-summarize`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are `csv2tsv`, `tsv-summarize` and `tsv-append`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.

Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.

Expand Down Expand Up @@ -364,6 +379,8 @@ This section provides more detailed documentation about the different tools as w
* [tsv-join reference](#tsv-join-reference)
* [tsv-uniq reference](#tsv-uniq-reference)
* [tsv-select reference](#tsv-select-reference)
* [tsv-summarize reference](#tsv-summarize-reference)
* [tsv-append reference](#tsv-append-reference)
* [csv2tsv reference](#csv2tsv-reference)
* [number-lines reference](#number-lines-reference)

Expand Down Expand Up @@ -722,6 +739,39 @@ Calculations hold onto the minimum data needed while reading data. A few operati
* `--mode n[,n...][:STR]` - Mode. The most frequent value. (Reads all values into memory.)
* `--values n[,n...][:STR]` - All the values, separated by --v|values-delimiter. (Reads all values into memory.)

### tsv-append reference

**Synopsis:** tsv-append [options] [file...]

tsv-append concatenates multiple TSV files, similar to the Unix 'cat' utility. Unlike 'cat', it is header aware ('--H|header'), writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. Results are written to standard output.

Concatenation with header support is useful when preparing data for traditional Unix utilities like 'sort' and 'sed' or applications that read a single file.

Source tracking is useful when creating long/narrow form tabular data, a format used by many statistics and data mining packages. In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file.

The file-name (without extension) is used as the source value. This can customized using the --f|file option.

Example: Header processing:

$ tsv-append -H file1.tsv file2.tsv file3.tsv

Example: Header processing and source tracking:

$ tsv-append -H -t file1.tsv file2.tsv file3.tsv

Example: Source tracking with custom values:

$ tsv-append -H -s test_id -f test1=file1.tsv -f test2=file2.tsv

**Options:**
* `--h|help` - Print help.
* `--help-verbose` - Print detailed help.
* `--H|header` - Treat the first line of each file as a header.
* `--t|track-source` - Track the source file. Adds an column with the source name.
* `--s|source-header STR` - Use STR as the header for the source column. Implies --H|header and --t|track-source. Default: 'file'
* `--f|file STR=FILE` - Read file FILE, using STR as the 'source' value. Implies --t|track-source.
* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)

### csv2tsv reference

**Synopsis:** csv2tsv [options] [file...]
Expand Down
2 changes: 1 addition & 1 deletion common/dub.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"description": "Routines used by applications in the tsv-utils-dlang package.",
"homepage": "https://github.com/eBay/tsv-utils-dlang",
"authors": ["Jon Degenhardt"],
"copyright": "Copyright (c) 2015-2016, eBay Software Foundation",
"copyright": "Copyright (c) 2015-2017, eBay Software Foundation",
"license": "BSL-1.0",
"targetType": "sourceLibrary"
}
115 changes: 115 additions & 0 deletions common/src/unittest_utils.d
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
/**
Helper functions for tsv-utils-dlang unit tests.
Copyright (c) 2017, eBay Software Foundation
Initially written by Jon Degenhardt
License: Boost License 1.0 (http://boost.org/LICENSE_1_0.txt)
*/

version(unittest)
{
/* Creates a temporary directory for writing unit test files. The path of the created
* directory is returned. The 'toolDirName' argument will be included in the directory
* name, and should consist of generic filename characters. e.g. "tsv_append". This
* name will also be used in assert error messages.
*
* The caller should delete the temporary directory and all its contents when tests
* are finished. This can be done using std.file.rmdirRecurse. For example:
*
* unittest
* {
* import std.file : rmdirRecurse;
* auto testDir = makeUnittestTempDir("tsv_append");
* scope(exit) testDir.rmdirRecurse;
* ... test code
* }
*
* An assert is triggered if the directory cannot be created. There are two typical
* reasons:
* - Unable to find an available directory name. A number of unique names are tried
* (currently 1000). If they are all taken, it will normally be because the directories
* haven't been properly cleaned up from previous unit test runs.
* - Directory creation failed. e.g. Permission denied.
*
* This routine is intended to be run in 'unittest' mode, so that an assert is triggered
* on failure. However, if run with asserts disabled, the returned path will be empty in
* event of a failure.
*/
string makeUnittestTempDir(string toolDirName)
{
import std.conv;
import std.file : exists, mkdir, tempDir;
import std.format;
import std.path : buildPath;
import std.range;

string dirNamePrefix = "tsv_utils_dlang__" ~ toolDirName ~ "_unittest_";
string systemTempDirPath = tempDir();
string newTempDirPath = "";

for (auto i = 0; i < 1000 && newTempDirPath.empty; i++)
{
string path = buildPath(systemTempDirPath, dirNamePrefix ~ i.to!string);
if (!path.exists) newTempDirPath = path;
}
assert (!newTempDirPath.empty,
format("Unable to obtain a new temp directory, paths tried already exist.\nPath prefix: %s",
buildPath(systemTempDirPath, dirNamePrefix)));

if (!newTempDirPath.empty)
{
try mkdir(newTempDirPath);
catch (Exception exc)
{
assert(false, format("Failed to create temp directory: %s\n Error: %s",
newTempDirPath, exc.msg));
}
}

return newTempDirPath;
}

/* Write a TSV file. The 'tsvData' argument is a 2-dimensional array of rows and
* columns. Asserts if the file cannot be written.
*
* This routine is intended to be run in 'unittest' mode, so that it will assert
* if the write fails. However, if run in a mode with asserts disabled, it will
* return false if the write failed.
*/
bool writeUnittestTsvFile(string filepath, string[][] tsvData, char delimiter = '\t')
{
import std.algorithm : each, joiner, map;
import std.conv;
import std.format: format;
import std.stdio : File;

try
{
auto file = File(filepath, "w");
tsvData
.map!(row => row.joiner(delimiter.to!string))
.each!(str => file.writeln(str));
}
catch (Exception exc)
{
assert(false, format("Failed to write TSV file: %s.\n Error: %s",
filepath, exc.msg));
return false;
}

return true;
}

/* Convert a 2-dimensional array of values to an in-memory string. */
string tsvDataToString(string[][] tsvData, char delimiter = '\t')
{
import std.algorithm : joiner, map;
import std.conv;

return tsvData
.map!(row => row.joiner(delimiter.to!string).to!string ~ "\n")
.joiner
.to!string;
}
}
6 changes: 4 additions & 2 deletions dub.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,21 @@
"tsv-utils-dlang:common": "*",
"tsv-utils-dlang:csv2tsv": "*",
"tsv-utils-dlang:number-lines": "*",
"tsv-utils-dlang:tsv-select": "*",
"tsv-utils-dlang:tsv-append": "*",
"tsv-utils-dlang:tsv-filter": "*",
"tsv-utils-dlang:tsv-join": "*",
"tsv-utils-dlang:tsv-select": "*",
"tsv-utils-dlang:tsv-summarize": "*",
"tsv-utils-dlang:tsv-uniq": "*"
},
"subPackages": [
"./common/",
"./csv2tsv/",
"./number-lines/",
"./tsv-select/",
"./tsv-append/",
"./tsv-filter/",
"./tsv-join/",
"./tsv-select/",
"./tsv-summarize/",
"./tsv-uniq/"
],
Expand Down
4 changes: 2 additions & 2 deletions dub_build.d
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Another use-case:
dub fetch --local <package>
cd <package>
dub build
dub run
This executable is intended to handle these cases. It also has one additional function:
inform the user where the binaries are stored so they can be added to the path.
Expand Down Expand Up @@ -56,7 +56,7 @@ int main(string[] args) {

// Note: At present 'common' is a source library and does not need a standalone compilation step.
auto packageName = "tsv-utils-dlang";
auto subPackages = ["csv2tsv", "number-lines", "tsv-filter", "tsv-join", "tsv-select", "tsv-summarize", "tsv-uniq"];
auto subPackages = ["csv2tsv", "number-lines", "tsv-append", "tsv-filter", "tsv-join", "tsv-select", "tsv-summarize", "tsv-uniq"];
auto buildCmdArgs = ["dub", "build", "<package>", "--force", "-b"];
buildCmdArgs ~= debugBuild ? "debug" : "release";
if (compiler.length > 0) {
Expand Down
2 changes: 1 addition & 1 deletion makeapp.mk
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
app ?= $(notdir $(basename $(CURDIR)))
common_srcs ?= $(common_srcdir)/tsvutil.d $(common_srcdir)/getopt_inorder.d
common_srcs ?= $(common_srcdir)/tsvutil.d $(common_srcdir)/getopt_inorder.d $(common_srcdir)/unittest_utils.d
app_src ?= src/$(app).d
srcs ?= $(app_src) $(common_srcs)
imports ?= -I$(common_srcdir)
Expand Down
2 changes: 1 addition & 1 deletion makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
appdirs = csv2tsv number-lines tsv-filter tsv-join tsv-select tsv-uniq tsv-summarize
appdirs = csv2tsv number-lines tsv-filter tsv-join tsv-select tsv-uniq tsv-summarize tsv-append
subdirs = common $(appdirs)

release: make_subdirs
Expand Down
28 changes: 28 additions & 0 deletions tsv-append/dub.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"name": "tsv-append",
"description": "Concatenate TSV files. Header aware, with support for source file tracking.",
"homepage": "https://github.com/eBay/tsv-utils-dlang",
"authors": ["Jon Degenhardt"],
"copyright": "Copyright (c) 2017, eBay Software Foundation",
"license": "BSL-1.0",
"targetType": "executable",
"configurations": [
{
"name" : "executable",
"targetName": "tsv-append",
"targetPath": "../bin/",
"mainSourceFile": "src/tsv-append.d",
"dependencies": {
"tsv-utils-dlang:common": "*"
}
},
{
"name": "unittest",
"targetType": "none"
}
],
"buildTypes": {
"debug": { "buildOptions": ["debugMode", "optimize"] },
"release": { "buildOptions": ["releaseMode", "optimize", "inline"], "dflags": ["-boundscheck=off"] }
}
}
2 changes: 2 additions & 0 deletions tsv-append/makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include ../makedefs.mk
include ../makeapp.mk
Loading

0 comments on commit 3101585

Please sign in to comment.