tsv-append first release (#13)

* First release of tsv-append. * Readme updates for tsv-append.
eBay · Jan 4, 2017 · 3101585 · 3101585
1 parent 1d77814
commit 3101585
Show file tree

Hide file tree

Showing 18 changed files with 954 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -24,6 +24,7 @@ A short description of each tool follows. There is more detail in the [tool refe
 * [tsv-uniq](#tsv-uniq) - Filter out duplicate lines using fields as a key.
 * [tsv-select](#tsv-select) - Keep a subset of the columns in the input.
 * [tsv-summarize](#tsv-summarize) - Aggregate field values, summarizing across the entire file or grouped by key.
+* [tsv-append](#tsv-append) - Concatenate TSV files. Header aware; supports source file tracking.
 * [csv2tsv](#csv2tsv) - Convert CSV files to TSV.
 * [number-lines](#number-lines) - Number the input lines.
 * [Useful bash aliases](#useful-bash-aliases)
@@ -84,7 +85,7 @@ See the [tsv-select reference](#tsv-select-reference) for details.
 
 ### tsv-summarize
 
-tsv-summarize runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. As an example, consider the file `data.tsv`:
+`tsv-summarize` runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. As an example, consider the file `data.tsv`:
 ```
 color   weight
 red     6
@@ -109,6 +110,18 @@ Multiple fields can be used as the `--group-by` key. The file's sort order does
 
 See the [tsv-summarize reference](#tsv-summarize-reference) for the list of statistical and other aggregation operations available.
 
+### tsv-append
+
+`tsv-append` concatenates multiple TSV files, similar to the Unix `cat` utility. It is header aware, writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row.
+
+Concatenation with header support is useful when preparing data for traditional Unix utilities like `sort` and `sed` or applications that read a single file.
+
+Source tracking is useful when creating long/narrow form tabular data. This format is used by many statistics and data mining packages. (See [Wide & Long Data - Stanford University](http://stanford.edu/~ejdemyr/r-tutorials/wide-and-long/) or Hadley Wickham's [Tidy data](http://vita.had.co.nz/papers/tidy-data.html) for more info.)
+
+In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized.
+
+See the [tsv-append reference](#tsv-append-reference) for the complete list of options available.
+
 ### csv2tsv
 
 Sometimes you have a CSV file. This program does what you expect: convert CSV data to TSV. Example:
@@ -205,7 +218,9 @@ There is directory for each tool, plus one directory for shared code (`common`).
 
 Documentation for each tool is found near the top of the main file, both in the help text and the option documentation.
 
-The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools.  `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison. 
+The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools. `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison. 
+
+`tsv-append` has a simple code structure. It's one of the newer tools. It's only additional complexity is that writes to an 'output range' rather than directly to standard output. This enables better encapsulation for unit testing.
 
 `tsv-join` and `tsv-filter` also have relatively straightforward functionality, but support more use cases resulting in more code. `tsv-filter` in particular has more elaborate setup steps that take a bit more time to understand. `tsv-filter` uses several features like delegates (closures) and regular expressions not used in the other tools.
 
@@ -258,7 +273,7 @@ $ make test-nobuild
 
 ### Unit tests
 
-D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are `csv2tsv` and `tsv-summarize`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
+D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are `csv2tsv`, `tsv-summarize` and `tsv-append`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
 
 Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.
 
@@ -364,6 +379,8 @@ This section provides more detailed documentation about the different tools as w
 * [tsv-join reference](#tsv-join-reference)
 * [tsv-uniq reference](#tsv-uniq-reference)
 * [tsv-select reference](#tsv-select-reference)
+* [tsv-summarize reference](#tsv-summarize-reference)
+* [tsv-append reference](#tsv-append-reference)
 * [csv2tsv reference](#csv2tsv-reference)
 * [number-lines reference](#number-lines-reference)
 
@@ -722,6 +739,39 @@ Calculations hold onto the minimum data needed while reading data. A few operati
 * `--mode n[,n...][:STR]` - Mode. The most frequent value. (Reads all values into memory.)
 * `--values n[,n...][:STR]` - All the values, separated by --v|values-delimiter. (Reads all values into memory.)
 
+### tsv-append reference
+
+**Synopsis:** tsv-append [options] [file...]
+
+tsv-append concatenates multiple TSV files, similar to the Unix 'cat' utility. Unlike 'cat', it is header aware ('--H|header'), writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. Results are written to standard output.
+
+Concatenation with header support is useful when preparing data for traditional Unix utilities like 'sort' and 'sed' or applications that read a single file.
+
+Source tracking is useful when creating long/narrow form tabular data, a format used by many statistics and data mining packages. In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file.
+
+The file-name (without extension) is used as the source value. This can customized using the --f|file option.
+
+Example: Header processing:
+
+   $ tsv-append -H file1.tsv file2.tsv file3.tsv
+
+Example: Header processing and source tracking:
+
+   $ tsv-append -H -t file1.tsv file2.tsv file3.tsv
+
+Example: Source tracking with custom values:
+
+   $ tsv-append -H -s test_id -f test1=file1.tsv -f test2=file2.tsv
+
+**Options:**
+* `--h|help` - Print help.
+* `--help-verbose` - Print detailed help.
+* `--H|header` - Treat the first line of each file as a header.
+* `--t|track-source` - Track the source file. Adds an column with the source name.
+* `--s|source-header STR` - Use STR as the header for the source column. Implies --H|header and --t|track-source. Default: 'file'
+* `--f|file STR=FILE` - Read file FILE, using STR as the 'source' value. Implies --t|track-source.
+* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
+
 ### csv2tsv reference
 
 **Synopsis:** csv2tsv [options] [file...]

diff --git a/common/dub.json b/common/dub.json
@@ -3,7 +3,7 @@
     "description": "Routines used by applications in the tsv-utils-dlang package.",
     "homepage": "https://github.com/eBay/tsv-utils-dlang",
     "authors": ["Jon Degenhardt"],
-    "copyright": "Copyright (c) 2015-2016, eBay Software Foundation",
+    "copyright": "Copyright (c) 2015-2017, eBay Software Foundation",
     "license": "BSL-1.0",
     "targetType": "sourceLibrary"
 }
diff --git a/common/src/unittest_utils.d b/common/src/unittest_utils.d
@@ -0,0 +1,115 @@
+/**
+Helper functions for tsv-utils-dlang unit tests.
+
+Copyright (c) 2017, eBay Software Foundation
+Initially written by Jon Degenhardt
+
+License: Boost License 1.0 (http://boost.org/LICENSE_1_0.txt) 
+*/
+
+version(unittest)
+{
+    /* Creates a temporary directory for writing unit test files. The path of the created
+     * directory is returned. The 'toolDirName' argument will be included in the directory
+     * name, and should consist of generic filename characters. e.g. "tsv_append". This
+     * name will also be used in assert error messages.
+     *
+     * The caller should delete the temporary directory and all its contents when tests
+     * are finished. This can be done using std.file.rmdirRecurse. For example:
+     *
+     *     unittest
+     *     {
+     *         import std.file : rmdirRecurse;
+     *         auto testDir = makeUnittestTempDir("tsv_append");
+     *         scope(exit) testDir.rmdirRecurse;
+     *         ... test code
+     *     }
+     *
+     * An assert is triggered if the directory cannot be created. There are two typical
+     * reasons:
+     * - Unable to find an available directory name. A number of unique names are tried
+     *   (currently 1000). If they are all taken, it will normally be because the directories
+     *   haven't been properly cleaned up from previous unit test runs.
+     * - Directory creation failed. e.g. Permission denied.
+     *
+     * This routine is intended to be run in 'unittest' mode, so that an assert is triggered
+     * on failure. However, if run with asserts disabled, the returned path will be empty in
+     * event of a failure.
+     */
+    string makeUnittestTempDir(string toolDirName)
+    {
+        import std.conv;
+        import std.file : exists, mkdir, tempDir;
+        import std.format;
+        import std.path : buildPath;
+        import std.range;
+
+        string dirNamePrefix = "tsv_utils_dlang__" ~ toolDirName ~ "_unittest_";
+        string systemTempDirPath = tempDir();
+        string newTempDirPath = "";
+
+        for (auto i = 0; i < 1000 && newTempDirPath.empty; i++)
+        {
+            string path = buildPath(systemTempDirPath, dirNamePrefix ~ i.to!string);
+            if (!path.exists) newTempDirPath = path;
+        }
+        assert (!newTempDirPath.empty,
+                format("Unable to obtain a new temp directory, paths tried already exist.\nPath prefix: %s",
+                       buildPath(systemTempDirPath, dirNamePrefix)));
+
+        if (!newTempDirPath.empty)
+        {
+            try mkdir(newTempDirPath);
+            catch (Exception exc)
+            {
+                assert(false, format("Failed to create temp directory: %s\n   Error: %s",
+                                     newTempDirPath, exc.msg));
+            }
+        }
+
+        return newTempDirPath;
+    }
+
+    /* Write a TSV file. The 'tsvData' argument is a 2-dimensional array of rows and
+     * columns. Asserts if the file cannot be written.
+     *
+     * This routine is intended to be run in 'unittest' mode, so that it will assert
+     * if the write fails. However, if run in a mode with asserts disabled, it will
+     * return false if the write failed.
+     */ 
+    bool writeUnittestTsvFile(string filepath, string[][] tsvData, char delimiter = '\t')
+    {
+        import std.algorithm : each, joiner, map;
+        import std.conv;
+        import std.format: format;
+        import std.stdio : File;
+
+        try
+        {
+            auto file = File(filepath, "w");
+            tsvData
+                .map!(row => row.joiner(delimiter.to!string))
+                .each!(str => file.writeln(str));
+        }
+        catch (Exception exc)
+        {
+            assert(false, format("Failed to write TSV file: %s.\n  Error: %s",
+                                 filepath, exc.msg));
+            return false;
+        }
+
+        return true;
+    }
+
+    /* Convert a 2-dimensional array of values to an in-memory string. */
+    string tsvDataToString(string[][] tsvData, char delimiter = '\t')
+    {
+        import std.algorithm : joiner, map;
+        import std.conv;
+
+        return tsvData
+            .map!(row => row.joiner(delimiter.to!string).to!string ~ "\n")
+            .joiner
+            .to!string;
+    }
+ }
diff --git a/dub.json b/dub.json
@@ -11,19 +11,21 @@
         "tsv-utils-dlang:common": "*",
         "tsv-utils-dlang:csv2tsv": "*",
         "tsv-utils-dlang:number-lines": "*",
-        "tsv-utils-dlang:tsv-select": "*",
+        "tsv-utils-dlang:tsv-append": "*",
         "tsv-utils-dlang:tsv-filter": "*",
         "tsv-utils-dlang:tsv-join": "*",
+        "tsv-utils-dlang:tsv-select": "*",
         "tsv-utils-dlang:tsv-summarize": "*",
         "tsv-utils-dlang:tsv-uniq": "*"
     },
     "subPackages": [
         "./common/",
         "./csv2tsv/",
         "./number-lines/",
-        "./tsv-select/",
+        "./tsv-append/",
         "./tsv-filter/",
         "./tsv-join/",
+        "./tsv-select/",
         "./tsv-summarize/",
         "./tsv-uniq/"
     ],

diff --git a/dub_build.d b/dub_build.d
@@ -13,7 +13,7 @@ Another use-case:
 
     dub fetch --local <package>
     cd <package>
-    dub build
+    dub run
 
 This executable is intended to handle these cases. It also has one additional function:
 inform the user where the binaries are stored so they can be added to the path.
@@ -56,7 +56,7 @@ int main(string[] args) {
 
     // Note: At present 'common' is a source library and does not need a standalone compilation step.
     auto packageName = "tsv-utils-dlang";
-    auto subPackages = ["csv2tsv", "number-lines", "tsv-filter", "tsv-join", "tsv-select", "tsv-summarize", "tsv-uniq"];
+    auto subPackages = ["csv2tsv", "number-lines", "tsv-append", "tsv-filter", "tsv-join", "tsv-select", "tsv-summarize", "tsv-uniq"];
     auto buildCmdArgs = ["dub", "build", "<package>", "--force", "-b"];
     buildCmdArgs ~= debugBuild ? "debug" : "release";
     if (compiler.length > 0) {

diff --git a/makeapp.mk b/makeapp.mk
@@ -1,5 +1,5 @@
 app ?= $(notdir $(basename $(CURDIR)))
-common_srcs ?= $(common_srcdir)/tsvutil.d $(common_srcdir)/getopt_inorder.d
+common_srcs ?= $(common_srcdir)/tsvutil.d $(common_srcdir)/getopt_inorder.d $(common_srcdir)/unittest_utils.d
 app_src ?= src/$(app).d
 srcs ?= $(app_src) $(common_srcs)
 imports ?= -I$(common_srcdir)

diff --git a/makefile b/makefile
@@ -1,4 +1,4 @@
-appdirs =  csv2tsv number-lines tsv-filter tsv-join tsv-select tsv-uniq tsv-summarize
+appdirs =  csv2tsv number-lines tsv-filter tsv-join tsv-select tsv-uniq tsv-summarize tsv-append
 subdirs = common $(appdirs)
 
 release: make_subdirs

diff --git a/tsv-append/dub.json b/tsv-append/dub.json
@@ -0,0 +1,28 @@
+{
+    "name": "tsv-append",
+    "description": "Concatenate TSV files. Header aware, with support for source file tracking.",
+    "homepage": "https://github.com/eBay/tsv-utils-dlang",
+    "authors": ["Jon Degenhardt"],
+    "copyright": "Copyright (c) 2017, eBay Software Foundation",
+    "license": "BSL-1.0",
+    "targetType": "executable",
+    "configurations": [
+        {
+            "name" : "executable",
+            "targetName": "tsv-append",
+            "targetPath": "../bin/",
+            "mainSourceFile": "src/tsv-append.d",
+            "dependencies": {
+                "tsv-utils-dlang:common": "*"
+            }
+        },
+        {
+            "name": "unittest",
+            "targetType": "none"
+        }
+    ],
+    "buildTypes": {
+        "debug": { "buildOptions": ["debugMode", "optimize"] },
+        "release": { "buildOptions": ["releaseMode", "optimize", "inline"], "dflags": ["-boundscheck=off"] }
+    }
+}
diff --git a/tsv-append/makefile b/tsv-append/makefile
@@ -0,0 +1,2 @@
+include ../makedefs.mk
+include ../makeapp.mk