Job setup and testing; new transforms (#33)

* make application configurable here, or from projects using it * handle special options for destinations * handle namespaced file registry keys * add FileRegistryEntry class to handle settings in one place and validate * use pry for console * implement dry-container as file registry and customize it to create/return FileRegistryEntry objects * separate Job from BaseJob; add TestingJob * add Deduplicate::Table and Extract::Fields transforms * add documentation and update tests to new format * add ExampleFormatter method to help create documentation
lyrasis · Aug 31, 2021 · 1462959 · 1462959
1 parent baed926
commit 1462959
Show file tree

Hide file tree

Showing 63 changed files with 3,041 additions and 198 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,6 @@
 
 # rspec failure tracking
 .rspec_status
+.byebug_history
+
+**/.~lock*
diff --git a/.rubocop.yml b/.rubocop.yml
@@ -9,3 +9,7 @@ Metrics/BlockLength:
     - spec/kiba/extend/**/*
 Naming/MethodParameterName:
   AllowedNames: i, v
+Style/ModuleFunction:
+  Exclude:
+    # alias_method doesn't work if extend self is changed to module_function
+    - lib/kiba/extend/utils/lookup.rb
diff --git a/.yardopts b/.yardopts
@@ -5,4 +5,5 @@
 --no-private
 --markup markdown
 -
-LICENSE.txt
+LICENSE.txt
+doc/file_registry_entry.md
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -1,40 +1,51 @@
 PATH
   remote: .
   specs:
-    kiba-extend (2.1.1)
+    kiba-extend (2.2.0)
       activesupport (~> 6.1.4)
       csv (~> 3.0)
+      dry-configurable (~> 0.11)
+      dry-container (~> 0.8)
       kiba (~> 4.0.0)
       kiba-common (~> 1.5.0)
       xxhash (~> 0.4)
 
 GEM
   remote: https://rubygems.org/
   specs:
-    activesupport (6.1.4)
+    activesupport (6.1.4.1)
       concurrent-ruby (~> 1.0, >= 1.0.2)
       i18n (>= 1.6, < 2)
       minitest (>= 5.1)
       tzinfo (~> 2.0)
       zeitwerk (~> 2.3)
     ast (2.4.2)
-    coderay (1.1.2)
+    byebug (11.1.3)
+    coderay (1.1.3)
     concurrent-ruby (1.1.9)
     csv (3.2.0)
     diff-lcs (1.3)
+    dry-configurable (0.12.1)
+      concurrent-ruby (~> 1.0)
+      dry-core (~> 0.5, >= 0.5.0)
+    dry-container (0.8.0)
+      concurrent-ruby (~> 1.0)
+      dry-configurable (~> 0.1, >= 0.1.3)
+    dry-core (0.7.1)
+      concurrent-ruby (~> 1.0)
     i18n (1.8.10)
       concurrent-ruby (~> 1.0)
     kiba (4.0.0)
     kiba-common (1.5.0)
       kiba (>= 3.0.0, < 5)
-    method_source (0.9.2)
+    method_source (1.0.0)
     minitest (5.14.4)
     parallel (1.20.1)
     parser (3.0.2.0)
       ast (~> 2.4.1)
-    pry (0.12.2)
-      coderay (~> 1.1.0)
-      method_source (~> 0.9.0)
+    pry (0.14.1)
+      coderay (~> 1.1)
+      method_source (~> 1.0)
     rainbow (3.0.0)
     rake (13.0.1)
     regexp_parser (2.1.1)
@@ -78,8 +89,9 @@ PLATFORMS
 
 DEPENDENCIES
   bundler (>= 1.17)
+  byebug (~> 11.0)
   kiba-extend!
-  pry (~> 0.12.2)
+  pry (~> 0.14)
   rake (~> 13.0)
   rspec (~> 3.0)
   rubocop (~> 1.18.4)

diff --git a/bin/console b/bin/console
@@ -7,9 +7,5 @@ require 'kiba/extend'
 # You can add fixtures and/or initialization code here to make experimenting
 # with your gem easier. You can also use a different console, if you like.
 
-# (If you use this, don't forget to add pry to your Gemfile!)
-# require "pry"
-# Pry.start
-
-require 'irb'
-IRB.start(__FILE__)
+require 'pry'
+Pry.start
diff --git a/doc/file_registry_entry.md b/doc/file_registry_entry.md
@@ -0,0 +1,101 @@
+# File Registry Entry
+
+## PATH_REQ
+
+Constant registering the known source/destination classes and whether each requires a file path for read/write.
+
+If you create or incorporate a new source/destination class, you will get a warning if you use it and do not register it here. 
+
+## reghash
+
+A file registry entry is initialized with a Hash of data about the file. This Hash will be sent from your ETL application. 
+
+The allowable Hash keys, expected Hash value formats, and expectations about them are described below.
+
+**`:path` [String] full or expandable relative path to the expected location of the file**
+
+* default: `nil`
+* required if either `:src_class` or `:dest_class` requires a path (in `PATH_REQ`)
+
+`:src_class` [Class] the Ruby class used to read in data
+
+* default: value of `Kiba::Extend.source` (`Kiba::Common::Sources::CSV` unless overridden by your ETL app)
+* required, but default supplied if not given
+
+`:src_opt` [Hash] file options used when reading in source
+
+* default: value of `Kiba::Extend.csvopts`
+* required, but default supplied if not given
+
+`:dest_class` [Class] the Ruby class used to write out the data
+
+* default: value of `Kiba::Extend.destination` (`Kiba::Extend::Destinations::CSV` unless overridden by your ETL app)
+* required, but default supplied if not given
+
+`:dest_opt` [Hash] file options used when writing data
+
+* default: value of `Kiba::Extend.csvopts`
+* required, but default supplied if not given
+
+`:dest_special_opts` [Hash] additional options for writing out the data
+
+* Not all destination classes support extra options. If you provide unsupported extra options, they will not be sent through to the destination class, and you will receive a warning in STDOUT. The current most common use is to define `initial_headers` (i.e. which columns should be first in file) to `Kiba::Extend::Destinations::CSV`.
+* optional
+
+```ruby
+reghash = {
+  path: '/path/to/file.csv',
+  dest_class: Kiba::Extend::Destinations::CSV,
+  dest_special_opts: { initial_headers: %i[objectnumber briefdescription] }
+  }
+```
+
+**`:creator` [Method] Ruby method that generates this file**
+
+* Used to run ETL jobs to create necessary files, if said files do not exist
+* required unless file is supplied
+
+**`:supplied` [true, false] whether the file/data is supplied from outside the ETL**
+
+- default: false
+- Manually set to true for:
+  - original data files from client
+  - mappings/reconciliations to be merged into the ETL/migration
+  - any other files created external to the ETL, which only need to be read from and never generated by the ETL process
+
+Both of the following are valid:
+
+```ruby
+reghash = {
+  path: '/project/working/objects_prep.csv',
+  creator: Project::ClientData::ObjectTable.method(:prep)
+}
+
+reghash = {
+  path: '/project/clientData/objects.csv',
+  supplied: true
+}
+```
+
+Note the following pattern!:
+
+    Class or Module constant name + `.method` + method name **as symbol**
+
+**`:lookup_on` [Symbol] column to use as keys in lookup table created from file data**
+
+* required if file is used as a lookup source
+* You can register the same file multiple times under different file keys with different `:lookup_on` values if you need to use the data for different lookup purposes
+
+`:desc` [String] description of what the file is/what it is used for. Used when post-processing reports results to STDOUT
+
+* optional
+
+`:tags` [Array<Symbol>] list of arbitrary tags useful for categorizing data/jobs in your ETL
+
+* optional
+* If set, you can filter to run only jobs tagged with a given tag
+* Tags I commonly use: 
+  * :report_problems - reports that indicate something unexpected or that I need to do more work
+  * :report_fyi - informational reports
+  * :cspace - final files ready to import
+
diff --git a/kiba-extend.gemspec b/kiba-extend.gemspec
@@ -37,12 +37,15 @@ Gem::Specification.new do |spec|
 
   spec.add_dependency 'activesupport', '~> 6.1.4'
   spec.add_dependency 'csv', '~> 3.0'
+  spec.add_dependency 'dry-configurable', '~> 0.11'
+  spec.add_dependency 'dry-container', '~> 0.8'
   spec.add_dependency 'kiba', '~> 4.0.0'
   spec.add_dependency 'kiba-common', '~> 1.5.0'
   spec.add_dependency 'xxhash', '~> 0.4'
 
   spec.add_development_dependency 'bundler', '>= 1.17'
-  spec.add_development_dependency 'pry', '~> 0.12.2'
+  spec.add_development_dependency 'byebug', '~>11.0'
+  spec.add_development_dependency 'pry', '~> 0.14'
   spec.add_development_dependency 'rake', '~> 13.0'
   spec.add_development_dependency 'rspec', '~> 3.0'
   spec.add_development_dependency 'rubocop', '~> 1.18.4'

diff --git a/lib/kiba/extend.rb b/lib/kiba/extend.rb
@@ -2,13 +2,17 @@
 
 require 'active_support'
 require 'active_support/core_ext/object'
+require 'dry-configurable'
 require 'kiba'
 require 'kiba-common/sources/csv'
+require 'kiba-common/sources/enumerable'
 require 'kiba-common/destinations/csv'
+require 'kiba-common/destinations/lambda'
 require 'pry'
+require 'byebug'
 require 'xxhash'
 
-require 'kiba/extend/version'
+# require 'kiba/extend/version'
 
 # Default CSV options
 CSVOPT = { headers: true, header_converters: :symbol }.freeze
@@ -24,13 +28,49 @@
 module Kiba
   # Provides a suite of abstract, reusable, well-tested data transformations for use in Kiba ETL pipelines
   module Extend
-    puts "kiba-extend version: #{Kiba::Extend::VERSION}"
+    module_function
+    extend Dry::Configurable
 
     # Require application files
     Dir.glob("#{__dir__}/**/*").sort.select { |path| path.match?(/\.rb$/) }.each do |rbfile|
       require rbfile.delete_prefix("#{File.expand_path(__dir__)}/lib/")
     end
 
+    # So we can call Kiba.job_segment
+    Kiba.extend(Kiba::Extend::Jobs::JobSegmenter)
+
+    # Default options for reading/writing CSVs
+    setting :csvopts, { headers: true, header_converters: %i[symbol downcase] }, reader: true
+
+    # Default settings for Lambda destination
+    setting :lambdaopts, { on_write: ->(r) { accumulator << r } }, reader: true
+
+    # Default delimiter for splitting/joining values in multi-valued fields
+    setting :delim, ';', reader: true
+
+    # Default source class for jobs
+    setting :source, Kiba::Common::Sources::CSV, reader: true
+
+    # Default destination class for jobs
+    setting :destination, Kiba::Extend::Destinations::CSV, reader: true
+
+    # Prefix for warnings from the ETL
+    setting :warning_label, 'KIBA WARNING', reader: true
+
+    setting :registry, Kiba::Extend::FileRegistry.new, reader: true
+
+    setting :job, reader: true do
+      # Whether to output results to STDOUT for debugging
+      setting :show_me, false, reader: true
+      # Whether to have computer say something when job is complete
+      setting :tell_me, false, reader: true
+      # How much output about jobs to output to STDOUT
+      # :debug - tells you A LOT - helpful when developing pipelines and debugging
+      # :normal - reports what is running, from where, and the results
+      # :minimal - bare minimum
+      setting :verbosity, :normal, reader: true
+    end
+
     # strips, collapses multiple spaces, removes terminal commas, strips again
     CSV::Converters[:stripplus] = lambda { |s|
       begin

diff --git a/lib/kiba/extend/jobs.rb b/lib/kiba/extend/jobs.rb
@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+
+require_relative 'jobs/parser'
+
+Kiba::Extend::Jobs.extend(Kiba::Extend::Jobs::Parser)
+
+module Kiba
+  module Extend
+    module Jobs
+    end
+  end
+end