Skip to content

Commit

Permalink
Job setup and testing; new transforms (#33)
Browse files Browse the repository at this point in the history
* make application configurable here, or from projects using it

* handle special options for destinations

* handle namespaced file registry keys

* add FileRegistryEntry class to handle settings in one place and validate

* use pry for console

* implement dry-container as file registry and customize it to create/return FileRegistryEntry objects

* separate Job from BaseJob; add TestingJob

* add Deduplicate::Table and Extract::Fields transforms

* add documentation and update tests to new format

* add ExampleFormatter method to help create documentation
  • Loading branch information
kspurgin authored Aug 31, 2021
1 parent baed926 commit 1462959
Show file tree
Hide file tree
Showing 63 changed files with 3,041 additions and 198 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@

# rspec failure tracking
.rspec_status
.byebug_history

**/.~lock*
4 changes: 4 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,7 @@ Metrics/BlockLength:
- spec/kiba/extend/**/*
Naming/MethodParameterName:
AllowedNames: i, v
Style/ModuleFunction:
Exclude:
# alias_method doesn't work if extend self is changed to module_function
- lib/kiba/extend/utils/lookup.rb
3 changes: 2 additions & 1 deletion .yardopts
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
--no-private
--markup markdown
-
LICENSE.txt
LICENSE.txt
doc/file_registry_entry.md
28 changes: 20 additions & 8 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,40 +1,51 @@
PATH
remote: .
specs:
kiba-extend (2.1.1)
kiba-extend (2.2.0)
activesupport (~> 6.1.4)
csv (~> 3.0)
dry-configurable (~> 0.11)
dry-container (~> 0.8)
kiba (~> 4.0.0)
kiba-common (~> 1.5.0)
xxhash (~> 0.4)

GEM
remote: https://rubygems.org/
specs:
activesupport (6.1.4)
activesupport (6.1.4.1)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 1.6, < 2)
minitest (>= 5.1)
tzinfo (~> 2.0)
zeitwerk (~> 2.3)
ast (2.4.2)
coderay (1.1.2)
byebug (11.1.3)
coderay (1.1.3)
concurrent-ruby (1.1.9)
csv (3.2.0)
diff-lcs (1.3)
dry-configurable (0.12.1)
concurrent-ruby (~> 1.0)
dry-core (~> 0.5, >= 0.5.0)
dry-container (0.8.0)
concurrent-ruby (~> 1.0)
dry-configurable (~> 0.1, >= 0.1.3)
dry-core (0.7.1)
concurrent-ruby (~> 1.0)
i18n (1.8.10)
concurrent-ruby (~> 1.0)
kiba (4.0.0)
kiba-common (1.5.0)
kiba (>= 3.0.0, < 5)
method_source (0.9.2)
method_source (1.0.0)
minitest (5.14.4)
parallel (1.20.1)
parser (3.0.2.0)
ast (~> 2.4.1)
pry (0.12.2)
coderay (~> 1.1.0)
method_source (~> 0.9.0)
pry (0.14.1)
coderay (~> 1.1)
method_source (~> 1.0)
rainbow (3.0.0)
rake (13.0.1)
regexp_parser (2.1.1)
Expand Down Expand Up @@ -78,8 +89,9 @@ PLATFORMS

DEPENDENCIES
bundler (>= 1.17)
byebug (~> 11.0)
kiba-extend!
pry (~> 0.12.2)
pry (~> 0.14)
rake (~> 13.0)
rspec (~> 3.0)
rubocop (~> 1.18.4)
Expand Down
8 changes: 2 additions & 6 deletions bin/console
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,5 @@ require 'kiba/extend'
# You can add fixtures and/or initialization code here to make experimenting
# with your gem easier. You can also use a different console, if you like.

# (If you use this, don't forget to add pry to your Gemfile!)
# require "pry"
# Pry.start

require 'irb'
IRB.start(__FILE__)
require 'pry'
Pry.start
101 changes: 101 additions & 0 deletions doc/file_registry_entry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# File Registry Entry

## PATH_REQ

Constant registering the known source/destination classes and whether each requires a file path for read/write.

If you create or incorporate a new source/destination class, you will get a warning if you use it and do not register it here.

## reghash

A file registry entry is initialized with a Hash of data about the file. This Hash will be sent from your ETL application.

The allowable Hash keys, expected Hash value formats, and expectations about them are described below.

**`:path` [String] full or expandable relative path to the expected location of the file**

* default: `nil`
* required if either `:src_class` or `:dest_class` requires a path (in `PATH_REQ`)

`:src_class` [Class] the Ruby class used to read in data

* default: value of `Kiba::Extend.source` (`Kiba::Common::Sources::CSV` unless overridden by your ETL app)
* required, but default supplied if not given

`:src_opt` [Hash] file options used when reading in source

* default: value of `Kiba::Extend.csvopts`
* required, but default supplied if not given

`:dest_class` [Class] the Ruby class used to write out the data

* default: value of `Kiba::Extend.destination` (`Kiba::Extend::Destinations::CSV` unless overridden by your ETL app)
* required, but default supplied if not given

`:dest_opt` [Hash] file options used when writing data

* default: value of `Kiba::Extend.csvopts`
* required, but default supplied if not given

`:dest_special_opts` [Hash] additional options for writing out the data

* Not all destination classes support extra options. If you provide unsupported extra options, they will not be sent through to the destination class, and you will receive a warning in STDOUT. The current most common use is to define `initial_headers` (i.e. which columns should be first in file) to `Kiba::Extend::Destinations::CSV`.
* optional

```ruby
reghash = {
path: '/path/to/file.csv',
dest_class: Kiba::Extend::Destinations::CSV,
dest_special_opts: { initial_headers: %i[objectnumber briefdescription] }
}
```

**`:creator` [Method] Ruby method that generates this file**

* Used to run ETL jobs to create necessary files, if said files do not exist
* required unless file is supplied

**`:supplied` [true, false] whether the file/data is supplied from outside the ETL**

- default: false
- Manually set to true for:
- original data files from client
- mappings/reconciliations to be merged into the ETL/migration
- any other files created external to the ETL, which only need to be read from and never generated by the ETL process

Both of the following are valid:

```ruby
reghash = {
path: '/project/working/objects_prep.csv',
creator: Project::ClientData::ObjectTable.method(:prep)
}

reghash = {
path: '/project/clientData/objects.csv',
supplied: true
}
```

Note the following pattern!:

Class or Module constant name + `.method` + method name **as symbol**

**`:lookup_on` [Symbol] column to use as keys in lookup table created from file data**

* required if file is used as a lookup source
* You can register the same file multiple times under different file keys with different `:lookup_on` values if you need to use the data for different lookup purposes

`:desc` [String] description of what the file is/what it is used for. Used when post-processing reports results to STDOUT

* optional

`:tags` [Array<Symbol>] list of arbitrary tags useful for categorizing data/jobs in your ETL

* optional
* If set, you can filter to run only jobs tagged with a given tag
* Tags I commonly use:
* :report_problems - reports that indicate something unexpected or that I need to do more work
* :report_fyi - informational reports
* :cspace - final files ready to import

5 changes: 4 additions & 1 deletion kiba-extend.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,15 @@ Gem::Specification.new do |spec|

spec.add_dependency 'activesupport', '~> 6.1.4'
spec.add_dependency 'csv', '~> 3.0'
spec.add_dependency 'dry-configurable', '~> 0.11'
spec.add_dependency 'dry-container', '~> 0.8'
spec.add_dependency 'kiba', '~> 4.0.0'
spec.add_dependency 'kiba-common', '~> 1.5.0'
spec.add_dependency 'xxhash', '~> 0.4'

spec.add_development_dependency 'bundler', '>= 1.17'
spec.add_development_dependency 'pry', '~> 0.12.2'
spec.add_development_dependency 'byebug', '~>11.0'
spec.add_development_dependency 'pry', '~> 0.14'
spec.add_development_dependency 'rake', '~> 13.0'
spec.add_development_dependency 'rspec', '~> 3.0'
spec.add_development_dependency 'rubocop', '~> 1.18.4'
Expand Down
44 changes: 42 additions & 2 deletions lib/kiba/extend.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,17 @@

require 'active_support'
require 'active_support/core_ext/object'
require 'dry-configurable'
require 'kiba'
require 'kiba-common/sources/csv'
require 'kiba-common/sources/enumerable'
require 'kiba-common/destinations/csv'
require 'kiba-common/destinations/lambda'
require 'pry'
require 'byebug'
require 'xxhash'

require 'kiba/extend/version'
# require 'kiba/extend/version'

# Default CSV options
CSVOPT = { headers: true, header_converters: :symbol }.freeze
Expand All @@ -24,13 +28,49 @@
module Kiba
# Provides a suite of abstract, reusable, well-tested data transformations for use in Kiba ETL pipelines
module Extend
puts "kiba-extend version: #{Kiba::Extend::VERSION}"
module_function
extend Dry::Configurable

# Require application files
Dir.glob("#{__dir__}/**/*").sort.select { |path| path.match?(/\.rb$/) }.each do |rbfile|
require rbfile.delete_prefix("#{File.expand_path(__dir__)}/lib/")
end

# So we can call Kiba.job_segment
Kiba.extend(Kiba::Extend::Jobs::JobSegmenter)

# Default options for reading/writing CSVs
setting :csvopts, { headers: true, header_converters: %i[symbol downcase] }, reader: true

# Default settings for Lambda destination
setting :lambdaopts, { on_write: ->(r) { accumulator << r } }, reader: true

# Default delimiter for splitting/joining values in multi-valued fields
setting :delim, ';', reader: true

# Default source class for jobs
setting :source, Kiba::Common::Sources::CSV, reader: true

# Default destination class for jobs
setting :destination, Kiba::Extend::Destinations::CSV, reader: true

# Prefix for warnings from the ETL
setting :warning_label, 'KIBA WARNING', reader: true

setting :registry, Kiba::Extend::FileRegistry.new, reader: true

setting :job, reader: true do
# Whether to output results to STDOUT for debugging
setting :show_me, false, reader: true
# Whether to have computer say something when job is complete
setting :tell_me, false, reader: true
# How much output about jobs to output to STDOUT
# :debug - tells you A LOT - helpful when developing pipelines and debugging
# :normal - reports what is running, from where, and the results
# :minimal - bare minimum
setting :verbosity, :normal, reader: true
end

# strips, collapses multiple spaces, removes terminal commas, strips again
CSV::Converters[:stripplus] = lambda { |s|
begin
Expand Down
12 changes: 12 additions & 0 deletions lib/kiba/extend/jobs.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# frozen_string_literal: true

require_relative 'jobs/parser'

Kiba::Extend::Jobs.extend(Kiba::Extend::Jobs::Parser)

module Kiba
module Extend
module Jobs
end
end
end
Loading

0 comments on commit 1462959

Please sign in to comment.