Skip to content

Commit

Permalink
Merge pull request #50 from leoebfolsom/lf/issue-49--compare_all_colu…
Browse files Browse the repository at this point in the history
…mns_macro_for_testing

lf/issue-49  compare all columns macro for testing
  • Loading branch information
joellabes authored Sep 7, 2022
2 parents dd5d2ed + 7677cac commit bd58775
Show file tree
Hide file tree
Showing 17 changed files with 404 additions and 83 deletions.
8 changes: 7 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,13 @@ jobs:
build:
docker:
- image: cimg/python:3.9.9
- image: circleci/postgres:9.6.5-alpine-ram
- image: cimg/postgres:14.0
auth:
username: dbt-labs
password: ''
environment:
POSTGRES_USER: root
POSTGRES_DB: circle_test

steps:
- checkout
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

target/
dbt_packages/
logs/
logfile
183 changes: 122 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Useful macros when performing data audits
* [compare_queries](#compare_queries-source)
* [compare_column_values](#compare_column_values-source)
* [compare_relation_columns](#compare_relation_columns-source)
* [compare_all_columns](#compare_all_columns-source)
* [compare_column_values_verbose](#compare_column_values_verbose-source)

# Installation instructions
New to dbt packages? Read more about them [here](https://docs.getdbt.com/docs/building-a-dbt-project/package-management/).
Expand Down Expand Up @@ -160,67 +162,6 @@ number of your records don't match.
work as expected.


### Advanced usage:
Got a wide table, and want to iterate through all the columns? Try something
like this:
```
{%- set columns_to_compare=adapter.get_columns_in_relation(ref('dim_product')) -%}
{% set old_etl_relation_query %}
select * from public.dim_product
where is_latest
{% endset %}
{% set new_etl_relation_query %}
select * from {{ ref('dim_product') }}
{% endset %}
{% if execute %}
{% for column in columns_to_compare %}
{{ log('Comparing column "' ~ column.name ~'"', info=True) }}
{% set audit_query = audit_helper.compare_column_values(
a_query=old_etl_relation_query,
b_query=new_etl_relation_query,
primary_key="product_id",
column_to_compare=column.name
) %}
{% set audit_results = run_query(audit_query) %}
{% do audit_results.print_table() %}
{{ log("", info=True) }}
{% endfor %}
{% endif %}
```

This will give you an output like:
```
Comparing column "name"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 41,573 | 99.43 |
| 🤷: missing from b | 26 | 0.06 |
| 🙅: ‍values do not... | 212 | 0.51 |
Comparing column "msrp"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 31,145 | 74.49 |
| ✅: both are null | 10,557 | 25.25 |
| 🤷: missing from b | 22 | 0.05 |
| 🤷: value is null ... | 31 | 0.07 |
| 🤷: value is null ... | 4 | 0.01 |
| 🙅: ‍values do not... | 52 | 0.12 |
Comparing column "status"
| match_status | count_records | percent_of_total |
| -------------------- | ------------- | ---------------- |
| ✅: perfect match | 37,715 | 90.20 |
| 🤷: missing from b | 26 | 0.06 |
| 🙅: ‍values do not... | 4,070 | 9.73 |
```

### Advanced usage - dbt Cloud:
The ``.print_table()`` function is not compatible with dbt Cloud so an adjustment needs to be made in order to print the results. Replace the following section of code:
```
Expand Down Expand Up @@ -280,5 +221,125 @@ it is a date in our "b" relation.

```

## compare_all_columns ([source](macros/compare_all_columns.sql))
This macro is designed to be added to a dbt test suite as a custom test. A
`compare_all_columns` test monitors changes data values when code is changed
as part of a PR or during development. It sets up a test that will fail
if any column values do not match.

Users can configure what exactly constitutes a value match or failure. If
there is a test failure, results can be inspected in the warehouse. The primary key
and the column name can be included in the test output that gets written to the warehouse.
This enables the user to join test results to relevant tables in your dev or prod schema to investigate the error.

### Usage:

_Note: this test should only be used on (and will only work on) models that have a primary key that is reliably `unique` and `not_null`. [Generic dbt tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests#generic-tests) should be used to ensure the model being tested meets the requirements of `unique` and `not_null`._

To create a test for the `stg_customers` model, create a custom test
in the `tests` subdirectory of your dbt project that looks like this:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.
exclude_columns=['updated_at'],
primary_key='id'
)
}}
where not perfect_match
```
The `where not perfect_match` statement is an example of a filter you can apply to define what
constitutes a test failure. The test will fail if any rows don't meet the
requirement of a perfect match. Failures would include:

* If the primary key exists in both relations, but one model has a null value in a column.
* If a primary key is missing from one relation.
* If the primary key exists in both relations, but the value conflicts.

If you'd like the test to only fail when there are conflicting values, you could configure it like this:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
primary_key='id'
)
}}
where conflicting_values
```

#### Arguemnts:

* `a_relation` and `b_relation`: The [relations](https://docs.getdbt.com/reference#relation)
you want to compare. Any two relations that have the same columns can be used. In the
example above, two different approaches to writing relations, using `ref` and
using `api.Relation.create`, are demonstrated. (When writing one-off code, it might make sense to
hard-code a relation, like this: `analytics_prod.stg_customers`. A hard-coded relation
is not recommended when building this macro into a CI cycle.)
* `exclude_columns` (optional): Any columns you wish to exclude from the
validation.
* `primary_key`: The primary key of the model. Used to sort unmatched
results for row-by-row validation.

If you want to create test results that include columns from the model itself
for easier inspection, that can be written into the test:

```
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
exclude_columns=['updated_at'],
primary_key='id'
)
}}
left join {{ ref('stg_customers') }} using(id)
```

This structure also allows for the test to group or filter by any attribute in the model or in
the macro's output as part of the test, for example:

```
with base_test_cte as (
{{
audit_helper.compare_all_columns(
a_relation=ref('stg_customers'),
b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
exclude_columns=['updated_at'],
primary_key='id'
)
}}
left join {{ ref('stg_customers') }} using(id)
where conflicting_values
)
select
status, -- assume there's a "status" column in stg_customers
count(distinct case when conflicting_values then id end) as conflicting_values
from base_test_cte
group by 1
```

You can write a `compare_all_columns` test on individual table; and the test will be run
as part of a full test suite run.

```
dbt test --select stg_customers
```

If you want to [store results in the warehouse for further analysis](https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures), add the `--store-failures`
flag.

```
dbt test --select stg_customers --store-failures
```

## compare_column_values_verbose ([source](macros/compare_column_values_verbose.sql))
This macro will return a query that, when executed, returns the same information as
`compare_column_values`, but not summarized. `compare_column_values_verbose` enables `compare_all_columns` to give the user more flexibility around what will result in a test failure.


# To-do:
* Macro to check if two schemas contain the same relations
11 changes: 11 additions & 0 deletions integration_tests/models/compare_all_columns_where_clause.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
summarize=false
) }}
where not perfect_match
9 changes: 9 additions & 0 deletions integration_tests/models/compare_all_columns_with_summary.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id"
) }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
exclude_columns=['ripeness']
) }}
10 changes: 10 additions & 0 deletions integration_tests/models/compare_all_columns_without_summary.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{% set a_relation=ref('data_compare_all_columns__market_of_choice_produce')%}

{% set b_relation=ref('data_compare_all_columns__albertsons_produce') %}

{{ audit_helper.compare_all_columns(
a_relation=a_relation,
b_relation=b_relation,
primary_key="id",
summarize=false
) }}
21 changes: 21 additions & 0 deletions integration_tests/models/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,24 @@ models:
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_relations_without_exclude')

- name: compare_all_columns_with_summary
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_with_summary')

- name: compare_all_columns_without_summary
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_without_summary')


- name: compare_all_columns_with_summary_and_exclude
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_with_summary_and_exclude')

- name: compare_all_columns_where_clause
tests:
- dbt_utils.equality:
compare_model: ref('expected_results__compare_all_columns_where_clause')
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
id,fruit,ripeness
1,banana,yellow
2,banana,brown
3,banana,brown
4,orange,green
5,orange,orange
6,,brown
7,orange,orange
9,apple,mushy
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
id,fruit,ripeness
1,banana,yellow
2,banana,green
3,banana,brown
4,orange,green
5,orange,orange
6,orange,brown
7,orange,
8,apple,mushy
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
primary_key,column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
8,ID,false,false,false,false,true,false
9,ID,false,false,false,true,false,false
6,FRUIT,false,false,true,false,false,false
8,FRUIT,false,false,false,false,true,false
9,FRUIT,false,false,false,true,false,false
2,RIPENESS,false,false,false,false,false,true
7,RIPENESS,false,true,false,false,false,false
8,RIPENESS,false,false,false,false,true,false
9,RIPENESS,false,false,false,true,false,false
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
ID,7,0,0,1,1,0
FRUIT,6,0,1,1,1,0
RIPENESS,5,1,0,1,1,1
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
ID,7,0,0,1,1,0
FRUIT,6,0,1,1,1,0
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
primary_key,column_name,perfect_match,null_in_a,null_in_b,missing_from_a,missing_from_b,conflicting_values
1,ID,true,false,false,false,false,false
2,ID,true,false,false,false,false,false
3,ID,true,false,false,false,false,false
4,ID,true,false,false,false,false,false
5,ID,true,false,false,false,false,false
6,ID,true,false,false,false,false,false
7,ID,true,false,false,false,false,false
8,ID,false,false,false,false,true,false
9,ID,false,false,false,true,false,false
1,FRUIT,true,false,false,false,false,false
2,FRUIT,true,false,false,false,false,false
3,FRUIT,true,false,false,false,false,false
4,FRUIT,true,false,false,false,false,false
5,FRUIT,true,false,false,false,false,false
6,FRUIT,false,false,true,false,false,false
7,FRUIT,true,false,false,false,false,false
8,FRUIT,false,false,false,false,true,false
9,FRUIT,false,false,false,true,false,false
1,RIPENESS,true,false,false,false,false,false
2,RIPENESS,false,false,false,false,false,true
3,RIPENESS,true,false,false,false,false,false
4,RIPENESS,true,false,false,false,false,false
5,RIPENESS,true,false,false,false,false,false
6,RIPENESS,true,false,false,false,false,false
7,RIPENESS,false,true,false,false,false,false
8,RIPENESS,false,false,false,false,true,false
9,RIPENESS,false,false,false,true,false,false
Loading

0 comments on commit bd58775

Please sign in to comment.