Collocating materialized test results into a single model #8929

dehume · 2023-10-26T18:02:44Z

dehume
Oct 26, 2023

It would be helpful if there was a way to combine tests that are materialized (store_failures_as) into a single object. For example if a dbt model had generic tests on a column for not null, unique and accepted values (leaving out relationship tests which will be harder), rather than storing each of those queries separately, they would be combined into something like:

SELECT
-- Not Null
SUM(CASE WHEN {column} IS NULL THEN 1 ELSE 0 END) AS {column}__is_null,
-- Unique
COUNT(DISTINCT {column}) = COUNT({column}) AS {column}__is_unique,
-- Accepted Values
SUM(CASE WHEN {column} NOT IN ('{allowed value}') THEN 1 ELSE 0 END) AS {column}__accepted_values
FROM {model}

Then tests/generic/builtin would query that aggregate table to lookup each of the three tests. So checking the not null test for the model would be be:

SELECT {column}__is_null FROM {test_model}

The problem is that this breaks the 1:1 nature of dbt tests. I understand how you would accomplish overriding the generic test queries but am less clear about how to first aggregate the tests into a single DDL command at the beginning of the test command.

rsanjabi · 2023-10-26T18:47:42Z

rsanjabi
Oct 26, 2023

A solution that we are trying for reporting-on/aggregating test results was to create a model for each test type. Basically we start with what do we want the final test schema to look like for reporting purposes and then rewrite it to match that. Then we use pattern matching to figure out what will need to be unioned together.

This meant rewriting the default test. For not_null it looked something like the following. primary_key isn't necessary but passing it in as metadata would allow us to generate a drill down query for folks to do further investigation on. Failure results in json format could work as well for maintaining a consistent schema for unioning across all test types.

{% test expect_column_to_be_not_null(model, column_name, primary_key) %}

{#
    Used in lieu of the standard not_null test in order to provide additional meta-data information 
    in the result set. 
#}

select
    '{{ model.name }}' as model_name
    , '{{ primary_key }}' as primary_key_name
    , {{ primary_key }} as primary_key_value
    , '{{ column_name }}' as column_name
    , 'expect {{ column_name }} to be not null' as failure_reason
from {{ model }}
where {{ column_name }} is null

{% endtest %}

The test is set in the schema.yml file. Note the name config might not be necessary but with Redshift we were running into situations where the default test results name would be too long and resulted in a hash. Setting it by hand allows us to ensure it's unique so it doesn't override another test while still being short enough to not be hashed (so we can find it in the unioning step)

    columns:
      - name: customer_address
        description: "Customer's street address"
        tests:
          - expect_column_to_be_not_null:
              name: test_not_null_<table_name>_customer_address
              primary_key: 'customer_id

We also set store_failures: true within dbt_project.yml tests section.

Next a helper macro returns a list of nodes that start with a search string:

/*
This macro can be used to generate a list of nodes (models or tests) that uses a prefixed 
naming convention. This macro is typically paired with a model that dynamically unions test 
results or models together.
*/

{% macro get_list_of_nodes(search_pattern, node_type='model', log_enabled=false) %}

{%- set node_list = [] -%}

{%- for node in graph.nodes.values() | selectattr("resource_type", "equalto", node_type) -%}
    {%- if node.name.startswith(search_pattern) -%}
        {%- do node_list.append(node.name) -%}
    {%- endif -%}
{%- endfor %}

{{ return(node_list) }}

{% endmacro %}

Finally a model is used to join all the not_null test results together:

{#
    This model uses the dbt graph to create a list of all possible test models for not_null.
#}

{% if execute %}
    {%- set not_null_test_results = get_list_of_nodes('test_not_null_', 'test') %}

    {# Loop through the list and union the results together #}
    {%- for result in not_null_test_results -%}
        {%- set test_table=api.Relation.create(database='warehouse',
                                                schema=generate_schema_name('test_results'),
                                                identifier=result)
                                            -%}

        select * from {{ test_table }}

        {% if not loop.last %}
        union all
        {% endif -%}

    {%- endfor -%}

{% endif %}

From here we can create a model by hand that sums results of the 8 or so tests we are using across the project and organizes them by model, which sounds like where you need to end up. We were looking at potentially a thousand+ or so tests and trying to model by hand would be prohibitive. Ultimately, I think some of the observability tools might be a better approach for our needs but this seems to be working. We're only in the initial pass at this so not sure about scalability. And I welcome feedback on improvements!

If there were a way to do this natively with dbt core as a test materialization that would be awesome! This isn't the first time I've worked with a client where we want to have one materialization of all test results, and ideally we wouldn't be modeling any of this after the fact.

0 replies

ttusing · 2023-11-15T09:08:01Z

ttusing
Nov 15, 2023

I had a similar idea.

Some of my models have many many generic tests! What if instead of each submitting a single SQL query, they submitted a single unioned query?

I suspect this may be much faster in many circumstances.

I am imaging this as (psuedocode):

union all
select count(test2sql) as failing_rows, 'test2' as test_name
union all
...

And a model with a config like union_tests would have this applied to it, so that the many generic tests are compiled into a single test (on compile).

If persist test results is on, on fail, a post run hook or something could perhaps run the failing test to persist.

0 replies

morsapaes · 2023-11-29T21:07:57Z

morsapaes
Nov 29, 2023

Linking to #4613.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collocating materialized test results into a single model #8929

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Collocating materialized test results into a single model #8929

dehume Oct 26, 2023

Replies: 3 comments

rsanjabi Oct 26, 2023

ttusing Nov 15, 2023

morsapaes Nov 29, 2023

dehume
Oct 26, 2023

rsanjabi
Oct 26, 2023

ttusing
Nov 15, 2023

morsapaes
Nov 29, 2023