Dbt setup #4011

zschira · 2025-01-13T19:30:47Z

Overview

This PR sets up a dbt project within the PUDL repo that will be used for data testing. Details on setup and usage can all be found in the Readme.md. This PR also includes several data validations that have been converted to dbt tests. The tests currently converted are all vcerare asset_checks and FERC fuel by plant cost per mmbtu range checks.

In scope

Standardize organization and naming conventions
Develop some basic macros that will enable migrating existing tests
- Weighted quantile
- Review validate.py for any other obvious candidates
Auto-generate row count checks
Update readme with directions for migration (how to add new asset, types of test in scope for immediate migration)

Out of scope

Make tests run in asset checks
Plan alternative to complex tests like the existing vs_historical
Migrate tests

closes #3997 #3971

…_setup

… dbt_setup

src/pudl/dbt/.user.yml

src/pudl/dbt/.gitignore

pyproject.toml

zaneselvans · 2025-01-13T21:35:38Z

environments/conda-linux-64.lock.yml

-  - importlib-metadata=8.5.0=pyha770c72_1
+  - importlib-metadata=6.10.0=pyha770c72_0


For some reason dbt-semantic-interfaces is stuck between 6 and 7.

zaneselvans · 2025-01-13T21:38:34Z

environments/conda-linux-64.lock.yml

-  - isodate=0.7.2=pyhd8ed1ab_1
+  - isodate=0.6.1=pyhd8ed1ab_0


dbt-common is stuck between 0.6 and 0.7.

zaneselvans · 2025-01-13T22:16:24Z

src/pudl/dbt/models/schema.yml

+    tables:
+      - name: out_eia923__boiler_fuel
+      - name: out_eia923__monthly_boiler_fuel
+      - name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
+      - name: out_vcerare__hourly_available_capacity_factor


In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.

Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.

Looks like the "X column is not null" checks are currently defined in fields.py under the field constraints, is that what you're thinking about?

I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.

That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.

In addition to the field specific constraints I think we automatically add NOT NULL check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt tests somehow.

It seems totally possible to auto-generate tests, but I think there's also probably many ways to do accomplish this, so we should figure out what we want from it. For example, when we talk about auto-generating row count/not null tests, will these be generated once and committed into the repo, or will some/all of them be dynamically generated at runtime?

It definitely seems tricky to minimize duplication between dbt/our existing python schema info. I also wonder how this plays into any refactoring of our metadata system?

It feels like we may need to clearly define the data tests that are ready to be migrated in a straightforward way, and the things that still need design work, so we can point Margay folks at the stuff that's ready to go and keep thinking about the things that still need some scaffolding?

zaneselvans · 2025-01-13T22:20:20Z

src/pudl/dbt/models/validation/ferc1_fbp_cost_per_mmbtu.sql

If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?

Yeah I think that seems like a good idea. Maybe just use a validation_ prefix and otherwise follow existing naming conventions?

zaneselvans · 2025-01-13T22:22:56Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_compound_columns_to_be_unique:
+              column_list: ["county_id_fips", "datetime_utc"]


Could be generated based on the PK that's defined for every table?

Should be possible. We can also probably come up with a way to generate foreign key checks so we can actually verify foreign keys for tables only in parquet

zaneselvans · 2025-01-13T22:24:08Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_table_row_count_to_equal:
+              value: |
+                {%- if target.name == "etl-fast" -%} 27287400
+                {%- else -%} 136437000
+                {%- endif -%}


Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?

We'd probably need to create a custom macro for this, but that seems totally doable. Big question is how we want to generate/store all of those tests.

The row count tests have functionally become regression tests -- we want to know when they change, and verify that the magnitude and nature of the change is expected based on the code or data that we've changed. Given that there are hundreds of tables (and thousands of table-years) it doesn't seem practical to hand-code all of the expected row counts.

It would be nice to have the per table-year row counts stored in (say) YAML somewhere, and be able to generate a new version of that file based on current ETL outputs. Then we could look at the diffs between the old and the new versions of the file when trying to assess changes in the lengths of the outputs.

zaneselvans · 2025-01-13T22:26:04Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_column_quantile_values_to_be_between:
+              quantile: 0.05
+              min_value: 1.5


I'm guessing these are not using the weighted quantiles?

Yeah this are just basic quantiles. It's not too hard to get a sql query that can do a version of weighted quantiles, but the existing vs_historical tests are hard because they're computing a bunch of quantiles, then comparing them all

… dbt_setup

marianneke · 2025-02-05T00:38:51Z

dbt/README.md

+### Adding new tables
+To add a new table to the project, you must add it as a
+[dbt source](https://docs.getdbt.com/docs/build/sources). The standard way to do
+this is to create a new file `models/{data_source}/{table_name}.yml`. If the the


@zschira I am confused by the difference between this instruction that tells users to add the table schema as an individual yml file with the table name and the instructions below for adding tests, where you mention a schema.yml file that would presumably contain all models.

…ow counts test

zschira

I think we're close to being able to convert all the existing tests using the validate.py framework. It seems like there are currently 3 categories of tests in this framework:

vs_historical/vs_self tests, which we've decided are out-of-scope and could be replaced with simpler tests. @zaneselvans we should decide what to do with these tests
Bounds checks on un-weighted quantiles. These are immediately convertable using dbt_expectations
Bounds checks on weighted quantiles. These can be converted if we get the weighted_quantile.sql macro working

Beyond this, the script I've added to generate scaffolding for new tables will add row count checks per partition. If there's any other checks we want to autogenerate from our metadata we should decide on that now.

zschira · 2025-02-05T19:13:11Z

devtools/dbt_helper.py

@@ -0,0 +1,267 @@
+"""A basic CLI to autogenerate dbt yml."""


Here's the script to add a new table to the project. It will generate a file called dbt/models/{data_source}/{table_name}/schema.yml that defines the basic schema for a table for dbt and adds a test for row counts per partition. I've added some documentation for usage in dbt/README.md.

We should decide if there's any other tests we want to automatically add based on our metadata at this point.

zschira · 2025-02-05T19:18:15Z

dbt/macros/weighted_quantile.sql

@@ -0,0 +1,26 @@
+{% macro weighted_quantile(model, column_name, weight_col, quantile) %}


@marianneke this is the macro that's currently broken. It gets called in expect_column_weighted_quantile_values_to_be_between.sql. The idea is that this will compute a weighted quantile on a table, then that test will check the quantile vs bounds. I think the issue is with the syntax using WITH here. The macro dbt/macros/check_row_counts_per_partition.sql has an example where I fixed this to use the proper duckdb syntax.

Can you clarify what the syntax issue was and how you solved it in the row count macro?

Hmm I'm actually not sure that it was the issue after all. I was getting a similar error message error message with the row count macro and I did have a syntax error there, but this looks ok to me. It might also be something to do with the expect_column_weighted_quantile_values_to_be_between macro which is calling this under the hood.

zschira · 2025-02-05T19:18:57Z

dbt/macros/check_row_counts_per_partition.sql

@@ -0,0 +1,18 @@
+{% test check_row_counts_per_partition(model, table_name, partition_column) %}


Compares row counts in the table to expected counts which are stored in seeds/row_counts.csv

zschira · 2025-02-05T19:53:54Z

dbt/models/eia923/out_eia923__boiler_fuel/schema.yml

@@ -0,0 +1,28 @@
+version: 2


@zaneselvans do you know why out_eia923__boiler_fuel isn't using the monthly naming convention?

There are 3 out_eia923__*_boiler_fuel tables -- this one hasn't been aggregated by month, so there are probably some cases in which there are multiple records for a single month that get combined in the __monthly version. But I'd need to compare the monthly and unaggregated tables to be sure. I know there's at least one case in the generation tables where the monthly and unaggragated tables are literally identical.

… dbt_setup

zaneselvans

There's more SQL than I was hoping we'd need to have. It makes sense for the data tests / macros that we're defining for ourselves, but it seems like there's a fair amount that's also required to do the calculations / create ephemeral tables for the tests to run on which seems like it could be a source of friction given our generally moderate level of SQL knowledge.

zaneselvans · 2025-02-10T17:49:08Z

dbt/README.md

+If a table is not partitioned by year, you can add the option
+`--partition-column {column_name}` to the command. This will find row counts per
+unique value in the column. This is common for monthly and hourly tables that are
+often partitioned by `report_date` and `datetime_utc` respectively.


Even most annual tables (outside of FERC?) don't really have a dedicated column with just the year or annual frequency report_date, and the annual row-counts that we've been using have been more linked to the frequency with which the data is released / the chunks it's bundled into, rather than the frequency of the data reported in the tables. Having finer partitions will give us more information about where things are changing if they're changing, but with monthly it'll be hundreds of partitions and hourly it'll be tens of thousands. Is that really maintainable? Would it be easy to allow a groupby that can count rows in a time period even if it's not an explicitly unique column value (years or months within a date column?)

zaneselvans · 2025-02-10T17:57:22Z

dbt/README.md

+The second method is to create a [model](https://docs.getdbt.com/docs/build/models)
+which will produce the intermediate table you want to execute tests on. To use this
+approach, simply add a sql file to `dbt/models/{data_source}/{table_name}/`.
+Now, add a SQL file to this directory named `validate_{table_name}` and define your model
+for producing the intermediate table here. Finally, add the model to the `schema.yml` file
+and define tests exactly as you would for a `source` table. See
+`models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402` for an example of this
+pattern.


When new models are defined, does dbt persist them somewhere e.g. using duckdb?

zaneselvans · 2025-02-10T17:58:51Z

dbt/README.md

+
+This command will first run any models, then execute all tests.
+
+For more finegrained control, first run:


"run" here meaning not "run the tests" but "run dbt in the traditional sense which we're not really using it -- to build database tables"? To the folks unfamiliar with dbt (most of us right now) I think "run the models" will be confusing.

zaneselvans · 2025-02-12T16:19:47Z

devtools/dbt_helper.py

+
+    # Load table of interest
+    if not use_local_tables:
+        df = pd.read_parquet(_get_nightly_url(table_name))


Rather than downloading the entire parquet file to get the row counts per partition (which will mean several GB every time this is run on all tables) can we potentially send a query that only has to access the Parquet metadata? That would require almost no data transfer, but might mean we need to change how the parquet files are indexed and possibly use s3:// rather than https:// URLs to access the nightly outputs.

zaneselvans · 2025-02-12T16:22:38Z

dbt/seeds/row_counts.csv

I mentioned this in our call but I'm nervous that calculating row counts based on the contents of the tables may sometimes result in different values that we get based on restricting input data to certain years, which we need to work -- since one of the main reasons we need per-partition counts is so that the integration tests / fast ETL can succeed even when only a subset of the years of data are available in the DB.

We should check that the partitions and the restricted datasets always give the same number of rows at least for the case of the fast ETL before we merge this in.

zaneselvans · 2025-02-12T16:27:06Z

dbt/seeds/row_counts.csv

+out_eia923__boiler_fuel,2008-01-01,6389
+out_eia923__boiler_fuel,2008-02-01,6358
+out_eia923__boiler_fuel,2008-03-01,6435
+out_eia923__boiler_fuel,2008-04-01,6443
+out_eia923__boiler_fuel,2008-05-01,6455
+out_eia923__boiler_fuel,2008-06-01,6457


Monthly partitions will provide a lot of granular information about where changes have happened, but using only unique values of a single column seems like it may not be very adaptable, e.g. in hourly tables, or tables where we need to select on more than one column to get well meaningful chunks, and the number of groups we're tracking in this system will be totally unmanageable by hand -- it'll have to be entirely automatable (with manual review / approval).

If we want to get much more granular data regression / diff testing than what we've been doing with the row counts, then we may want to revisit the more generalized database diff work that @rousik was doing before he vanished, especially now that he's back.

zaneselvans · 2025-02-12T16:29:27Z

environments/conda-linux-64.lock.yml

-  - importlib-metadata=8.6.1=pyha770c72_0
+  - importlib-metadata=6.10.0=pyha770c72_0


Whew that's a big backrev.

zaneselvans · 2025-02-12T16:33:09Z

dbt/models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402/schema.yml

+    columns:
+      - name: gas_cost_per_mmbtu
+        data_tests:
+          - dbt_expectations.expect_column_quantile_values_to_be_between:


My intuition is that these should probably really be weighted quantiles, or that they are relatively low-value checks because the FERC fuel reporting is such a mess.

zschira and others added 6 commits January 9, 2025 17:11

Add basic dbt setup

bf40ffb

Update to dagster 1.9.7 & grpcio 1.67.1

9aac625

Setup multiple dbt profiles

415a113

Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…

ba32bd8

…_setup

Add all vcerare dbt tests

dc51c8f

Add more example dbt tests

590b02a

zschira requested a review from zaneselvans January 13, 2025 19:30

zschira marked this pull request as draft January 13, 2025 19:31

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

63e663a

… dbt_setup

zschira requested a review from e-belfer January 13, 2025 19:44

Merge changes from main and revert to python 3.12

d428b5d

zaneselvans reviewed Jan 13, 2025

View reviewed changes

zaneselvans and others added 9 commits January 14, 2025 10:44

Bump gdal to v3.10.1 bugfix release.

784cf96

Merge branch 'main' into dbt_setup

48a16e1

Merge branch 'main' into dbt_setup

6f45ba5

Update to dagster 1.9.9

ac41a41

Merge branch 'main' into dbt_setup

0ce1648

Merge branch 'main' into dbt_setup

c19cfd8

Reorganize dbt into multiple schema.yml files

6335e94

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

2585eca

… dbt_setup

Move dbt project to top level of repo

e24af8c

jdangerx mentioned this pull request Jan 27, 2025

Prepare first batch of validation tests for migration #4030

Open

2 tasks

zschira added 3 commits January 30, 2025 16:09

Only set parquet path in dbt project once

1ed85b3

Standardize dbt maning scheme

e92f5be

Add more detail to README

5de9ebe

marianneke reviewed Feb 5, 2025

View reviewed changes

zschira added 4 commits February 5, 2025 12:46

Add script to generate dbt scaffolding and row count tests

da9ae93

Add documentation for dbt helper script

7461786

Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…

0d120c6

…ow counts test

Add weighted quantile test (broken)

3666360

zschira added 2 commits February 5, 2025 13:59

Change row count test name

a3579dc

Update dbt initialization process

c98219c

zschira commented Feb 5, 2025

View reviewed changes

Make dbt helper script work properly with non-yearly partitioned tables

f9b3fa7

zschira commented Feb 5, 2025

View reviewed changes

zschira added 2 commits February 5, 2025 15:30

Update dbt readme

79e2153

Regenerate ferc dbt schemas

012ba4a

jdangerx linked an issue Feb 10, 2025 that may be closed by this pull request

Data validation project design conversation #3971

Open

7 tasks

jdangerx assigned marianneke and zschira Feb 10, 2025

zaneselvans and others added 3 commits February 10, 2025 11:47

Merge branch 'main' into dbt_setup

ff766b3

Improve dbt_helper command line usability

94267a5

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

8f660fd

… dbt_setup

zaneselvans reviewed Feb 12, 2025

View reviewed changes

Flesh out test migration command

70e6895

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dbt setup #4011

Dbt setup #4011

zschira commented Jan 13, 2025 •

edited

Loading

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

jdangerx Jan 15, 2025

zaneselvans Jan 15, 2025

zschira Jan 22, 2025

zaneselvans Jan 22, 2025

zaneselvans Jan 13, 2025

zschira Jan 22, 2025

zaneselvans Jan 13, 2025

zschira Jan 22, 2025

zaneselvans Jan 13, 2025

zschira Jan 22, 2025

zaneselvans Jan 22, 2025

zaneselvans Jan 13, 2025

zschira Jan 22, 2025

marianneke Feb 5, 2025

zschira left a comment

zschira Feb 5, 2025

zschira Feb 5, 2025

marianneke Feb 5, 2025

zschira Feb 6, 2025

zschira Feb 5, 2025

zschira Feb 5, 2025

zaneselvans Feb 10, 2025

zaneselvans left a comment

zaneselvans Feb 10, 2025

zaneselvans Feb 10, 2025

zaneselvans Feb 10, 2025

zaneselvans Feb 12, 2025

zaneselvans Feb 12, 2025

zaneselvans Feb 12, 2025

zaneselvans Feb 12, 2025

zaneselvans Feb 12, 2025

		- importlib-metadata=8.5.0=pyha770c72_1
		- importlib-metadata=6.10.0=pyha770c72_0

		- dbt_expectations.expect_compound_columns_to_be_unique:
		column_list: ["county_id_fips", "datetime_utc"]

		@@ -0,0 +1,26 @@
		{% macro weighted_quantile(model, column_name, weight_col, quantile) %}

		@@ -0,0 +1,18 @@
		{% test check_row_counts_per_partition(model, table_name, partition_column) %}


		This command will first run any models, then execute all tests.

		For more finegrained control, first run:

		- importlib-metadata=8.6.1=pyha770c72_0
		- importlib-metadata=6.10.0=pyha770c72_0

Dbt setup #4011

Are you sure you want to change the base?

Dbt setup #4011

Conversation

zschira commented Jan 13, 2025 • edited Loading

Overview

In scope

Out of scope

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira commented Jan 13, 2025 •

edited

Loading