Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bf40ffb
Add basic dbt setup
zschira Jan 9, 2025
9aac625
Update to dagster 1.9.7 & grpcio 1.67.1
zaneselvans Jan 10, 2025
415a113
Setup multiple dbt profiles
zschira Jan 10, 2025
ba32bd8
Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…
zaneselvans Jan 10, 2025
dc51c8f
Add all vcerare dbt tests
zschira Jan 10, 2025
590b02a
Add more example dbt tests
zschira Jan 13, 2025
63e663a
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zaneselvans Jan 13, 2025
d428b5d
Merge changes from main and revert to python 3.12
zaneselvans Jan 13, 2025
784cf96
Bump gdal to v3.10.1 bugfix release.
zaneselvans Jan 14, 2025
48a16e1
Merge branch 'main' into dbt_setup
zaneselvans Jan 14, 2025
6f45ba5
Merge branch 'main' into dbt_setup
zaneselvans Jan 15, 2025
ac41a41
Update to dagster 1.9.9
zaneselvans Jan 19, 2025
0ce1648
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
c19cfd8
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
6335e94
Reorganize dbt into multiple schema.yml files
zschira Jan 21, 2025
2585eca
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Jan 21, 2025
e24af8c
Move dbt project to top level of repo
zschira Jan 22, 2025
1ed85b3
Only set parquet path in dbt project once
zschira Jan 30, 2025
e92f5be
Standardize dbt maning scheme
zschira Jan 30, 2025
5de9ebe
Add more detail to README
zschira Jan 30, 2025
da9ae93
Add script to generate dbt scaffolding and row count tests
zschira Feb 5, 2025
7461786
Add documentation for dbt helper script
zschira Feb 5, 2025
0d120c6
Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…
zschira Feb 5, 2025
3666360
Add weighted quantile test (broken)
zschira Feb 5, 2025
a3579dc
Change row count test name
zschira Feb 5, 2025
c98219c
Update dbt initialization process
zschira Feb 5, 2025
f9b3fa7
Make dbt helper script work properly with non-yearly partitioned tables
zschira Feb 5, 2025
79e2153
Update dbt readme
zschira Feb 5, 2025
012ba4a
Regenerate ferc dbt schemas
zschira Feb 6, 2025
ff766b3
Merge branch 'main' into dbt_setup
zaneselvans Feb 10, 2025
94267a5
Improve dbt_helper command line usability
zschira Feb 10, 2025
8f660fd
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 10, 2025
70e6895
Flesh out test migration command
zschira Feb 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
## Overview
This directory contains an initial setup of a `dbt` project meant to write
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The
project is setup with profiles that allow you to select running tests on `nightly`
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast`
profiles will look for parquet files based on your `PUDL_OUTPUT` environment
variable. See the `Usage` section below for examples using these profiles.


## Development
To setup the `dbt` project, simply install the PUDL `conda` environment as normal,
then run the following commands from this directory.

```
dbt deps
dbt seed
```

### Adding new tables
#### Helper script
To add a new table to the project, you must add it as a
[dbt source](https://docs.getdbt.com/docs/build/sources). We've included a helper
script to automate the process at `devtools/dbt_helper.py`.

#### Usage
Basic usage of the helper script looks like:

```
python devtools/dbt_helper.py --tables {table_name(s)}
```

This will add a file called `dbt/models/{data_source}/{table_name}/schema.yml` which
tells `dbt` about the table and it's schema. It will also apply the test
`check_row_counts_per_partition`, which by default will check row counts per year.
To accomplish this it will add row counts to the file `seeds/row_counts.csv`, which
get compared to observed row counts in the table when running tests.

If a table is not partitioned by year, you can add the option
`--partition-column {column_name}` to the command. This will find row counts per
unique value in the column. This is common for monthly and hourly tables that are
often partitioned by `report_date` and `datetime_utc` respectively.
Comment on lines +39 to +42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even most annual tables (outside of FERC?) don't really have a dedicated column with just the year or annual frequency report_date, and the annual row-counts that we've been using have been more linked to the frequency with which the data is released / the chunks it's bundled into, rather than the frequency of the data reported in the tables. Having finer partitions will give us more information about where things are changing if they're changing, but with monthly it'll be hundreds of partitions and hourly it'll be tens of thousands. Is that really maintainable? Would it be easy to allow a groupby that can count rows in a time period even if it's not an explicitly unique column value (years or months within a date column?)


To see all options for command run:

```
python devtools/dbt_helper.py add-tables --help
```

### Adding tests
#### Default case
Once a table is included as a `source`, you can add tests for the table. You can
either add a generic test directly in `src/pudl/dbt/models/{table_name}/schema.yml`,
or create a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`.
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a
failure for that test.

The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations)
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These
packages include useful tests out of the box that can be applied to any tables
in the project. There are several examples in
`src/pudl/dbt/models/out_vcerare__hourly_available_capacity_factor/schema.yml` which
use `dbt-expectations`.

#### Modifying a table before test
In some cases you may want to modify the table before applying tests. There are two
ways to accomplish this. First, you can add the table as a `source` as described
above, then create a SQL file in the `tests/` directory like
`tests/{data_source}/{table_name}.yml`. From here you can construct a SQL query to
modify the table and execute a test on the intermediate table you've created. `dbt`
expects a SQL test to be a query that returns 0 rows for a successful test. See
the `dbt` [source function](https://docs.getdbt.com/reference/dbt-jinja-functions/source)
for guidance on how to reference a `source` from a SQL file.

The second method is to create a [model](https://docs.getdbt.com/docs/build/models)
which will produce the intermediate table you want to execute tests on. To use this
approach, simply add a sql file to `dbt/models/{data_source}/{table_name}/`.
Now, add a SQL file to this directory named `validate_{table_name}` and define your model
for producing the intermediate table here. Finally, add the model to the `schema.yml` file
and define tests exactly as you would for a `source` table. See
`models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402` for an example of this
pattern.
Comment on lines +76 to +83
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When new models are defined, does dbt persist them somewhere e.g. using duckdb?


### Usage
There are a few ways to execute tests. To run all tests with a single command:

```
dbt build
```

This command will first run any models, then execute all tests.

For more finegrained control, first run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"run" here meaning not "run the tests" but "run dbt in the traditional sense which we're not really using it -- to build database tables"? To the folks unfamiliar with dbt (most of us right now) I think "run the models" will be confusing.


```
dbt run
```

This will run all models, thus prepairing any `sql` views that will be referenced in
tests. Once you've done this, you can run all tests with:

```
dbt test
```

To run all tests for a single source table:

```
dbt test --select source:pudl.{table_name}
```

To run all tests for a model table:

```
dbt test --select {model_name}
```

#### Selecting target profile
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append
`--target {target_name}` to any of the previous commands.
23 changes: 23 additions & 0 deletions dbt/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "pudl_dbt"
version: "1.0.0"

# This setting configures which "profile" dbt uses for this project.
profile: "pudl_dbt"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
macro-paths: ["macros"]
seed-paths: ["seeds"]
test-paths: ["tests"]

sources:
pudl_dbt:
+external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
18 changes: 18 additions & 0 deletions dbt/macros/check_row_counts_per_partition.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{% test check_row_counts_per_partition(model, table_name, partition_column) %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compares row counts in the table to expected counts which are stored in seeds/row_counts.csv


WITH
expected AS (
SELECT table_name, partition, row_count as expected_count
FROM {{ ref("row_counts") }} WHERE table_name = '{{ table_name }}'
),
observed AS (
SELECT {{ partition_column }} as partition, COUNT(*) as observed_count
FROM {{ model }}
GROUP BY {{ partition_column }}
)
SELECT expected.partition, expected.expected_count, observed.observed_count
FROM expected
INNER JOIN observed ON expected.partition=observed.partition
WHERE expected.expected_count != observed.observed_count

{% endtest %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{% test expect_column_weighted_quantile_values_to_be_between(model, column_name,
quantile,
weight_column,
min_value=None,
max_value=None,
group_by=None,
row_condition=None,
strictly=False
) %}
{% set expression %}
{{ weighted_quantile(column_name, weight_column, quantile) }}
{% endset %}
{{ dbt_expectations.expression_between(model,
expression=expression,
min_value=min_value,
max_value=max_value,
group_by_columns=group_by,
row_condition=row_condition,
strictly=strictly
) }}
{% endtest %}
26 changes: 26 additions & 0 deletions dbt/macros/weighted_quantile.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{% macro weighted_quantile(model, column_name, weight_col, quantile) %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marianneke this is the macro that's currently broken. It gets called in expect_column_weighted_quantile_values_to_be_between.sql. The idea is that this will compute a weighted quantile on a table, then that test will check the quantile vs bounds. I think the issue is with the syntax using WITH here. The macro dbt/macros/check_row_counts_per_partition.sql has an example where I fixed this to use the proper duckdb syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what the syntax issue was and how you solved it in the row count macro?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm actually not sure that it was the issue after all. I was getting a similar error message error message with the row count macro and I did have a syntax error there, but this looks ok to me. It might also be something to do with the expect_column_weighted_quantile_values_to_be_between macro which is calling this under the hood.


WITH CumulativeWeights AS (
SELECT
{{ column_name }},
{{ weight_col }},
SUM({{ weight_col }}) OVER (ORDER BY {{ column_name }}) AS cumulative_weight,
SUM({{ weight_col }}) OVER () AS total_weight
FROM bf
),
QuantileData AS (
SELECT
{{ column_name }},
{{ weight_col }},
cumulative_weight,
total_weight,
cumulative_weight / total_weight AS cumulative_probability
FROM CumulativeWeights
)
SELECT {{ column_name }}
FROM QuantileData
WHERE cumulative_probability >= {{ quantile }} AND {{ column_name }} < {{ lower_bound }}
ORDER BY {{ column_name }}
LIMIT 1

{% endmacro %}
28 changes: 28 additions & 0 deletions dbt/models/eia923/out_eia923__boiler_fuel/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
version: 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneselvans do you know why out_eia923__boiler_fuel isn't using the monthly naming convention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 out_eia923__*_boiler_fuel tables -- this one hasn't been aggregated by month, so there are probably some cases in which there are multiple records for a single month that get combined in the __monthly version. But I'd need to compare the monthly and unaggregated tables to be sure. I know there's at least one case in the generation tables where the monthly and unaggragated tables are literally identical.

sources:
- name: pudl
tables:
- name: out_eia923__boiler_fuel
data_tests:
- check_row_counts_per_partition:
table_name: out_eia923__boiler_fuel
partition_column: report_date
columns:
- name: report_date
- name: plant_id_eia
- name: plant_id_pudl
- name: plant_name_eia
- name: utility_id_eia
- name: utility_id_pudl
- name: utility_name_eia
- name: boiler_id
- name: unit_id_pudl
- name: energy_source_code
- name: prime_mover_code
- name: fuel_type_code_pudl
- name: fuel_consumed_units
- name: fuel_mmbtu_per_unit
- name: fuel_consumed_mmbtu
- name: sulfur_content_pct
- name: ash_content_pct
- name: data_maturity
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
version: 2
sources:
- name: pudl
tables:
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
data_tests:
- check_row_counts_per_partition:
table_name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
partition_column: report_year
columns:
- name: report_year
- name: utility_id_ferc1
- name: utility_id_pudl
- name: utility_name_ferc1
- name: plant_id_pudl
- name: plant_name_ferc1
- name: coal_fraction_cost
- name: coal_fraction_mmbtu
- name: fuel_cost
- name: fuel_mmbtu
- name: gas_fraction_cost
- name: gas_fraction_mmbtu
- name: nuclear_fraction_cost
- name: nuclear_fraction_mmbtu
- name: oil_fraction_cost
- name: oil_fraction_mmbtu
- name: primary_fuel_by_cost
- name: primary_fuel_by_mmbtu
- name: waste_fraction_cost
- name: waste_fraction_mmbtu

models:
- name: validate_ferc1__yearly_steam_plants_fuel_by_plant_sched402
columns:
- name: gas_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that these should probably really be weighted quantiles, or that they are relatively low-value checks because the FERC fuel reporting is such a mess.

quantile: 0.05
min_value: 1.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 15.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 2.0
max_value: 10.0
- name: oil_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 3.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 25.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 6.5
max_value: 17.0
- name: coal_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 0.75
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 4.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 1.0
max_value: 2.5
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

select
{% for fuel_type in ["gas", "oil", "coal"] %}
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu,
{% endfor %}
from {{ source('pudl', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
version: 2
sources:
- name: pudl
tables:
- name: out_ferc1__yearly_steam_plants_sched402
data_tests:
- check_row_counts_per_partition:
table_name: out_ferc1__yearly_steam_plants_sched402
partition_column: report_year
columns:
- name: report_year
- name: utility_id_ferc1
- name: utility_id_pudl
- name: utility_name_ferc1
- name: plant_id_pudl
- name: plant_id_ferc1
- name: plant_name_ferc1
- name: asset_retirement_cost
- name: avg_num_employees
- name: capacity_factor
- name: capacity_mw
- name: capex_annual_addition
- name: capex_annual_addition_rolling
- name: capex_annual_per_kw
- name: capex_annual_per_mw
- name: capex_annual_per_mw_rolling
- name: capex_annual_per_mwh
- name: capex_annual_per_mwh_rolling
- name: capex_equipment
- name: capex_land
- name: capex_per_mw
- name: capex_structures
- name: capex_total
- name: capex_wo_retirement_total
- name: construction_type
- name: construction_year
- name: installation_year
- name: net_generation_mwh
- name: not_water_limited_capacity_mw
- name: opex_allowances
- name: opex_boiler
- name: opex_coolants
- name: opex_electric
- name: opex_engineering
- name: opex_fuel
- name: opex_fuel_per_mwh
- name: opex_misc_power
- name: opex_misc_steam
- name: opex_nonfuel_per_mwh
- name: opex_operations
- name: opex_per_mwh
- name: opex_plants
- name: opex_production_total
- name: opex_rents
- name: opex_steam
- name: opex_steam_other
- name: opex_structures
- name: opex_total_nonfuel
- name: opex_transfer
- name: peak_demand_mw
- name: plant_capability_mw
- name: plant_hours_connected_while_generating
- name: plant_type
- name: record_id
- name: water_limited_capacity_mw

models:
- name: validate_ferc1__yearly_steam_plants_sched402
columns:
- name: gas_cost_per_mmbtu
data_tests:
- expect_column_weighted_quantile_values_to_be_between:
quantile: 0.5
min_value: 200000
max_value: 600000
weight_column: capacity_mw
Loading
Loading