Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bf40ffb
Add basic dbt setup
zschira Jan 9, 2025
9aac625
Update to dagster 1.9.7 & grpcio 1.67.1
zaneselvans Jan 10, 2025
415a113
Setup multiple dbt profiles
zschira Jan 10, 2025
ba32bd8
Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…
zaneselvans Jan 10, 2025
dc51c8f
Add all vcerare dbt tests
zschira Jan 10, 2025
590b02a
Add more example dbt tests
zschira Jan 13, 2025
63e663a
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zaneselvans Jan 13, 2025
d428b5d
Merge changes from main and revert to python 3.12
zaneselvans Jan 13, 2025
784cf96
Bump gdal to v3.10.1 bugfix release.
zaneselvans Jan 14, 2025
48a16e1
Merge branch 'main' into dbt_setup
zaneselvans Jan 14, 2025
6f45ba5
Merge branch 'main' into dbt_setup
zaneselvans Jan 15, 2025
ac41a41
Update to dagster 1.9.9
zaneselvans Jan 19, 2025
0ce1648
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
c19cfd8
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
6335e94
Reorganize dbt into multiple schema.yml files
zschira Jan 21, 2025
2585eca
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Jan 21, 2025
e24af8c
Move dbt project to top level of repo
zschira Jan 22, 2025
1ed85b3
Only set parquet path in dbt project once
zschira Jan 30, 2025
e92f5be
Standardize dbt maning scheme
zschira Jan 30, 2025
5de9ebe
Add more detail to README
zschira Jan 30, 2025
da9ae93
Add script to generate dbt scaffolding and row count tests
zschira Feb 5, 2025
7461786
Add documentation for dbt helper script
zschira Feb 5, 2025
0d120c6
Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…
zschira Feb 5, 2025
3666360
Add weighted quantile test (broken)
zschira Feb 5, 2025
a3579dc
Change row count test name
zschira Feb 5, 2025
c98219c
Update dbt initialization process
zschira Feb 5, 2025
f9b3fa7
Make dbt helper script work properly with non-yearly partitioned tables
zschira Feb 5, 2025
79e2153
Update dbt readme
zschira Feb 5, 2025
012ba4a
Regenerate ferc dbt schemas
zschira Feb 6, 2025
ff766b3
Merge branch 'main' into dbt_setup
zaneselvans Feb 10, 2025
94267a5
Improve dbt_helper command line usability
zschira Feb 10, 2025
8f660fd
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 10, 2025
70e6895
Flesh out test migration command
zschira Feb 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,9 @@ devtools/datasette/fly/Dockerfile
devtools/datasette/fly/inspect-data.json
devtools/datasette/fly/metadata.yml
devtools/datasette/fly/all_dbs.tar.zst

# dbt specific ignores
dbt/dbt_packages/
dbt/target/
dbt/logs/
dbt/.user.yml
97 changes: 97 additions & 0 deletions dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
## Overview
This directory contains an initial setup of a `dbt` project meant to write
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The
project is setup with profiles that allow you to select running tests on `nightly`
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast`
profiles will look for parquet files based on your `PUDL_OUTPUT` environment
variable. See the `Usage` section below for examples using these profiles.


## Development
To setup the `dbt` project, simply install the PUDL `conda` environment as normal,
then run the following command from this directory.

```
dbt deps
```

### Adding new tables
To add a new table to the project, you must add it as a
[dbt source](https://docs.getdbt.com/docs/build/sources). The standard way to do
this is to create a new file `models/{data_source}/{table_name}.yml`. If the the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zschira I am confused by the difference between this instruction that tells users to add the table schema as an individual yml file with the table name and the instructions below for adding tests, where you mention a schema.yml file that would presumably contain all models.

`data_source` doesn't already have a directory within `models/` you should first
create one and add the yaml file here.

### Adding tests
#### Default case
Once a table is included as a `source`, you can add tests for the table. You can
either add a generic test directly in `src/pudl/dbt/models/schema.yml`, or create
a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`.
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a
failure for that test.

The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations)
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These
packages include useful tests out of the box that can be applied to any tables
in the project. There are several examples in `src/pudl/dbt/models/schema.yml` which
use `dbt-expectations`.

#### Modifying a table before test
In some cases you may want to modify the table before applying tests. There are two
ways to accomplish this. First, you can add the table as a `source` as described
above, then create a SQL file in the `tests/` directory like
`tests/{data_source}/{table_name}.yml`. From here you can construct a SQL query to
modify the table and execute a test on the intermediate table you've created. `dbt`
expects a SQL test to be a query that returns 0 rows for a successful test. See
the `dbt` [source function](https://docs.getdbt.com/reference/dbt-jinja-functions/source)
for guidance on how to reference a `source` from a SQL file.

The second method is to create a [model](https://docs.getdbt.com/docs/build/models)
which will produce the intermediate table you want to execute tests on. To use this
approach, first create a directory named `tests/{data_source}/{table_name}/` and move
your yaml file defining the `source` table to `tests/{data_source}/{table_name}/schema.yml`.
Now, add a SQL file to this directory named `validate_{table_name}` and define your model
for producing the intermediate table here. Finally, add the model to the `schema.yml` file
and define tests exactly as you would for a `source` table. See
`models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402` for an example of this
pattern.

### Usage
There are a few ways to execute tests. To run all tests with a single command:

```
dbt build
```

This command will first run any models, then execute all tests.

For more finegrained control, first run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"run" here meaning not "run the tests" but "run dbt in the traditional sense which we're not really using it -- to build database tables"? To the folks unfamiliar with dbt (most of us right now) I think "run the models" will be confusing.


```
dbt run
```

This will run all models, thus prepairing any `sql` views that will be referenced in
tests. Once you've done this, you can run all tests with:

```
dbt test
```

To run all tests for a single source table:

```
dbt test --select source:pudl.{table_name}
```

To run all tests for a model table:

```
dbt test --select {model_name}
```

#### Selecting target profile
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append
`--target {target_name}` to any of the previous commands.
22 changes: 22 additions & 0 deletions dbt/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "pudl_dbt"
version: "1.0.0"

# This setting configures which "profile" dbt uses for this project.
profile: "pudl_dbt"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
macro-paths: ["macros"]
test-paths: ["tests"]

sources:
pudl_dbt:
+external_location: |
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet'
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet'
{%- endif -%}
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
version: 2

sources:
- name: pudl
tables:
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402

models:
- name: validate_ferc1__yearly_steam_plants_fuel_by_plant_sched402
columns:
- name: gas_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that these should probably really be weighted quantiles, or that they are relatively low-value checks because the FERC fuel reporting is such a mess.

quantile: 0.05
min_value: 1.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 15.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 2.0
max_value: 10.0
- name: oil_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 3.5
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 25.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 6.5
max_value: 17.0
- name: coal_cost_per_mmbtu
data_tests:
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.10
min_value: 0.75
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.90
max_value: 4.0
- dbt_expectations.expect_column_median_to_be_between:
min_value: 1.0
max_value: 2.5
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

select
{% for fuel_type in ["gas", "oil", "coal"] %}
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu,
{% endfor %}
from {{ source('pudl', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
version: 2

sources:
- name: pudl
tables:
- name: out_vcerare__hourly_available_capacity_factor
data_tests:
- dbt_expectations.expect_table_row_count_to_equal:
value: |
{%- if target.name == "etl-fast" -%} 27287400
{%- else -%} 136437000
{%- endif -%}
- dbt_expectations.expect_compound_columns_to_be_unique:
column_list: ["county_id_fips", "datetime_utc"]
row_condition: "county_id_fips is not null"
columns:
- name: capacity_factor_solar_pv
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.02
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: capacity_factor_offshore_wind
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
max_value: 1.00
- dbt_expectations.expect_column_min_to_be_between:
min_value: 0.00
- name: hour_of_year
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
min_value: 8759
max_value: 8761
- name: datetime_utc
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["{{ dbt_date.date(2020, 12, 31) }}"]
- name: county_or_lake_name
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ["bedford_city", "clifton_forge_city"]
8 changes: 8 additions & 0 deletions dbt/package-lock.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
packages:
- package: calogica/dbt_expectations
version: 0.10.4
- package: dbt-labs/dbt_utils
version: 1.3.0
- package: calogica/dbt_date
version: 0.10.1
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b
5 changes: 5 additions & 0 deletions dbt/packages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: dbt-labs/dbt_utils
version: [">=1.3.0", "<1.4.0"]
17 changes: 17 additions & 0 deletions dbt/profiles.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
pudl_dbt:
outputs:
# Define targets for nightly builds, and local ETL full/fast
# See models/schema.yml for further configuration
nightly:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
filesystems:
- fs: s3
etl-full:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"
etl-fast:
type: duckdb
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb"

target: nightly
Loading