-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dbt setup #4011
base: main
Are you sure you want to change the base?
Dbt setup #4011
Changes from 20 commits
bf40ffb
9aac625
415a113
ba32bd8
dc51c8f
590b02a
63e663a
d428b5d
784cf96
48a16e1
6f45ba5
ac41a41
0ce1648
c19cfd8
6335e94
2585eca
e24af8c
1ed85b3
e92f5be
5de9ebe
da9ae93
7461786
0d120c6
3666360
a3579dc
c98219c
f9b3fa7
79e2153
012ba4a
ff766b3
94267a5
8f660fd
70e6895
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
## Overview | ||
This directory contains an initial setup of a `dbt` project meant to write | ||
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The | ||
project is setup with profiles that allow you to select running tests on `nightly` | ||
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate | ||
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast` | ||
profiles will look for parquet files based on your `PUDL_OUTPUT` environment | ||
variable. See the `Usage` section below for examples using these profiles. | ||
|
||
|
||
## Development | ||
To setup the `dbt` project, simply install the PUDL `conda` environment as normal, | ||
then run the following command from this directory. | ||
|
||
``` | ||
dbt deps | ||
``` | ||
|
||
### Adding new tables | ||
To add a new table to the project, you must add it as a | ||
[dbt source](https://docs.getdbt.com/docs/build/sources). The standard way to do | ||
this is to create a new file `models/{data_source}/{table_name}.yml`. If the the | ||
`data_source` doesn't already have a directory within `models/` you should first | ||
create one and add the yaml file here. | ||
|
||
### Adding tests | ||
#### Default case | ||
Once a table is included as a `source`, you can add tests for the table. You can | ||
either add a generic test directly in `src/pudl/dbt/models/schema.yml`, or create | ||
a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`. | ||
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows | ||
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a | ||
failure for that test. | ||
|
||
The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations) | ||
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These | ||
packages include useful tests out of the box that can be applied to any tables | ||
in the project. There are several examples in `src/pudl/dbt/models/schema.yml` which | ||
use `dbt-expectations`. | ||
|
||
#### Modifying a table before test | ||
In some cases you may want to modify the table before applying tests. There are two | ||
ways to accomplish this. First, you can add the table as a `source` as described | ||
above, then create a SQL file in the `tests/` directory like | ||
`tests/{data_source}/{table_name}.yml`. From here you can construct a SQL query to | ||
modify the table and execute a test on the intermediate table you've created. `dbt` | ||
expects a SQL test to be a query that returns 0 rows for a successful test. See | ||
the `dbt` [source function](https://docs.getdbt.com/reference/dbt-jinja-functions/source) | ||
for guidance on how to reference a `source` from a SQL file. | ||
|
||
The second method is to create a [model](https://docs.getdbt.com/docs/build/models) | ||
which will produce the intermediate table you want to execute tests on. To use this | ||
approach, first create a directory named `tests/{data_source}/{table_name}/` and move | ||
your yaml file defining the `source` table to `tests/{data_source}/{table_name}/schema.yml`. | ||
Now, add a SQL file to this directory named `validate_{table_name}` and define your model | ||
for producing the intermediate table here. Finally, add the model to the `schema.yml` file | ||
and define tests exactly as you would for a `source` table. See | ||
`models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402` for an example of this | ||
pattern. | ||
|
||
### Usage | ||
There are a few ways to execute tests. To run all tests with a single command: | ||
|
||
``` | ||
dbt build | ||
``` | ||
|
||
This command will first run any models, then execute all tests. | ||
|
||
For more finegrained control, first run: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "run" here meaning not "run the tests" but "run dbt in the traditional sense which we're not really using it -- to build database tables"? To the folks unfamiliar with dbt (most of us right now) I think "run the models" will be confusing. |
||
|
||
``` | ||
dbt run | ||
``` | ||
|
||
This will run all models, thus prepairing any `sql` views that will be referenced in | ||
tests. Once you've done this, you can run all tests with: | ||
|
||
``` | ||
dbt test | ||
``` | ||
|
||
To run all tests for a single source table: | ||
|
||
``` | ||
dbt test --select source:pudl.{table_name} | ||
``` | ||
|
||
To run all tests for a model table: | ||
|
||
``` | ||
dbt test --select {model_name} | ||
``` | ||
|
||
#### Selecting target profile | ||
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append | ||
`--target {target_name}` to any of the previous commands. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Name your project! Project names should contain only lowercase characters | ||
# and underscores. A good package name should reflect your organization's | ||
# name or the intended use of these models | ||
name: "pudl_dbt" | ||
version: "1.0.0" | ||
|
||
# This setting configures which "profile" dbt uses for this project. | ||
profile: "pudl_dbt" | ||
|
||
# These configurations specify where dbt should look for different types of files. | ||
# The `model-paths` config, for example, states that models in this project can be | ||
# found in the "models/" directory. You probably won't need to change these! | ||
model-paths: ["models"] | ||
macro-paths: ["macros"] | ||
test-paths: ["tests"] | ||
|
||
sources: | ||
pudl_dbt: | ||
+external_location: | | ||
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet' | ||
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet' | ||
{%- endif -%} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
version: 2 | ||
|
||
sources: | ||
- name: pudl | ||
tables: | ||
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 | ||
|
||
models: | ||
- name: validate_ferc1__yearly_steam_plants_fuel_by_plant_sched402 | ||
columns: | ||
- name: gas_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My intuition is that these should probably really be weighted quantiles, or that they are relatively low-value checks because the FERC fuel reporting is such a mess. |
||
quantile: 0.05 | ||
min_value: 1.5 | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 15.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 2.0 | ||
max_value: 10.0 | ||
- name: oil_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.10 | ||
min_value: 3.5 | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 25.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 6.5 | ||
max_value: 17.0 | ||
- name: coal_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.10 | ||
min_value: 0.75 | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 4.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 1.0 | ||
max_value: 2.5 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
|
||
select | ||
{% for fuel_type in ["gas", "oil", "coal"] %} | ||
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu, | ||
{% endfor %} | ||
from {{ source('pudl', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
version: 2 | ||
|
||
sources: | ||
- name: pudl | ||
tables: | ||
- name: out_vcerare__hourly_available_capacity_factor | ||
data_tests: | ||
- dbt_expectations.expect_table_row_count_to_equal: | ||
value: | | ||
{%- if target.name == "etl-fast" -%} 27287400 | ||
{%- else -%} 136437000 | ||
{%- endif -%} | ||
- dbt_expectations.expect_compound_columns_to_be_unique: | ||
column_list: ["county_id_fips", "datetime_utc"] | ||
row_condition: "county_id_fips is not null" | ||
columns: | ||
- name: capacity_factor_solar_pv | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
max_value: 1.02 | ||
- dbt_expectations.expect_column_min_to_be_between: | ||
min_value: 0.00 | ||
- name: capacity_factor_offshore_wind | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
max_value: 1.00 | ||
- dbt_expectations.expect_column_min_to_be_between: | ||
min_value: 0.00 | ||
- name: hour_of_year | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
min_value: 8759 | ||
max_value: 8761 | ||
- name: datetime_utc | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_values_to_not_be_in_set: | ||
value_set: ["{{ dbt_date.date(2020, 12, 31) }}"] | ||
- name: county_or_lake_name | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_values_to_not_be_in_set: | ||
value_set: ["bedford_city", "clifton_forge_city"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
packages: | ||
- package: calogica/dbt_expectations | ||
version: 0.10.4 | ||
- package: dbt-labs/dbt_utils | ||
version: 1.3.0 | ||
- package: calogica/dbt_date | ||
version: 0.10.1 | ||
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
packages: | ||
- package: calogica/dbt_expectations | ||
version: [">=0.10.0", "<0.11.0"] | ||
- package: dbt-labs/dbt_utils | ||
version: [">=1.3.0", "<1.4.0"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
pudl_dbt: | ||
outputs: | ||
# Define targets for nightly builds, and local ETL full/fast | ||
# See models/schema.yml for further configuration | ||
nightly: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
filesystems: | ||
- fs: s3 | ||
etl-full: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
etl-fast: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
|
||
target: nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zschira I am confused by the difference between this instruction that tells users to add the table schema as an individual
yml
file with the table name and the instructions below for adding tests, where you mention aschema.yml
file that would presumably contain all models.