Skip to content

Commit

Permalink
Merge pull request #12 from dwreeves/dev-pg
Browse files Browse the repository at this point in the history
[Draft] Support Postgres
  • Loading branch information
dwreeves authored Oct 30, 2023
2 parents fbb61d8 + 73b8fa6 commit 47c5b88
Show file tree
Hide file tree
Showing 53 changed files with 360 additions and 140 deletions.
77 changes: 51 additions & 26 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,58 @@ on:
branches:
- main
jobs:
test:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
- uses: pre-commit/[email protected]
integration-tests:
runs-on: ubuntu-latest
services:
postgres:
image: postgres
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: dbt_linreg
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
strategy:
matrix:
dbt_core: [1.3.*, 1.4.*, 1.5.*, 1.6.*]
dbt_core: [1.4.*, 1.6.*]
db_target: [dbt-duckdb, dbt-postgres]
steps:
- uses: actions/checkout@v1
- uses: actions/setup-python@v1
with:
python-version: "3.10"
architecture: x64
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: 1.4.0
virtualenvs-create: true
virtualenvs-in-project: true
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install
chmod +x ./run
./run setup
pip install -U "dbt-core==$DBT_CORE_VERSION" "dbt-duckdb"
env:
DBT_CORE_VERSION: ${{ matrix.dbt_core }}
- name: Lint
run: ./run lint
- name: Test
run: ./run test
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: 1.4.0
virtualenvs-create: true
virtualenvs-in-project: true
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install
chmod +x ./run
./run setup
pip install -U "dbt-core==$DBT_CORE_VERSION" "${DBT_PROVIDER_PACKAGE}"
env:
DBT_CORE_VERSION: ${{ matrix.dbt_core }}
DBT_PROVIDER_PACKAGE: ${{ matrix.db_target }}
- name: Test
run: ./run test "${DBT_TARGET}"
env:
DBT_TARGET: ${{ matrix.db_target }}
POSTGRES_HOST: localhost
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: dbt_linreg
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
dbt.duckdb
dbt.duckdb.wal
.user.yml
docs/site/
integration_tests/seeds/*.csv
Expand Down
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,8 @@ repos:
hooks:
- id: shellcheck
args: [-x, run]

- repo: https://github.com/rhysd/actionlint
rev: v1.6.26
hooks:
- id: actionlint
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

### `0.2.3`

- Added Postgres support in integration tests + fixed bugs that prevented Postgres from working.

### `0.2.2`

- Added dbt documentation of the `ols()` macro.
Expand Down
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Add this the `packages:` list your dbt project's `packages.yml`:

```yaml
- package: "dwreeves/dbt_linreg"
version: "0.2.2"
version: "0.2.3"
```
The full file will look something like this:
Expand All @@ -41,7 +41,7 @@ packages:
# Other packages here
# ...
- package: "dwreeves/dbt_linreg"
version: "0.2.2"
version: "0.2.3"
```
# Examples
Expand All @@ -64,7 +64,7 @@ select * from {{
format='long',
format_options={'round': 5}
)
}}
}} as linreg
```

Output:
Expand Down Expand Up @@ -169,9 +169,12 @@ group by

- Snowflake
- DuckDB
- Postgres\*

If `dbt_linreg` does not work in your database tool, please let me know in a bug report and I can make sure it is supported.

> _* Minimal support. Postgres is syntactically supported, but is not performant under certain circumstances._

# API

The only function available in the public API is the `dbt_linreg.ols()` macro.
Expand Down Expand Up @@ -255,7 +258,7 @@ This method calculates regression coefficients using the Moore-Penrose pseudo-in
Specify these in a dict using the `method_options=` kwarg:

- **safe** (default = `True`): If True, returns null coefficients instead of an error when X is perfectly multicollinear. If False, a negative value will be passed into a SQRT(), and most SQL engines will raise an error when this happens.
- **subquery_optimization** (default = `True`): If True, nested subqueries are used during some of the steps to optimize the query speed. If false, the query is flattened. Note that turning this off can significantly degrade performance.
- **subquery_optimization** (default: `True`): If True, nested subqueries are used during some of the steps to optimize the query speed. If false, the query is flattened.

## `fwl` method

Expand All @@ -269,10 +272,12 @@ Ridge regression is implemented using the augmentation technique described in Ex

There are a few reasons why this method is discouraged over the `chol` method:

- 🐌 It tends to be much slower, and struggles to efficiently calculate large number of columns.
- 🐌 It tends to be much slower in OLAP systems, and struggles to efficiently calculate large number of columns.
- 📊 It does not calculate standard errors.
- 😕 For ridge regression, coefficients are not accurate; they tend to be off by a magnitude of ~0.01%.

So when should you use `fwl`? The main use case is in OLTP systems (e.g. Postgres) for unregularized coefficient estimation. Long story short, the `chol` method relies on subquery optimization to be more performant than `fwl`; however, OLTP systems do not benefit at all from subquery optimization. This means that `fwl` is slightly more performant in this context.

# Notes

- ⚠️ **If your coefficients are null, it does not mean dbt_linreg is broken, it most likely means your feature columns are perfectly multicollinear.** If you are 100% sure that is not the issue, please file a bug report with a minimally reproducible example.
Expand All @@ -282,6 +287,11 @@ There are a few reasons why this method is discouraged over the `chol` method:
- An array input (e.g. `alpha=[0.01, 0.02, 0.03, 0.04, 0.05]`) will apply an alpha of `0.01` to the first column, `0.02` to the second column, etc.
- `alpha` is equivalent to what TEoSL refers to as "lambda," times the sample size N. That is to say: `α ≡ λ * N`.

- Regularization as currently implemented for the `chol` method tends to be very slow in OLTP systems (e.g. Postgres), but is very performant in OLAP systems (e.g. Snowflake, DuckDB, BigQuery, Redshift). As dbt is more commonly used in OLAP contexts, the code base is optimized for the OLAP use case.
- That said, it may be possible to make regularization in OLTP more performant (e.g. with augmentation of the design matrix), so PRs are welcome.

- Regression coefficients in Postgres are always `numeric` types.

### Possible future features

Some things I am thinking about working on down the line:
Expand Down
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: "dbt_linreg"
version: "0.2.2"
version: "0.2.3"

# 1.2 is required because of modules.itertools.
require-dbt-version: [">=1.2.0", "<2.0.0"]
Expand Down
2 changes: 1 addition & 1 deletion integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: "dbt_linreg_tests"
version: "0.2.1"
version: "0.2.3"

require-dbt-version: [">=1.0.0", "<2.0.0"]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -12,4 +13,4 @@ select * from {{
format='long',
add_constant=False
)
}}
}} as linreg
5 changes: 3 additions & 2 deletions integration_tests/models/collinear_matrix_regression_chol.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -11,4 +12,4 @@ select * from {{
format='long',
method='chol'
)
}}
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -12,4 +13,4 @@ select * from {{
method='chol',
method_options={'subquery_optimization': False}
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
method='fwl'
)
}}
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -12,4 +13,4 @@ select * from {{
alpha=0.01,
method='chol'
)
}}
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -13,4 +14,4 @@ select * from {{
method='chol',
method_options={'subquery_optimization': False}
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ select * from {{
alpha=0.01,
method='fwl'
)
}}
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -10,7 +11,8 @@ select * from {{
exog=['x1', 'x2', 'x3'],
group_by=['gb_var'],
format='long',
method='chol'
method='chol',
method_options={'subquery_optimization': True}
)
}}
}} as linreg
order by gb_var, variable_name
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{{
config(
materialized="table"
materialized="table",
tags=["skip-postgres"]
)
}}
select * from {{
Expand All @@ -13,5 +14,5 @@ select * from {{
method='chol',
method_options={'subquery_optimization': False}
)
}}
}} as linreg
order by gb_var, variable_name
2 changes: 1 addition & 1 deletion integration_tests/models/groups_matrix_regression_fwl.sql
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ select * from {{
format='long',
method='fwl'
)
}}
}} as linreg
order by gb_var, variable_name
4 changes: 2 additions & 2 deletions integration_tests/models/long_format_options.sql
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ select
'strip_quotes': True
}
)
}}
}} as linreg1

union all

Expand All @@ -37,4 +37,4 @@ select
'strip_quotes': False
}
)
}}
}} as linreg2
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ select * from {{
endog='y',
exog=['xa', 'xb']
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
format_options={'round': 5}
)
}}
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ select * from {{
format='long',
format_options={'round': 5}
)
}}
}} as linreg
Loading

0 comments on commit 47c5b88

Please sign in to comment.