Skip to content

Commit

Permalink
Merge pull request #23 from dwreeves/support-clickhouse-and-rename-var
Browse files Browse the repository at this point in the history
Add support for Clickhouse + bump to 0.3.0
  • Loading branch information
dwreeves authored Jan 7, 2025
2 parents 2881c7d + d36580f commit 6c92344
Show file tree
Hide file tree
Showing 88 changed files with 2,130 additions and 2,496 deletions.
34 changes: 16 additions & 18 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,13 @@ jobs:
- uses: pre-commit/[email protected]
integration-tests:
runs-on: ubuntu-latest
strategy:
matrix:
dbt_core: [1.5.*, 1.8.*]
db_target: [duckdb, postgres, clickhouse]
services:
postgres:
image: postgres
image: ${{ (matrix.db_target == 'postgres') && 'postgres' || '' }}
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
Expand All @@ -29,34 +33,28 @@ jobs:
--health-interval 10s
--health-timeout 5s
--health-retries 5
strategy:
matrix:
dbt_core: [1.4.*, 1.7.*]
db_target: [dbt-duckdb, dbt-postgres]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: 1.4.0
virtualenvs-create: true
virtualenvs-in-project: true
- name: Install dependencies
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Setup
run: |
sudo apt-get update
sudo apt-get install
chmod +x ./run
./run setup
pip install -U "dbt-core==$DBT_CORE_VERSION" "${DBT_PROVIDER_PACKAGE}"
uv venv
uv sync --group python-dev
uv pip install -U "dbt-core==$DBT_CORE_VERSION" "dbt-${DBT_TARGET}==$DBT_CORE_VERSION"
env:
UV_NO_SYNC: true
DO_NOT_TRACK: 1
DBT_CORE_VERSION: ${{ matrix.dbt_core }}
DBT_PROVIDER_PACKAGE: ${{ matrix.db_target }}
DBT_TARGET: ${{ matrix.db_target }}
- name: Test
run: ./run test "${DBT_TARGET}"
env:
UV_NO_SYNC: true
DO_NOT_TRACK: 1
DBT_TARGET: ${{ matrix.db_target }}
POSTGRES_HOST: localhost
POSTGRES_USER: postgres
Expand Down
3 changes: 0 additions & 3 deletions .idea/.gitignore

This file was deleted.

34 changes: 0 additions & 34 deletions .idea/dbt_linreg.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.10.4
3.11
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

### `0.3.0`

- Official support for Clickhouse!
- Rename `format=` and `format_options=` to `output=` and `output_options=` to make the API consistent with **dbt_pca**.

### `0.2.6`

- Fix bug with `group_by` on multiple variables; contributed by [@svkohler](https://github.com/dwreeves/dbt_linreg/issues/21).
Expand Down
52 changes: 29 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# Overview

**dbt_linreg** is an easy way to perform linear regression and ridge regression in SQL (Snowflake, DuckDB, and more) with OLS using dbt's Jinja2 templating.
**dbt_linreg** is an easy way to perform linear regression and ridge regression in SQL (Snowflake, DuckDB, Clickhouse, and more) with OLS using dbt's Jinja2 templating.

Reasons to use **dbt_linreg**:

Expand All @@ -32,7 +32,7 @@ Add this the `packages:` list your dbt project's `packages.yml`:

```yaml
- package: "dwreeves/dbt_linreg"
version: "0.2.6"
version: "0.3.0"
```
The full file will look something like this:
Expand All @@ -43,7 +43,7 @@ packages:
# Other packages here
# ...
- package: "dwreeves/dbt_linreg"
version: "0.2.6"
version: "0.3.0"
```
# Examples
Expand All @@ -63,8 +63,8 @@ select * from {{
table=ref('simple_matrix'),
endog='y',
exog=['xa', 'xb', 'xc'],
format='long',
format_options={'round': 5}
output='long',
output_options={'round': 5}
)
}} as linreg
```
Expand Down Expand Up @@ -171,9 +171,10 @@ group by

- Snowflake
- DuckDB
- Clickhouse
- Postgres\*

If `dbt_linreg` does not work in your database tool, please let me know in a bug report and I can make sure it is supported.
If **dbt_linreg** does not work in your database tool, please let me know in a bug report.

> _* Minimal support. Postgres is syntactically supported, but is not performant under certain circumstances._

Expand All @@ -189,8 +190,8 @@ def ols(
endog: str,
exog: Union[str, list[str]],
add_constant: bool = True,
format: Literal['wide', 'long'] = 'wide',
format_options: Optional[dict[str, Any]] = None,
output: Literal['wide', 'long'] = 'wide',
output_options: Optional[dict[str, Any]] = None,
group_by: Optional[Union[str, list[str]]] = None,
alpha: Optional[Union[float, list[float]]] = None,
method: Literal['chol', 'fwl'] = 'chol',
Expand All @@ -205,38 +206,39 @@ Where:
- **endog**: The endogenous variable / y variable / target variable of the regression. (You can also specify `y=...` instead of `endog=...` if you prefer.)
- **exog**: The exogenous variables / X variables / features of the regression. (You can also specify `x=...` instead of `exog=...` if you prefer.)
- **add_constant**: If true, a constant term is added automatically to the regression.
- **format**: Either "wide" or "long" format for coefficients. See **Formats and format options** for more.
- **output**: Either "wide" or "long" output format for coefficients. See **Outputs and output options** for more.
- If `wide`, the variables span the columns with their original variable names, and the coefficients fill a single row.
- If `long`, the coefficients are in a single column called `coefficient`, and the variable names are in a single column called `variable_name`.
- **format_options**: See **Formats and format options** section for more.
- **output_options**: See **Formats and format options** section for more.
- **group_by**: If specified, the regression will be grouped by these variables, and individual regressions will run on each group.
- **alpha**: If not null, the regression will be run as a ridge regression with a penalty of `alpha`. See **Notes** section for more information.
- **method**: The method used to calculate the regression. See **Methods and method options** for more.
- **method_options**: Options specific to the estimation method. See **Methods and method options** for more.

# Formats and format options
# Outputs and output options

Outputs can be returned either in `format='long'` or `format='wide'`.
Outputs can be returned either in `output='long'` or `output='wide'`.

(In the future, I might add one or two more formats, notably a summary table format.)
All outputs have their own output options, which can be passed into the `output_options=` arg as a dict, e.g. `output_options={'foo': 'bar'}`.

All formats have their own format options, which can be passed into the `format_options=` arg as a dict, e.g. `format_options={'foo': 'bar'}`.
`output=` and `output_options=` were formerly named `format=` and `format_options=` respectively.
This has been deprecated to make **dbt_linreg**'s API more consistent with **dbt_pca**'s API.

### Options for `format='long'`
### Options for `output='long'`

- **round** (default = `None`): If not None, round all coefficients to `round` number of digits.
- **constant_name** (default = `'const'`): String name that refers to constant term.
- **variable_column_name** (default = `'variable_name'`): Column name storing strings of variable names.
- **coefficient_column_name** (default = `'coefficient'`): Column name storing model coefficients.
- **strip_quotes** (default = `True`): If true, strip outer quotes from column names if provided; if false, always use string literals.

These options are available for `format='long'` only when `method='chol'`:
These options are available for `output='long'` only when `method='chol'`:

- **calculate_standard_error** (default = `True if not alpha else False`): If true, provide the standard error in the output.
- **standard_error_column_name** (default = `'standard_error'`): Column name storing the standard error for the parameter.
- **t_statistic_column_name** (default = `'t_statistic'`): Column name storing the t-statistic for the parameter.

### Options for `format='wide'`
### Options for `output='wide'`

- **round** (default = `None`): If not None, round all coefficients to `round` number of digits.
- **constant_name** (default = `'const'`): String name that refers to constant term.
Expand Down Expand Up @@ -290,6 +292,7 @@ So when should you use `fwl`? The main use case is in OLTP systems (e.g. Postgre
- A scalar input (e.g. `alpha=0.01`) will apply an alpha of `0.01` to all features.
- An array input (e.g. `alpha=[0.01, 0.02, 0.03, 0.04, 0.05]`) will apply an alpha of `0.01` to the first column, `0.02` to the second column, etc.
- `alpha` is equivalent to what TEoSL refers to as "lambda," times the sample size N. That is to say: `α ≡ λ * N`.
- (Of course, you can regularize the constant term by DIYing your own constant term and doing `add_constant=false`.)

- Regularization as currently implemented for the `chol` method tends to be very slow in OLTP systems (e.g. Postgres), but is very performant in OLAP systems (e.g. Snowflake, DuckDB, BigQuery, Redshift). As dbt is more commonly used in OLAP contexts, the code base is optimized for the OLAP use case.
- That said, it may be possible to make regularization in OLTP more performant (e.g. with augmentation of the design matrix), so PRs are welcome.
Expand All @@ -298,19 +301,22 @@ So when should you use `fwl`? The main use case is in OLTP systems (e.g. Postgre

### Possible future features

Some things I am thinking about working on down the line:
Some things that could happen in the future:

- **Optimization:** Given access to Jinja2 templating, there may be more efficient ways to calculate the get a closed form OLS solution than the approach taken in this code base.
- Weighted least squares (WLS)
- P-values
- Heteroskedasticity robust standard errors
- Recursive CTE implementations + long formatted inputs

- **Standard errors and t-stats:** For the `format='long'` output (or perhaps a new format?), there is space to sensibly add t-stats and standard errors. The main challenge is that this necessitates inverting a covariance matrix, although this is theoretically doable using Jinja2 templating.
Note that although I maintain this library (as of writing in 2025), I do not actively update it much with new features, so this wish list is unlikely unless I personally need it or unless someone else contributes these features.

# FAQ

### How does this work?

See **Methods and method options** section for a full breakdown of each linear regression implementation.

All approaches were validated using Statsmodels `sm.OLS()`. Note that the ridge regression coefficients differ very slightly from Statsmodels's outputs for currently unknown reasons, but the coefficients are very close (I enforce a `<0.01%` deviation from Statsmodels's ridge regression coefficients in my integration tests).
All approaches were validated using Statsmodels `sm.OLS()`.

### BigQuery (or other database) has linear regression implemented natively. Why should I use `dbt_linreg` over that?

Expand All @@ -334,11 +340,11 @@ I opt to leave out dummy variable support because it's tricky, and I want to kee

Note that you couldn't simply add categorical variables in the same list as numeric variables because Jinja2 templating is not natively aware of the types you're feeding through it, nor does Jinja2 know the values that a string variable can take on. The way you would actually implement categorical variables is with `group by` trickery (i.e. center both y and X by categorical variable group means), although I am not sure how to do that efficiently for more than one categorical variable column.

If you'd like to regress on a categorical variable, for now you'll need to do your own feature engineering, e.g. `(foo = 'bar')::int as foo_bar`
If you'd like to regress on a categorical variable, for now you'll need to do your own feature engineering, e.g. `(foo = 'bar')::int as foo_bar, (foo = 'baz')::int as foo_baz`.

### Why are there no p-values?

This is planned for the future, so stay tuned! P-values would require a lookup on a dimension table, which is a significant amount of work to manage nicely, but I hope to get to it soon.
This is something that might happen in the future. P-values would require a lookup on a dimension table, which is a significant amount of work to manage nicely.

In the meanwhile, you can implement this yourself-- just create a dimension table that left joins a t-statistic on a half-open interval to lookup a p-value.

Expand Down
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: "dbt_linreg"
version: "0.2.6"
version: "0.3.0"

# 1.2 is required because of modules.itertools.
require-dbt-version: [">=1.2.0", "<2.0.0"]
Expand Down
Binary file removed docs/src/img/dbt-linreg-banner.png
Binary file not shown.
12 changes: 11 additions & 1 deletion integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: "dbt_linreg_tests"
version: "0.2.3"
version: "0.3.0"

require-dbt-version: [">=1.0.0", "<2.0.0"]

Expand All @@ -10,5 +10,15 @@ clean-targets: ["target", "dbt_modules", "dbt_packages"]
macro-paths: ["macros"]
log-path: "logs"

vars:
_test_precision_simple_matrix: '{{ "10e-8" if target.name == "clickhouse" else 0.0 }}'
_test_precision_collinear_matrix: '{{ "10e-6" if target.name == "clickhouse" else "10e-7" }}'

models:
+materialized: table

tests:
+store_failures: true

# During dev only!
profile: "dbt_linreg_profile"
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1'],
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ select * from {{
endog='y',
exog=['x1'],
alpha=2.0,
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2'],
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2', 'x3'],
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2', 'x3', 'x4'],
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2', 'x3', 'x4', 'x5'],
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ select * from {{
endog='y',
exog=['x1', 'x2', 'x3', 'x4', 'x5'],
alpha=1.0,
format='long',
add_constant=False
output='long',
add_constant=false
)
}} as linreg
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ select * from {{
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2', 'x3', 'x4', 'x5'],
format='long',
output='long',
method='chol'
)
}} as linreg
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{{
config(
materialized="table",
tags=["skip-postgres"]
tags=["skip-postgres", "skip-clickhouse"]
)
}}
select * from {{
dbt_linreg.ols(
table=ref('collinear_matrix'),
endog='y',
exog=['x1', 'x2', 'x3', 'x4', 'x5'],
format='long',
output='long',
method='chol',
method_options={'subquery_optimization': False}
)
Expand Down
Loading

0 comments on commit 6c92344

Please sign in to comment.