-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: cost-effective merge for partitioned incremental models on BigQuery #1971
Feature: cost-effective merge for partitioned incremental models on BigQuery #1971
Conversation
🤞 @jtcohen6 Thanks a lot. I hope it goes through. Can't wait for it. I believe it makes dbt_ better prepared for the bigger data sets and stronger to withstand the test of time. 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is groovy, thanks for sending over the PR @jtcohen6!
I have some tiny thoughts about the code - I've dropped them inline here. I think the approach is a good one though. Let me spend some time testing this out :)
core/dbt/include/global_project/macros/materializations/common/merge.sql
Outdated
Show resolved
Hide resolved
plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
Hey, I was looking to do something like this and stumbled upon your PR.. I tried it and it works well I have a couple of questions/comments:
(
select * from {{tmp_relation}}
) in situations where
Anyway, thanks for doing this, it's really helpful 👍 |
@inve1 Thanks for trying it out, and for the great ideas/comments! Even cheaper!Between your two suggestions, I prefer the second, where we {% set source_sql -%}
(
select * from {{tmp_relation}}
)
{%- endset -%} After some quick testing, I think this would be the most cost-effective. 1. Create partitioned temp table: charged for full scan of data needed by Per your first suggestion, there may be some cases where it's extremely cheap to get the partition min/max values from the model SQL as select min({{partition_by}}, max({{partition_by}}) from (
{{ sql }}
) but then we would be charged the entire cost of the model SQL over again when we want to merge in step 3. For unpartitioned tables, we should still elide the need to create a temp table and use
because we're only going to be using the model SQL once, in the Partition overwriteThis is a great question—and I think it's asking after a genuinely different materialization strategy, one that performs a counterintuitive (but highly effective) merge like: merge into {{ target }} as DBT_INTERNAL_DEST
using {{ source }} as DBT_INTERNAL_SOURCE
on
FALSE
when not matched by source
and DBT_INTERNAL_DEST.{{partition_by}} between {{partition_min}} and {{partition_max}}
then delete
when not matched then insert
({{ dest_cols_csv }})
values
({{ dest_cols_csv }}) This to me is an @drewbanin What do you think about adding first-class support for this as an alternative strategy for incremental models on BQ, in the same way that we support two incremental strategies on Snowflake? |
Yeah I agree the "selecting the temp table" approach seems a little better. On that note, you could create a real bigquery temp table and do the whole merge job in one request. DECLARE
partition_min,
partition_max timestamp;
CREATE TEMP TABLE scans__dbt_tmp -- scans is just the name of my model
PARTITION BY
DATE(start_time) AS (
SELECT
......
FROM
.....
;
SET
(partition_min,
partition_max) = (
SELECT
AS STRUCT MIN(start_time), -- partition_by column
MAX(start_time)
FROM
scans__dbt_tmp);
MERGE INTO
.... AS DBT_INTERNAL_DEST
USING
(
SELECT
*
FROM
`scans__dbt_tmp` ) AS DBT_INTERNAL_SOURCE
ON
DBT_INTERNAL_DEST.start_time BETWEEN partition_min AND partition_max
AND {{unique_key_constraint_here}}
WHEN MATCHED THEN UPDATE SET ....
WHEN NOT MATCHED
THEN ..... VALUES ......
This has the advantage of not creating a table you can see in your datasets, not getting charged for storing the temp table for 12 hours and Link to docs in case you haven't seen these |
That's really interesting stuff! I've only done a little reading about "true" temp tables and scripting in BQ, since they seem to be quite new as features. I'll definitely talk more about it with some of the folks here. For the moment, I think the use of Edit: @drewbanin and I had a chance to play around with scripting quickly. Very cool. It seems like this is a much more straightforward way to code up exactly what we want here. I'll give this a spin soon. |
…eature/cost-effective-bq-incremental
I've rewritten this update to the BQ incremental materialization such that it now leverages the beta scripting feature. This required some additional code in Extensions to incremental materializationNow that this script-based framework has helped in making the code more straightforward, I think it will be very simple to enable the Extensions to scriptingBased on the work in #1967, I'd love to allow users to specify their In the simplest case, that looks like: {%- if is_incremental() -%}
{%- call sql_header -%}
DECLARE
max_from_this date;
SET
(max_from_this) = (select as struct max(my_date) from {{this}});
{%- endcall -%}
{%- endif -%}
with base as (
select * from source_table
{%- if is_incremental() -%}
where my_date >= max_from_this
{%- endif -%}
) The trickiness comes with integrating the script-based work I've done here. Since each BQ script can only have one Edit: CaveatI just reread the BQ docs and saw that they have beta support for integer-based partitioning. There are a couple of places in this new script-based approach that presume the partitioning column to be either |
New
|
Very excited about this! Do you know if this will allow us to run this incremental strategy on tables that have partition filter requirements? https://getdbt.slack.com/archives/C2JRRQDTL/p1579289882011400 |
@clausherther Yes! These Out of curiosity, how are you configuring that flag with dbt? I don't think it's explicitly supported as a |
@jtcohen6 we set that option via an |
@drewbanin @beckjake I'm ready to hand this over for expert appraisal.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have a couple style comments
Can you add a test that the deprecated version works properly?
The 012_deprecation_tests
folder has some tests, TestDeprecations
should be a decent example: Set up a test that attempts to use the old behavior and runs with strict=True
, which should fail. With strict=False
, it should pass and generate a properly partitioned table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 I just changed the base branch to dev/barbara-gittings
(0.16.0) - looks like there are some merge conflicts to account for as a result.
I'll do a quick pass here too, but this looks really stellar!
The changes are good to merge into |
Hi @jtcohen6 and @drewbanin That is exciting stuff. Thank you for sharing with me. I ran into some critical blockers in our dbt production that is related to this issue. We would like to patch the fix quickly. Could I have some questions?
For dbt-labs/dbt-bigquery#5, As @inve1 mentioned,
Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! I left some comments here that I'd love to catch up with you about IRL!
I also cooked up a docs link as a placeholder, so we can merge this thing as soon as it's ready :)
core/dbt/deprecations.py
Outdated
dbt inferred: {inferred_partition_by} | ||
|
||
For more information, see: | ||
[dbt docs link] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a placeholder guide here, to be completed before 0.16.0 is released. Can you add this link accordingly? https://docs.getdbt.com/docs/upgrading-to-0-16-0
{% endif %} | ||
|
||
{# BigQuery only #} | ||
{%- if partition_by -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we're encoding this BigQuery specific logic in the common merge implementation. Further, the pprint_partition_field
macro used below is only defined in the BigQuery plugin. This probably won't be an issue, as the Snowflake code path won't be providing a partition_by
arg here, but it does feel out of place to me.
I'd be comfortable just leaving the existing common_get_merge_sql
as-is and instead baking this logic directly into the BigQuery implementation of get_merge_sql
:
Alternatively, we could make this a little bit more generic and change the partition_by
arg for this macro to predicates
(or similar). The BigQuery implementation of get_merge_sql
could supply a list of predicates to this macro, but other implementations (like Snowflake) could provide an empty list.
Either way, let's move the partitioning-specific logic out of this macro.
merge into {{ target }} as DBT_INTERNAL_DEST | ||
using {{ source }} as DBT_INTERNAL_SOURCE | ||
on | ||
{% if conditions|length == 0 %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slick way of doing this 👍
if 'range_bucket' in raw_partition_by.lower(): | ||
dbt.exceptions.CompilerException(''' | ||
BigQuery integer range partitioning (beta) is supported \ | ||
by the new `partition_by` config, which accepts a \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery integer range partitioning (beta) is only supported
with the dictionary format of the partition_by
config.
More information: [Link from above]
^ I like using modifiers like only
, as they help clarify the nature of these validation errors. These error messages should:
- Explain what went wrong
- Explain how to fix it
Also, this error message will invariably stick around for longer than we intend, so I try to avoid temporal words like "new", as things like that tend to cause chuckles a couple of years down the line :p
dictionary. See: [dbt docs link TK] | ||
''') # TODO | ||
else: | ||
p = re.compile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex is a little too complicated for my liking... can we try something a little simpler instead? How about we try something like:
partition_by = raw_partition_by.strip()
if partition_by.lower().startswith('date('):
partition_by = re.match('date\((.*)\)', partition_by, re.IGNORECASE)
data_type = 'date'
else:
data_type = 'timestamp'
inferred_partition_by = ...
Let me know if I'm missing something important in the logic here!
{%- if partition_by_type in ('date','timestamp','datetime') -%} | ||
partition by {{pprint_partition_field(partition_by_dict)}} | ||
{%- elif partition_by_type in ('int64') -%} | ||
{%- set pbr = partition_by_dict.range -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't mind a pbr
@@ -62,18 +77,25 @@ | |||
{%- set raw_kms_key_name = config.get('kms_key_name', none) -%} | |||
{%- set raw_labels = config.get('labels', []) -%} | |||
{%- set sql_header = config.get('sql_header', none) -%} | |||
|
|||
{%- set partition_by_dict = adapter.parse_partition_by(raw_partition_by) -%} | |||
{%- set is_scripting = (temporary == 'scripting') -%} {# "true" temp tables only possible when scripting #} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just remind me where we ended up on this? BigQuery temp tables can't be used outside of scripting?
Closing in favor of #2140 |
Would there be any interest in something like this so that we can get some improvement for integer keys that otherwise exceed the BQ script memory limits (e.g. dbt-labs/dbt-adapters#605)? It's extending the BQ implementation of the |
Attempt an implementation of the solution suggested by @jarlainnix in dbt-labs/dbt-bigquery#1034
When running an
incremental
model on BigQuery incrementally, IFF the model is partitioned, dbt should:is_incremental()
input set limit)min
andmax
partition values from that temporary tableThe way I've implemented this is quite verbose and requires adding a keyword arg to the contract of
get_merge_sql
. A better approach may leverage**kwargs
. I'm very open to feedback.