-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a delete+insert incremental_strategy for Google BigQuery #2020
Comments
I'd be a big fan of that! Just ran into this the other day. I'm still figuring out how to best profile BQ queries (vs Snowflake), but it seems this would help a lot with table scans when the source table is partitioned by date. |
@clausherther I know that we had talked about this before - I was surprised that we didn't already have an issue to track this. Thanks for creating this one @amcarvalho. I know @jtcohen6 has been giving some thought to the BigQuery incremental materialization. Check out the "partition overwrite" section of this comment: #1971 (comment) Really the whole thread is pretty good and worth the read :) We're imaging that the BigQuery version of this strategy sounds more like "insert_overwrite" instead of "delete+insert". The two are pretty similar in practice, but we don't actually want to run a delete + insert on BigQuery, as that would not be atomic (no transactions on BQ). Curious to hear what you all think! |
@drewbanin ah super interesting, thanks! I hadn't realized there were no transactions in BQ. I'll read through that thread more, but so far this seems great! |
@drewbanin that's a great thread, thanks for pointing it out! I think the Understand that the non-atomic operation might be a problem if the |
see the docs on insert_overwrite for usage info. Closing this one :) |
Similar to what has been done for Snowflake (#1556), implement a delete+insert incremental strategy for Google BigQuery. This would allow us to, as also described on #1556, design pipelines for facts tables entirely based on a date.
As an alternative, we have to define a unique_key based on multiple columns. This wouldn't work for cases where the source for the facts data now contains less rows, as the merge statement would leave these rows in the target table.
This is a database specific feature for Google BigQuery only, similar to what is already supported for Snowflake.
This would benefit the use-case where we are looking to entirely reprocessing data for a specific date/partition, based on only a subset of the primary key columns. More details of the specific use case were provided by @drewbanin under #1556
The text was updated successfully, but these errors were encountered: