Implement a delete+insert incremental_strategy for Google BigQuery #2020

amcarvalho · 2019-12-20T16:45:55Z

Similar to what has been done for Snowflake (#1556), implement a delete+insert incremental strategy for Google BigQuery. This would allow us to, as also described on #1556, design pipelines for facts tables entirely based on a date.

As an alternative, we have to define a unique_key based on multiple columns. This wouldn't work for cases where the source for the facts data now contains less rows, as the merge statement would leave these rows in the target table.

This is a database specific feature for Google BigQuery only, similar to what is already supported for Snowflake.

This would benefit the use-case where we are looking to entirely reprocessing data for a specific date/partition, based on only a subset of the primary key columns. More details of the specific use case were provided by @drewbanin under #1556

clausherther · 2019-12-20T16:51:52Z

I'd be a big fan of that! Just ran into this the other day. I'm still figuring out how to best profile BQ queries (vs Snowflake), but it seems this would help a lot with table scans when the source table is partitioned by date.

drewbanin · 2019-12-20T17:22:36Z

@clausherther I know that we had talked about this before - I was surprised that we didn't already have an issue to track this. Thanks for creating this one @amcarvalho.

I know @jtcohen6 has been giving some thought to the BigQuery incremental materialization. Check out the "partition overwrite" section of this comment: #1971 (comment)

Really the whole thread is pretty good and worth the read :)

We're imaging that the BigQuery version of this strategy sounds more like "insert_overwrite" instead of "delete+insert". The two are pretty similar in practice, but we don't actually want to run a delete + insert on BigQuery, as that would not be atomic (no transactions on BQ).

Curious to hear what you all think!

clausherther · 2019-12-20T17:27:30Z

@drewbanin ah super interesting, thanks! I hadn't realized there were no transactions in BQ. I'll read through that thread more, but so far this seems great!

amcarvalho · 2019-12-20T22:19:54Z

@drewbanin that's a great thread, thanks for pointing it out! I think the partition_overwrite incremental strategy will likely cover most of the cases, but I still think there would be a use-case for this one, specifically for small fact tables which are not partitioned but we are still processing a full date (or any other subset of columns) per pipeline execution.

Understand that the non-atomic operation might be a problem if the delete operation succeeds but subsequently the insert fails.

drewbanin · 2020-05-13T16:43:11Z

see the docs on insert_overwrite for usage info. Closing this one :)

amcarvalho added enhancement New feature or request triage labels Dec 20, 2019

amcarvalho mentioned this issue Dec 20, 2019

[Wilt Chamberlain] Merge on Snowflake fails if unique_key values are not unique #1556

Closed

drewbanin added bigquery and removed triage labels Dec 20, 2019

drewbanin closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a delete+insert incremental_strategy for Google BigQuery #2020

Implement a delete+insert incremental_strategy for Google BigQuery #2020

amcarvalho commented Dec 20, 2019

clausherther commented Dec 20, 2019

drewbanin commented Dec 20, 2019

clausherther commented Dec 20, 2019

amcarvalho commented Dec 20, 2019 •

edited

Loading

drewbanin commented May 13, 2020

Implement a delete+insert incremental_strategy for Google BigQuery #2020

Implement a delete+insert incremental_strategy for Google BigQuery #2020

Comments

amcarvalho commented Dec 20, 2019

clausherther commented Dec 20, 2019

drewbanin commented Dec 20, 2019

clausherther commented Dec 20, 2019

amcarvalho commented Dec 20, 2019 • edited Loading

drewbanin commented May 13, 2020

amcarvalho commented Dec 20, 2019 •

edited

Loading