Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a delete+insert incremental_strategy for Google BigQuery #2020

Closed
amcarvalho opened this issue Dec 20, 2019 · 5 comments
Closed

Implement a delete+insert incremental_strategy for Google BigQuery #2020

amcarvalho opened this issue Dec 20, 2019 · 5 comments
Labels
bigquery enhancement New feature or request

Comments

@amcarvalho
Copy link

Similar to what has been done for Snowflake (#1556), implement a delete+insert incremental strategy for Google BigQuery. This would allow us to, as also described on #1556, design pipelines for facts tables entirely based on a date.

As an alternative, we have to define a unique_key based on multiple columns. This wouldn't work for cases where the source for the facts data now contains less rows, as the merge statement would leave these rows in the target table.

This is a database specific feature for Google BigQuery only, similar to what is already supported for Snowflake.

This would benefit the use-case where we are looking to entirely reprocessing data for a specific date/partition, based on only a subset of the primary key columns. More details of the specific use case were provided by @drewbanin under #1556

@clausherther
Copy link
Contributor

I'd be a big fan of that! Just ran into this the other day. I'm still figuring out how to best profile BQ queries (vs Snowflake), but it seems this would help a lot with table scans when the source table is partitioned by date.

@drewbanin drewbanin added bigquery and removed triage labels Dec 20, 2019
@drewbanin
Copy link
Contributor

@clausherther I know that we had talked about this before - I was surprised that we didn't already have an issue to track this. Thanks for creating this one @amcarvalho.

I know @jtcohen6 has been giving some thought to the BigQuery incremental materialization. Check out the "partition overwrite" section of this comment: #1971 (comment)

Really the whole thread is pretty good and worth the read :)

We're imaging that the BigQuery version of this strategy sounds more like "insert_overwrite" instead of "delete+insert". The two are pretty similar in practice, but we don't actually want to run a delete + insert on BigQuery, as that would not be atomic (no transactions on BQ).

Curious to hear what you all think!

@clausherther
Copy link
Contributor

@drewbanin ah super interesting, thanks! I hadn't realized there were no transactions in BQ. I'll read through that thread more, but so far this seems great!

@amcarvalho
Copy link
Author

amcarvalho commented Dec 20, 2019

@drewbanin that's a great thread, thanks for pointing it out! I think the partition_overwrite incremental strategy will likely cover most of the cases, but I still think there would be a use-case for this one, specifically for small fact tables which are not partitioned but we are still processing a full date (or any other subset of columns) per pipeline execution.

Understand that the non-atomic operation might be a problem if the delete operation succeeds but subsequently the insert fails.

@drewbanin
Copy link
Contributor

see the docs on insert_overwrite for usage info. Closing this one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants