-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998
Comments
What about adding a macro override in your project:
? |
If I follow the method you suggested, multiple different temp{invocation_id} tables would be created simultaneously when running a single model, right? Instead of making it work that way, I thought it would be better if there is a way to truncate and insert at the same time. |
Isn't it what you can pretty much already do with |
Does If a same model runs in parallel wouldn't it have a concurrency problem? For instance, in my case I pass date value as
If it is already guranteed with copy_partitions, then it's simply my misunderstanding about that functionality. By the way, |
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Is this your first time submitting a feature request?
Describe the feature
When trying to use incremental materialization with insert_overwrite strategy, it currently only supports running a query and saving into separate __temp table and then copying to partition. This is fine if I run one model at a time, but if I want to run same model with different partitions in parallel it doesn't work well since the __temp table name overlaps among jobs.
I want to truncate specific partition and run query to save result to that partition. This is needed in my case because I am trying to run backfill jobs on same model in parallel for only specific partitions, so using full-refresh is not an option. Also I want to pass partition date dynamically (I was considering passing through vars).
This is possible when using bigquery client as below.
Describe alternatives you've considered
Instead of running backfill jobs in parallel I considered running tasks one by one. So, __temp table is not overriden by other jobs. I am using airflow with dbt so if I run only one job at a time it takes too long for all jobs to be scheduled if backfill date is big.
Who will this benefit?
This would benefit those who want to run same incremental model in parallel.
Are you interested in contributing this feature?
I am interested.
Anything else?
I was thinking it would be possible by adding incremental truncate method to adapters/bigquery/connections.py since biquery client is used in that file. Then, calling that func on job run.
I wonder if it violates philosophy of dbt since it seems that dbt is trying to run jobs only using sql
The text was updated successfully, but these errors were encountered: