[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

dojinkimm · 2023-11-01T07:49:11Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When trying to use incremental materialization with insert_overwrite strategy, it currently only supports running a query and saving into separate __temp table and then copying to partition. This is fine if I run one model at a time, but if I want to run same model with different partitions in parallel it doesn't work well since the __temp table name overlaps among jobs.

I want to truncate specific partition and run query to save result to that partition. This is needed in my case because I am trying to run backfill jobs on same model in parallel for only specific partitions, so using full-refresh is not an option. Also I want to pass partition date dynamically (I was considering passing through vars).

This is possible when using bigquery client as below.

-- For instance, I want to pass target_date that I want and replace partition data
job_config = bigquery.QueryJobConfig(
        destination=f"{project_id}.{dataset_id}.{table_id}${{ var('target_date') }}",
        write_disposition=write_disposition,
        create_disposition=CreateDisposition.CREATE_IF_NEEDED,
        time_partitioning=bigquery.TimePartitioning(
            TimePartitioningType.DAY, 'date'
        ),
        priority=bigquery.QueryPriority.BATCH,
        ...
)
bq_client.query("""SELECT .... FROM table WHERE DATE(_PARTITIONTIME) = '{{ var('target_date') }}'""", job_config=job_config)

Describe alternatives you've considered

Instead of running backfill jobs in parallel I considered running tasks one by one. So, __temp table is not overriden by other jobs. I am using airflow with dbt so if I run only one job at a time it takes too long for all jobs to be scheduled if backfill date is big.

Who will this benefit?

This would benefit those who want to run same incremental model in parallel.

Are you interested in contributing this feature?

I am interested.

Anything else?

I was thinking it would be possible by adding incremental truncate method to adapters/bigquery/connections.py since biquery client is used in that file. Then, calling that func on job run.

I wonder if it violates philosophy of dbt since it seems that dbt is trying to run jobs only using sql

The text was updated successfully, but these errors were encountered:

github-christophe-oudar · 2023-12-05T08:49:44Z

What about adding a macro override in your project:

-- override the temp_relation macro to change the suffix used for temporary tables
-- the goal is to avoid collisions with concurrent runs
{% macro make_temp_relation(base_relation, suffix='__dbt_tmp') %}
  {%- set suffix_with_invocation_id = suffix ~ '__' ~ invocation_id -%}
  {{ return(adapter.dispatch('make_temp_relation', 'dbt')(base_relation, suffix_with_invocation_id)) }}
{% endmacro %}

?

dojinkimm · 2023-12-19T07:30:20Z

If I follow the method you suggested, multiple different temp{invocation_id} tables would be created simultaneously when running a single model, right? Instead of making it work that way, I thought it would be better if there is a way to truncate and insert at the same time.

github-christophe-oudar · 2023-12-19T08:57:31Z

Isn't it what you can pretty much already do with copy_partitions option?
https://docs.getdbt.com/blog/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt

dojinkimm · 2024-01-03T07:11:30Z

Does copy_partitions work when models run in parallel? What I understood is that when using copy partitions - dbt creates temp table -> removes partition of model table (truncate) -> copies temp table to model table, and in this way dbt always creates {model_name}__dbt_tmp named table for temp table.

If a same model runs in parallel wouldn't it have a concurrency problem? For instance, in my case I pass date value as vars so a model can be run for 2023-12-01, 2023-12-02, 2023-12-03 date partitions, all temp tables are created with __dbt_tmp suffix. I want 2023-12-01 date to be run and insert overwrite to 2023-12-01 partition only, but since tamp table name is always same 2023-12-01 data could be removed from temp table before being copied.

SELECT ... FROM ... WHERE date = {{ vars('date') }}

If it is already guranteed with copy_partitions, then it's simply my misunderstanding about that functionality.

By the way,
#998 (comment) your idea would prevent collision with concurrent runs, so I will try it as well. Thanks for the idea @github-christophe-oudar

github-christophe-oudar · 2024-01-03T09:24:41Z

copy_partitions option without the provided macro doesn't support parallel runs (just like without the option).
That's why I linked it: you have to copy the macro override & copy_partitions so that you can leverage the best performance.
I guess it might prevent other patterns like the dbt retry if you have some stateful setup that could reuse the temporary tables in case of failures. (though in my production environment, I hardly have any use for it).

github-actions · 2024-07-02T01:52:31Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-07-09T01:53:42Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dojinkimm added type:enhancement New feature or request triage:product labels Nov 1, 2023

github-actions bot changed the title ~~[Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy~~ [ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy Nov 1, 2023

github-actions bot added the Stale label Jul 2, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

dojinkimm commented Nov 1, 2023 •

edited

Loading

github-christophe-oudar commented Dec 5, 2023

dojinkimm commented Dec 19, 2023

github-christophe-oudar commented Dec 19, 2023

dojinkimm commented Jan 3, 2024 •

edited

Loading

github-christophe-oudar commented Jan 3, 2024

github-actions bot commented Jul 2, 2024

github-actions bot commented Jul 9, 2024

[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

Comments

dojinkimm commented Nov 1, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

github-christophe-oudar commented Dec 5, 2023

dojinkimm commented Dec 19, 2023

github-christophe-oudar commented Dec 19, 2023

dojinkimm commented Jan 3, 2024 • edited Loading

github-christophe-oudar commented Jan 3, 2024

github-actions bot commented Jul 2, 2024

github-actions bot commented Jul 9, 2024

dojinkimm commented Nov 1, 2023 •

edited

Loading

dojinkimm commented Jan 3, 2024 •

edited

Loading