Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy #998

Closed
3 tasks done
dojinkimm opened this issue Nov 1, 2023 · 7 comments
Labels

Comments

@dojinkimm
Copy link

dojinkimm commented Nov 1, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When trying to use incremental materialization with insert_overwrite strategy, it currently only supports running a query and saving into separate __temp table and then copying to partition. This is fine if I run one model at a time, but if I want to run same model with different partitions in parallel it doesn't work well since the __temp table name overlaps among jobs.

I want to truncate specific partition and run query to save result to that partition. This is needed in my case because I am trying to run backfill jobs on same model in parallel for only specific partitions, so using full-refresh is not an option. Also I want to pass partition date dynamically (I was considering passing through vars).

This is possible when using bigquery client as below.

-- For instance, I want to pass target_date that I want and replace partition data
job_config = bigquery.QueryJobConfig(
        destination=f"{project_id}.{dataset_id}.{table_id}${{ var('target_date') }}",
        write_disposition=write_disposition,
        create_disposition=CreateDisposition.CREATE_IF_NEEDED,
        time_partitioning=bigquery.TimePartitioning(
            TimePartitioningType.DAY, 'date'
        ),
        priority=bigquery.QueryPriority.BATCH,
        ...
)
bq_client.query("""SELECT .... FROM table WHERE DATE(_PARTITIONTIME) = '{{ var('target_date') }}'""", job_config=job_config)

Describe alternatives you've considered

Instead of running backfill jobs in parallel I considered running tasks one by one. So, __temp table is not overriden by other jobs. I am using airflow with dbt so if I run only one job at a time it takes too long for all jobs to be scheduled if backfill date is big.

Who will this benefit?

This would benefit those who want to run same incremental model in parallel.

Are you interested in contributing this feature?

I am interested.

Anything else?

I was thinking it would be possible by adding incremental truncate method to adapters/bigquery/connections.py since biquery client is used in that file. Then, calling that func on job run.

I wonder if it violates philosophy of dbt since it seems that dbt is trying to run jobs only using sql

@dojinkimm dojinkimm added type:enhancement New feature or request triage:product labels Nov 1, 2023
@github-actions github-actions bot changed the title [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy [ADAP-976] [Feature] BigQuery truncate specific partition on incremental, insert_overwrite strategy Nov 1, 2023
@github-christophe-oudar
Copy link
Contributor

What about adding a macro override in your project:

-- override the temp_relation macro to change the suffix used for temporary tables
-- the goal is to avoid collisions with concurrent runs
{% macro make_temp_relation(base_relation, suffix='__dbt_tmp') %}
  {%- set suffix_with_invocation_id = suffix ~ '__' ~ invocation_id -%}
  {{ return(adapter.dispatch('make_temp_relation', 'dbt')(base_relation, suffix_with_invocation_id)) }}
{% endmacro %}

?

@dojinkimm
Copy link
Author

If I follow the method you suggested, multiple different temp{invocation_id} tables would be created simultaneously when running a single model, right? Instead of making it work that way, I thought it would be better if there is a way to truncate and insert at the same time.

@github-christophe-oudar
Copy link
Contributor

Isn't it what you can pretty much already do with copy_partitions option?
https://docs.getdbt.com/blog/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt

@dojinkimm
Copy link
Author

dojinkimm commented Jan 3, 2024

Does copy_partitions work when models run in parallel? What I understood is that when using copy partitions - dbt creates temp table -> removes partition of model table (truncate) -> copies temp table to model table, and in this way dbt always creates {model_name}__dbt_tmp named table for temp table.

If a same model runs in parallel wouldn't it have a concurrency problem? For instance, in my case I pass date value as vars so a model can be run for 2023-12-01, 2023-12-02, 2023-12-03 date partitions, all temp tables are created with __dbt_tmp suffix. I want 2023-12-01 date to be run and insert overwrite to 2023-12-01 partition only, but since tamp table name is always same 2023-12-01 data could be removed from temp table before being copied.

SELECT ... FROM ... WHERE date = {{ vars('date') }}

If it is already guranteed with copy_partitions, then it's simply my misunderstanding about that functionality.

By the way,
#998 (comment) your idea would prevent collision with concurrent runs, so I will try it as well. Thanks for the idea @github-christophe-oudar

@github-christophe-oudar
Copy link
Contributor

copy_partitions option without the provided macro doesn't support parallel runs (just like without the option).
That's why I linked it: you have to copy the macro override & copy_partitions so that you can leverage the best performance.
I guess it might prevent other patterns like the dbt retry if you have some stateful setup that could reuse the temporary tables in case of failures. (though in my production environment, I hardly have any use for it).

Copy link
Contributor

github-actions bot commented Jul 2, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 2, 2024
Copy link
Contributor

github-actions bot commented Jul 9, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants