dbt seed truncate tables #182

machov · 2021-06-22T06:55:32Z

Description

dbt seed command fixed with expected behavior from dbt global project to truncate table in order remove all rows from the existing seed tables and replace values. As explained in issue 112, the current seed command in dbt-spark appends to existing seeded tables instead overwriting.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

cla-bot · 2021-06-22T06:55:37Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @mv1742

machov · 2021-06-22T14:16:38Z

hello @drewbanin, I signed the CLA with my username mv1742 and email but the check still fails. Any ideas?

jtcohen6

Hey @mv1742, thanks for contributing! I see you submitted the CLA form with your full profile URL (https://github.com/mv1742/). Could you try re-submitting with just the username mv1742?

jtcohen6 · 2021-06-22T14:31:50Z

dbt/include/spark/macros/materializations/seed.sql

+        {{ adapter.truncate_relation(old_relation) }}
+        {% set sql = "truncate table " ~ old_relation %}
+        {{ return(sql) }}


I see in the TRUNCATE docs that The table must not be a view or an external/temporary table. Given that #112 was initially prompted by the case of a seed being an external table, are we sure this approach will work?

It looks like we're not appropriately handling the full-refresh case today. We want to drop the relation and truncate/remove the data, to enable a new column schema to take its place.

If this ends up being the same code as in the default version, we could just delete spark__reset_csv_table entirely and fall back on default__reset_csv_table.

hmm, i see. truncate works for me when the external tables already exists, even though docs say table must not be a view or external table. When the table does not exist in the unmanaged database, I'm getting " ('The SQL contains 0 parameter markers, but 2 parameters were supplied', 'HY000')", not sure why...

full-refresh work around for external tables could be truncate > drop > create > trunctate?

default__reset_csv_table won't work as drop_relation doesn't delete the underlying data for external tables

Is there a guide on how to setup tests for dbt-spark like there is for dbt https://github.com/dbt-labs/dbt/blob/HEAD/CONTRIBUTING.md#testing?

For more details this is the dbt.log file message when running dbt seed when the table does not exist in an unmanaged database (something to do with insert into default.gc_branch_correction_test values (cast(%s as bigint),cast(%s as bigint))):

2021-09-16 06:46:36.950408 (ThreadPoolExecutor-1_0): Opening a new connection, currently in state closed 2021-09-16 06:46:42.488937 (ThreadPoolExecutor-1_0): SQL status: OK in 5.54 seconds 2021-09-16 06:46:42.492460 (ThreadPoolExecutor-1_0): On list_None_default: ROLLBACK 2021-09-16 06:46:42.492723 (ThreadPoolExecutor-1_0): NotImplemented: rollback 2021-09-16 06:46:42.492850 (ThreadPoolExecutor-1_0): On list_None_default: Close 2021-09-16 06:46:42.495528 (MainThread): NotImplemented: add_begin_query 2021-09-16 06:46:42.495681 (MainThread): NotImplemented: commit 2021-09-16 06:46:42.496011 (MainThread): 02:46:42 | Concurrency: 1 threads (target='prod') 2021-09-16 06:46:42.496201 (MainThread): 02:46:42 | 2021-09-16 06:46:42.498050 (Thread-1): Began running node seed.er_silver_gold.gc_branch_correction_test 2021-09-16 06:46:42.498375 (Thread-1): 02:46:42 | 1 of 1 START seed file default.gc_branch_correction_test............. [RUN] 2021-09-16 06:46:42.498717 (Thread-1): Acquiring new spark connection "seed.er_silver_gold.gc_branch_correction_test". 2021-09-16 06:46:42.630036 (Thread-1): finished collecting timing info 2021-09-16 06:46:42.659872 (Thread-1): Using spark connection "seed.er_silver_gold.gc_branch_correction_test". 2021-09-16 06:46:42.660005 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: /* {"app": "dbt", "dbt_version": "0.19.1", "profile_name": "dbt_databricks", "target_name": "prod", "node_id": "seed.er_silver_gold.gc_branch_correction_test"} */ drop table if exists default.gc_branch_correction_test 2021-09-16 06:46:42.660116 (Thread-1): Opening a new connection, currently in state closed 2021-09-16 06:46:45.129582 (Thread-1): SQL status: OK in 2.47 seconds 2021-09-16 06:46:45.143552 (Thread-1): 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1. 2021-09-16 06:46:45.143772 (Thread-1): 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1. 2021-09-16 06:46:45.162986 (Thread-1): NotImplemented: add_begin_query 2021-09-16 06:46:45.163100 (Thread-1): Using spark connection "seed.er_silver_gold.gc_branch_correction_test". 2021-09-16 06:46:45.163197 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: /* {"app": "dbt", "dbt_version": "0.19.1", "profile_name": "dbt_databricks", "target_name": "prod", "node_id": "seed.er_silver_gold.gc_branch_correction_test"} */ create table default.gc_branch_correction_test (Program_ID bigint,Branch_ID_Corrected bigint) using delta 2021-09-16 06:46:52.098564 (Thread-1): SQL status: OK in 6.94 seconds 2021-09-16 06:46:52.104950 (Thread-1): Using spark connection "seed.er_silver_gold.gc_branch_correction_test". 2021-09-16 06:46:52.105085 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: /* {"app": "dbt", "dbt_version": "0.19.1", "profile_name": "dbt_databricks", "target_name": "prod", "node_id": "seed.er_silver_gold.gc_branch_correction_test"} */ truncate table default.gc_branch_correction_test 2021-09-16 06:46:53.726390 (Thread-1): SQL status: OK in 1.62 seconds 2021-09-16 06:46:53.743346 (Thread-1): Using spark connection "seed.er_silver_gold.gc_branch_correction_test". 2021-09-16 06:46:53.743491 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: insert into default.gc_branch_correction_test values (cast(%s as bigint),cast(%s as bigint)) ... 2021-09-16 06:46:53.744413 (Thread-1): Error while running: insert into default.gc_branch_correction_test values (cast(%s as bigint),cast(%s as bigint)) 2021-09-16 06:46:53.744535 (Thread-1): ('The SQL contains 0 parameter markers, but 2 parameters were supplied', 'HY000') 2021-09-16 06:46:53.744696 (Thread-1): finished collecting timing info 2021-09-16 06:46:53.744847 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: ROLLBACK 2021-09-16 06:46:53.744951 (Thread-1): NotImplemented: rollback 2021-09-16 06:46:53.745045 (Thread-1): On seed.er_silver_gold.gc_branch_correction_test: Close 2021-09-16 06:46:53.745350 (Thread-1): Runtime Error in seed gc_branch_correction_test (data/default/gc_branch_correction_test.csv) ('The SQL contains 0 parameter markers, but 2 parameters were supplied', 'HY000') Traceback (most recent call last): File "/Users/<userid>/dbt_spark-0.19.1-py3.7.egg/dbt/adapters/spark/connections.py", line 276, in exception_handler yield File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/adapters/sql/connections.py", line 80, in add_query cursor.execute(sql, bindings) File "/Users/<userid>/dbt_spark-0.19.1-py3.7.egg/dbt/adapters/spark/connections.py", line 261, in execute self._cursor.execute(sql, *bindings) pyodbc.ProgrammingError: ('The SQL contains 0 parameter markers, but 2 parameters were supplied', 'HY000') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/task/base.py", line 344, in safe_run result = self.compile_and_execute(manifest, ctx) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/task/base.py", line 287, in compile_and_execute result = self.run(ctx.node, manifest) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/task/base.py", line 389, in run return self.execute(compiled_node, manifest) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/task/run.py", line 248, in execute result = MacroGenerator(materialization_macro, context)() File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 332, in __call__ return self.call_macro(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 259, in call_macro return macro(*args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 675, in __call__ return self._invoke(arguments, autoescape) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 679, in _invoke rv = self._func(*arguments) File "<template>", line 54, in macro File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/sandbox.py", line 462, in call return __context.call(__obj, *args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 290, in call return __obj(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 332, in __call__ return self.call_macro(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 259, in call_macro return macro(*args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 675, in __call__ return self._invoke(arguments, autoescape) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 679, in _invoke rv = self._func(*arguments) File "<template>", line 21, in macro File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/sandbox.py", line 462, in call return __context.call(__obj, *args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 290, in call return __obj(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 332, in __call__ return self.call_macro(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/clients/jinja.py", line 259, in call_macro return macro(*args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 675, in __call__ return self._invoke(arguments, autoescape) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 679, in _invoke rv = self._func(*arguments) File "<template>", line 110, in macro File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/sandbox.py", line 462, in call return __context.call(__obj, *args, **kwargs) File "/Users/<userid>/Jinja2-2.11.2-py3.7.egg/jinja2/runtime.py", line 290, in call return __obj(*args, **kwargs) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/adapters/sql/impl.py", line 64, in add_query abridge_sql_log) File "/Users/<userid>/dbt_core-0.19.1-py3.7.egg/dbt/adapters/sql/connections.py", line 87, in add_query return connection, cursor File "/Users/u1214742/.pyenv/versions/3.7.8/lib/python3.7/contextlib.py", line 130, in __exit__ self.gen.throw(type, value, traceback) File "/Users/<userid>/dbt_spark-0.19.1-py3.7.egg/dbt/adapters/spark/connections.py", line 289, in exception_handler raise dbt.exceptions.RuntimeException(str(exc)) dbt.exceptions.RuntimeException: Runtime Error in seed gc_branch_correction_test (data/default/gc_branch_correction_test.csv) ('The SQL contains 0 parameter markers, but 2 parameters were supplied', 'HY000')

Sorry for the delayed response from me!

Glad to hear this works for external tables as well in local testing

Agree, truncate + insert should be the standard approach, and then full refresh mode should just add in drop + create as well. And you're right, we'll need to keep this logic around in spark__reset_csv_table, rather than using the default.

Ugh, parameter markers are one of those things I just have a tremendously hard time debugging. Which connection method are you using? %s is the standard parameter marker, but ? is the one used by pyodbc, and we don't do a great job in our logs today of making clear which one dbt is actually using, since the switch for pyodbc happens at runtime

I believe this does not work in the case where a seed table already exists and we rename a column in the CSV file. The drop statement does not seem to drop the underlying schema as well.

github-actions · 2022-08-10T02:07:44Z

This PR has been marked as Stale because it has been open for 180 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR, or it will be closed in 7 days.

resolves #114 ### Description Uses `insert overwrite` for the first batch of seed to help seeds in external tables. This must be a temporary fix and should follow dbt-labs/dbt-spark#182.

resolves databricks#114 ### Description Uses `insert overwrite` for the first batch of seed to help seeds in external tables. This must be a temporary fix and should follow dbt-labs/dbt-spark#182.

resolves #114 ### Description Uses `insert overwrite` for the first batch of seed to help seeds in external tables. This must be a temporary fix and should follow dbt-labs/dbt-spark#182.

Fleid · 2023-02-11T04:36:44Z

Let's see if we can push that over the line.

github-actions · 2023-09-14T01:43:32Z

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

github-actions · 2023-09-21T01:43:45Z

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

dbt seed truncate tables

9aaf231

machov mentioned this pull request Jun 22, 2021

[CT-2084] Rerun dbt seed append data instead of refresh data if seed is stored in external table dbt-labs/dbt-adapters#514

Open

jtcohen6 reviewed Jun 22, 2021

View reviewed changes

crystalro0 mentioned this pull request Apr 18, 2022

[CT-502] [Bug] dbt seed returning '0 parameter markers' error #334

Closed

1 task

jtcohen6 mentioned this pull request Jun 17, 2022

Data is duplicated on reloading seeds that are using an external table databricks/dbt-databricks#114

Closed

jtcohen6 mentioned this pull request Jul 8, 2022

[CT-827] Use truncate for seeds #388

Closed

ueshin mentioned this pull request Aug 5, 2022

Use insert overwrite for the first batch of seed databricks/dbt-databricks#149

Merged

github-actions bot added the Stale label Aug 10, 2022

github-actions bot closed this Aug 17, 2022

Fleid reopened this Feb 11, 2023

Fleid added triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering and removed Stale labels Feb 11, 2023

Fleid mentioned this pull request Feb 11, 2023

[CT-2085] [PR Review] dbt seed truncate tables #182 #638

Closed

Fleid added the pr_tracked label Feb 11, 2023

nathaniel-may requested a review from a team as a code owner March 16, 2023 21:20

nathaniel-may requested a review from nssalian March 16, 2023 21:20

github-actions bot added the Stale label Sep 14, 2023

github-actions bot closed this Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt seed truncate tables #182

dbt seed truncate tables #182

machov commented Jun 22, 2021 •

edited

Loading

cla-bot bot commented Jun 22, 2021

machov commented Jun 22, 2021

jtcohen6 left a comment

jtcohen6 Jun 22, 2021

machov Sep 16, 2021

machov Sep 16, 2021

guycarp-mv Nov 2, 2021

jtcohen6 Nov 8, 2021

binhnefits Feb 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022

Fleid commented Feb 11, 2023

github-actions bot commented Sep 14, 2023

github-actions bot commented Sep 21, 2023

dbt seed truncate tables #182

dbt seed truncate tables #182

Conversation

machov commented Jun 22, 2021 • edited Loading

Description

Checklist

cla-bot bot commented Jun 22, 2021

machov commented Jun 22, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Jun 22, 2021

Choose a reason for hiding this comment

machov Sep 16, 2021

Choose a reason for hiding this comment

machov Sep 16, 2021

Choose a reason for hiding this comment

guycarp-mv Nov 2, 2021

Choose a reason for hiding this comment

jtcohen6 Nov 8, 2021

Choose a reason for hiding this comment

binhnefits Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Aug 10, 2022

Fleid commented Feb 11, 2023

github-actions bot commented Sep 14, 2023

github-actions bot commented Sep 21, 2023

machov commented Jun 22, 2021 •

edited

Loading

binhnefits Feb 10, 2022 •

edited

Loading