Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance when materialization as table (SELECT * INTO Model__dbt_tmp FROM Model__dbt_tmp_temp_view) #410

Open
romiof opened this issue May 29, 2023 · 5 comments

Comments

@romiof
Copy link

romiof commented May 29, 2023

Hello!

This is a bug and a suggestion for improvement...
I'm suffering with slow performance when materialization as table.

My environment

I started this month to use dbt for MS SQL. My database is a SQL Server 2014. Using the last version of dbt 1.4.3

My issue

I'm having problems whit materialization macro which make the tables at our SQL Server. Its performance is very, very slow.

More specific, my project consist in:

  • Using source yml files, to map my tables to dbt objects
  • Created views for stage layer, with this views querying my source tables. as suggested for dbt team here
  • Created my Dimensions and Fact entities as tables
  • Here, when I simply make a SELECT * on one model SQL, in the worst case, it takes less than 1 minute to return all lines.
  • But when using dbt run / dbt build it performs very poor.

What I investigated

  • In my case, the _Model__dbt_tmp_temp_view_ is a nested view, and this is the origin of my problem.
  • With a simple SELECT * FROM _Model__dbt_tmp_temp_view_ , everything works fine, the performance is equal when I select without the temp view.
  • But when using the INTO Model__dbt_tmp (https://github.com/dbt-msft/dbt-sqlserver/blob/master/dbt/include/sqlserver/macros/materializations/models/table/create_table_as.sql#LL23-L25) my SQL change its plan and it takes about 10 minutes to execute this materialization.
  • In last two days, I make a lot of tests, changed my JOINs, refactorated my SELECT but when I use nested views (because I'm following dbt best pratice) I got this slow performance. I'm not a expert to debug my execution plan, just I realized that these two queries make different plans ( SELECT * FROM _Model__dbt_tmp_temp_view_ vs SELECT * INTO Model__dbt_tmp FROM _Model__dbt_tmp_temp_view_)
  • Maybe it is a glitch in our old SQL Server 2014

A possible workaround

I also discovered, that there is a hint which make SELECT * INTO Model__dbt_tmp FROM _Model__dbt_tmp_temp_view_ runs with its original perfomance:
OPTION (FORCE ORDER)

I founded it from this SO post.

So, if I run this query SELECT * INTO Model__dbt_tmp FROM _Model__dbt_tmp_temp_view_ OPTION (FORCE ORDER) from my MS SQL Studio, it create my Model table in less than 1 minute.

Setting this option at dbt-sqlserver

I'd like to know, how can we change this macro to have a possibility to include this hint ?
Maybe a configuration, like as_columnstore ?

@matsonj
Copy link

matsonj commented May 29, 2023

For what it’s worth, nested views perform horribly in SQL Server because of how the query planner works.

That being said, I really like the idea of supporting query hints as config in the macros.

@edvald-kvika
Copy link

edvald-kvika commented Jul 6, 2023

Came in here from google. We're experiencing the same issue, dreadfully slow performance in dbt, We're on sql server 2019 and dbt 1.4.6 (since dbt-sqlserver isn't 1.5 yet). adding option (force order) into the create_table_as macro unfortunately did not fix the issue though:

{% macro sqlserver__create_table_as(temporary, relation, sql) -%}
   {#- TODO: add contracts here when in dbt 1.5 -#}
   {%- set sql_header = config.get('sql_header', none) -%}
   {%- set as_columnstore = config.get('as_columnstore', default=true) -%}
   {%- set temp_view_sql = sql.replace("'", "''") -%}
   {%- set tmp_relation = relation.incorporate(
        path={"identifier": relation.identifier.replace("#", "") ~ '_temp_view'},
        type='view') -%}

   {{- sql_header if sql_header is not none -}}

    -- drop previous temp view
   {{- sqlserver__drop_relation_script(tmp_relation) }}

    -- create temp view
   USE [{{ relation.database }}];
   EXEC('create view {{ tmp_relation.include(database=False) }} as
    {{ temp_view_sql }}
    ');

   -- select into the table and create it that way
   {# TempDB schema is ignored, always goes to dbo #}
   SELECT *
   INTO {{ relation.include(database=False, schema=(not temporary))  }}
   FROM {{ tmp_relation }} option (force order)
   -- drop temp view
   {{ sqlserver__drop_relation_script(tmp_relation) }}

   {%- if not temporary and as_columnstore -%}
        -- add columnstore index
        {{ sqlserver__create_clustered_columnstore_index(relation) }}
   {%- endif -%}

{% endmacro %}

@romiof
Copy link
Author

romiof commented Jul 11, 2023

Came in here from google. We're experiencing the same issue, dreadfully slow performance in dbt, We're on sql server 2019 and dbt 1.4.6 (since dbt-sqlserver isn't 1.5 yet). adding option (force order) into the create_table_as macro unfortunately did not fix the issue though:

@edvald-kvika

Here, I've experienced some models with slow performance without option (force order).
About 10~15% of my models are in this situation.

For them, I created a config at model code, and had created a copy of macro sqlserver__create_table_as with an if statement to implement force order in case my config == true.

At my job, our SQL Server is not exclusive to our DW. I've about a dozen of other DBs used by other workloads (in general OLTP, but also have a couple of DBs for OLAP, where our users connect their Excel's at SQL tables and analyze data with Pivot Table).

So, I've moments during the workday, where dbt run hang with slows performance, but it's during resource concurrency in SQL Server.

In general, without bottlenecks in SQL Server, a dbt run model_abc is about 1.5x to 2x the time needed to SELECT * the model_abc SQL code.

@edvald-kvika
Copy link

@romiof we just patched sqlserver__create_table_as in a larger dbt project (adding option (force order)) and model performance time greatly improved, from around 900 seconds down to 30.

@edvald-kvika
Copy link

edvald-kvika commented Jul 20, 2023

Now we have a bit more experience with option (force order). In some cases it drastically improves performance and in other cases is does the exact opposite. Looking at the query plans for the select into query showed that in those cases option (force order) put the optimizer into a heavy (and early) table spool operation. Our solution was to add a config parameter for force_order:

{% macro sqlserver__create_table_as(temporary, relation, sql) -%}
   {#- TODO: add contracts here when in dbt 1.5 -#}
   {%- set sql_header = config.get('sql_header', none) -%}
   {%- set as_columnstore = config.get('as_columnstore', default=true) -%}
   {%- set force_order = config.get('force_order', default=true) -%}
   {%- set temp_view_sql = sql.replace("'", "''") -%}
   {%- set tmp_relation = relation.incorporate(
        path={"identifier": relation.identifier.replace("#", "") ~ '_temp_view'},
        type='view') -%}

   {{- sql_header if sql_header is not none -}}

    -- drop previous temp view
   {{- sqlserver__drop_relation_script(tmp_relation) }}

    -- create temp view
   USE [{{ relation.database }}];
   EXEC('create view {{ tmp_relation.include(database=False) }} as
    {{ temp_view_sql }}
    ');

   -- select into the table and create it that way
   {# TempDB schema is ignored, always goes to dbo #}
   SELECT *
   INTO {{ relation.include(database=False, schema=(not temporary))  }}
   {%- if force_order %}
   -- add option (force order) to improve performance in nested views:
   -- https://github.com/dbt-msft/dbt-sqlserver/issues/410
   {%- endif %}
   FROM {{ tmp_relation }} {%- if force_order %} option (force order) {%- endif %}
   -- drop temp view
   {{ sqlserver__drop_relation_script(tmp_relation) }}

   {%- if not temporary and as_columnstore -%}
        -- add columnstore index
        {{ sqlserver__create_clustered_columnstore_index(relation) }}
   {%- endif -%}

{% endmacro %}

then in the table config we can turn it off where needed:

{{
  config(
    materialized = 'incremental',
    unique_key = ['account', 'level', 'month'],
    force_order = False
    )
}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants