Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena support POC #1

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

Athena support POC #1

wants to merge 26 commits into from

Conversation

artem-garmash
Copy link
Owner

@artem-garmash artem-garmash commented Jun 13, 2023

POC athena support for elementary. Limited to Athena 3 and iceberg tables for incremental elementary tables. Testing with dbt-core 1.5.0, dbt-athena-community 1.5.0 and lalalilo/athena_utils 0.3.0.
Working features based on ad-hoc tests:

  • elementary dbt artifcat tables (dbt_*, elementary_test_results)
  • dbt and elementary tests (volume_anomalies, column_anomalies, dimension_anomalies)
  • one page report (edr report)
  • alert tables and sending slack alerts (edr monitor)

cast({{ timestamp_field }} as {{ elementary.edr_type_timestamp() }})
{%- endmacro -%}

{# Athena needs explicit conversion for ISO8601 timestamps used in buckets_cte #}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agate_to_dicts converts timestamps to ISO8601 format and Athena needs explicit handling

@@ -121,3 +130,13 @@
{% macro bigquery__edr_type_date() %}
date
{% endmacro %}

{% macro athena__edr_type_timestamp() %}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on cast_timestamp from dbt-athena

@@ -25,3 +25,12 @@
{% do run_query(dbt.create_table_as(temporary, relation, sql_query)) %}
{% do adapter.commit() %}
{% endmacro %}

{% macro athena__create_or_replace(temporary, relation, sql_query) %}
{% set drop_query %}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: relation doesn't have type for some reason when this is called during the initial run to build elementary tables. drop_relation fails as it checks for type.

{%- macro render_value(value) -%}
{%- if value is defined and value is not none -%}
{%- if value is number -%}
{{- value -}}
{%- elif value is string -%}
'{{- elementary.escape_special_chars(value) -}}'
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inserts with timestamp strings should be casted explicitly in Athena. How to handle it properly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the clean thing would be to pass to this function the actual type of the column we are rendering to from insert_rows:

  1. insert_rows already calls adapter.get_columns_in_relation so we can pass column.type as another parameter to render_value
  2. We can normalize the DB data type with the elementary.normalize_data_type and then treat the value differently if it's a timestamp.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @haritamar , looks really straightforward. Do you see any problem with using edr_cast_as_timestamp for timestamp literals? It's needed for Athena to add timestamp type but will also change timestamp literals for other connector. Or should I add a dedicated macro to handle timestamp literals which would be no-op for all but Athena?

@@ -19,3 +19,29 @@
{% do return(macro(target_relation, tmp_relation, unique_key, dest_columns)) %}
{{ return(merge_sql) }}
{% endmacro %}

{% macro athena__merge_sql(target_relation, tmp_relation, unique_key, dest_columns, incremental_predicates) %}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be done in dbt-athena

Copy link

@nicor88 nicor88 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I was using it for reference. But that iceberg_merge is running merge query and here we need sql only. https://github.com/artem-garmash/dbt-data-reliability/blob/master/macros/utils/table_operations/merge_sql.sql#L7 refers to dbt.merge_sql and part of dbt (https://github.com/dbt-labs/dbt-core/blob/7eedfcd2742fcf789200a4b829c8c2eb98369089/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L4). Is it somehow available from dbt-athena?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seems that we overwrite get_merge_sql, maybe in this case is fine to leave then athena__merge_sqlhere

@@ -4,8 +4,10 @@
unique_key='id',
on_schema_change='append_new_columns',
full_refresh=elementary.get_config_var('elementary_full_refresh'),
meta={"timestamp_column": "updated_at"}
meta={"timestamp_column": "updated_at"},
table_type="iceberg",
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there better way to provide connector specific config properties for some models?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special, I guess this should work:

table_type="iceberg" if target.type = 'athena' else none

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, I thought maybe there is some way to avoid having "table_type=None" for other connectors (e.g. to avoid possible conflicts). As an example, dbt_columns.sql has materialized=elementary.get_dbt_columns_materialization(), to have connector specific value, but here it should work for a set of properties, something like kwargs.

@@ -1,7 +1,7 @@
{{
config(
materialized = 'view',
enabled = elementary.get_config_var('enable_dbt_columns') and target.type != 'databricks' and target.type != 'spark' | as_bool()
enabled = elementary.get_config_var('enable_dbt_columns') and target.type != 'databricks' and target.type != 'spark' and target.type != 'athena' | as_bool()
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably would work fine, need re-check it

Comment on lines +89 to +90
aws_access_key_id: "<AWS_ACCESS_KEY_ID>"
aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if I run the CLI from a setup that require the usage of an AWS session token? e.g. I'm running the cli from a setup where I use AWS SSO - that might be revisited to support this cases.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? It's just a copy of dbt-athena config from https://github.com/dbt-athena/dbt-athena#configuring-your-profile and dbt is used by the cli to query elementary tables. And the CLI is used same way as dbt with dbt-athena. E.g. if I'm using AWS SSO, aws_profile_name is used and the login is outside of the scope of dbt/elementary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when working with AWS SSO setups sometimes an extra variable need to be passed if we get the keys from something like aws-vault the key is aws_session_token, and I was wondering if we need to include that.
But you are right in most of the case if the user uses aws sso login, using the aws profile is enough.

upper(column_name) as column_name,
data_type
from information_schema.columns
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this query is going to be super slow - did you think about using a dbt-athena method to make that working?
e.g. we can add another method in the adapter to leverage glue apis - those are going to be few order of magnitude faster

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I noticed dbt-athena switched to glue apis some time ago. Definitely it should be checked, once the port is working properly overall.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is that in order to use glue api you have to do an adapter.dispatch call from your macro, as it's not possible to make python invocations that are outside the scope of the adapter.
As said, we can expose all what we need in the adatper if necessary.

@nicor88
Copy link

nicor88 commented Aug 22, 2023

@artem-garmash great job
Few notes:

I left some minor comments, overall looks great 💯

@@ -1,6 +1,6 @@
{{
config(
materialized = 'view',
materialized = 'table' if target.type == 'athena' else 'view',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be a table in Athena?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

information_schema can not be used from view in Athena (https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html)

You cannot use CREATE VIEW to create a view on the information_schema database.

artem-garmash pushed a commit that referenced this pull request Aug 31, 2023
…easonality

Moss adding hour of day and week seasonality
@artem-garmash artem-garmash changed the title Athena support Athena support POC Sep 6, 2023
@artem-garmash artem-garmash mentioned this pull request Sep 6, 2023
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants