-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Athena support POC #1
base: master
Are you sure you want to change the base?
Conversation
Views don't work against information_schema, maybe change them to tables
TODO: handle properly to allow casts for insert expressions without quoting all strings
using drop_relation() doesn't work in this context as type is null use explicit drop table for now
it should be done in athena-connector properly
cast({{ timestamp_field }} as {{ elementary.edr_type_timestamp() }}) | ||
{%- endmacro -%} | ||
|
||
{# Athena needs explicit conversion for ISO8601 timestamps used in buckets_cte #} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agate_to_dicts
converts timestamps to ISO8601 format and Athena needs explicit handling
@@ -121,3 +130,13 @@ | |||
{% macro bigquery__edr_type_date() %} | |||
date | |||
{% endmacro %} | |||
|
|||
{% macro athena__edr_type_timestamp() %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on cast_timestamp
from dbt-athena
@@ -25,3 +25,12 @@ | |||
{% do run_query(dbt.create_table_as(temporary, relation, sql_query)) %} | |||
{% do adapter.commit() %} | |||
{% endmacro %} | |||
|
|||
{% macro athena__create_or_replace(temporary, relation, sql_query) %} | |||
{% set drop_query %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: relation
doesn't have type
for some reason when this is called during the initial run to build elementary tables. drop_relation
fails as it checks for type.
{%- macro render_value(value) -%} | ||
{%- if value is defined and value is not none -%} | ||
{%- if value is number -%} | ||
{{- value -}} | ||
{%- elif value is string -%} | ||
'{{- elementary.escape_special_chars(value) -}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inserts with timestamp strings should be casted explicitly in Athena. How to handle it properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the clean thing would be to pass to this function the actual type of the column we are rendering to from insert_rows
:
insert_rows
already callsadapter.get_columns_in_relation
so we can passcolumn.type
as another parameter torender_value
- We can normalize the DB data type with the
elementary.normalize_data_type
and then treat the value differently if it's a timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @haritamar , looks really straightforward. Do you see any problem with using edr_cast_as_timestamp
for timestamp literals? It's needed for Athena to add timestamp
type but will also change timestamp literals for other connector. Or should I add a dedicated macro to handle timestamp literals which would be no-op for all but Athena?
@@ -19,3 +19,29 @@ | |||
{% do return(macro(target_relation, tmp_relation, unique_key, dest_columns)) %} | |||
{{ return(merge_sql) }} | |||
{% endmacro %} | |||
|
|||
{% macro athena__merge_sql(target_relation, tmp_relation, unique_key, dest_columns, incremental_predicates) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this should be done in dbt-athena
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@artem-garmash currently we have https://github.com/dbt-athena/dbt-athena/blob/main/dbt/include/athena/macros/materializations/models/incremental/merge.sql#L50 - did you have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I was using it for reference. But that iceberg_merge
is running merge query and here we need sql only. https://github.com/artem-garmash/dbt-data-reliability/blob/master/macros/utils/table_operations/merge_sql.sql#L7 refers to dbt.merge_sql
and part of dbt (https://github.com/dbt-labs/dbt-core/blob/7eedfcd2742fcf789200a4b829c8c2eb98369089/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L4). Is it somehow available from dbt-athena?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seems that we overwrite get_merge_sql
, maybe in this case is fine to leave then athena__merge_sql
here
@@ -4,8 +4,10 @@ | |||
unique_key='id', | |||
on_schema_change='append_new_columns', | |||
full_refresh=elementary.get_config_var('elementary_full_refresh'), | |||
meta={"timestamp_column": "updated_at"} | |||
meta={"timestamp_column": "updated_at"}, | |||
table_type="iceberg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there better way to provide connector specific config properties for some models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing special, I guess this should work:
table_type="iceberg" if target.type = 'athena' else none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey, I thought maybe there is some way to avoid having "table_type=None" for other connectors (e.g. to avoid possible conflicts). As an example, dbt_columns.sql has materialized=elementary.get_dbt_columns_materialization(),
to have connector specific value, but here it should work for a set of properties, something like kwargs
.
@@ -1,7 +1,7 @@ | |||
{{ | |||
config( | |||
materialized = 'view', | |||
enabled = elementary.get_config_var('enable_dbt_columns') and target.type != 'databricks' and target.type != 'spark' | as_bool() | |||
enabled = elementary.get_config_var('enable_dbt_columns') and target.type != 'databricks' and target.type != 'spark' and target.type != 'athena' | as_bool() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably would work fine, need re-check it
aws_access_key_id: "<AWS_ACCESS_KEY_ID>" | ||
aws_secret_access_key: "<AWS_SECRET_ACCESS_KEY>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if I run the CLI from a setup that require the usage of an AWS session token? e.g. I'm running the cli from a setup where I use AWS SSO - that might be revisited to support this cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? It's just a copy of dbt-athena config from https://github.com/dbt-athena/dbt-athena#configuring-your-profile and dbt is used by the cli to query elementary tables. And the CLI is used same way as dbt with dbt-athena. E.g. if I'm using AWS SSO, aws_profile_name
is used and the login is outside of the scope of dbt/elementary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when working with AWS SSO setups sometimes an extra variable need to be passed if we get the keys from something like aws-vault
the key is aws_session_token, and I was wondering if we need to include that.
But you are right in most of the case if the user uses aws sso login
, using the aws profile is enough.
upper(column_name) as column_name, | ||
data_type | ||
from information_schema.columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this query is going to be super slow - did you think about using a dbt-athena method to make that working?
e.g. we can add another method in the adapter to leverage glue apis - those are going to be few order of magnitude faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I noticed dbt-athena switched to glue apis some time ago. Definitely it should be checked, once the port is working properly overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is that in order to use glue api you have to do an adapter.dispatch call from your macro, as it's not possible to make python invocations that are outside the scope of the adapter.
As said, we can expose all what we need in the adatper if necessary.
@artem-garmash great job
I left some minor comments, overall looks great 💯 |
@@ -1,6 +1,6 @@ | |||
{{ | |||
config( | |||
materialized = 'view', | |||
materialized = 'table' if target.type == 'athena' else 'view', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be a table in Athena?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
information_schema
can not be used from view in Athena (https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html)
You cannot use CREATE VIEW to create a view on the information_schema database.
…easonality Moss adding hour of day and week seasonality
POC athena support for elementary. Limited to Athena 3 and iceberg tables for incremental elementary tables. Testing with
dbt-core
1.5.0,dbt-athena-community
1.5.0 andlalalilo/athena_utils
0.3.0.Working features based on ad-hoc tests:
dbt_*
,elementary_test_results
)edr report
)edr monitor
)