You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the dbt_utils.generate_surrogate_key takes in timestamp fields, it casts them to strings in order to concatenate them together with other parts of the compound key. In doing this, it truncates timestamps to the millisecond level, which removes some precision which may be necessary for compound keys to stay unique.
Steps to reproduce
We can see here that casting these two timestamps as text removes the level of precision that makes them unique:
with final as (
select
'2023-10-23 21:00:02.884636000'::timestamp as ts_field
union
select
'2023-10-23 21:00:02.884637000'::timestamp as ts_field
)
select
ts_field as ts_field_precise,
cast(ts_field as text) as ts_field_text
from final
When our grouping includes these timestamps, they show as separate rows, but if they are also passed into the surrogate key macro (simulated here), the truncation results in the same hash for both rows.
with final as (
select
1234 as client_id,
'2023-10-23 21:00:02.884636000'::timestamp as event_at,
123 as amount
union
select
1234 as client_id,
'2023-10-23 21:00:02.884637000'::timestamp as event_at,
456 as amount
)
--Returns two rows since timestamps are different, but the surrogate key is the same
select
md5(cast(client_id as text) || '-' || cast(event_at as text)) as surrogate_key,
client_id,
event_at,
sum(amount) as sum_amount
from final
group by 1,2,3
Expected results
Timestamps should be included in the surrogate key hash at their full precision rather than being truncated in order to keep it a reliable producer of unique keys.
Actual results
Timestamps are truncated, which can result in surrogate key duplication if multiple timestamps are within a millisecond of each other.
As the macro is currently structured, it appears to be happening with the typecasting HERE.
It seems like the solution may be to convert timestamps to their nanosecond unix time format before concatenating, but I was not able to get that working locally. Since columns are passed in as string value arguments rather than column objects, it seems a bit less straightforward than using jinja to pre-transform columns based on their type before concatenating.
Are you interested in contributing the fix?
I have tried a couple of potential solutions, but was not able to get it working the way I wanted. I'll be curious to hear other thoughts on how this can be addressed.
The text was updated successfully, but these errors were encountered:
I have encountered the same issue and fixed it (in Snowflake) by casting to string with nanosecond precision prior to using the macro. You should be able to do this as well as long as you are explicitly listing the fields in your surrogate key macro call:
I suspect this may not be implemented as a new version of the algorithm for generating key because it would introduce a "breaking change" of sorts in the sense that the generated key would then be different for everybody using sub-millisecond timestamps in their keys. Maybe it could be another dbt_project.yml parameter though?
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Describe the bug
When the
dbt_utils.generate_surrogate_key
takes in timestamp fields, it casts them to strings in order to concatenate them together with other parts of the compound key. In doing this, it truncates timestamps to the millisecond level, which removes some precision which may be necessary for compound keys to stay unique.Steps to reproduce
We can see here that casting these two timestamps as text removes the level of precision that makes them unique:
When our grouping includes these timestamps, they show as separate rows, but if they are also passed into the surrogate key macro (simulated here), the truncation results in the same hash for both rows.
Expected results
Timestamps should be included in the surrogate key hash at their full precision rather than being truncated in order to keep it a reliable producer of unique keys.
Actual results
Timestamps are truncated, which can result in surrogate key duplication if multiple timestamps are within a millisecond of each other.
Screenshots and log output
See above output.
System information
The contents of your
packages.yml
file:Which database are you using dbt with?
The output of
dbt --version
:Additional context
As the macro is currently structured, it appears to be happening with the typecasting HERE.
It seems like the solution may be to convert timestamps to their nanosecond unix time format before concatenating, but I was not able to get that working locally. Since columns are passed in as string value arguments rather than column objects, it seems a bit less straightforward than using jinja to pre-transform columns based on their type before concatenating.
Are you interested in contributing the fix?
I have tried a couple of potential solutions, but was not able to get it working the way I wanted. I'll be curious to hear other thoughts on how this can be addressed.
The text was updated successfully, but these errors were encountered: