Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_surrogate_key macro truncates timestamps to ms #850

Closed
1 of 5 tasks
bjwilliford opened this issue Oct 24, 2023 · 3 comments
Closed
1 of 5 tasks

generate_surrogate_key macro truncates timestamps to ms #850

bjwilliford opened this issue Oct 24, 2023 · 3 comments
Labels
bug Something isn't working Stale triage

Comments

@bjwilliford
Copy link

Describe the bug

When the dbt_utils.generate_surrogate_key takes in timestamp fields, it casts them to strings in order to concatenate them together with other parts of the compound key. In doing this, it truncates timestamps to the millisecond level, which removes some precision which may be necessary for compound keys to stay unique.

Steps to reproduce

We can see here that casting these two timestamps as text removes the level of precision that makes them unique:

with final as (
select
    '2023-10-23 21:00:02.884636000'::timestamp as ts_field
union
select
    '2023-10-23 21:00:02.884637000'::timestamp as ts_field
)

select
    ts_field as ts_field_precise,
    cast(ts_field as text) as ts_field_text
from final
image

When our grouping includes these timestamps, they show as separate rows, but if they are also passed into the surrogate key macro (simulated here), the truncation results in the same hash for both rows.

with final as (
select
    1234 as client_id,
    '2023-10-23 21:00:02.884636000'::timestamp as event_at,
    123 as amount
union
select
    1234 as client_id,
    '2023-10-23 21:00:02.884637000'::timestamp as event_at,
    456 as amount
)

--Returns two rows since timestamps are different, but the surrogate key is the same
select
    md5(cast(client_id as text) || '-' || cast(event_at as text)) as surrogate_key,
    client_id,
    event_at,
    sum(amount) as sum_amount
from final
group by 1,2,3
image

Expected results

Timestamps should be included in the surrogate key hash at their full precision rather than being truncated in order to keep it a reliable producer of unique keys.

Actual results

Timestamps are truncated, which can result in surrogate key duplication if multiple timestamps are within a millisecond of each other.

Screenshots and log output

See above output.

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.0.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.6.4
  - latest:    1.6.6

Additional context

As the macro is currently structured, it appears to be happening with the typecasting HERE.

It seems like the solution may be to convert timestamps to their nanosecond unix time format before concatenating, but I was not able to get that working locally. Since columns are passed in as string value arguments rather than column objects, it seems a bit less straightforward than using jinja to pre-transform columns based on their type before concatenating.

Are you interested in contributing the fix?

I have tried a couple of potential solutions, but was not able to get it working the way I wanted. I'll be curious to hear other thoughts on how this can be addressed.

@bjwilliford bjwilliford added bug Something isn't working triage labels Oct 24, 2023
@raphaelvarieras
Copy link

I have encountered the same issue and fixed it (in Snowflake) by casting to string with nanosecond precision prior to using the macro. You should be able to do this as well as long as you are explicitly listing the fields in your surrogate key macro call:

In other words, replace:

{{ dbt_utils.generate_surrogate_key(['field_a', 'field_b',...,'timestamp_field'[,...]]) }}

by:

{{ dbt_utils.generate_surrogate_key(['field_a', 'field_b',...,'to_varchar(timestamp_field,\'YYYY-MM-DD HH24:MI:SS.FF9\')'[,...]]) }}

I suspect this may not be implemented as a new version of the algorithm for generating key because it would introduce a "breaking change" of sorts in the sense that the generated key would then be different for everybody using sub-millisecond timestamps in their keys. Maybe it could be another dbt_project.yml parameter though?

Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 27, 2024
Copy link

github-actions bot commented May 4, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale triage
Projects
None yet
Development

No branches or pull requests

2 participants