`generate_surrogate_key` macro truncates timestamps to ms #850

bjwilliford · 2023-10-24T02:21:24Z

Describe the bug

When the dbt_utils.generate_surrogate_key takes in timestamp fields, it casts them to strings in order to concatenate them together with other parts of the compound key. In doing this, it truncates timestamps to the millisecond level, which removes some precision which may be necessary for compound keys to stay unique.

Steps to reproduce

We can see here that casting these two timestamps as text removes the level of precision that makes them unique:

with final as (
select
    '2023-10-23 21:00:02.884636000'::timestamp as ts_field
union
select
    '2023-10-23 21:00:02.884637000'::timestamp as ts_field
)

select
    ts_field as ts_field_precise,
    cast(ts_field as text) as ts_field_text
from final

When our grouping includes these timestamps, they show as separate rows, but if they are also passed into the surrogate key macro (simulated here), the truncation results in the same hash for both rows.

with final as (
select
    1234 as client_id,
    '2023-10-23 21:00:02.884636000'::timestamp as event_at,
    123 as amount
union
select
    1234 as client_id,
    '2023-10-23 21:00:02.884637000'::timestamp as event_at,
    456 as amount
)

--Returns two rows since timestamps are different, but the surrogate key is the same
select
    md5(cast(client_id as text) || '-' || cast(event_at as text)) as surrogate_key,
    client_id,
    event_at,
    sum(amount) as sum_amount
from final
group by 1,2,3

Expected results

Timestamps should be included in the surrogate key hash at their full precision rather than being truncated in order to keep it a reliable producer of unique keys.

Actual results

Timestamps are truncated, which can result in surrogate key duplication if multiple timestamps are within a millisecond of each other.

Screenshots and log output

See above output.

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.0.0

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.6.4
  - latest:    1.6.6

Additional context

As the macro is currently structured, it appears to be happening with the typecasting HERE.

It seems like the solution may be to convert timestamps to their nanosecond unix time format before concatenating, but I was not able to get that working locally. Since columns are passed in as string value arguments rather than column objects, it seems a bit less straightforward than using jinja to pre-transform columns based on their type before concatenating.

Are you interested in contributing the fix?

I have tried a couple of potential solutions, but was not able to get it working the way I wanted. I'll be curious to hear other thoughts on how this can be addressed.

The text was updated successfully, but these errors were encountered:

raphaelvarieras · 2023-10-29T11:46:00Z

I have encountered the same issue and fixed it (in Snowflake) by casting to string with nanosecond precision prior to using the macro. You should be able to do this as well as long as you are explicitly listing the fields in your surrogate key macro call:

In other words, replace:

{{ dbt_utils.generate_surrogate_key(['field_a', 'field_b',...,'timestamp_field'[,...]]) }}

by:

{{ dbt_utils.generate_surrogate_key(['field_a', 'field_b',...,'to_varchar(timestamp_field,\'YYYY-MM-DD HH24:MI:SS.FF9\')'[,...]]) }}

I suspect this may not be implemented as a new version of the algorithm for generating key because it would introduce a "breaking change" of sorts in the sense that the generated key would then be different for everybody using sub-millisecond timestamps in their keys. Maybe it could be another dbt_project.yml parameter though?

github-actions · 2024-04-27T01:44:15Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-05-04T01:44:24Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

bjwilliford added bug Something isn't working triage labels Oct 24, 2023

github-actions bot added the Stale label Apr 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`generate_surrogate_key` macro truncates timestamps to ms #850

`generate_surrogate_key` macro truncates timestamps to ms #850

bjwilliford commented Oct 24, 2023

raphaelvarieras commented Oct 29, 2023

github-actions bot commented Apr 27, 2024

github-actions bot commented May 4, 2024

generate_surrogate_key macro truncates timestamps to ms #850

generate_surrogate_key macro truncates timestamps to ms #850

Comments

bjwilliford commented Oct 24, 2023

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

raphaelvarieras commented Oct 29, 2023

github-actions bot commented Apr 27, 2024

github-actions bot commented May 4, 2024

`generate_surrogate_key` macro truncates timestamps to ms #850

`generate_surrogate_key` macro truncates timestamps to ms #850